PySpark
framework3.utils.pyspark
¶
PySparkMapReduce
¶
Bases: MapReduceStrategy
A MapReduce strategy implementation using PySpark for distributed computing.
This class provides methods to perform map and reduce operations on large datasets using Apache Spark's distributed computing capabilities.
Key Features
- Initializes a Spark session with configurable parameters
- Supports map, flatMap, and reduce operations
- Allows for parallel processing of data across multiple workers
- Provides a method to stop the Spark context when processing is complete
Usage
from framework3.utils.pyspark import PySparkMapReduce
# Initialize the PySparkMapReduce
spark_mr = PySparkMapReduce(app_name="MySparkApp", master="local[*]", num_workers=4)
# Perform map operation
data = [1, 2, 3, 4, 5]
mapped_data = spark_mr.map(data, lambda x: x * 2)
# Perform reduce operation
result = spark_mr.reduce(lambda x, y: x + y)
print(result) # Output: 30
# Stop the Spark context
spark_mr.stop()
Attributes:
Name | Type | Description |
---|---|---|
sc |
SparkContext
|
The Spark context used for distributed computing. |
Methods:
Name | Description |
---|---|
map |
Any, map_function: Callable[..., Any], numSlices: int | None = None) -> Any: Applies a map function to the input data in parallel. |
flatMap |
Any, map_function: Callable[..., Any], numSlices: int | None = None) -> Any: Applies a flatMap function to the input data in parallel. |
reduce |
Callable[..., Any]) -> Any: Applies a reduce function to the mapped data. |
stop |
Stops the Spark context. |
Source code in framework3/utils/pyspark.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
sc = spark.sparkContext
instance-attribute
¶
__init__(app_name, master='local', num_workers=4)
¶
Initialize the PySparkMapReduce with a Spark session.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
app_name
|
str
|
The name of the Spark application. |
required |
master
|
str
|
The Spark master URL. Defaults to "local". |
'local'
|
num_workers
|
int
|
The number of worker instances. Defaults to 4. |
4
|
Source code in framework3/utils/pyspark.py
flatMap(data, map_function, numSlices=None)
¶
Apply a flatMap function to the input data in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
The input data to be processed. |
required |
map_function
|
Callable[..., Any]
|
The function to apply to each element of the data. |
required |
numSlices
|
int | None
|
The number of partitions to create. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the flatMap operation as a PySpark RDD. |
Source code in framework3/utils/pyspark.py
map(data, map_function, numSlices=None)
¶
Apply a map function to the input data in parallel.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
Any
|
The input data to be processed. |
required |
map_function
|
Callable[..., Any]
|
The function to apply to each element of the data. |
required |
numSlices
|
int | None
|
The number of partitions to create. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the map operation as a PySpark RDD. |
Source code in framework3/utils/pyspark.py
reduce(reduce_function)
¶
Apply a reduce function to the mapped data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reduce_function
|
Callable[..., Any]
|
The function to reduce the mapped data. |
required |
Returns:
Name | Type | Description |
---|---|---|
Any |
Any
|
The result of the reduce operation. |
Source code in framework3/utils/pyspark.py
stop()
¶
Stop the Spark context.
This method should be called when you're done with Spark operations to release resources.