Reduce and Fold in Spark
We have two commonly used RDD functions reduce
and fold
in Spark, and this article mainly talks about their similaritiy and difference, and under what scenarios should they be used.
Looking at the signatures of two methods, we find that both reduce
and fold
accept a function, but fold
in addition take a zeroValue
to be used for the initiali call of each partition.
Similarity
Most of the time, we do not really get bothered by the difference between the two.
The extra parameter required in reduce
is zeroValue
. However, zeroValue
has different values in different scenarios. In general, it is an identify element for your operation. That is, for addition, any number can be added by 0
and remains the same; for multiplication, any number can be multiplied by 1
and remains the same.
As long as we are careful with the choice of zeroValue
, reduce
and fold
can be used interchangeably.
Hmm, what if we are not careful with zeroValue
?
Difference
For addition zeroValue
is 0
. What if we make a mistake and choose 1
for addition? Let’s see an example.
Note that the results by reduce
still remain the same while the results by fold
is obviously wrong. But where does it get wrong? 2 + 3 + 4 = 9
from reduce
, and how does 2 + 3 + 4 + ? = 14
from fold
? Let’s first see how many partitions there are for the RDD.
Now that we know there are 4 partitions in the RDD, let’s slip in some print statements to the operation that we pass to RDD function, to get a glimpse of what happens at each step.
They may seen a bit messy at the first sight. Let’s illustrate with two flowcharts.
Reduce
Fold
Now the mystery is solved! The operation reduce
never has any type of zeroValue
- it simply adds up all values within each partition, and then adds up the totals of each partition.
Instead, fold
has a zeroValue
for each partition, which is 1
. Each partition will use zeroValue
as a starting point, and add it with all numbers within this partition - even when this partition is empty. Finally, when adding the totals of each partition, it will have another zeroValue
as a starting point in the driver program. A simple formula to illustrate such situations of addition can be expressed as
Similarly, for multiplication, we can deduce the following formula
When zeroValue=0
, rdd.fold(zeroValue, add)==rdd.reduce(add)
. When zeroValue=1
, rdd.fold(zeroValue, multiply)==rdd.reduce(multiply)
. Now you also see why zeroValue
is 0
for addition and 1
for multiplication.
Special usage for reduce
Most of the time, we would be comfortable with either reduce
and fold
, as long as we do not make mistakes on zeroValue
. However, there are some special cases where fold
is desired.
Empty RDD
If you run reduce
on an empty RDD, you will come across the following error
But you can safely run fold
on it with a correct zeroValue
In this case, zeroValue
is playing the role of a default value for an empty RDD.
Bar raiser
If you are running an auction for your old iPhone 6 on eBay, you would like to sell it to the buyer with the highest bid. However, you also have a threshold under which you would rather keep it yourself. Let’s say, your threshold is $300. You will finally sell your lovely iPhone 6 to the buyer whose bid price is the highest and also larger than $300, or set the price to $300 if no one bids more than $300.
You run the program above, and sadly find that no one bids more than $300. You finally set the price to your threshold which is $300 and hope someone would take it someday.
Conclusion
We have illustrates the difference between reduce
and fold
. There are also lots of RDD functions in Spark worth exploring. Just remember that most of the time in the Spark world, computation happens within each partition first, and then gets combined and shuffled. Keep this in mind and you will be able to understand how Spark works and solve lots of mysteries yourself.
An exercise: draw the flowcharts of reduce
and fold
for multiplication :)