Spark RDD
Last updated
Was this helpful?
Last updated
Was this helpful?
Resilient Distributed DataSet = a collection of fault-tolerant operational elements that run in parallel.
2 operation = [1] transformation, [2] action
For parallel processing, Spark uses shared variables = 先把A copy of shared variable送到每個node, 計算時可在每個node直接用.
A shared variable is cached on all nodes. (not sent on machines with tasks)
Like counter accumulated and shared across several nodes.