Spark RDD
https://ithelp.ithome.com.tw/articles/10192489
RDD
Resilient Distributed DataSet = a collection of fault-tolerant operational elements that run in parallel.
1. Read only = not alter original data
2. Many operations = more than map & reduce
3. Parallel = process at each cluster locally
4. Fault tolerant = DAG and lineage (a node holding the partition fails, the other node retrieves its data.)
2 operation = [1] transformation, [2] action
Simple than hadoop
1. Spark stores data in-memory = fast
2. complete recovery using lineage graph
Broadcast & Accumulator
For parallel processing, Spark uses shared variables = 先把A copy of shared variable送到每個node, 計算時可在每個node直接用.
Broadcast
A shared variable is cached on all nodes. (not sent on machines with tasks)
Accumulator
Like counter accumulated and shared across several nodes.
Last updated
Was this helpful?