Spark RDD

https://ithelp.ithome.com.tw/articles/10192489

RDD

Resilient Distributed DataSet = a collection of fault-tolerant operational elements that run in parallel.

1. Read only = not alter original data
2. Many operations = more than map & reduce
3. Parallel = process at each cluster locally
4. Fault tolerant = DAG and lineage (a node holding the partition fails, the other node retrieves its data.)

2 operation = [1] transformation, [2] action

Simple than hadoop

1. Spark stores data in-memory = fast
2. complete recovery using lineage graph

Broadcast & Accumulator

For parallel processing, Spark uses shared variables = 先把A copy of shared variable送到每個node, 計算時可在每個node直接用.

Broadcast

A shared variable is cached on all nodes. (not sent on machines with tasks)

Accumulator

Like counter accumulated and shared across several nodes.

Last updated

Was this helpful?