Spark RDD
RDD
1. Read only = not alter original data
2. Many operations = more than map & reduce
3. Parallel = process at each cluster locally
4. Fault tolerant = DAG and lineage (a node holding the partition fails, the other node retrieves its data.)Simple than hadoop
1. Spark stores data in-memory = fast
2. complete recovery using lineage graphBroadcast & Accumulator
Broadcast
Accumulator
Last updated