apache/spark
5 articles
Prerequisites
- ›Basic understanding of distributed systems concepts
- ›Familiarity with Scala syntax (case classes, traits, pattern matching)
- ›General knowledge of build tools (Maven or SBT)
Navigating Apache Spark's Codebase: Architecture and Module Map
A comprehensive mental model of Apache Spark's monorepo structure, its ~40 Maven modules, key entry points, and the Classic vs Connect architectural split.
The Boot Sequence: SparkContext, SparkEnv, and the Scheduling Stack
Traces the complete initialization path from spark-submit to a running Spark application, dissecting SparkContext's init sequence, SparkEnv, and the two-level scheduling stack.
The Catalyst Query Pipeline: From SQL Text to Optimized Plan
Deep-dive into Spark SQL's Catalyst optimizer: TreeNode abstraction, RuleExecutor framework, and the complete query pipeline from parsing to physical planning.
From Plan to Execution: RDDs, Stages, Tasks, and the Shuffle
How physical SparkPlans become distributed computation: RDD properties, stage creation, the shuffle system, BlockManager, and Adaptive Query Execution.
Spark Connect and the Extensibility Architecture
The client-server decoupling via gRPC/Protobuf, plus extensibility patterns: SparkSessionExtensions, pluggable cluster managers, ShuffleManager, and Data Source API V2.