apache/spark

5 articles

Prerequisites

›Basic understanding of distributed systems concepts
›Familiarity with Scala syntax (case classes, traits, pattern matching)
›General knowledge of build tools (Maven or SBT)

Navigating Apache Spark's Codebase: Architecture and Module Map

A comprehensive mental model of Apache Spark's monorepo structure, its ~40 Maven modules, key entry points, and the Classic vs Connect architectural split.

The Boot Sequence: SparkContext, SparkEnv, and the Scheduling Stack

Traces the complete initialization path from spark-submit to a running Spark application, dissecting SparkContext's init sequence, SparkEnv, and the two-level scheduling stack.

The Catalyst Query Pipeline: From SQL Text to Optimized Plan

Deep-dive into Spark SQL's Catalyst optimizer: TreeNode abstraction, RuleExecutor framework, and the complete query pipeline from parsing to physical planning.

From Plan to Execution: RDDs, Stages, Tasks, and the Shuffle

How physical SparkPlans become distributed computation: RDD properties, stage creation, the shuffle system, BlockManager, and Adaptive Query Execution.

Spark Connect and the Extensibility Architecture

The client-server decoupling via gRPC/Protobuf, plus extensibility patterns: SparkSessionExtensions, pluggable cluster managers, ShuffleManager, and Data Source API V2.