Navigating Apache Spark's Codebase: Architecture and Module Map
Prerequisites
- ›Basic understanding of distributed systems concepts
- ›Familiarity with Scala syntax (case classes, traits, pattern matching)
- ›General knowledge of build tools (Maven or SBT)
Navigating Apache Spark's Codebase: Architecture and Module Map
Apache Spark is one of the largest and most influential open-source data processing engines in existence. Its codebase spans roughly 40 Maven modules, supports four programming languages, and orchestrates computation across thousands of machines. For a newcomer, cracking open this repository can feel overwhelming — where do you even start?
This article provides you with a mental map. We'll survey the monorepo structure, trace the dependency layers from low-level utilities up through the SQL optimizer, follow the path from spark-submit to your running application, and understand the major architectural split between Classic and Connect execution modes introduced in Spark 4.x. By the end, you'll know which directory to look in for any given concern, and how the pieces fit together.
Monorepo Structure and Module Organization
Spark's repository is a classic monorepo: everything lives in one Git repository, built by a single root POM. The root pom.xml enumerates every module:
graph TD
subgraph "Common Utilities"
A[common/sketch]
B[common/kvstore]
C[common/network-common]
D[common/network-shuffle]
E[common/unsafe]
F[common/utils]
G[common/variant]
end
subgraph "Core Engine"
H[core]
end
subgraph "SQL Stack"
I[sql/api]
J[sql/catalyst]
K[sql/core]
L[sql/hive]
M[sql/pipelines]
end
subgraph "Connect"
N[sql/connect/common]
O[sql/connect/server]
P[sql/connect/client/jvm]
end
subgraph "Connectors"
Q[connector/kafka-0-10-sql]
R[connector/avro]
S[connector/protobuf]
end
subgraph "Resource Managers"
T[resource-managers/kubernetes]
U[resource-managers/yarn]
end
C --> H
E --> H
H --> J
I --> K
J --> K
K --> O
K --> L
The modules fall into six logical groups:
| Group | Directory | Purpose |
|---|---|---|
| Common | common/* |
Low-level utilities: networking, memory management, sketches, key-value store |
| Core | core/ |
RDD abstraction, scheduling (DAGScheduler, TaskScheduler), storage (BlockManager), RPC |
| SQL | sql/api, sql/catalyst, sql/core, sql/hive |
The DataFrame/Dataset API, Catalyst optimizer, query execution engine |
| Connect | sql/connect/* |
Client-server architecture: Protobuf protocol, gRPC server, JVM/JDBC clients |
| Connectors | connector/* |
Data source integrations: Kafka, Avro, Protobuf |
| Resource Managers | resource-managers/* |
Cluster manager plugins: Kubernetes, YARN |
There are also language bindings (python/, R/), legacy modules (streaming/, graphx/, mllib/), and build infrastructure (assembly/, launcher/, examples/, repl/).
Tip: The
common/unsafemodule is Spark's off-heap memory layer. It providesPlatformfor raw memory access andUnsafeRowfor the binary row format used throughout the SQL engine — understanding it is key to performance work.
The Three-Layer Architecture
Spark's codebase follows a strict layered dependency model. Think of it as three concentric rings:
flowchart TB
subgraph Layer1["Layer 1: Core Engine"]
direction LR
RDD["RDD Abstraction"]
SCHED["DAGScheduler + TaskScheduler"]
STORE["BlockManager + Shuffle"]
RPC["RpcEnv"]
end
subgraph Layer2["Layer 2: SQL/Catalyst"]
direction LR
PARSE["Parser (ANTLR4)"]
ANALYZE["Analyzer"]
OPT["Optimizer"]
PLAN["Physical Planner"]
end
subgraph Layer3["Layer 3: User-Facing API"]
direction LR
SESSION["SparkSession"]
DF["DataFrame / Dataset"]
CONNECT["Spark Connect (gRPC)"]
end
Layer3 --> Layer2
Layer2 --> Layer1
Layer 1 — Core Engine (core/) provides the distributed compute primitives. The RDD (Resilient Distributed Dataset) is the fundamental abstraction: an immutable, partitioned collection with lineage-based fault tolerance. The DAGScheduler breaks computation into stages, the TaskScheduler ships tasks to executors, and BlockManager handles storage.
Layer 2 — SQL/Catalyst (sql/catalyst/, sql/core/) is where Spark SQL lives. The Catalyst optimizer transforms SQL or DataFrame operations through a pipeline: parsing → analysis → optimization → physical planning. The output is a physical plan (SparkPlan) whose execute() method produces an RDD[InternalRow] — bridging back down to Layer 1.
Layer 3 — User-Facing APIs (sql/api/, sql/connect/) provides the entry points developers actually use: SparkSession, DataFrame, Dataset. In Spark 4.x, this layer splits into two implementations: Classic (in-process JVM) and Connect (gRPC client-server).
This layering is enforced by Maven module dependencies. sql/catalyst depends on core, but core has zero knowledge of SQL. This clean separation is what allows the RDD engine to remain general-purpose while the SQL layer provides higher-level abstractions.
CLI Entry Points: From Shell to JVM
When you run spark-submit my-app.jar, what actually happens? Let's trace the path.
sequenceDiagram
participant User
participant spark_submit as bin/spark-submit
participant spark_class as bin/spark-class
participant Launcher as o.a.s.launcher.Main
participant SparkSubmit as SparkSubmit.doSubmit()
participant App as User's main()
User->>spark_submit: spark-submit --class MyApp app.jar
spark_submit->>spark_class: exec spark-class o.a.s.deploy.SparkSubmit "$@"
spark_class->>Launcher: java -cp ... o.a.s.launcher.Main SparkSubmit ...
Launcher-->>spark_class: Returns JVM command with full classpath
spark_class->>SparkSubmit: exec java -cp <full-classpath> SparkSubmit args...
SparkSubmit->>SparkSubmit: prepareSubmitEnvironment()
SparkSubmit->>App: Reflectively invoke main()
Step 1: bin/spark-submit is a thin shell script — just 8 lines of real code. It locates SPARK_HOME, disables Python hash randomization, and delegates to bin/spark-class with the class name org.apache.spark.deploy.SparkSubmit.
Step 2: bin/spark-class does the heavy lifting on the shell side. It finds Java, assembles the initial classpath from $SPARK_HOME/jars/*, and invokes org.apache.spark.launcher.Main — a small Java program that outputs the final java command with the correct classpath, memory settings, and JVM options. The output is parsed and exec'd.
Step 3: SparkSubmit.doSubmit() parses arguments and dispatches based on the action — SUBMIT, KILL, REQUEST_STATUS, or PRINT_VERSION. For SUBMIT, it calls prepareSubmitEnvironment() which is the real workhorse: it determines the cluster manager (YARN, Kubernetes, Standalone, or Local), resolves the deploy mode (client or cluster), downloads dependency JARs, and finally reflectively invokes the user's main class.
Tip: When debugging submission issues, add
--verboseto yourspark-submitcommand. This triggersappArgs.verboseindoSubmit()and prints the resolved classpath, deploy mode, and all configuration before launching your application.
SparkSession: The User-Facing Entry Point
If you've written any Spark SQL code, you've used SparkSession. In Spark 4.x, this is an abstract class defined in sql/api:
sql/api/src/main/scala/org/apache/spark/sql/SparkSession.scala#L63-L72
classDiagram
class SparkSession {
<<abstract>>
+sparkContext: SparkContext*
+version: String
+sql(query: String): DataFrame
+read: DataFrameReader
+createDataFrame(): DataFrame
}
class ClassicSparkSession {
+sparkContext: SparkContext
-sharedState: SharedState
-sessionState: SessionState
+extensions: SparkSessionExtensions
}
class ConnectSparkSession {
-client: SparkConnectClient
-channel: ManagedChannel
}
SparkSession <|-- ClassicSparkSession
SparkSession <|-- ConnectSparkSession
The abstract SparkSession defines the user-facing API — sql(), read, createDataFrame(), and so on. It has two concrete implementations:
Classic (sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala) runs everything in-process. It wraps a SparkContext, holds a SessionState (which contains the parser, analyzer, optimizer, and planner), and executes queries directly in the driver JVM. This is the traditional Spark execution model.
Connect runs as a thin gRPC client. DataFrame operations are serialized as Protobuf messages and sent to a remote Spark Connect server for execution. The client never sees an RDD or SparkContext — it just receives Arrow-formatted results back over the wire.
This split is a major architectural decision in Spark 4.x. The motivation: Classic's in-process model ties the user's application to the Spark runtime. If you upgrade Spark, you must recompile your application. With Connect, the client is decoupled — you can upgrade the server independently, and lightweight clients in Python, Go, or Rust can talk to Spark without embedding the JVM.
Note the @ClassicOnly annotations on methods like sparkContext and sharedState — these are only available in Classic mode. Code that uses them won't work with Connect.
Build System and Developer Navigation
Spark supports two build systems, each with a distinct role:
flowchart LR
subgraph Maven["Maven (CI / Releases)"]
MPOM[pom.xml] --> MBUILD[mvn package -DskipTests]
MBUILD --> MJARS[assembly/target/jars/]
end
subgraph SBT["SBT (Development)"]
SBUILD[project/SparkBuild.scala] --> SCOMPILE["build/sbt compile"]
SCOMPILE --> SINCR[Incremental compilation]
end
Maven is the canonical build, used in CI and for official releases. The root pom.xml defines all module dependencies and plugin configurations. Use it when you need a full, reproducible build.
SBT is preferred for development. As the AGENTS.md guide states: "Prefer SBT over Maven for faster incremental compilation." The SBT build is defined in project/SparkBuild.scala, which maps Maven module names to SBT project names. For example, sql/core becomes the SBT project sql, and sql/catalyst becomes catalyst.
Key commands for navigating the codebase:
| Task | Command |
|---|---|
| Compile a single module | build/sbt catalyst/compile |
| Run tests by wildcard | build/sbt 'catalyst/testOnly *AnalyzerSuite' |
| Run a specific test | build/sbt 'catalyst/testOnly *AnalyzerSuite -- -z "test name"' |
| Interactive SBT shell | build/sbt then project catalyst |
Tip: When exploring Spark's code, start from the entry points and trace inward.
SparkSessionis the best starting point for SQL-related code, whileSparkContextis the starting point for core engine internals. Don't try to read the entire codebase — follow the execution path for the feature you care about.
What's Next
With this map in hand, you know the territory. In the next article, we'll zoom into what happens when your application actually starts: the carefully orchestrated initialization sequence inside SparkContext, the SparkEnv runtime container, and the two-level scheduling stack that turns your code into distributed tasks running across a cluster.