Read OSS

Navigating Apache Spark's Codebase: Architecture and Module Map

Intermediate

Prerequisites

  • Basic understanding of distributed systems concepts
  • Familiarity with Scala syntax (case classes, traits, pattern matching)
  • General knowledge of build tools (Maven or SBT)

Navigating Apache Spark's Codebase: Architecture and Module Map

Apache Spark is one of the largest and most influential open-source data processing engines in existence. Its codebase spans roughly 40 Maven modules, supports four programming languages, and orchestrates computation across thousands of machines. For a newcomer, cracking open this repository can feel overwhelming — where do you even start?

This article provides you with a mental map. We'll survey the monorepo structure, trace the dependency layers from low-level utilities up through the SQL optimizer, follow the path from spark-submit to your running application, and understand the major architectural split between Classic and Connect execution modes introduced in Spark 4.x. By the end, you'll know which directory to look in for any given concern, and how the pieces fit together.

Monorepo Structure and Module Organization

Spark's repository is a classic monorepo: everything lives in one Git repository, built by a single root POM. The root pom.xml enumerates every module:

graph TD
    subgraph "Common Utilities"
        A[common/sketch]
        B[common/kvstore]
        C[common/network-common]
        D[common/network-shuffle]
        E[common/unsafe]
        F[common/utils]
        G[common/variant]
    end

    subgraph "Core Engine"
        H[core]
    end

    subgraph "SQL Stack"
        I[sql/api]
        J[sql/catalyst]
        K[sql/core]
        L[sql/hive]
        M[sql/pipelines]
    end

    subgraph "Connect"
        N[sql/connect/common]
        O[sql/connect/server]
        P[sql/connect/client/jvm]
    end

    subgraph "Connectors"
        Q[connector/kafka-0-10-sql]
        R[connector/avro]
        S[connector/protobuf]
    end

    subgraph "Resource Managers"
        T[resource-managers/kubernetes]
        U[resource-managers/yarn]
    end

    C --> H
    E --> H
    H --> J
    I --> K
    J --> K
    K --> O
    K --> L

The modules fall into six logical groups:

Group Directory Purpose
Common common/* Low-level utilities: networking, memory management, sketches, key-value store
Core core/ RDD abstraction, scheduling (DAGScheduler, TaskScheduler), storage (BlockManager), RPC
SQL sql/api, sql/catalyst, sql/core, sql/hive The DataFrame/Dataset API, Catalyst optimizer, query execution engine
Connect sql/connect/* Client-server architecture: Protobuf protocol, gRPC server, JVM/JDBC clients
Connectors connector/* Data source integrations: Kafka, Avro, Protobuf
Resource Managers resource-managers/* Cluster manager plugins: Kubernetes, YARN

There are also language bindings (python/, R/), legacy modules (streaming/, graphx/, mllib/), and build infrastructure (assembly/, launcher/, examples/, repl/).

Tip: The common/unsafe module is Spark's off-heap memory layer. It provides Platform for raw memory access and UnsafeRow for the binary row format used throughout the SQL engine — understanding it is key to performance work.

The Three-Layer Architecture

Spark's codebase follows a strict layered dependency model. Think of it as three concentric rings:

flowchart TB
    subgraph Layer1["Layer 1: Core Engine"]
        direction LR
        RDD["RDD Abstraction"]
        SCHED["DAGScheduler + TaskScheduler"]
        STORE["BlockManager + Shuffle"]
        RPC["RpcEnv"]
    end

    subgraph Layer2["Layer 2: SQL/Catalyst"]
        direction LR
        PARSE["Parser (ANTLR4)"]
        ANALYZE["Analyzer"]
        OPT["Optimizer"]
        PLAN["Physical Planner"]
    end

    subgraph Layer3["Layer 3: User-Facing API"]
        direction LR
        SESSION["SparkSession"]
        DF["DataFrame / Dataset"]
        CONNECT["Spark Connect (gRPC)"]
    end

    Layer3 --> Layer2
    Layer2 --> Layer1

Layer 1 — Core Engine (core/) provides the distributed compute primitives. The RDD (Resilient Distributed Dataset) is the fundamental abstraction: an immutable, partitioned collection with lineage-based fault tolerance. The DAGScheduler breaks computation into stages, the TaskScheduler ships tasks to executors, and BlockManager handles storage.

Layer 2 — SQL/Catalyst (sql/catalyst/, sql/core/) is where Spark SQL lives. The Catalyst optimizer transforms SQL or DataFrame operations through a pipeline: parsing → analysis → optimization → physical planning. The output is a physical plan (SparkPlan) whose execute() method produces an RDD[InternalRow] — bridging back down to Layer 1.

Layer 3 — User-Facing APIs (sql/api/, sql/connect/) provides the entry points developers actually use: SparkSession, DataFrame, Dataset. In Spark 4.x, this layer splits into two implementations: Classic (in-process JVM) and Connect (gRPC client-server).

This layering is enforced by Maven module dependencies. sql/catalyst depends on core, but core has zero knowledge of SQL. This clean separation is what allows the RDD engine to remain general-purpose while the SQL layer provides higher-level abstractions.

CLI Entry Points: From Shell to JVM

When you run spark-submit my-app.jar, what actually happens? Let's trace the path.

sequenceDiagram
    participant User
    participant spark_submit as bin/spark-submit
    participant spark_class as bin/spark-class
    participant Launcher as o.a.s.launcher.Main
    participant SparkSubmit as SparkSubmit.doSubmit()
    participant App as User's main()

    User->>spark_submit: spark-submit --class MyApp app.jar
    spark_submit->>spark_class: exec spark-class o.a.s.deploy.SparkSubmit "$@"
    spark_class->>Launcher: java -cp ... o.a.s.launcher.Main SparkSubmit ...
    Launcher-->>spark_class: Returns JVM command with full classpath
    spark_class->>SparkSubmit: exec java -cp <full-classpath> SparkSubmit args...
    SparkSubmit->>SparkSubmit: prepareSubmitEnvironment()
    SparkSubmit->>App: Reflectively invoke main()

Step 1: bin/spark-submit is a thin shell script — just 8 lines of real code. It locates SPARK_HOME, disables Python hash randomization, and delegates to bin/spark-class with the class name org.apache.spark.deploy.SparkSubmit.

Step 2: bin/spark-class does the heavy lifting on the shell side. It finds Java, assembles the initial classpath from $SPARK_HOME/jars/*, and invokes org.apache.spark.launcher.Main — a small Java program that outputs the final java command with the correct classpath, memory settings, and JVM options. The output is parsed and exec'd.

Step 3: SparkSubmit.doSubmit() parses arguments and dispatches based on the action — SUBMIT, KILL, REQUEST_STATUS, or PRINT_VERSION. For SUBMIT, it calls prepareSubmitEnvironment() which is the real workhorse: it determines the cluster manager (YARN, Kubernetes, Standalone, or Local), resolves the deploy mode (client or cluster), downloads dependency JARs, and finally reflectively invokes the user's main class.

Tip: When debugging submission issues, add --verbose to your spark-submit command. This triggers appArgs.verbose in doSubmit() and prints the resolved classpath, deploy mode, and all configuration before launching your application.

SparkSession: The User-Facing Entry Point

If you've written any Spark SQL code, you've used SparkSession. In Spark 4.x, this is an abstract class defined in sql/api:

sql/api/src/main/scala/org/apache/spark/sql/SparkSession.scala#L63-L72

classDiagram
    class SparkSession {
        <<abstract>>
        +sparkContext: SparkContext*
        +version: String
        +sql(query: String): DataFrame
        +read: DataFrameReader
        +createDataFrame(): DataFrame
    }

    class ClassicSparkSession {
        +sparkContext: SparkContext
        -sharedState: SharedState
        -sessionState: SessionState
        +extensions: SparkSessionExtensions
    }

    class ConnectSparkSession {
        -client: SparkConnectClient
        -channel: ManagedChannel
    }

    SparkSession <|-- ClassicSparkSession
    SparkSession <|-- ConnectSparkSession

The abstract SparkSession defines the user-facing API — sql(), read, createDataFrame(), and so on. It has two concrete implementations:

Classic (sql/core/src/main/scala/org/apache/spark/sql/classic/SparkSession.scala) runs everything in-process. It wraps a SparkContext, holds a SessionState (which contains the parser, analyzer, optimizer, and planner), and executes queries directly in the driver JVM. This is the traditional Spark execution model.

Connect runs as a thin gRPC client. DataFrame operations are serialized as Protobuf messages and sent to a remote Spark Connect server for execution. The client never sees an RDD or SparkContext — it just receives Arrow-formatted results back over the wire.

This split is a major architectural decision in Spark 4.x. The motivation: Classic's in-process model ties the user's application to the Spark runtime. If you upgrade Spark, you must recompile your application. With Connect, the client is decoupled — you can upgrade the server independently, and lightweight clients in Python, Go, or Rust can talk to Spark without embedding the JVM.

Note the @ClassicOnly annotations on methods like sparkContext and sharedState — these are only available in Classic mode. Code that uses them won't work with Connect.

Build System and Developer Navigation

Spark supports two build systems, each with a distinct role:

flowchart LR
    subgraph Maven["Maven (CI / Releases)"]
        MPOM[pom.xml] --> MBUILD[mvn package -DskipTests]
        MBUILD --> MJARS[assembly/target/jars/]
    end

    subgraph SBT["SBT (Development)"]
        SBUILD[project/SparkBuild.scala] --> SCOMPILE["build/sbt compile"]
        SCOMPILE --> SINCR[Incremental compilation]
    end

Maven is the canonical build, used in CI and for official releases. The root pom.xml defines all module dependencies and plugin configurations. Use it when you need a full, reproducible build.

SBT is preferred for development. As the AGENTS.md guide states: "Prefer SBT over Maven for faster incremental compilation." The SBT build is defined in project/SparkBuild.scala, which maps Maven module names to SBT project names. For example, sql/core becomes the SBT project sql, and sql/catalyst becomes catalyst.

Key commands for navigating the codebase:

Task Command
Compile a single module build/sbt catalyst/compile
Run tests by wildcard build/sbt 'catalyst/testOnly *AnalyzerSuite'
Run a specific test build/sbt 'catalyst/testOnly *AnalyzerSuite -- -z "test name"'
Interactive SBT shell build/sbt then project catalyst

Tip: When exploring Spark's code, start from the entry points and trace inward. SparkSession is the best starting point for SQL-related code, while SparkContext is the starting point for core engine internals. Don't try to read the entire codebase — follow the execution path for the feature you care about.

What's Next

With this map in hand, you know the territory. In the next article, we'll zoom into what happens when your application actually starts: the carefully orchestrated initialization sequence inside SparkContext, the SparkEnv runtime container, and the two-level scheduling stack that turns your code into distributed tasks running across a cluster.