The Scheduler Class Hierarchy: How Linux Decides What Runs Next

The Linux scheduler must answer a deceptively simple question billions of times per second: which task should run next? The answer involves a polymorphic class hierarchy implemented via C struct function pointers, a priority ordering trick using linker sections, per-CPU run queues designed with cache-line precision, and a recently added framework for writing scheduling policies in BPF. This article dissects all of it.

As we saw in Article 2, sched_init() is called early in start_kernel() to bring the scheduler online. Here we explore what that initialization creates, and how the scheduling machinery operates at runtime.

struct task_struct: The Process Descriptor

Every process and thread in the system is represented by a struct task_struct. At over 800 lines of conditionally-compiled fields, it's one of the largest structures in the kernel.

include/linux/sched.h#L820-L919

The scheduling-relevant portion starts right after thread_info:

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
    struct thread_info      thread_info;
#endif
    unsigned int            __state;
    unsigned int            saved_state;

    randomized_struct_fields_start

    void                    *stack;
    ...
    int                     on_rq;
    int                     prio;
    int                     static_prio;
    int                     normal_prio;
    unsigned int            rt_priority;

    struct sched_entity     se;
    struct sched_rt_entity  rt;
    struct sched_dl_entity  dl;
#ifdef CONFIG_SCHED_CLASS_EXT
    struct sched_ext_entity scx;
#endif
    const struct sched_class *sched_class;
    ...

classDiagram
    class task_struct {
        +unsigned int __state
        +int prio
        +int static_prio
        +int on_rq
        +sched_entity se
        +sched_rt_entity rt
        +sched_dl_entity dl
        +sched_ext_entity scx
        +sched_class* sched_class
    }
    class sched_entity {
        +u64 vruntime
        +u64 deadline
        +struct load_weight load
    }
    class sched_rt_entity {
        +struct list_head run_list
        +unsigned int time_slice
    }
    class sched_dl_entity {
        +u64 dl_deadline
        +u64 dl_period
        +u64 dl_runtime
    }
    task_struct --> sched_entity
    task_struct --> sched_rt_entity
    task_struct --> sched_dl_entity

Several things stand out. First, randomized_struct_fields_start — this is a GCC plugin that randomizes the layout of subsequent fields to make exploits harder. Second, every task carries scheduling entities for multiple scheduling classes simultaneously (se for CFS, rt for real-time, dl for deadline, scx for sched_ext), but only one is active as indicated by the sched_class pointer. Third, the scx field is gated on CONFIG_SCHED_CLASS_EXT.

Tip: When reading task_struct, don't try to understand every field. Focus on the fields relevant to the subsystem you're studying — scheduling, memory, signals, or credentials.

struct sched_class: The Scheduling Vtable

The scheduler's extensibility comes from struct sched_class — a table of function pointers that each scheduling policy implements. This is the "C vtable" pattern that permeates the kernel.

kernel/sched/sched.h#L2497-L2658

struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
    int uclamp_enabled;
#endif
    void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags);
    bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*yield_task)(struct rq *rq);
    void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
    struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
    void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
    void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
    int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
    void (*update_curr)(struct rq *rq);
    ...
};

Each comment in the source documents the locking context for each operation (e.g., rq->lock, task_rq_lock). This is essential information — calling these at the wrong time causes deadlocks or data corruption.

The core operations are:

enqueue_task / dequeue_task — Add/remove a task from the run queue
pick_task — Select the next task to run from this class's queue
wakeup_preempt — Decide if a waking task should preempt the current one
put_prev_task / set_next_task — Manage the context switch bookkeeping
task_tick — Called on every timer tick for accounting

The Class Hierarchy and Linker Section Trick

Linux has six scheduling classes, and their priority ordering is critical: a real-time task must always preempt a CFS task, and a deadline task must preempt a real-time task. This ordering is achieved through a clever linker section trick.

kernel/sched/sched.h#L2706-L2709

#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
    __aligned(__alignof__(struct sched_class)) \
    __section("__" #name "_sched_class")

Each scheduling class uses DEFINE_SCHED_CLASS(name) to place its sched_class instance into a named linker section. The linker script then orders these sections:

include/asm-generic/vmlinux.lds.h lines 153-162

__sched_class_highest = .;
*(__stop_sched_class)
*(__dl_sched_class)
*(__rt_sched_class)
*(__fair_sched_class)
*(__ext_sched_class)
*(__idle_sched_class)
__sched_class_lowest = .;

The comment at line 2702 warns: "CAREFUL they are laid out in REVERSE order!!!" This means __sched_class_highest points to stop_sched_class (the highest priority), and iteration goes from highest to lowest by incrementing the pointer.

flowchart LR
    stop["stop<br/>(highest)"] --> dl["deadline"]
    dl --> rt["real-time"]
    rt --> fair["CFS/fair"]
    fair --> ext["sched_ext"]
    ext --> idle["idle<br/>(lowest)"]

The iteration macro uses this layout:

kernel/sched/sched.h#L2746-L2747

#define for_each_active_class(class)    \
    for_active_class_range(class, __sched_class_highest, __sched_class_lowest)

The word "active" is key. The next_active_class() function dynamically skips classes:

kernel/sched/sched.h#L2725-L2735

static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
    class++;
#ifdef CONFIG_SCHED_CLASS_EXT
    if (scx_switched_all() && class == &fair_sched_class)
        class++;
    if (!scx_enabled() && class == &ext_sched_class)
        class++;
#endif
    return class;
}

When scx_switched_all() is true, CFS is entirely bypassed — sched_ext has taken over all fair scheduling. When sched_ext is disabled, the ext class is skipped. This dynamic class skipping is what makes sched_ext's runtime hot-swapping possible.

The __schedule() Context Switch Path

The core scheduling function is __schedule():

kernel/sched/core.c#L6764

static void __sched notrace __schedule(int sched_mode)
{
    struct task_struct *prev, *next;
    bool preempt = sched_mode > SM_NONE;
    ...
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;
    ...

The flow is:

flowchart TD
    A["__schedule(sched_mode)"] --> B["Get current CPU's rq"]
    B --> C["Lock rq->__lock"]
    C --> D["Handle prev task state<br/>(sleeping? yielding?)"]
    D --> E["pick_next_task()<br/>iterates active classes"]
    E --> F{"next != prev?"}
    F -->|Yes| G["context_switch(rq, prev, next)"]
    F -->|No| H["Unlock and return"]
    G --> I["Switch stack, registers, address space"]

The pick_next_task() function iterates the active scheduler classes from highest to lowest priority. The first class that has a runnable task wins. This means a single runnable deadline task will always be picked before any CFS tasks, regardless of how many CFS tasks are waiting.

The context_switch() at the end performs the actual CPU context switch: saving and restoring registers, switching the memory mapping (if processes have different address spaces), and updating the rq->curr pointer.

CFS and sched_ext

CFS: Completely Fair Scheduler

CFS is the default policy for normal (SCHED_OTHER) tasks. Its core idea is virtual runtime (vruntime): each task accumulates virtual time proportional to its actual CPU time divided by its weight. The task with the smallest vruntime is always chosen next, ensuring proportional fairness.

The CFS implementation in DEFINE_SCHED_CLASS(fair) is one of the largest single operation structures in the kernel:

kernel/sched/fair.c#L13955-L14001

DEFINE_SCHED_CLASS(fair) = {
    .enqueue_task       = enqueue_task_fair,
    .dequeue_task       = dequeue_task_fair,
    .yield_task         = yield_task_fair,
    .wakeup_preempt     = wakeup_preempt_fair,
    .pick_task          = pick_task_fair,
    .pick_next_task     = pick_next_task_fair,
    .put_prev_task      = put_prev_task_fair,
    .set_next_task      = set_next_task_fair,
    .select_task_rq     = select_task_rq_fair,
    .task_tick          = task_tick_fair,
    .update_curr        = update_curr_fair,
    ...
};

sched_ext: BPF-Extensible Scheduling

sched_ext is a revolutionary addition: it allows scheduling policies to be written as BPF programs and loaded at runtime. When active, it can replace CFS entirely (hence scx_switched_all() in the class iteration) or run alongside it.

The scheduling class hierarchy lets sched_ext slot in at a specific priority level between CFS and idle. BPF programs implement the pick_task, enqueue_task, and other operations, allowing researchers and operators to experiment with custom scheduling policies without recompiling the kernel.

Tip: Check out the tools/sched_ext/ directory for example BPF schedulers, including scx_simple and scx_layered. They demonstrate how to write scheduling policies that can be loaded with a single command.

Per-CPU Run Queues and Cache-Line Layout

Every CPU has its own struct rq, and the layout of this structure is meticulously designed for cache performance:

kernel/sched/sched.h#L1121-L1170

struct rq {
    /* Hot read-mostly data loaded in update_sg_lb_stats() */
    unsigned int        nr_running;
    ...
    unsigned long       cpu_capacity;
    struct task_struct __rcu *curr;
    struct task_struct  *idle;

    /* Next cacheline: the hot lock */
    u64                 nr_switches    ____cacheline_aligned;
    raw_spinlock_t      __lock;
    ...
#ifdef CONFIG_UCLAMP_TASK
    struct uclamp_rq    uclamp[UCLAMP_CNT] ____cacheline_aligned;
#endif
    struct cfs_rq       cfs;
    struct rt_rq        rt;
    struct dl_rq        dl;
#ifdef CONFIG_SCHED_CLASS_EXT
    struct scx_rq       scx;
#endif

flowchart LR
    subgraph CL1["Cache Line 1 (read-mostly)"]
        nr["nr_running"]
        cap["cpu_capacity"]
        curr["curr / donor"]
        idle["idle"]
    end
    subgraph CL2["Cache Line 2 (lock + hot writes)"]
        sw["nr_switches"]
        lock["__lock (spinlock)"]
        nohz["nohz fields"]
    end
    subgraph CL3["Cache Line 3+"]
        cfs["cfs_rq"]
        rt["rt_rq"]
        dl["dl_rq"]
    end
    CL1 -.-> CL2 -.-> CL3

The comment at line 1123 explains the design: fields like nr_running and cpu_capacity are "loaded together, without holding the rq->lock, in an extremely hot loop." They're placed on a dedicated cache line separate from the lock to avoid false sharing. The lock itself is on the next cache line, separated by ____cacheline_aligned.

This is not premature optimization — the scheduler runs on every CPU, on every context switch, on every timer tick. Reducing a single cache miss here can translate to measurable throughput gains across the system.

What's Next

We now understand how the scheduler decides what to run. In the next article, we'll trace the path in the opposite direction — from userspace into the kernel — following a system call from the x86-64 SYSCALL instruction through assembly entry, C dispatch, and into the VFS layer where the kernel's most important abstraction pattern operates at scale.

The Scheduler Class Hierarchy: How Linux Decides What Runs Next

Prerequisites

The Scheduler Class Hierarchy: How Linux Decides What Runs Next

struct task_struct: The Process Descriptor

struct sched_class: The Scheduling Vtable

The Class Hierarchy and Linker Section Trick

The __schedule() Context Switch Path

CFS and sched_ext

CFS: Completely Fair Scheduler

sched_ext: BPF-Extensible Scheduling

Per-CPU Run Queues and Cache-Line Layout

What's Next