The Scheduler Class Hierarchy: How Linux Decides What Runs Next
Prerequisites
- ›Article 1: architecture-and-directory-map
- ›Article 2: boot-sequence-and-initialization
- ›Understanding of C function pointers and struct-based polymorphism
- ›Basic concurrency concepts (spinlocks, per-CPU data)
The Scheduler Class Hierarchy: How Linux Decides What Runs Next
The Linux scheduler must answer a deceptively simple question billions of times per second: which task should run next? The answer involves a polymorphic class hierarchy implemented via C struct function pointers, a priority ordering trick using linker sections, per-CPU run queues designed with cache-line precision, and a recently added framework for writing scheduling policies in BPF. This article dissects all of it.
As we saw in Article 2, sched_init() is called early in start_kernel() to bring the scheduler online. Here we explore what that initialization creates, and how the scheduling machinery operates at runtime.
struct task_struct: The Process Descriptor
Every process and thread in the system is represented by a struct task_struct. At over 800 lines of conditionally-compiled fields, it's one of the largest structures in the kernel.
include/linux/sched.h#L820-L919
The scheduling-relevant portion starts right after thread_info:
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
struct thread_info thread_info;
#endif
unsigned int __state;
unsigned int saved_state;
randomized_struct_fields_start
void *stack;
...
int on_rq;
int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
struct sched_entity se;
struct sched_rt_entity rt;
struct sched_dl_entity dl;
#ifdef CONFIG_SCHED_CLASS_EXT
struct sched_ext_entity scx;
#endif
const struct sched_class *sched_class;
...
classDiagram
class task_struct {
+unsigned int __state
+int prio
+int static_prio
+int on_rq
+sched_entity se
+sched_rt_entity rt
+sched_dl_entity dl
+sched_ext_entity scx
+sched_class* sched_class
}
class sched_entity {
+u64 vruntime
+u64 deadline
+struct load_weight load
}
class sched_rt_entity {
+struct list_head run_list
+unsigned int time_slice
}
class sched_dl_entity {
+u64 dl_deadline
+u64 dl_period
+u64 dl_runtime
}
task_struct --> sched_entity
task_struct --> sched_rt_entity
task_struct --> sched_dl_entity
Several things stand out. First, randomized_struct_fields_start — this is a GCC plugin that randomizes the layout of subsequent fields to make exploits harder. Second, every task carries scheduling entities for multiple scheduling classes simultaneously (se for CFS, rt for real-time, dl for deadline, scx for sched_ext), but only one is active as indicated by the sched_class pointer. Third, the scx field is gated on CONFIG_SCHED_CLASS_EXT.
Tip: When reading
task_struct, don't try to understand every field. Focus on the fields relevant to the subsystem you're studying — scheduling, memory, signals, or credentials.
struct sched_class: The Scheduling Vtable
The scheduler's extensibility comes from struct sched_class — a table of function pointers that each scheduling policy implements. This is the "C vtable" pattern that permeates the kernel.
kernel/sched/sched.h#L2497-L2658
struct sched_class {
#ifdef CONFIG_UCLAMP_TASK
int uclamp_enabled;
#endif
void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags);
bool (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags);
void (*yield_task)(struct rq *rq);
void (*wakeup_preempt)(struct rq *rq, struct task_struct *p, int flags);
struct task_struct *(*pick_task)(struct rq *rq, struct rq_flags *rf);
void (*put_prev_task)(struct rq *rq, struct task_struct *p, struct task_struct *next);
void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
void (*update_curr)(struct rq *rq);
...
};
Each comment in the source documents the locking context for each operation (e.g., rq->lock, task_rq_lock). This is essential information — calling these at the wrong time causes deadlocks or data corruption.
The core operations are:
enqueue_task/dequeue_task— Add/remove a task from the run queuepick_task— Select the next task to run from this class's queuewakeup_preempt— Decide if a waking task should preempt the current oneput_prev_task/set_next_task— Manage the context switch bookkeepingtask_tick— Called on every timer tick for accounting
The Class Hierarchy and Linker Section Trick
Linux has six scheduling classes, and their priority ordering is critical: a real-time task must always preempt a CFS task, and a deadline task must preempt a real-time task. This ordering is achieved through a clever linker section trick.
kernel/sched/sched.h#L2706-L2709
#define DEFINE_SCHED_CLASS(name) \
const struct sched_class name##_sched_class \
__aligned(__alignof__(struct sched_class)) \
__section("__" #name "_sched_class")
Each scheduling class uses DEFINE_SCHED_CLASS(name) to place its sched_class instance into a named linker section. The linker script then orders these sections:
include/asm-generic/vmlinux.lds.h lines 153-162
__sched_class_highest = .;
*(__stop_sched_class)
*(__dl_sched_class)
*(__rt_sched_class)
*(__fair_sched_class)
*(__ext_sched_class)
*(__idle_sched_class)
__sched_class_lowest = .;
The comment at line 2702 warns: "CAREFUL they are laid out in REVERSE order!!!" This means __sched_class_highest points to stop_sched_class (the highest priority), and iteration goes from highest to lowest by incrementing the pointer.
flowchart LR
stop["stop<br/>(highest)"] --> dl["deadline"]
dl --> rt["real-time"]
rt --> fair["CFS/fair"]
fair --> ext["sched_ext"]
ext --> idle["idle<br/>(lowest)"]
The iteration macro uses this layout:
kernel/sched/sched.h#L2746-L2747
#define for_each_active_class(class) \
for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
The word "active" is key. The next_active_class() function dynamically skips classes:
kernel/sched/sched.h#L2725-L2735
static inline const struct sched_class *next_active_class(const struct sched_class *class)
{
class++;
#ifdef CONFIG_SCHED_CLASS_EXT
if (scx_switched_all() && class == &fair_sched_class)
class++;
if (!scx_enabled() && class == &ext_sched_class)
class++;
#endif
return class;
}
When scx_switched_all() is true, CFS is entirely bypassed — sched_ext has taken over all fair scheduling. When sched_ext is disabled, the ext class is skipped. This dynamic class skipping is what makes sched_ext's runtime hot-swapping possible.
The __schedule() Context Switch Path
The core scheduling function is __schedule():
static void __sched notrace __schedule(int sched_mode)
{
struct task_struct *prev, *next;
bool preempt = sched_mode > SM_NONE;
...
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
...
The flow is:
flowchart TD
A["__schedule(sched_mode)"] --> B["Get current CPU's rq"]
B --> C["Lock rq->__lock"]
C --> D["Handle prev task state<br/>(sleeping? yielding?)"]
D --> E["pick_next_task()<br/>iterates active classes"]
E --> F{"next != prev?"}
F -->|Yes| G["context_switch(rq, prev, next)"]
F -->|No| H["Unlock and return"]
G --> I["Switch stack, registers, address space"]
The pick_next_task() function iterates the active scheduler classes from highest to lowest priority. The first class that has a runnable task wins. This means a single runnable deadline task will always be picked before any CFS tasks, regardless of how many CFS tasks are waiting.
The context_switch() at the end performs the actual CPU context switch: saving and restoring registers, switching the memory mapping (if processes have different address spaces), and updating the rq->curr pointer.
CFS and sched_ext
CFS: Completely Fair Scheduler
CFS is the default policy for normal (SCHED_OTHER) tasks. Its core idea is virtual runtime (vruntime): each task accumulates virtual time proportional to its actual CPU time divided by its weight. The task with the smallest vruntime is always chosen next, ensuring proportional fairness.
The CFS implementation in DEFINE_SCHED_CLASS(fair) is one of the largest single operation structures in the kernel:
kernel/sched/fair.c#L13955-L14001
DEFINE_SCHED_CLASS(fair) = {
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
.yield_task = yield_task_fair,
.wakeup_preempt = wakeup_preempt_fair,
.pick_task = pick_task_fair,
.pick_next_task = pick_next_task_fair,
.put_prev_task = put_prev_task_fair,
.set_next_task = set_next_task_fair,
.select_task_rq = select_task_rq_fair,
.task_tick = task_tick_fair,
.update_curr = update_curr_fair,
...
};
sched_ext: BPF-Extensible Scheduling
sched_ext is a revolutionary addition: it allows scheduling policies to be written as BPF programs and loaded at runtime. When active, it can replace CFS entirely (hence scx_switched_all() in the class iteration) or run alongside it.
The scheduling class hierarchy lets sched_ext slot in at a specific priority level between CFS and idle. BPF programs implement the pick_task, enqueue_task, and other operations, allowing researchers and operators to experiment with custom scheduling policies without recompiling the kernel.
Tip: Check out the
tools/sched_ext/directory for example BPF schedulers, includingscx_simpleandscx_layered. They demonstrate how to write scheduling policies that can be loaded with a single command.
Per-CPU Run Queues and Cache-Line Layout
Every CPU has its own struct rq, and the layout of this structure is meticulously designed for cache performance:
kernel/sched/sched.h#L1121-L1170
struct rq {
/* Hot read-mostly data loaded in update_sg_lb_stats() */
unsigned int nr_running;
...
unsigned long cpu_capacity;
struct task_struct __rcu *curr;
struct task_struct *idle;
/* Next cacheline: the hot lock */
u64 nr_switches ____cacheline_aligned;
raw_spinlock_t __lock;
...
#ifdef CONFIG_UCLAMP_TASK
struct uclamp_rq uclamp[UCLAMP_CNT] ____cacheline_aligned;
#endif
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
#ifdef CONFIG_SCHED_CLASS_EXT
struct scx_rq scx;
#endif
flowchart LR
subgraph CL1["Cache Line 1 (read-mostly)"]
nr["nr_running"]
cap["cpu_capacity"]
curr["curr / donor"]
idle["idle"]
end
subgraph CL2["Cache Line 2 (lock + hot writes)"]
sw["nr_switches"]
lock["__lock (spinlock)"]
nohz["nohz fields"]
end
subgraph CL3["Cache Line 3+"]
cfs["cfs_rq"]
rt["rt_rq"]
dl["dl_rq"]
end
CL1 -.-> CL2 -.-> CL3
The comment at line 1123 explains the design: fields like nr_running and cpu_capacity are "loaded together, without holding the rq->lock, in an extremely hot loop." They're placed on a dedicated cache line separate from the lock to avoid false sharing. The lock itself is on the next cache line, separated by ____cacheline_aligned.
This is not premature optimization — the scheduler runs on every CPU, on every context switch, on every timer tick. Reducing a single cache miss here can translate to measurable throughput gains across the system.
What's Next
We now understand how the scheduler decides what to run. In the next article, we'll trace the path in the opposite direction — from userspace into the kernel — following a system call from the x86-64 SYSCALL instruction through assembly entry, C dispatch, and into the VFS layer where the kernel's most important abstraction pattern operates at scale.