Read OSS

From Power-On to PID 1: How the Linux Kernel Boots

Intermediate

Prerequisites

  • Article 1: architecture-and-directory-map
  • Basic understanding of C function pointers and linker sections

From Power-On to PID 1: How the Linux Kernel Boots

Every running Linux system was once a single instruction executing on a freshly powered CPU. Between that instruction and the login: prompt lies an intricate initialization sequence — roughly 80 function calls in strict order, a self-registering driver mechanism spanning eight priority levels, and the birth of the two most important processes in the system. Understanding this boot path reveals not just how Linux starts, but why its initialization architecture scales to thousands of drivers without a central registry.

As we saw in Part 1, the kernel source is split into architecture-specific and portable layers. The boot sequence is where these layers meet: architecture-specific assembly bootstraps the hardware, then hands off to portable C code that brings every subsystem online.

Architecture Entry to start_kernel()

When an x86-64 system boots, the bootloader (GRUB, systemd-boot, etc.) loads the compressed kernel image into memory and jumps to its entry point. The kernel decompresses itself, sets up preliminary page tables, and eventually reaches the architecture-specific startup code that prepares the environment for C execution.

The key transition is the jump to start_kernel() — the C-language entry point in init/main.c. Before this call, the assembly code has:

  1. Set up a valid kernel stack
  2. Enabled basic paging
  3. Initialized the BSS section to zero
  4. Set up the GDT (Global Descriptor Table)
flowchart LR
    A["Bootloader<br/>(GRUB)"] --> B["Decompress<br/>kernel"]
    B --> C["Arch-specific<br/>assembly setup"]
    C --> D["start_kernel()<br/>in init/main.c"]
    D --> E["rest_init()<br/>creates PID 1 & 2"]

The start_kernel() function signature itself tells you something important — it's decorated with multiple attributes:

init/main.c#L1007-L1008

asmlinkage __visible __init __no_sanitize_address __noreturn __no_stack_protector
void start_kernel(void)

The __init annotation is crucial — it places this function in the .init.text section, which will be freed after boot completes. The asmlinkage tells the compiler this is called from assembly with arguments on the stack. The __noreturn means exactly what it says: this function never returns.

Walking Through start_kernel()

The body of start_kernel() is a linear sequence of approximately 80 initialization calls. Their order is not arbitrary — each call depends on the subsystems initialized before it.

init/main.c#L1008-L1211

Here's the logical grouping:

sequenceDiagram
    participant SK as start_kernel()
    participant HW as Hardware Setup
    participant MM as Memory Mgmt
    participant SCHED as Scheduler
    participant IRQ as Interrupts/Timers
    participant VFS as VFS & Processes

    SK->>HW: set_task_stack_end_magic()
    SK->>HW: setup_arch(&command_line)
    SK->>MM: mm_core_init()
    SK->>SCHED: sched_init()
    SK->>IRQ: init_IRQ(), timers_init()
    SK->>IRQ: hrtimers_init(), softirq_init()
    SK->>VFS: vfs_caches_init()
    SK->>VFS: fork_init(), signals_init()
    SK->>SK: rest_init()

Phase 1: Early Hardwaresetup_arch() is the big architecture-specific call. On x86, it identifies the CPU, parses memory maps from the BIOS/UEFI, and sets up the initial memory layout. Everything before setup_arch() uses minimal, architecture-independent code.

Phase 2: Memorymm_core_init() brings up the page allocator, slab allocator, and vmalloc. After this, the kernel can kmalloc().

Phase 3: Schedulingsched_init() initializes the per-CPU run queues and creates the idle task. The scheduler is functional after this, though SMP isn't up yet.

Phase 4: Interrupts and Timersinit_IRQ(), tick_init(), timers_init(), hrtimers_init(), and softirq_init() bring the interrupt and timer subsystems online. After local_irq_enable() at line 1138, interrupts are running.

Phase 5: Core Subsystemsvfs_caches_init() creates the dentry and inode caches. fork_init() sets up the process creation machinery. signals_init(), proc_root_init(), and dozens more.

The very last call is rest_init() — and despite its modest name, it's where the most important thing happens.

Tip: If you're debugging a boot hang, add initcall_debug to the kernel command line. It timestamps every initialization call, making it obvious which one is stuck.

The Initcall Mechanism

Before we see rest_init() create the first processes, we need to understand the mechanism those processes use to initialize drivers: the initcall system.

The problem is this: the kernel has thousands of drivers, filesystems, and subsystems that need initialization. A centralized list would be unmaintainable. Instead, each subsystem declares its own init function and self-registers it into a linker section.

include/linux/init.h#L293-L309

#define pure_initcall(fn)           __define_initcall(fn, 0)
#define core_initcall(fn)           __define_initcall(fn, 1)
#define postcore_initcall(fn)       __define_initcall(fn, 2)
#define arch_initcall(fn)           __define_initcall(fn, 3)
#define subsys_initcall(fn)         __define_initcall(fn, 4)
#define fs_initcall(fn)             __define_initcall(fn, 5)
#define rootfs_initcall(fn)         __define_initcall(fn, rootfs)
#define device_initcall(fn)         __define_initcall(fn, 6)
#define late_initcall(fn)           __define_initcall(fn, 7)

Each level has a _sync variant (e.g., core_initcall_sync) that acts as a barrier — all initcalls at the previous level must complete before the sync call runs.

Level Name Typical Users
0 pure_initcall Static variable initialization only
1 core_initcall Core kernel infrastructure (IRQ, DMA)
2 postcore_initcall Bus types (PCI, USB bus registration)
3 arch_initcall Architecture-specific setup
4 subsys_initcall Subsystem init (networking, block layer)
5 fs_initcall Filesystem registration
rootfs rootfs_initcall Root filesystem setup
6 device_initcall Most device drivers (the default module_init)
7 late_initcall Anything that depends on everything else

The __define_initcall macro places a function pointer into a named linker section like .initcall1.init. The linker script orders these sections numerically, creating an array of function pointers sorted by priority — without any driver needing to know about any other driver.

Birth of PID 1 and PID 2

rest_init() creates the kernel's first two processes:

init/main.c#L714-L743

static noinline void __ref __noreturn rest_init(void)
{
    struct task_struct *tsk;
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
    ...
    pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
    ...
}

PID 1 (kernel_init) is the ancestor of all userspace processes. PID 2 (kthreadd) is the kernel thread daemon — every kernel thread in the system is ultimately created by kthreadd.

The comment explains a subtle ordering constraint: PID 1 is created first (to get pid 1), but it must not be scheduled until kthreadd (PID 2) exists, because kernel_init will need to create kernel threads as part of its work.

flowchart TD
    rest_init["rest_init()"] --> pid1["PID 1: kernel_init<br/>(user_mode_thread)"]
    rest_init --> pid2["PID 2: kthreadd<br/>(kernel_thread)"]
    pid1 --> kif["kernel_init_freeable()"]
    kif --> dbs["do_basic_setup()"]
    dbs --> di["driver_init()"]
    dbs --> dic["do_initcalls()"]
    dic --> lvl0["Level 0: pure"]
    dic --> lvl1["Level 1: core"]
    dic --> lvl6["..."]
    dic --> lvl7["Level 7: late"]
    kif --> sinit["Search for /sbin/init"]
    pid2 --> kt["Manages all<br/>kernel threads"]

The kernel_init function calls kernel_init_freeable(), which calls do_basic_setup():

init/main.c#L1473-L1480

static void __init do_basic_setup(void)
{
    cpuset_init_smp();
    driver_init();
    init_irq_proc();
    do_ctors();
    do_initcalls();
}

And do_initcalls() iterates all eight priority levels:

init/main.c#L1447-L1464

static void __init do_initcalls(void)
{
    int level;
    ...
    for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++) {
        strcpy(command_line, saved_command_line);
        do_initcall_level(level, command_line);
    }
    ...
}

This is the moment when every driver, filesystem, and subsystem that was compiled into the kernel gets initialized — not by explicit registration, but by iterating linker-section-placed function pointers.

The Search for /sbin/init

After all initcalls complete, kernel_init makes the transition from kernel space to userspace by exec()-ing an init program:

init/main.c#L1573-L1646

The search order is telling:

  1. ramdisk_execute_command — typically /init from an initramfs
  2. execute_command — whatever was passed as init= on the kernel command line
  3. CONFIG_DEFAULT_INIT — compile-time default
  4. /sbin/init, /etc/init, /bin/init, /bin/sh — fallback search

If none succeed: panic("No working init found.").

This is the boundary between kernel initialization and userspace. Once kernel_execve() succeeds, PID 1 is a userspace process (systemd, OpenRC, or whatever init system you use). The kernel's job is now to serve system calls, manage memory, and schedule processes.

__init and Memory Reclamation

Everything we've discussed — start_kernel(), the initcall functions, the init search logic — is annotated with __init:

include/linux/init.h#L45-L48

#define __init      __section(".init.text") __cold __latent_entropy __no_kstack_erase
#define __initdata  __section(".init.data")
#define __initconst __section(".init.rodata")

The __init macro places functions in .init.text and __initdata places data in .init.data. After boot, free_initmem() releases these pages back to the allocator. On a typical system, this reclaims several megabytes of RAM that was only needed during boot.

init/main.c#L1568-L1571

void __weak free_initmem(void)
{
    free_initmem_default(POISON_FREE_INITMEM);
}

The kernel prints a message like Freeing unused kernel memory: 2048K during boot — that's free_initmem() at work. The __init pattern is so important that the build system actively checks for "section mismatches" — non-init code referencing init code would be a use-after-free bug.

Tip: If you write a kernel module that has initialization-only code, mark it with __init. But be careful: loadable modules can be loaded and unloaded at any time, so __init in modules is freed after module_init() runs, not after system boot.

What's Next

We've now traced the kernel from assembly entry through the birth of PID 1. In the next article, we'll dive into the scheduler — the subsystem initialized by that sched_init() call we saw in start_kernel(). We'll explore how the sched_class vtable pattern enables six pluggable scheduling policies, how __schedule() picks the next task, and how the novel sched_ext framework lets you write scheduler policies in BPF.