External Scheduler
Linux kernel 6.12 introduced sched_ext (extensible scheduler) as a new scheduling class that allows pluggable CPU schedulers via eBPF.
Enables implementing and dynamically loading thread schedulers. No need for recompiling the kernel and rebooting.
sched-ext/scx project is a collection of sched_ext schedulers and tools. Schedulers in scx range from simple demonstrative policies to production-oriented ones tailored for specific use cases:
- scx_simple : basic FIFO or least-run-time policy
- scx_nest : places tasks on high-frequency cores
- scx_lavd : is optimized for gaming workloads
- scx_rusty : partitions CPUs by last-level cache to improve locality
- scx_bpfland : threads that block frequently (i.e. perform many voluntary context switches per second) are assumed to be interactive, and thus prioritized
Each scheduler in SCX implements required
sched_exthooksVia eBPF programs and can be selected at runtime. The default Linux scheduler can always be restored if needed.
eBPF
High-level architecture:
flowchart LR
A[START] --> B{user space}
B -- eBPF --> C[eBPF bytecode]
C --> D[sched-ext framework]
D --> E[core scheduler]
E --> F[CPU]
Architecture
- Task Wakeup
- A task becomes runnable. Exp: you press a key, a timer fires, blocked thread gets unblocked
- Now Linux must decide: where should this task run?
ops.select_cpu()
- Your BPF scheduler is asked: Which CPU should this task run on?
- You return a CPU number.
- Direct dispatch?
- Now kernel asks: Should we run it immediately on that CPU?
- If YES: kernel calls
scx_bpf_dispatch() - This directly hands the task to that CPU.
- No queue. No waiting.
- If NO: we go to
ops.enqueue - It means put the task into some queue
- Enqueue Options (Where Can Task Go?)
- Three options
- Built-in CPU Local DSQ
- Built-in Global DSQ
- You define your own queue
- This is where
sched_extbecomes powerful. - You can build: CFS-like, RT-like, or completely weird experimental scheduler
- CPU Becomes Ready
- First check: Does the CPU have tasks in its local queue?
- If YES: run from there
- If NO: check global DSQ and run from there
- If NO: go to custom dispatch logic
- ops.dispatch()
- If no built-in queues worked, kernel calls your: ops.dispatch()
- Now you decide: Which task should this CPU run?
- You have these options
- Call:
scx_bpf_dispatch(), push a task to the cpu - Call:
scx_bpf_consume(), pull from some DSQ - Then loops again: Local DSQ? Run task.
Why This Design?
Because SCX separates:
Enqueue phase: When task becomes runnable Dispatch phase: When CPU needs work
This is powerful because:
- You can control placement (select_cpu)
- You can control queuing
- You can control dispatch
- You can build any scheduling algorithm
Hooks
Instead of CFS picking tasks via pick_next_task_fair(), sched-ext allows:
pick_next_task_ext() to be delegated to BPF.
sched-ext exposes the following hooks:
- enqueue
- dequeue
pick_next_task- dispatch
task_ticktask_exit