Skip to content

vfs: add per-instance inotify watch and event-queue caps#13188

Open
ibondarenko1 wants to merge 1 commit into
google:masterfrom
ibondarenko1:hardening/inotify-resource-caps
Open

vfs: add per-instance inotify watch and event-queue caps#13188
ibondarenko1 wants to merge 1 commit into
google:masterfrom
ibondarenko1:hardening/inotify-resource-caps

Conversation

@ibondarenko1
Copy link
Copy Markdown

@ibondarenko1 ibondarenko1 commented May 14, 2026

Summary

pkg/sentry/vfs/inotify.go has no per-instance cap on the number of watches an *Inotify can hold or on the depth of its pending-event queue. AddWatch extends i.watches (line 313) and Watches.ws (line 433) without a size check, and queueEvent (line 275) appends to i.events without checking length.

Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES quota is reached (default 8192). fsnotify_insert_event emits a single IN_Q_OVERFLOW marker when group->q_len reaches max_events (default 16384). gVisor accepts both without bound.

This PR adds two per-instance caps matching the Linux default values:

maxInotifyWatchesPerInstance = 8192
maxInotifyQueuedEvents       = 16384

AddWatch returns ENOSPC once len(i.watches) reaches the cap. queueEvent tracks queue length via numQueuedEvents under evMu and, on overflow, emits a single IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue tail unless one is already there. Subsequent overflowing events are dropped silently, matching fsnotify_insert_event.

Affected code at HEAD (503ea178ff)

pkg/sentry/vfs/inotify.go lines 326-353 before the change:

// AddWatch constructs a new inotify watch and adds it to the target. It
// returns the watch descriptor returned by inotify_add_watch(2).
//
// The caller must hold a reference on target.
func (i *Inotify) AddWatch(target *Dentry, mask uint32) int32 {
    i.mu.Lock()
    defer i.mu.Unlock()

    ws := target.Watches()
    if existing := ws.Lookup(i.id); existing != nil {
        ...
        return existing.wd
    }

    w := i.newWatchLocked(target, ws, mask)
    return w.wd
}

pkg/sentry/vfs/inotify.go lines 275-293 before the change:

func (i *Inotify) queueEvent(ev *Event) {
    i.evMu.Lock()

    if last := i.events.Back(); last != nil {
        if ev.equals(last) {
            i.evMu.Unlock()
            return
        }
    }

    i.events.PushBack(ev)
    i.evMu.Unlock()

    i.queue.Notify(waiter.ReadableEvents)
}

Witness

Reproducer (alpine:3.20 container with --runtime=runsc, runsc release-20260406.0):

docker run -d --runtime=runsc --name=v alpine:3.20 sleep 7200
docker exec v python3 -c '
import ctypes, os, tempfile
libc = ctypes.CDLL(None)
inotify_init = libc.inotify_init
inotify_add_watch = libc.inotify_add_watch
inotify_add_watch.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_uint32]
fd = inotify_init()
base = tempfile.mkdtemp()
for i in range(200000):
    d = os.path.join(base, "d%d" % i)
    os.mkdir(d)
    wd = inotify_add_watch(fd, d.encode(), 0x0FFF)
    assert wd >= 0
'

Measured sentry VmRSS:

Stage VmRSS Delta
Baseline (sandbox idle) 52 MB -
After 20000 watches 95 MB +43 MB
After 50000 watches 180 MB +128 MB
After 200000 watches 510 MB +458 MB

No syscall returned ENOSPC. Linux would have stopped at watch #8192 with ENOSPC. Approximately 2.3 KB sentry heap per watch. Sustainable consumption rate approximately 4 MB per second; a 512 MB sentry caps within minutes.

Linux reference

fs/notify/inotify/inotify_user.c defines inotify_table:

{
    .procname = "max_user_watches",
    .data     = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
    .maxlen   = sizeof(long),
    .mode     = 0644,
    .proc_handler = proc_doulongvec_minmax,
    ...
},
{
    .procname = "max_queued_events",
    .data     = &inotify_max_queued_events,
    ...
}

Default max_user_watches ranges from 8192 to 1048576 depending on system RAM. Default max_queued_events is 16384. gVisor adopts the lower-bound conservative defaults.

inotify_new_watch enforces the watch quota:

if (!inc_inotify_watches(group->inotify_data.ucounts)) {
    inotify_remove_from_idr(group, tmp_i_mark);
    ret = -ENOSPC;
    goto out_err;
}

fsnotify_insert_event enforces the queue quota by inserting an overflow marker once group->q_len >= group->max_events.

Change

  1. pkg/sentry/vfs/inotify.go

    • Add maxInotifyWatchesPerInstance and maxInotifyQueuedEvents constants near inotifyEventBaseSize.
    • Add numQueuedEvents int field to Inotify, protected by evMu.
    • AddWatch signature changes from (int32) to (int32, error). Returns linuxerr.ENOSPC once len(i.watches) >= maxInotifyWatchesPerInstance.
    • queueEvent increments numQueuedEvents after PushBack. On overflow, emits a single IN_Q_OVERFLOW marker at the tail if one is not already present and returns without queuing the would-be event.
    • The reader loop decrements numQueuedEvents when an event is removed via i.events.Remove(event).
  2. pkg/sentry/syscalls/linux/sys_inotify.go

    • Propagate the new AddWatch error to the inotify_add_watch(2) syscall return.
  3. pkg/sentry/fsimpl/kernfs/kernfs_test.go

    • Two AddWatch call sites updated to use the new (wd, err) return; t.Fatal on unexpected error.
  4. pkg/sentry/vfs/inotify_test.go (new file).

Out of scope for this PR

fs.inotify.max_user_instances (Linux default 128) is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES. gVisor does not have an equivalent ucount infrastructure in pkg/sentry/kernel/auth today; that cap is deferred to a follow-up change once the supporting accounting is in place.

The cap added here is per-Inotify-instance. Linux is per-user across all instances of a user. The per-instance cap covers a narrower axis than Linux's per-user cap; a process holding multiple Inotify instances can still exceed the Linux equivalent total. Once UCOUNT-like accounting lands in pkg/sentry/kernel/auth, a per-user cap can be added on top of this per-instance one.

Test plan

  • gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.go returns clean.
  • Witness reproduced before the change: sentry RSS grew from 52 MB to 510 MB on 200000 watches with no ENOSPC returned.
  • Witness rerun after the change should return ENOSPC at watch fix(sec): upgrade github.com/opencontainers/runc to 1.1.2 #8192 (deferred to CI; local Bazel chain has an unrelated cannot find 'ld' issue on the protobuf tool host-link in my environment).
  • Regression tests added: TestInotifyAddWatchReturnsENOSPCAtCap and TestInotifyQueueOverflowEmitsMarker in inotify_test.go.

Related

CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400 Uncontrolled Resource Consumption, CVSS 4.8 Medium, fixed in gVisor commit 6a112c60a257dadac59962e0bc9e9b5aee70b5b6) is the class precedent. The required attacker prerequisites there were higher (root user inside sandbox with mount permission). The inotify gap addressed here is reachable from an unprivileged sandbox process with no special capability.

Notes

This PR is hardening only. It does not claim a CVE and does not request a CVE. The CVE precedent is cited to document class lineage and to surface the conservative defaults rationale.

@ibondarenko1 ibondarenko1 force-pushed the hardening/inotify-resource-caps branch from 19badfc to 5f413d5 Compare May 16, 2026 07:34
Comment thread pkg/sentry/vfs/inotify_test.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Would you please consider adding the new tests to the existing cpp test suite in inotify.cc instead of having a new Go test? It is the prevailing practice to add tests like these to those "syscall" suites; one advantage is that we get to compare and verify behavior against Linux. IIUC, the two new behaviors here, limiting watch count and emitting an overflow marker, are both Linux behaviors.

@ibondarenko1 ibondarenko1 force-pushed the hardening/inotify-resource-caps branch from 5f413d5 to 9d65bca Compare May 19, 2026 18:58
@ibondarenko1
Copy link
Copy Markdown
Author

@shailend-g — thanks for the review. Ported both tests to the C++ syscall suite at test/syscalls/linux/inotify.cc and removed the Go test file. The two new tests (Inotify.WatchCapReturnsENOSPC and Inotify.QueueOverflowEmitsMarker) read /proc/sys/fs/inotify/max_user_watches and /proc/sys/fs/inotify/max_queued_events respectively and verify the ENOSPC / IN_Q_OVERFLOW behavior against whichever cap the runner is enforcing, so the tests pass on both Linux and gVisor with the same code path. They SKIP_IF the sysctl is configured beyond 16384 to keep test runtime bounded on large-RAM Linux hosts. Force-pushed in commit 9d65bca.

Comment thread test/syscalls/linux/inotify.cc Outdated
// them. Push past the cap so the kernel must emit IN_Q_OVERFLOW.
for (long i = 0; i < max_q + 16; i++) {
const std::string name = absl::StrCat(root.path(), "/q", i);
ASSERT_THAT(open(name.c_str(), O_CREAT | O_EXCL | O_WRONLY, 0644),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we leaking the fd returned by open()? Consider using Open() in file_descriptor.h.

// The drained event count never exceeds the cap plus the overflow marker.
EXPECT_LE(events.size(), static_cast<size_t>(max_q + 1));
(void)wd;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if you too think its valuable, consider continuing this test to include a verification for the fact that events still flow through after userspace consumes the overflow marker.

Inotify in pkg/sentry/vfs/inotify.go has no upper bound on the number of
watches a single instance can hold or on the depth of its pending-event
queue. AddWatch grows i.watches (line 313) and the target's Watches.ws
(line 433) without a size check; queueEvent (line 275) appends to
i.events without checking length.

Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns
ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES
quota is reached (default 8192), and fsnotify_insert_event emits a
single IN_Q_OVERFLOW marker when group->q_len reaches max_events
(default 16384). Without these caps, an unprivileged sandboxed process
can grow the sentry heap without bound.

Witness (kali, runsc release-20260406.0, alpine 3.20 sandbox):

  Baseline VmRSS = 52 MB.
  inotify_add_watch x 200000 distinct dirs from one inotify fd.
  Post-flood VmRSS = 510 MB. No ENOSPC returned at any step.
  Sustainable growth rate approximately 4 MB per second.
  Default sentry memory caps would OOM within minutes.

Add two per-instance caps matching the Linux default values:

  maxInotifyWatchesPerInstance = 8192
  maxInotifyQueuedEvents       = 16384

AddWatch now returns (int32, error) and returns ENOSPC once
len(i.watches) reaches the cap. queueEvent tracks queue length via
numQueuedEvents under evMu and, on overflow, emits a single
IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue
tail unless one is already there. Subsequent overflowing events are
dropped silently, matching Linux fsnotify_insert_event.

A separate Linux limit, fs.inotify.max_user_instances (default 128),
is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES in the
kernel. gVisor does not have an equivalent UCOUNT infrastructure in
pkg/sentry/kernel/auth today; that cap is deferred to a follow-up
change once the supporting accounting is in place.

The AddWatch signature change requires two call-site updates:
  pkg/sentry/syscalls/linux/sys_inotify.go - propagate the error to
    the inotify_add_watch(2) caller.
  pkg/sentry/fsimpl/kernfs/kernfs_test.go - existing tests use t.Fatal
    on unexpected errors.

Adds two regression tests in pkg/sentry/vfs/inotify_test.go:
  TestInotifyAddWatchReturnsENOSPCAtCap
  TestInotifyQueueOverflowEmitsMarker

Tested:
  gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go             pkg/sentry/syscalls/linux/sys_inotify.go             pkg/sentry/fsimpl/kernfs/kernfs_test.go
  (clean)

Related: CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400,
CVSS 4.8) is the class precedent. The attacker prerequisites here
are lower (no CAP_SYS_ADMIN, no mount permission required).
@ibondarenko1 ibondarenko1 force-pushed the hardening/inotify-resource-caps branch from 9d65bca to 829bddb Compare May 19, 2026 23:17
@ibondarenko1
Copy link
Copy Markdown
Author

@shailend-g thanks for the review. Both points addressed in commit 829bddb9.

  1. The raw open() in TestQueueOverflowEmitsMarker is replaced with Open() from test/util/file_descriptor.h. The returned FileDescriptor destructs at end of statement so the fd is closed immediately and no fd leaks across the cap+16 iterations.

  2. Added a post-overflow recovery assertion at the end of the same test: after DrainEvents consumes the IN_Q_OVERFLOW marker, trigger one more create and assert the next DrainEvents returns at least one IN_CREATE event on wd. This verifies that events flow through after the marker is consumed, not just that the marker itself appears.

Net diff in inotify.cc: +93/-3 across the two existing tests, no other files affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants