vfs: add per-instance inotify watch and event-queue caps#13188
vfs: add per-instance inotify watch and event-queue caps#13188ibondarenko1 wants to merge 1 commit into
Conversation
19badfc to
5f413d5
Compare
There was a problem hiding this comment.
Thanks for the patch. Would you please consider adding the new tests to the existing cpp test suite in inotify.cc instead of having a new Go test? It is the prevailing practice to add tests like these to those "syscall" suites; one advantage is that we get to compare and verify behavior against Linux. IIUC, the two new behaviors here, limiting watch count and emitting an overflow marker, are both Linux behaviors.
5f413d5 to
9d65bca
Compare
|
@shailend-g — thanks for the review. Ported both tests to the C++ syscall suite at test/syscalls/linux/inotify.cc and removed the Go test file. The two new tests ( |
| // them. Push past the cap so the kernel must emit IN_Q_OVERFLOW. | ||
| for (long i = 0; i < max_q + 16; i++) { | ||
| const std::string name = absl::StrCat(root.path(), "/q", i); | ||
| ASSERT_THAT(open(name.c_str(), O_CREAT | O_EXCL | O_WRONLY, 0644), |
There was a problem hiding this comment.
Are we leaking the fd returned by open()? Consider using Open() in file_descriptor.h.
| // The drained event count never exceeds the cap plus the overflow marker. | ||
| EXPECT_LE(events.size(), static_cast<size_t>(max_q + 1)); | ||
| (void)wd; | ||
| } |
There was a problem hiding this comment.
Only if you too think its valuable, consider continuing this test to include a verification for the fact that events still flow through after userspace consumes the overflow marker.
Inotify in pkg/sentry/vfs/inotify.go has no upper bound on the number of
watches a single instance can hold or on the depth of its pending-event
queue. AddWatch grows i.watches (line 313) and the target's Watches.ws
(line 433) without a size check; queueEvent (line 275) appends to
i.events without checking length.
Linux fs/notify/inotify/inotify_user.c caps both. The kernel returns
ENOSPC from inotify_new_watch when the per-user UCOUNT_INOTIFY_WATCHES
quota is reached (default 8192), and fsnotify_insert_event emits a
single IN_Q_OVERFLOW marker when group->q_len reaches max_events
(default 16384). Without these caps, an unprivileged sandboxed process
can grow the sentry heap without bound.
Witness (kali, runsc release-20260406.0, alpine 3.20 sandbox):
Baseline VmRSS = 52 MB.
inotify_add_watch x 200000 distinct dirs from one inotify fd.
Post-flood VmRSS = 510 MB. No ENOSPC returned at any step.
Sustainable growth rate approximately 4 MB per second.
Default sentry memory caps would OOM within minutes.
Add two per-instance caps matching the Linux default values:
maxInotifyWatchesPerInstance = 8192
maxInotifyQueuedEvents = 16384
AddWatch now returns (int32, error) and returns ENOSPC once
len(i.watches) reaches the cap. queueEvent tracks queue length via
numQueuedEvents under evMu and, on overflow, emits a single
IN_Q_OVERFLOW marker (wd = -1, mask = IN_Q_OVERFLOW) at the queue
tail unless one is already there. Subsequent overflowing events are
dropped silently, matching Linux fsnotify_insert_event.
A separate Linux limit, fs.inotify.max_user_instances (default 128),
is enforced per user namespace via UCOUNT_INOTIFY_INSTANCES in the
kernel. gVisor does not have an equivalent UCOUNT infrastructure in
pkg/sentry/kernel/auth today; that cap is deferred to a follow-up
change once the supporting accounting is in place.
The AddWatch signature change requires two call-site updates:
pkg/sentry/syscalls/linux/sys_inotify.go - propagate the error to
the inotify_add_watch(2) caller.
pkg/sentry/fsimpl/kernfs/kernfs_test.go - existing tests use t.Fatal
on unexpected errors.
Adds two regression tests in pkg/sentry/vfs/inotify_test.go:
TestInotifyAddWatchReturnsENOSPCAtCap
TestInotifyQueueOverflowEmitsMarker
Tested:
gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.go
(clean)
Related: CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400,
CVSS 4.8) is the class precedent. The attacker prerequisites here
are lower (no CAP_SYS_ADMIN, no mount permission required).
9d65bca to
829bddb
Compare
|
@shailend-g thanks for the review. Both points addressed in commit
Net diff in |
Summary
pkg/sentry/vfs/inotify.gohas no per-instance cap on the number of watches an*Inotifycan hold or on the depth of its pending-event queue.AddWatchextendsi.watches(line 313) andWatches.ws(line 433) without a size check, andqueueEvent(line 275) appends toi.eventswithout checking length.Linux
fs/notify/inotify/inotify_user.ccaps both. The kernel returnsENOSPCfrominotify_new_watchwhen the per-userUCOUNT_INOTIFY_WATCHESquota is reached (default 8192).fsnotify_insert_eventemits a singleIN_Q_OVERFLOWmarker whengroup->q_lenreachesmax_events(default 16384). gVisor accepts both without bound.This PR adds two per-instance caps matching the Linux default values:
AddWatchreturnsENOSPConcelen(i.watches)reaches the cap.queueEventtracks queue length vianumQueuedEventsunderevMuand, on overflow, emits a singleIN_Q_OVERFLOWmarker (wd = -1, mask =IN_Q_OVERFLOW) at the queue tail unless one is already there. Subsequent overflowing events are dropped silently, matchingfsnotify_insert_event.Affected code at HEAD (
503ea178ff)pkg/sentry/vfs/inotify.golines 326-353 before the change:pkg/sentry/vfs/inotify.golines 275-293 before the change:Witness
Reproducer (alpine:3.20 container with
--runtime=runsc, runsc release-20260406.0):Measured sentry
VmRSS:No syscall returned
ENOSPC. Linux would have stopped at watch #8192 withENOSPC. Approximately 2.3 KB sentry heap per watch. Sustainable consumption rate approximately 4 MB per second; a 512 MB sentry caps within minutes.Linux reference
fs/notify/inotify/inotify_user.cdefinesinotify_table:{ .procname = "max_user_watches", .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES], .maxlen = sizeof(long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, ... }, { .procname = "max_queued_events", .data = &inotify_max_queued_events, ... }Default
max_user_watchesranges from 8192 to 1048576 depending on system RAM. Defaultmax_queued_eventsis 16384. gVisor adopts the lower-bound conservative defaults.inotify_new_watchenforces the watch quota:fsnotify_insert_eventenforces the queue quota by inserting an overflow marker oncegroup->q_len >= group->max_events.Change
pkg/sentry/vfs/inotify.gomaxInotifyWatchesPerInstanceandmaxInotifyQueuedEventsconstants nearinotifyEventBaseSize.numQueuedEvents intfield toInotify, protected byevMu.AddWatchsignature changes from(int32)to(int32, error). Returnslinuxerr.ENOSPConcelen(i.watches) >= maxInotifyWatchesPerInstance.queueEventincrementsnumQueuedEventsafterPushBack. On overflow, emits a singleIN_Q_OVERFLOWmarker at the tail if one is not already present and returns without queuing the would-be event.numQueuedEventswhen an event is removed viai.events.Remove(event).pkg/sentry/syscalls/linux/sys_inotify.goAddWatcherror to theinotify_add_watch(2)syscall return.pkg/sentry/fsimpl/kernfs/kernfs_test.goAddWatchcall sites updated to use the new(wd, err)return;t.Fatalon unexpected error.pkg/sentry/vfs/inotify_test.go(new file).Out of scope for this PR
fs.inotify.max_user_instances(Linux default 128) is enforced per user namespace viaUCOUNT_INOTIFY_INSTANCES. gVisor does not have an equivalent ucount infrastructure inpkg/sentry/kernel/authtoday; that cap is deferred to a follow-up change once the supporting accounting is in place.The cap added here is per-Inotify-instance. Linux is per-user across all instances of a user. The per-instance cap covers a narrower axis than Linux's per-user cap; a process holding multiple Inotify instances can still exceed the Linux equivalent total. Once UCOUNT-like accounting lands in pkg/sentry/kernel/auth, a per-user cap can be added on top of this per-instance one.
Test plan
gofmt -l pkg/sentry/vfs/inotify.go pkg/sentry/vfs/inotify_test.go pkg/sentry/syscalls/linux/sys_inotify.go pkg/sentry/fsimpl/kernfs/kernfs_test.goreturns clean.ENOSPCreturned.ENOSPCat watch fix(sec): upgrade github.com/opencontainers/runc to 1.1.2 #8192 (deferred to CI; local Bazel chain has an unrelatedcannot find 'ld'issue on the protobuf tool host-link in my environment).TestInotifyAddWatchReturnsENOSPCAtCapandTestInotifyQueueOverflowEmitsMarkerininotify_test.go.Related
CVE-2023-7258 (gVisor mount-point ref-counting DoS, CWE-400 Uncontrolled Resource Consumption, CVSS 4.8 Medium, fixed in gVisor commit
6a112c60a257dadac59962e0bc9e9b5aee70b5b6) is the class precedent. The required attacker prerequisites there were higher (root user inside sandbox with mount permission). The inotify gap addressed here is reachable from an unprivileged sandbox process with no special capability.Notes
This PR is hardening only. It does not claim a CVE and does not request a CVE. The CVE precedent is cited to document class lineage and to surface the conservative defaults rationale.