Skip to content

decrypt API lacks context.Context support: slow KMS round-trips cannot be cancelled mid-call #2179

@trendvidia

Description

@trendvidia

Summary

decrypt.Data(data []byte, format string) ([]byte, error) (and the related decrypt.File) doesn't accept a context.Context. Consumers that bind sops into a request path or a boot sequence can't interrupt a stuck KMS round-trip — the call blocks until the underlying provider's own internal timeout fires (often 30s+).

Reproduction

ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)
defer cancel()

// We can check ctx before calling…
if err := ctx.Err(); err != nil {
    return err
}

// …but once we're in here, the ctx is irrelevant.
plaintext, err := decrypt.Data(ciphertext, "yaml")
// If the KMS provider is hung, this blocks for ~30s regardless
// of the 100ms deadline we set above.

Affects every key service that calls out to a remote provider: AWS KMS, GCP KMS, Azure Key Vault, HashiCorp Vault. The underlying provider SDKs (aws-sdk-go-v2, cloud.google.com/go/kms, etc.) accept contexts; sops just doesn't thread one through.

Real-world impact

We hit this in chameleon, a layered-config library that wraps sops/decrypt for encrypted layer files. A hung GCP KMS round during application boot blocks initialization indefinitely instead of failing fast. Our workaround is checking ctx.Err() before invoking sops; once we're in the call, we're stuck. Documented as a known limitation on our side: chameleon README ("Sops API has no ctx.Context hook").

Proposed API

Add ctx-aware variants alongside the existing functions (backward compatible):

// New, in package decrypt:
func DataWithContext(ctx context.Context, data []byte, format string) ([]byte, error)
func FileWithContext(ctx context.Context, path, format string) ([]byte, error)

The existing context-less functions can delegate to the ctx variants with context.Background(). Internally, the ctx threads to each key-service call (aws-sdk's WithContext request options, GCP's existing ctx-first methods, etc.).

For sops v4 / next major: replace the existing signatures.

Alternatives considered

  • Goroutine + select-on-channel — works but leaks goroutines on cancel (no way to interrupt the blocked provider call). Worse than no fix.
  • runtime.Goexit from a watchdog goroutine — actively destructive: leaks resources held by the provider SDK (connections, mutexes).
  • Documented "don't call from latency-sensitive paths" — true but unhelpful for boot-time use.

Scope estimate

Mechanical: every kms.Decrypt(...) / kv.Decrypt(...) call site inside the key-service implementations under keyservice/ accepts a context already; threading from a new DataWithContext entry point through is a one-pass change. Tests need a fake key service that respects a cancel signal.

Happy to put up a PR if there's appetite — would appreciate a maintainer comment first on the API shape (separate WithContext functions vs. breaking change).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions