feat(metering): deterministic S3 keys + skip-if-exists for billing export#6242
Conversation
…port
The hourly cron was re-uploading the same (day, metric) data 24 times
per day, relying on Orb's idempotency_key dedup to absorb the
redundancy. This change makes each (day, metric) upload exactly once:
- S3 object key drops the runTimestamp prefix and becomes
`{day-compact}/{eventName}.jsonl` — one canonical name per
(day, metric).
- Before INSERT INTO FUNCTION s3, the cron does an S3 HeadObject; if
the file already exists for this (day, metric) it logs and skips.
Later ticks on the same UTC day are now self-healing no-ops.
- Cron schedule moves from `*/N * * * *` (top of every hour) to a
fixed `15 * * * *` — gives ClickHouse a 15-min buffer to ingest
late events for the previous UTC day before we snapshot it.
- Adds @aws-sdk/client-s3 to metering and a BILLING_EVENTS_S3_REGION
env var (defaults to us-west-2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f14f4de7f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (status === 404 || (err as Error).name === 'NotFound') { | ||
| return false; | ||
| } | ||
| throw err; |
There was a problem hiding this comment.
Handle 403 on missing-key HEAD checks
The new objectExists helper only treats 404/NotFound as “missing”, but HeadObject commonly returns 403 for a non-existent key when the caller lacks s3:ListBucket on the bucket. In that IAM setup, every first-time export key is treated as an error, so the cron logs a per-metric failure and never reaches the ClickHouse INSERT for that metric. This is a regression introduced by the preflight HEAD path and will block exports unless the runtime credentials are broadened.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
but HeadObject commonly returns 403 for a non-existent key when the caller lacks s3:ListBucket on the bucket
This one will make sure that the cron has access, so we get a 404 if the key/object does not exist
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f14f4de7f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| BILLING_EVENTS_S3_BUCKET: z.string().optional(), | ||
| BILLING_EVENTS_S3_WRITER_ROLE_ARN: z.string().optional(), | ||
| BILLING_EVENTS_S3_EVENT_NAME_SUFFIX: z.string().optional(), | ||
| BILLING_EVENTS_S3_REGION: z.string().optional().default('us-west-2'), |
There was a problem hiding this comment.
Avoid defaulting S3 region to a fixed value
BILLING_EVENTS_S3_REGION now defaults to us-west-2, and exec() always builds new S3Client({ region }) before the preflight HeadObject check. In any deployment where the export bucket is in a different region and this env var is not explicitly set, the HEAD call can fail with PermanentRedirect/region mismatch, causing each metric export to error out before INSERT INTO FUNCTION s3 runs. This is a regression from the previous flow (which did not require an SDK-side S3 region), so the region should be derived reliably (or redirect-following enabled) instead of hard-coding a default.
Useful? React with 👍 / 👎.
A failed (day, metric) export can now mean two distinct things: HeadObject blew up (e.g. IAM regression → 403) or the ClickHouse INSERT did (e.g. CH timeout). Both used to share the same success=false tag, so dashboards couldn't tell them apart without inspecting log lines. Tracks an explicit `step: 'head' | 'export'` variable through the try block; the catch tags the result metric with the in-flight step. The error log also includes the step for easier triage.
More meaningful when reading dashboards / log lines than the HTTP-verb shorthand.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: af4742d658
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| BILLING_EVENTS_S3_BUCKET: z.string().optional(), | ||
| BILLING_EVENTS_S3_WRITER_ROLE_ARN: z.string().optional(), | ||
| BILLING_EVENTS_S3_EVENT_NAME_SUFFIX: z.string().optional(), | ||
| BILLING_EVENTS_S3_REGION: z.string().optional().default('us-west-2'), |
There was a problem hiding this comment.
Remove hard-coded fallback for billing export S3 region
Defaulting BILLING_EVENTS_S3_REGION to us-west-2 makes the new HeadObject preflight run against a fixed region whenever the env var is unset, and S3Client is not configured with followRegionRedirects. In deployments where the billing export bucket is in another region, the existence check can fail (e.g., redirect/region mismatch), so each metric is marked failed before the ClickHouse export runs. This is a regression introduced by adding the SDK-side preflight check.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
1 issue found across 4 files
Confidence score: 3/5
- There is a concrete reliability risk in
packages/metering/lib/crons/billingEventsS3Export.ts: when S3 returns 403 for a missing object, the export can fail instead of treating the object as absent. - Given the issue’s higher severity/confidence (7/10, 8/10) and direct user-facing impact on billing event exports, this is a moderate merge risk rather than a merge-blocker.
- Pay close attention to
packages/metering/lib/crons/billingEventsS3Export.ts- missing-key handling should account for S3 403 behavior (or ensures3:ListBucketis granted).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/metering/lib/crons/billingEventsS3Export.ts">
<violation number="1" location="packages/metering/lib/crons/billingEventsS3Export.ts:226">
P1: Handle S3's 403-on-missing-object case, or ensure the writer role also has `s3:ListBucket`. Right now a missing key can fail the export instead of being treated as absent.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| return true; | ||
| } catch (err) { | ||
| const status = (err as { $metadata?: { httpStatusCode?: number }; name?: string }).$metadata?.httpStatusCode; | ||
| if (status === 404 || (err as Error).name === 'NotFound') { |
There was a problem hiding this comment.
P1: Handle S3's 403-on-missing-object case, or ensure the writer role also has s3:ListBucket. Right now a missing key can fail the export instead of being treated as absent.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/metering/lib/crons/billingEventsS3Export.ts, line 226:
<comment>Handle S3's 403-on-missing-object case, or ensure the writer role also has `s3:ListBucket`. Right now a missing key can fail the export instead of being treated as absent.</comment>
<file context>
@@ -191,11 +211,25 @@ export async function exec(): Promise<void> {
+ return true;
+ } catch (err) {
+ const status = (err as { $metadata?: { httpStatusCode?: number }; name?: string }).$metadata?.httpStatusCode;
+ if (status === 404 || (err as Error).name === 'NotFound') {
+ return false;
+ }
</file context>
There was a problem hiding this comment.
Thanks for the update!
| // late events from the previous UTC day before we snapshot it. Each (day, metric) | ||
| // is uploaded exactly once (skip-if-exists below); later ticks on the same UTC | ||
| // day no-op, providing only self-healing if the first attempt failed. | ||
| const CRON_SCHEDULE = '15 * * * *'; |
There was a problem hiding this comment.
This should be an environment variable with a default?
| logger.error(`Clickhouse client not configured`); | ||
| return; | ||
| } | ||
| const s3 = new S3Client({ region }); |
There was a problem hiding this comment.
I think it would be better to construct an s3 client outside of the function scope and skip the destroy.
| logger.info(`✅ done`); | ||
| } finally { | ||
| await client.close(); | ||
| s3.destroy(); |
There was a problem hiding this comment.
See previous comments. I would only call destroy if we were shutting down.
Address two review points on the skip-if-exists PR: - Hardcoded `15 * * * *` schedule is now driven by CRON_BILLING_EVENTS_S3_EXPORT_SCHEDULE (defaults to `15 * * * *`). - S3Client moves from per-`exec()` instantiation to a module-level const, reused across cron ticks. The matching `s3.destroy()` in the exec() finally block is removed — the client's lifecycle matches the process, not a single run, so the SDK can pool TCP connections. `objectExists` no longer takes the client as a parameter; it reads the module-level handle.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0ef8c94f02
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| async function objectExists(key: string): Promise<boolean> { | ||
| try { | ||
| await s3.send(new HeadObjectCommand({ Bucket: bucket, Key: key })); |
There was a problem hiding this comment.
Use writer role for preflight object checks
The new existence gate can block exports even when the configured writer role is valid, because HeadObject here runs with the metering service’s own AWS credentials, not the BILLING_EVENTS_S3_WRITER_ROLE_ARN used by ClickHouse for the actual INSERT INTO FUNCTION s3(...) write path. In environments where the service identity lacks s3:GetObject/s3:ListBucket on this bucket (while the writer role has access), every metric will fail at step=s3_check and never execute the export query. Fresh evidence in this commit is that the preflight check is done via s3.send(...) in app code, while the role ARN is only injected later in SQL for ClickHouse.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
2 issues found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/metering/lib/crons/billingEventsS3Export.ts">
<violation number="1" location="packages/metering/lib/crons/billingEventsS3Export.ts:160">
P2: Validate the cron expression before scheduling; an invalid env value can now break cron registration at startup.</violation>
<violation number="2" location="packages/metering/lib/crons/billingEventsS3Export.ts:226">
P1: Handle S3's 403-on-missing-object case, or ensure the writer role also has `s3:ListBucket`. Right now a missing key can fail the export instead of being treated as absent.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| } | ||
|
|
||
| cron.schedule(`*/${cronMinutes} * * * *`, () => { | ||
| cron.schedule(cronSchedule, () => { |
There was a problem hiding this comment.
P2: Validate the cron expression before scheduling; an invalid env value can now break cron registration at startup.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/metering/lib/crons/billingEventsS3Export.ts, line 160:
<comment>Validate the cron expression before scheduling; an invalid env value can now break cron registration at startup.</comment>
<file context>
@@ -157,7 +157,7 @@ export function billingEventsS3ExportCron(): void {
}
- cron.schedule(CRON_SCHEDULE, () => {
+ cron.schedule(cronSchedule, () => {
exec().catch((err: unknown) => {
logger.error('Cron tick failed unexpectedly', err);
</file context>
| CRON_BILLING_EVENTS_S3_EXPORT_MINUTES: z.coerce.number().optional().default(0), | ||
| // Fires every hour at :15 by default — the 15-min skew gives ClickHouse a buffer | ||
| // to ingest late events from the previous UTC day before we snapshot it. | ||
| CRON_BILLING_EVENTS_S3_EXPORT_SCHEDULE: z.string().optional().default('15 * * * *'), |
There was a problem hiding this comment.
I think we should keep this the same as the others - define minutes instead of cron pattern - in fact there is already one there CRON_BILLING_EVENTS_S3_EXPORT_MINUTES? Why can't that be used?
We also should not have cron value by default - they are usually set to zero so they don't run in self-hosted environments - and then we provide a value in the environment.
Per review — the existing CRON_BILLING_EVENTS_S3_EXPORT_MINUTES env var (defaults to 0 = disabled) is already the kill switch, no need for a separate _SCHEDULE string variable. Drops the _SCHEDULE env var added on the previous commit and goes back to `*/N * * * *`. The :15 minute offset (15-min late-event ingest buffer) is lost; if we end up needing it we can add it as a separate offset env var, but the hourly retries already self-heal.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ee420b1c4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (await objectExists(key)) { | ||
| logger.info(`Skipping ${eventName} for day=${day} (already in s3://${bucket}/${key})`); | ||
| continue; |
There was a problem hiding this comment.
Delay skip-if-exists until late-ingest window has passed
With the new deterministic key, objectExists short-circuits all subsequent exports for that (day, metric) after the first successful write, so any ClickHouse rows that arrive later are permanently dropped for billing. The previous implementation could absorb lag because it re-exported hourly to new keys, but this branch now makes correctness depend on the first run time while the scheduler still uses */${cronMinutes} (top-of-hour when set to 60), so a run near UTC midnight can lock in an incomplete snapshot.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/metering/lib/crons/billingEventsS3Export.ts">
<violation number="1" location="packages/metering/lib/crons/billingEventsS3Export.ts:160">
P2: Validate the cron expression before scheduling; an invalid env value can now break cron registration at startup.</violation>
<violation number="2" location="packages/metering/lib/crons/billingEventsS3Export.ts:226">
P1: Handle S3's 403-on-missing-object case, or ensure the writer role also has `s3:ListBucket`. Right now a missing key can fail the export instead of being treated as absent.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
The previous revert dropped the schedule string env var and went back
to `*/${cronMinutes} * * * *`, which fires the export cron at HH:00.
That loses the 15-min buffer we want for ClickHouse to ingest the
tail end of the previous UTC day before we snapshot it.
Reintroduce the offset as a numeric env var (per reviewer's
"should be minutes" feedback). Defaults to 0 (top of the hour);
deployment sets it to 15 to get HH:15 firing. MINUTES stays the
kill switch.
…o 15 Default to 15 so the cron fires at HH:15 out of the box (matching the original design intent: a ~15min buffer for CH to ingest the previous UTC day's tail before we snapshot it). 0 is still a valid setting if someone wants the cron at the top of the hour. Kill switch is unchanged: CRON_BILLING_EVENTS_S3_EXPORT_MINUTES <= 0 returns before cron.schedule(), regardless of OFFSET_MINUTES.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/metering/lib/crons/billingEventsS3Export.ts">
<violation number="1" location="packages/metering/lib/crons/billingEventsS3Export.ts:163">
P1: Pin this cron to UTC; otherwise the offset is evaluated in the host timezone and the export window can drift away from the intended UTC day boundary.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| cron.schedule(`${cronOffsetMinutes} * * * *`, () => { | ||
| exec().catch((err: unknown) => { | ||
| logger.error('Cron tick failed unexpectedly', err); | ||
| }); |
There was a problem hiding this comment.
P1: Pin this cron to UTC; otherwise the offset is evaluated in the host timezone and the export window can drift away from the intended UTC day boundary.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/metering/lib/crons/billingEventsS3Export.ts, line 163:
<comment>Pin this cron to UTC; otherwise the offset is evaluated in the host timezone and the export window can drift away from the intended UTC day boundary.</comment>
<file context>
@@ -156,7 +157,10 @@ export function billingEventsS3ExportCron(): void {
+ // Hourly cron with a configurable minute-of-the-hour offset
+ // (CRON_BILLING_EVENTS_S3_EXPORT_OFFSET_MINUTES — default 0). The MINUTES var
+ // stays the kill switch (0 = disabled, anything > 0 enables).
+ cron.schedule(`${cronOffsetMinutes} * * * *`, () => {
exec().catch((err: unknown) => {
logger.error('Cron tick failed unexpectedly', err);
</file context>
Drops CRON_BILLING_EVENTS_S3_EXPORT_MINUTES and the just-added
CRON_BILLING_EVENTS_S3_EXPORT_OFFSET_MINUTES in favour of one
variable that does both jobs:
CRON_BILLING_EVENTS_S3_HOURLY_EXPORT_MINUTE
-1 (default) -> cron disabled, no schedule registered
0..59 -> hourly cron firing at that minute of the hour
Cadence is implicitly hourly (we don't need finer than that for the
"export yesterday's daily_* totals" job). Lock TTL is now hardcoded
to 30min since it no longer needs to derive from the interval.
Deploy-side rename in nango-environments lands separately, brief
dev breakage between deploys is acceptable.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be84687bb4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // cron fires on (0–59). -1 (default) disables the cron entirely. 15 gives | ||
| // ClickHouse a ~15min buffer to ingest the previous UTC day's tail before we | ||
| // snapshot it. | ||
| CRON_BILLING_EVENTS_S3_HOURLY_EXPORT_MINUTE: z.coerce.number().min(-1).max(59).optional().default(-1), |
There was a problem hiding this comment.
Preserve the existing cron enablement env var
In deployments that currently enable this export with CRON_BILLING_EVENTS_S3_EXPORT_MINUTES=60, replacing that schema entry with the new CRON_BILLING_EVENTS_S3_HOURLY_EXPORT_MINUTE means the old variable is ignored by parseEnvs, the new value defaults to -1, and billingEventsS3ExportCron() returns before scheduling anything. Unless every environment is updated in lockstep with this app deploy, billing event S3 exports will silently stop; keep the old kill-switch/interval env or derive the new minute from it during the transition.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="packages/utils/lib/environment/parse.ts">
<violation number="1" location="packages/utils/lib/environment/parse.ts:88">
P1: Defaulting this cron to `-1` makes the billing export stop running whenever the env var is unset. If the new schedule should be on by default, use `15` here and keep `-1` only as an explicit opt-out.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Fix all with cubic | Re-trigger cubic
| // cron fires on (0–59). -1 (default) disables the cron entirely. 15 gives | ||
| // ClickHouse a ~15min buffer to ingest the previous UTC day's tail before we | ||
| // snapshot it. | ||
| CRON_BILLING_EVENTS_S3_HOURLY_EXPORT_MINUTE: z.coerce.number().min(-1).max(59).optional().default(-1), |
There was a problem hiding this comment.
P1: Defaulting this cron to -1 makes the billing export stop running whenever the env var is unset. If the new schedule should be on by default, use 15 here and keep -1 only as an explicit opt-out.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At packages/utils/lib/environment/parse.ts, line 88:
<comment>Defaulting this cron to `-1` makes the billing export stop running whenever the env var is unset. If the new schedule should be on by default, use `15` here and keep `-1` only as an explicit opt-out.</comment>
<file context>
@@ -81,11 +81,11 @@ export const ENVS = z.object({
+ // cron fires on (0–59). -1 (default) disables the cron entirely. 15 gives
+ // ClickHouse a ~15min buffer to ingest the previous UTC day's tail before we
+ // snapshot it.
+ CRON_BILLING_EVENTS_S3_HOURLY_EXPORT_MINUTE: z.coerce.number().min(-1).max(59).optional().default(-1),
// Persist
</file context>
Changes
{day-compact}/{eventName}.jsonl(norunTimestampprefix) — one canonical key per (day, metric). The hourly cron does anS3 HeadObjectbeforeINSERT INTO FUNCTION s3and skips if the file already exists, so each (day, metric) lands in S3 exactly once instead of 24× per day.*/N * * * *) to15 * * * *so the first run of each day has a ~15-min buffer for ClickHouse to ingest any late events from the previous UTC day before we snapshot it. Later ticks on the same UTC day are pure self-healing no-ops.This one is taking care of the permissions https://github.com/NangoHQ/nango-infra/pull/132
Trade-off worth noting
With deterministic + skip-if-exists, the first run on day D+1 wins — late ClickHouse events arriving after that first snapshot don't make it into Orb. In practice the day boundary plus the 15-min skew should cover ingest lag, but we lose the "accidental absorb" property the old re-upload-every-hour pattern had.
Out of scope
FORCEenv var or backfill runbook — manualaws s3 rm+ wait-for-next-tick if we need to re-export a day.BILLING_USAGE_CLICKHOUSE_S3_EXPORT_RESULT.Test plan
npm run ts-build)Exported …log per (day, metric) on day D+1 at HH:15Skipping … (already in s3://…)logs and no S3 writesLinear: https://linear.app/nango/issue/NAN-5315