ReproducibleLive from mainLast run 2026-05-15 01:28 UTC · refreshes every push

Public benchmark scoreboard

A deterministic regression harness over the AiSOC substrate — the keyword extractors, the in-harness fusion grouping (a faithful re-implementation of the production Tier 1/2/3 logic in services/fusion, minus the DB-backed dedup and ML scoring), the report and response templates, and the offline judges that grade them. The dataset, the harness, and the CI gate are in the repo. The numbers on this page are pulled from eval/results/latest.json on the eval-results branch and refresh on every push to main.

Read this first: the harness does not exercise the live LLM agent. It runs deterministic substrate code against synthetic data so the CI gate can run in milliseconds. Three of the four metrics measure internal consistency of that substrate, not agent accuracy. The sections below describe what each suite measures and what it does not.

This page is the live snapshot of the latest CI run. For the append-only weekly history (substrate and wet-eval rows side-by-side with a MITRE-accuracy trend chart), see the public scoreboard in the docs.

2026 published KPI bar

The four buyer-side targets AiSOC commits to. The two derivable from the open harness are checked live below; the other two are tenant-runtime metrics surfaced in the in-app SLA dashboard.

Source: services/api/app/services/sla.py
False-positive rate
≤ 5%in-app

A live tenant metric from `services/api`'s SLA service — surfaced in the in-app SLA dashboard, not on the public harness. The harness can't measure FP rate against synthetic data.

Alert-to-incident ratio
≥ 50:1Live: 4.0:1

Derived from the alert-reduction suite: a 1,000-alert noisy stream is fed through the in-harness fusion grouping, and the ratio is 1,000 ÷ incidents_out.

MITRE technique tagging
≥ 85%Live: 96.4%

Per-template macro accuracy across 55 incident templates is the closest open-harness proxy. In-tenant data uses the live ECS technique-tag rate.

MITRE sub-technique tagging
≥ 60%in-app

Live tenant metric. The synthetic harness labels at the tactic level; sub-technique coverage is reported from live detection content in the in-app SLA dashboard.

Latest results

Four metrics, four CI gates. A regression on any gate blocks the build. Each card shows the per-case mean and, where applicable, the per-template macro — an equal-weight average across 55 distinct incident templates that surfaces a single weak template the per-case mean would mask. Numbers come from latest.json on the most recent successful run on main.

Reduction ratio

Alert reduction

Pass
75.3%target ≥70%
+5.3 pts above gate
Real measurement

A 1,000-alert noisy stream (duplicates, near-duplicates, rule storms, low-score chatter) is fed into an in-harness re-implementation of the production Tier 1 / 2 / 3 grouping rules — same logic, no DB-backed dedup or ML scorer. The number is whatever the code produces; a regression in the grouping rules moves it.

Alerts in
1000
Incidents out
247
Storms
16

Tactic accuracy (per-case)

MITRE tactic accuracy

Pass
97.0%target ≥80%
+17.0 pts above gate
Substrate self-consistency

Each synthetic incident is generated with a labeled tactic and a description written to include keywords the hand-curated extractor recognizes. The headline number mostly checks that dataset and extractor agree — useful as a regression sentinel for the extractor, not a measure of LLM agent accuracy.

Per-template macro

Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.

96.4%

target ≥80%

Regressions: outlook-auto-forward-rule, compromised-ci-runner

Incidents
200
Correct
194
F1 (per-case)
0.77

Mean keyword coverage

Investigation completeness

Pass
94.3%target ≥85%
+9.3 pts above gate
Substrate self-consistency

The simulator wraps each incident's description in a Markdown report; the judge then looks for evidence keywords drawn from that same description. Close to a string-copy tautology — it confirms the report template includes the description and the judge can find keywords inside it. Catches template breakage, not LLM quality.

Per-template macro

Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.

94.3%

target ≥80%

Incidents
200
Fully covered
134 (67%)
Judge
Offline keyword

Mean rubric score

Response-plan quality

Pass
100.0%target ≥80%
+20.0 pts above gate
Substrate self-consistency

The synthesizer embeds the expected MITRE techniques and first evidence keyword directly into the templated plan, then a 5-criterion rubric checks for them. By construction the score is ~1.000. Catches a broken templating pipeline; it is not a grade of LLM-written plans.

Per-template macro

Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.

100.0%

target ≥75%

Incidents
200
Criteria
5 (all hit by template)
Judge
Offline keyword

What each suite measures

Alert reduction (75.3%)

Real measurement

A 1,000-alert noisy stream with duplicates, near-duplicates, rule-storms, and benign chatter is fabricated deterministically, then passed through fuse_alerts— an in-harness re-implementation of the same Tier 1 / 2 / 3 merge windows and score floor that the production fusion service runs. The grouping logic is identical; the harness skips the DB-backed deduplicator and ML scorer that ride on top in production. The reduction ratio is whatever the harness code emits. This is a legitimate measurement of the grouping logic, and a regression in those rules will move the number.

MITRE tactic accuracy (97.0%)

Substrate self-consistency

Each synthetic incident is generated with a tactic label, and its description is written to include keywords that the hand-curated extractor recognises. The 97% is therefore largely a check that the dataset and the extractor agree with each other, not a measure of LLM-agent accuracy. The gate still has value as a regression sentinel: a misnamed tactic, a typo in the keyword table, or a lost tactic will fail it.

Investigation completeness (94.3%)

Substrate self-consistency

The simulator wraps the incident description in a Markdown report, and the judge looks for evidence keywords inside it. Those evidence keywords are drawn from the description, so the gate confirms that the report template includes the description and that the judge can find the keywords. It catches drops in the report template (for example a missing Summary section) but does not grade an LLM-written investigation.

Response-plan quality (1.000)

Substrate self-consistency

The synthesiser embeds the expected MITRE techniques and the first evidence keyword directly into the templated plan, and the rubric judge checks for them. The score is ~1.000 by construction. This catches a broken templating pipeline (for example, the synthesiser silently dropping an action class) but is not a grade of LLM output. The 1.000 is a green regression-gate signal, not a quality measurement.

The next milestone is an online eval: nightly runs that drive the real LangGraph agent against the same dataset, with an LLM-as-judge gated by OPENAI_API_KEY. That is the run where actual agent accuracy is measured. Tracking issue: github.com/beenuar/AiSOC/issues.

Reproduce these numbers

No Docker, no API key, no GPU, no LLM call. The harness is deterministic and runs in roughly 25 ms.

git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.py --json --out report.json

Expected output:

============================================================================
  AiSOC Pillar-1 Eval - 200-incident synthetic benchmark
============================================================================
  [PASS] mitre_accuracy               accuracy               0.970  (target >= 0.80)
  [PASS] alert_reduction              reduction_ratio        0.753  (target >= 0.70)
  [PASS] investigation_completeness   mean_keyword_coverage  0.943  (target >= 0.85)
  [PASS] response_quality             mean_rubric_score      1.000  (target >= 0.80)
============================================================================
  ALL GATES PASSED

For machine-readable output, pass --json or --ci --out report.json (the latter also exits non-zero on regression).

Comparison to other AI SOC offerings

Where a vendor publishes a number or a verifiable capability, it is cited. Where a vendor does not, the row is marked absent.

ProductAlert reductionMITRE accuracy gateDecision auditSelf-hostReproducible harness
AiSOCOpen
75.3% (measured on fixed noisy stream)97% (substrate regression gate)Per-step ledgerYes (MIT)Yes — every PR to main / develop
Closed-source AI SOCClosed
Vendor claim, no harnessNot publishedVendor portalNo (cloud only)No
Closed-source SOARClosed
N/A (SOAR)Not applicableRun historyOn-prem optionNo published harness

A self-hostable, MIT-licensed agent with a published regression harness can be reviewed directly by an auditor. Vendor cloud agents typically cannot be reviewed at the same level.

What this is not

  • No LLM agent runs here. The harness exercises deterministic substrate code: extractors, fusion, templates, and keyword judges. The live LangGraph orchestrator (services/agents/app/investigator/) is not invoked. An online eval that drives it nightly is on the roadmap.
  • The dataset is synthetic. 200 incidents are enough to flag substrate regressions but not enough to claim production parity. Federated, opt-in real-customer evaluation is on the roadmap.
  • The judges are keyword-based. They can be gamed by template-stuffing. In three of the four suites the templates already include the keywords the judge looks for, which is why those suites are labelled substrate self-consistency rather than agent quality. The LLM-as-judge variant is the follow-up.
  • “Public eval harness” means this harness, not a third-party leaderboard. No outside body grades AiSOC. The dataset, the code, and the gates are open and CI-enforced, and anyone can run, audit, or extend the harness.

Community submissions

The dataset and the harness are MIT and reproducible. Any third party — another open-source project, a vendor, or an internal team — can run the same suite against the same 200 incidents and submit a result. Accepted entries appear here.

Submit a run

Submission rules

Same fixed dataset
Run against the deterministic 200-incident dataset at services/agents/tests/eval_data/synthetic_incidents.json on the commit you submit. No private fixtures.
Same harness
Run scripts/run_evals.py --json --out report.json with no flags that disable gates. Attach the full report.json so per-template macros are auditable.
Open agent or label as closed
If your agent code is open, link it. If it is closed, the entry is still accepted but is labeled "closed-source" so reviewers know they cannot reproduce internals.
No template-stuffing
The three substrate self-consistency suites (MITRE, completeness, response) are gameable by stuffing keywords into reports. Submissions caught doing this are rejected; the alert-reduction measurement is not gameable in the same way.

No accepted community submissions yet.

The leaderboard fills up as runs are merged. AiSOC's own numbers appear in the cards above.

Contributing to the harness

New fixtures for missed tactics or fusion edge cases, replacements for tautological judges, and the online LLM-as-judge variant are all in scope for contributions.