Public benchmark scoreboard
A deterministic regression harness over the AiSOC substrate — the keyword extractors, the in-harness fusion grouping (a faithful re-implementation of the production Tier 1/2/3 logic in services/fusion, minus the DB-backed dedup and ML scoring), the report and response templates, and the offline judges that grade them. The dataset, the harness, and the CI gate are in the repo. The numbers on this page are pulled from eval/results/latest.json on the eval-results branch and refresh on every push to main.
This page is the live snapshot of the latest CI run. For the append-only weekly history (substrate and wet-eval rows side-by-side with a MITRE-accuracy trend chart), see the public scoreboard in the docs.
2026 published KPI bar
The four buyer-side targets AiSOC commits to. The two derivable from the open harness are checked live below; the other two are tenant-runtime metrics surfaced in the in-app SLA dashboard.
- False-positive rate
- ≤ 5%in-app
- Alert-to-incident ratio
- ≥ 50:1Live: 4.0:1
- MITRE technique tagging
- ≥ 85%Live: 96.4%
- MITRE sub-technique tagging
- ≥ 60%in-app
A live tenant metric from `services/api`'s SLA service — surfaced in the in-app SLA dashboard, not on the public harness. The harness can't measure FP rate against synthetic data.
Derived from the alert-reduction suite: a 1,000-alert noisy stream is fed through the in-harness fusion grouping, and the ratio is 1,000 ÷ incidents_out.
Per-template macro accuracy across 55 incident templates is the closest open-harness proxy. In-tenant data uses the live ECS technique-tag rate.
Live tenant metric. The synthetic harness labels at the tactic level; sub-technique coverage is reported from live detection content in the in-app SLA dashboard.
Latest results
Four metrics, four CI gates. A regression on any gate blocks the build. Each card shows the per-case mean and, where applicable, the per-template macro — an equal-weight average across 55 distinct incident templates that surfaces a single weak template the per-case mean would mask. Numbers come from latest.json on the most recent successful run on main.
Reduction ratio
Alert reduction
A 1,000-alert noisy stream (duplicates, near-duplicates, rule storms, low-score chatter) is fed into an in-harness re-implementation of the production Tier 1 / 2 / 3 grouping rules — same logic, no DB-backed dedup or ML scorer. The number is whatever the code produces; a regression in the grouping rules moves it.
Tactic accuracy (per-case)
MITRE tactic accuracy
Each synthetic incident is generated with a labeled tactic and a description written to include keywords the hand-curated extractor recognizes. The headline number mostly checks that dataset and extractor agree — useful as a regression sentinel for the extractor, not a measure of LLM agent accuracy.
Per-template macro
Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.
target ≥80%
Regressions: outlook-auto-forward-rule, compromised-ci-runner
Mean keyword coverage
Investigation completeness
The simulator wraps each incident's description in a Markdown report; the judge then looks for evidence keywords drawn from that same description. Close to a string-copy tautology — it confirms the report template includes the description and the judge can find keywords inside it. Catches template breakage, not LLM quality.
Per-template macro
Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.
target ≥80%
Mean rubric score
Response-plan quality
The synthesizer embeds the expected MITRE techniques and first evidence keyword directly into the templated plan, then a 5-criterion rubric checks for them. By construction the score is ~1.000. Catches a broken templating pipeline; it is not a grade of LLM-written plans.
Per-template macro
Equal-weight average across 55distinct incident templates — surfaces a single weak template that the per-case mean would mask.
target ≥75%
What each suite measures
Alert reduction (75.3%)
Real measurementA 1,000-alert noisy stream with duplicates, near-duplicates, rule-storms, and benign chatter is fabricated deterministically, then passed through fuse_alerts— an in-harness re-implementation of the same Tier 1 / 2 / 3 merge windows and score floor that the production fusion service runs. The grouping logic is identical; the harness skips the DB-backed deduplicator and ML scorer that ride on top in production. The reduction ratio is whatever the harness code emits. This is a legitimate measurement of the grouping logic, and a regression in those rules will move the number.
MITRE tactic accuracy (97.0%)
Substrate self-consistencyEach synthetic incident is generated with a tactic label, and its description is written to include keywords that the hand-curated extractor recognises. The 97% is therefore largely a check that the dataset and the extractor agree with each other, not a measure of LLM-agent accuracy. The gate still has value as a regression sentinel: a misnamed tactic, a typo in the keyword table, or a lost tactic will fail it.
Investigation completeness (94.3%)
Substrate self-consistencyThe simulator wraps the incident description in a Markdown report, and the judge looks for evidence keywords inside it. Those evidence keywords are drawn from the description, so the gate confirms that the report template includes the description and that the judge can find the keywords. It catches drops in the report template (for example a missing Summary section) but does not grade an LLM-written investigation.
Response-plan quality (1.000)
Substrate self-consistencyThe synthesiser embeds the expected MITRE techniques and the first evidence keyword directly into the templated plan, and the rubric judge checks for them. The score is ~1.000 by construction. This catches a broken templating pipeline (for example, the synthesiser silently dropping an action class) but is not a grade of LLM output. The 1.000 is a green regression-gate signal, not a quality measurement.
The next milestone is an online eval: nightly runs that drive the real LangGraph agent against the same dataset, with an LLM-as-judge gated by OPENAI_API_KEY. That is the run where actual agent accuracy is measured. Tracking issue: github.com/beenuar/AiSOC/issues.
Reproduce these numbers
No Docker, no API key, no GPU, no LLM call. The harness is deterministic and runs in roughly 25 ms.
git clone https://github.com/beenuar/AiSOC && cd AiSOC
python3 scripts/run_evals.py --json --out report.jsonExpected output:
============================================================================
AiSOC Pillar-1 Eval - 200-incident synthetic benchmark
============================================================================
[PASS] mitre_accuracy accuracy 0.970 (target >= 0.80)
[PASS] alert_reduction reduction_ratio 0.753 (target >= 0.70)
[PASS] investigation_completeness mean_keyword_coverage 0.943 (target >= 0.85)
[PASS] response_quality mean_rubric_score 1.000 (target >= 0.80)
============================================================================
ALL GATES PASSEDFor machine-readable output, pass --json or --ci --out report.json (the latter also exits non-zero on regression).
Comparison to other AI SOC offerings
Where a vendor publishes a number or a verifiable capability, it is cited. Where a vendor does not, the row is marked absent.
| Product | Alert reduction | MITRE accuracy gate | Decision audit | Self-host | Reproducible harness |
|---|---|---|---|---|---|
AiSOCOpen | 75.3% (measured on fixed noisy stream) | 97% (substrate regression gate) | Per-step ledger | Yes (MIT) | Yes — every PR to main / develop |
Closed-source AI SOCClosed | Vendor claim, no harness | Not published | Vendor portal | No (cloud only) | No |
Closed-source SOARClosed | N/A (SOAR) | Not applicable | Run history | On-prem option | No published harness |
A self-hostable, MIT-licensed agent with a published regression harness can be reviewed directly by an auditor. Vendor cloud agents typically cannot be reviewed at the same level.
What this is not
- No LLM agent runs here. The harness exercises deterministic substrate code: extractors, fusion, templates, and keyword judges. The live LangGraph orchestrator (
services/agents/app/investigator/) is not invoked. An online eval that drives it nightly is on the roadmap. - The dataset is synthetic. 200 incidents are enough to flag substrate regressions but not enough to claim production parity. Federated, opt-in real-customer evaluation is on the roadmap.
- The judges are keyword-based. They can be gamed by template-stuffing. In three of the four suites the templates already include the keywords the judge looks for, which is why those suites are labelled substrate self-consistency rather than agent quality. The LLM-as-judge variant is the follow-up.
- “Public eval harness” means this harness, not a third-party leaderboard. No outside body grades AiSOC. The dataset, the code, and the gates are open and CI-enforced, and anyone can run, audit, or extend the harness.
Community submissions
The dataset and the harness are MIT and reproducible. Any third party — another open-source project, a vendor, or an internal team — can run the same suite against the same 200 incidents and submit a result. Accepted entries appear here.
Submission rules
- Same fixed dataset
- Run against the deterministic 200-incident dataset at services/agents/tests/eval_data/synthetic_incidents.json on the commit you submit. No private fixtures.
- Same harness
- Run scripts/run_evals.py --json --out report.json with no flags that disable gates. Attach the full report.json so per-template macros are auditable.
- Open agent or label as closed
- If your agent code is open, link it. If it is closed, the entry is still accepted but is labeled "closed-source" so reviewers know they cannot reproduce internals.
- No template-stuffing
- The three substrate self-consistency suites (MITRE, completeness, response) are gameable by stuffing keywords into reports. Submissions caught doing this are rejected; the alert-reduction measurement is not gameable in the same way.
No accepted community submissions yet.
The leaderboard fills up as runs are merged. AiSOC's own numbers appear in the cards above.
Contributing to the harness
New fixtures for missed tactics or fusion edge cases, replacements for tautological judges, and the online LLM-as-judge variant are all in scope for contributions.