A structured library of 121,245 machine-learning paper analyses covering arXiv publications from 2023–2025. Each paper is distilled into a standardised markdown report so you can survey findings, compare mechanisms, and spot trends without reading thousands of PDFs.
| Year | Analyses | Unique papers |
|---|---|---|
| 2023 | 29,961 | 29,961 |
| 2024 | 39,185 | 38,027 |
| 2025 | 52,099 | 51,517 |
Every file follows the same template. Here's a trimmed example (2511.21730_a-benchmark-for-procedural-memory-retrieval_…md):
# frontmatter
arxiv_id: '2511.21730'
core_contribution: >
Introduces the first benchmark for evaluating procedural
memory retrieval in language agents, isolating retrieval
from execution …
tags: [procedural, retrieval, memory, …] # ⚠ see caveat below## Quick Facts — arXiv link, authors, headline numbers
## Executive Summary — what the paper does in one paragraph
## Method Summary — experimental setup, models, data
## Key Results — quantitative findings
## Mechanism Analysis — *why* the approach works (multiple sub-sections)
## Reproduction Notes — hyperparameters, compute, data details
## Limitations & ConfidenceTag caveat: Auto-generated
tagsare noisy (every record shares a long generic tail). Prefer searchingcore_contribution, titles, and body text. Tag regeneration is planned.
Pick a year folder and search with ripgrep:
# Find all papers mentioning mixture-of-experts
rg -l "mixture of experts|MoE" ml_research_analysis_2025/
# Full-text search across every year
rg -n "speculative decoding" ml_research_analysis_202*/The search script works around noisy tags by matching across title, core_contribution, and filename:
python scripts/search_topic.py --topic "mixture of experts" --alias moe
python scripts/search_topic.py --topic "reinforcement learning" --alias rl --limit 25 --jsonanalysis_outputs/research_index.sqlite indexes the 2025 bucket (52,099 rows) with columns: title, arxiv_id, core_contribution, tags, filename, file_size.
# papers whose core contribution mentions "distillation"
sqlite3 analysis_outputs/research_index.sqlite \
"SELECT arxiv_id, title FROM papers WHERE core_contribution LIKE '%distillation%' LIMIT 10"
# look up a specific paper
sqlite3 analysis_outputs/research_index.sqlite \
"SELECT filename FROM papers WHERE arxiv_id = '1706.03762'"The spot_analyses/ directory and the spot_analysis_paper_groups table contain deep-dive clusters across eight research themes:
| Group | Theme |
|---|---|
test_time_compute_scaling |
Scaling compute at inference |
reasoning_distillation |
Distilling reasoning capabilities |
multi_agent_debate |
Multi-agent argumentation |
process_reward_models |
Step-level reward modelling |
agentic_workflow_pipeline_design |
LLM agent architectures |
adaptive_compute_allocation |
Dynamic compute budgets |
test_time_adaptation |
Adapting models at test time |
continual_online_tta |
Continual / online TTA |
The website/ directory contains a static site with full-text search. See website/README.md for build and deploy instructions.
ml_research_analysis_2023/ Per-paper markdown analyses
ml_research_analysis_2024/
ml_research_analysis_2025/
analysis_outputs/ SQLite index, digests, assessment outputs
scripts/ index_frontmatter.py, search_topic.py
spot_analyses/ Curated topic deep-dives (8 groups, 1,824 papers)
website/ Static browse/search UI
docs/ Internal reference documents
archive/ Superseded v1 analyses
A three-phase FlatAgents pipeline produces each report:
- Prep — download arXiv PDF, extract text, match against ML terminology corpus
- Expensive — parallel LLM calls for mechanism analysis, reproduction notes, and open questions
- Wrap — limitations/confidence, tagging, report assembly, quality judge + auto-repair
The 2025 batch used GLM-5 (pony-alpha) for the expensive phase; 2023–2024 used Trinity Large throughout. Pipeline code, configs, and execution databases live in the pipeline repo — this repository is output only.
- ~190 permanent failures across all years: PDF 404s (~106), context overflow >256k (~60), provider errors (~9), PDF parse errors (~15). No pending retries.
- Tags are unreliable — the tail of every tag list contains generic terms. Use
core_contributionand full-text search instead. - Duplicate filenames exist where papers were rerun (1,158 in 2024, 582 in 2025). The SQLite index and filenames are deduplicated by
(arxiv_id, timestamp).
After adding or removing analysis files, rebuild the SQLite index:
python scripts/index_frontmatter.py ml_research_analysis_2025
python scripts/index_frontmatter.py ml_research_analysis_2025 --prune # also remove deleted files