Public version: 79,505 audited articles across 3 periods
Source-audit subset: 51,589 audited articles across 2 post-appointment periods
This repository contains an open research project about IMI's public reasoning for excluding and later re-including Ukrinform in the White List.
The project now keeps both research layers visible:
-
docs/anddata/explorer_data.jsonMain public comparison across three periods:P0before Matsuka, when Ukrinform was still in the White ListP1Matsuka period, when Ukrinform was excludedP2later period before re-inclusion
-
data/corpus_fast.csvanddashboard/Two-period audited source-analysis subset:P1Matsuka periodP2later period before re-inclusion
Method corrections made on 2026-04-30 are documented in CORRECTIONS.md.
| Period | Audited | Parket | Balance |
|---|---|---|---|
| P0: 2023-05-01 -> 2023-10-31 | 18,369 | 5.80% | 6.60% |
| P1: 2023-11-09 -> 2024-04-25 | 18,375 | 4.97% | 5.77% |
| P2: 2025-07-01 -> 2025-12-15 | 20,855 | 4.14% | 4.87% |
Pairwise parket comparison without ATO:
P0 vs P1:p=0.00047, Cohen'sh=0.0365P0 vs P2:p=3.27e-14, Cohen'sh=0.0766P1 vs P2:p=7.08e-05, Cohen'sh=0.0401
| Period | Audited | Parket | Balance |
|---|---|---|---|
| P0: 2023-05-01 -> 2023-10-31 | 27,916 | 5.87% | 6.72% |
| P1: 2023-11-09 -> 2024-04-25 | 26,342 | 5.12% | 6.01% |
| P2: 2025-07-01 -> 2025-12-15 | 25,247 | 4.67% | 5.57% |
| Scenario | P1 Parket | P2 Parket | P1 Balance | P2 Balance |
|---|---|---|---|---|
| Without ATO | 4.97% | 4.14% | 5.77% | 4.87% |
| With ATO | 5.12% | 4.67% | 6.01% | 5.57% |
The current canonical rebuild fixes three repo-wide problems:
- Official sources in article text are now classified against Ukrainian/Cyrillic entity markers, not transliterated URL markers.
- Official URL classification now uses word boundaries and known prefixes instead of naive substring matching.
parketandbalanceare no longer computed with the same formula.
The shared implementation lives in canonical_metrics.py.
| File | Scope |
|---|---|
docs/index.html |
Main three-period public page |
docs/explorer_data.json |
Explorer dataset for 79,505 audited articles |
docs/graph_data.json |
Monthly/rubric aggregates for 3 periods |
data/corpus_fast.csv |
Canonical two-period audited corpus (51,667 rows, 51,589 audited) |
data/statistical_tests_v3.json |
Canonical two-period statistical summary |
data/statistical_tests_v3_three_periods.json |
Canonical three-period statistical summary |
data/period_zero/p0_audited.json |
Audited period-zero source data |
dashboard/index.html |
Two-period dashboard view kept for transparency |
git clone https://github.com/alexmazuka/ukrinform.git
cd ukrinform
python3 scripts/rebuild_public_assets.pyTo re-run collectors/parsers instead of only rebuilding published assets:
python3 scripts/recover_missing.py
python3 scripts/audit_full_corpus.py
python3 scripts/reparse_improved.py
python3 scripts/fix_official_classification.py
python3 scripts/collect_p0_pre_matsuka.py
python3 scripts/rebuild_public_assets.pyparket: official URL classification +source_count <= 1+non_official_source_count == 0balance: official URL classification +non_official_source_count == 0source_count: extracted cited sources in article text using the improved parser
data/corpus_v1_backup.csv,data/corpus_v2_parsed.csv, and older public claims remain in Git history and backup files for auditability.data/corpus_fast.csvis still a two-period corpus; the public three-period layer is assembled from that corpus plusdata/period_zero/p0_audited.json.