Research Knowledge Gaps Mitigation (RKGM)

Automatically identifies what you need to know before reading a research paper, then generates a personalised, chronologically ordered learning document — sourced from the paper's own reference hierarchy.

Based on: Rahman et al. (2022), An Article Recommendation Technique from a Multi-Layer Reference Article Graph for Facilitating Chronological Learning, IEEE ICCIT. DOI: 10.1109/ICCIT57492.2022.10103286

What It Does

Phase A — Give it any paper (arXiv ID, DOI, Semantic Scholar ID or PDF). It fetches the paper's 3-level reference graph, detects knowledge gaps using structured LLM prompting with self-consistency, and returns a ranked list of candidate papers per gap with rationale.
Human checkpoint — Review the gap list, toggle off concepts you already know, add your own. The system auto-fetches open-access PDFs via Unpaywall and arXiv.
Phase B — Ingests the PDFs, retrieves the most relevant passages per gap (hybrid dense + BM25 + cross-encoder reranking), generates a grounded explanation for each gap with inline citations, orders them chronologically, and assembles a complete learning document.

Architecture

LLM Provider Support

Provider	Models	Notes
Groq (default)	Llama 3.3 70B (heavy) · Llama 3.1 8B (light)	Free tier, fast — recommended
OpenAI	GPT-4o (heavy) · GPT-4o-mini (light)	Higher quality, pay-per-use
Auto	Groq if `GROQ_API_KEY` set, else OpenAI	Automatic fallback

Retrieval Pipeline

Gap query
  → Dense (ChromaDB cosine, top-20)  ─┐
  → Sparse (BM25, top-20)            ─┤→ RRF merge → Cross-encoder rerank → Top-5

Ordering Strategy

Stage 1 (deterministic): Foundation → Development → Frontier, then by trendscore within layer
Stage 2 (LLM): Refine within-layer order using dependency sentences
Cycle resolution: Mutual dependency → higher trendscore paper goes first

Gap Detection

Runs 3× at temperatures [0.1, 0.35, 0.6]
Keeps concepts appearing in ≥ 2/3 runs (self-consistency)
Validates each gap traces back to a specific passage in the paper

Quick Start

1. Clone & install

git clone https://github.com/sharukhn32/Research-Knowledge-Gaps.git
cd Research-Knowledge-Gaps

# Using uv (recommended)
uv venv && source .venv/bin/activate && uv pip install -r requirements.txt

# Or plain pip
pip install -r requirements.txt

2. Configure API keys

Copy .env.example to .env and fill in your key(s):

cp .env.example .env

# Use Groq (free) — recommended
GROQ_API_KEY="gsk_..."

# Or OpenAI
OPENAI_API_KEY="sk-..."

# Provider: "groq" | "openai" | "auto" (default: auto)
LLM_PROVIDER="auto"

Get a free Groq key at console.groq.com.
Get an OpenAI key at platform.openai.com/api-keys.

3. Run

Web UI (recommended):

python app.py

Opens the Streamlit interface automatically in your browser. The provider and API keys can also be set interactively from the sidebar — no .env required.

CLI:

# Minimal — paper ID only, abstract-level gap detection
python run.py --paper 2405.20139 --depth 0 --out ./output/

# With PDF for richer gap detection
python run.py --paper 2405.20139 --pdf /path/to/paper.pdf

# Control depth and add custom gaps
python run.py --paper 2405.20139 --pdf paper.pdf --depth 2 \
              --gaps "knowledge graph" "SPARQL"

# Phase A only (gaps + candidate list, skip generation)
python run.py --paper 2405.20139 --pdf paper.pdf --phase-a-only

# Custom output directory
python run.py --paper 2405.20139 --pdf paper.pdf --out ./results/

# Verbose logging
python run.py --paper 2405.20139 --pdf paper.pdf --verbose

# Clear API cache (force fresh Semantic Scholar calls)
python run.py --paper 2405.20139 --clear-cache

CLI outputs written to --out directory (default: current directory):

File	Contents
`learning_roadmap_<id>.md`	The generated learning document
`phase_a_state_<id>.json`	Saved Phase A state (re-usable)
`candidates_<id>.bib`	BibTeX for all candidate papers
`candidates_<id>.csv`	PDF availability status

Optional: Better PDF Parsing

For academic two-column layouts, install marker-pdf:

pip install marker-pdf
# Downloads ~1.5 GB of models on first use

Falls back to PyMuPDF automatically if marker-pdf is unavailable.

Configuration

All tunable parameters live in core/config.py. The most useful ones:

# ── Reference graph ────────────────────────────────────────────────────────
FOUNDATION_TOP_K     = 10     # how many papers qualify as Foundation layer
FRONTIER_YEAR_CUTOFF = 2022   # papers from this year onwards → Frontier

# ── Gap detection ──────────────────────────────────────────────────────────
SELF_CONSISTENCY_RUNS = 3     # number of LLM runs per gap detection pass
SELF_CONSISTENCY_MIN  = 2     # gap must appear in ≥ N runs to be kept
CONFIDENCE_THRESHOLD  = 0.5   # discard ungrounded gaps below this

# ── Candidate scoring ──────────────────────────────────────────────────────
ALPHA = 0.5    # semantic similarity weight
BETA  = 0.3    # trendscore weight
GAMMA = 0.2    # layer-match weight

# ── Generation ────────────────────────────────────────────────────────────
MAX_MULTIHOP_DEPTH = 2        # max recursion depth for sub-gap detection
MAX_EVAL_LOOPS     = 2        # Writing ↔ Eval agent iterations per gap
FAITHFULNESS_GATE  = 0.70     # RAGAS faithfulness minimum before retry

Supported Paper Input Formats

Format	Example
arXiv ID	`2405.20139` or `arXiv:2405.20139`
DOI	`10.1109/ICCIT57492.2022.10103286`
Semantic Scholar ID	`649def34f8be52c8b66281af98ae884c09aef38b`

Citation

If you use this project, please cite the original paper it extends:

@inproceedings{rahman2022article,
  title     = {An Article Recommendation Technique from a Multi-Layer Reference
               Article Graph for Facilitating Chronological Learning},
  author    = {Rahman, Sharukh and Emad, Kazi Hasnayeen and Azad, Saiful
               and Mahmud, Mufti and Kaiser, M. Shamim},
  booktitle = {2022 25th International Conference on Computer and Information
               Technology (ICCIT)},
  year      = {2022},
  publisher = {IEEE},
  doi       = {10.1109/ICCIT57492.2022.10103286}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
core		core
eval		eval
img		img
output		output
phase_a		phase_a
phase_b		phase_b
utils		utils
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
evaluate.py		evaluate.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
run.py		run.py
test_phase_a.py		test_phase_a.py
test_phase_b.py		test_phase_b.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Research Knowledge Gaps Mitigation (RKGM)

What It Does

Architecture

LLM Provider Support

Retrieval Pipeline

Ordering Strategy

Gap Detection

Quick Start

1. Clone & install

2. Configure API keys

3. Run

Optional: Better PDF Parsing

Configuration

Supported Paper Input Formats

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Research Knowledge Gaps Mitigation (RKGM)

What It Does

Architecture

LLM Provider Support

Retrieval Pipeline

Ordering Strategy

Gap Detection

Quick Start

1. Clone & install

2. Configure API keys

3. Run

Optional: Better PDF Parsing

Configuration

Supported Paper Input Formats

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages