Qgentic-AI is a research & development automation stack aimed at iterating on Kaggle competitions with minimal human intervention. Two collaborating LLM-driven agents β a Researcher and a Developer β take a competition bundle, explore the data, produce a technical plan, generate code, run it locally, analyse the results, and keep refining the solution. Guardrails and supporting tools keep the loop grounded, reproducible, and safe.
[2025/11/23] π Qgentic-AI achieves 0.68 Public LB for CSIRO-Biomass competition, placing it 20th/1400 teams (as of today) in the top 1.5%!
[2025/11/13] π Qgentic-AI achieves 94.82% weighted mean percentile across 7 MLE-Bench competitions - outperforming FM-Agent (93.91%), InternAgent (76.25%), MLE-STAR-PRO (72.74%), Operand (69.81%), and R&D-Agent (62.46%). Results reported as percentiles with 74:1 weighting to match MLE-Bench's 75-competition benchmark.
| Kaggle Competition | LB score | Ranking | Notebook |
|---|---|---|---|
| csiro-biomass | 0.68 | Top 1.5% (20/1400) - Nov 23, 2025 | Released after deadline |
| playground-series-s5e11 | 0.92684 | Top 6% (40/700) - Best Single Model Public Notebook π - Nov 4, 2025 | Notebook 1 |
| playground-series-s5e10 | 0.05576 | Top 15% (606/4082) - Oct 31, 2025 | Notebook 1, Notebook 2 |
The competitions were evaluated on 4x A100 80GB GPUs (splitted into 8 instances of 40GB using MIG). The average runtime is 8-9 hours per competition (1 hour researcher + 3 hours baseline + 4 hours ensemble + buffers).
Results are reported as raw score (percentile%) where percentile = % of leaderboard submissions beaten. The weighted mean uses a 74:1 ratio (weight 12.33 for each of 6 competitions, weight 1.0 for Statoil) to match MLE-Bench's 75-competition benchmark and prevent the Statoil outlier from dominating the overall metric.
Note: All results are based on commit 5b81ee0.
| Kaggle Competition | Difficulty | Type | Metric | Qgentic-AI | FM Agent | InternAgent | Operand | R&D-Agent | MLE-STAR-PRO |
|---|---|---|---|---|---|---|---|---|---|
| us-patent-phrase-to-phrase-matching | Medium | Information Retrieval | PCC (higher) | 0.858 (74.01%) |
0.862 (90.26%) π₯ |
0.868 (98.99%) π₯ |
0.692 (15.19%) |
0.801 (21.81%) |
0.753 (17.31%) |
| learning-agency-lab-automated-essay-scoring-2 | Medium | Text | QWK (higher) | 0.847 (100.00%) π₯ |
0.845 (100.00%) π₯ |
0.830 (63.18%) |
0.830 (64.40%) |
0.825 (46.90%) |
0.832 (68.91%) |
| tabular-playground-series-dec-2021 | Easy | Tabular | Accuracy % (higher) | 0.963 (100.00%) π₯ |
0.960 (100.00%) π₯ |
0.963 (100.00%) π₯ |
0.963 (100.00%) π₯ |
0.963 (100.00%) π₯ |
0.963 (100.00%) π₯ |
| statoil-iceberg-classifier-challenge | Medium | Image Classification | Logloss (lower) | 0.160 (83.18%) |
1.258 (3.54%) |
0.203 (50.30%) |
Failed (0.00%) |
Failed (0.00%) |
0.246 (33.69%) |
| denoising-dirty-documents | Medium | Computer Vision | RMSE (lower) | 0.012 (98.15%) π₯ |
0.020 (93.21%) π₯ |
0.023 (89.51%) π₯ |
0.023 (89.51%) π₯ |
0.011 (98.15%) π₯ |
0.011 (98.15%) π₯ |
| whale-categorization-playground | Medium | Computer Vision | MAP@5 (higher) | 0.557 (97.73%) π₯ |
0.466 (93.01%) π₯ |
0.183 (10.40%) |
0.361 (60.11%) |
0.262 (14.37%) |
0.360 (60.11%) |
| google-quest-challenge | Medium | Text | Spearman Correlation (higher) | 0.445 (100.00%) π₯ |
0.394 (94.34%) π₯ |
0.409 (97.52%) π₯ |
0.398 (95.29%) π₯ |
0.415 (98.60%) π₯ |
0.396 (95.10%) π₯ |
| Weighted Mean Percentile | 94.82% (4π₯ 1π₯ 0π₯) |
93.91% (2π₯ 2π₯ 2π₯) |
76.25% (1π₯ 3π₯ 0π₯) |
69.81% (1π₯ 2π₯ 0π₯) |
62.46% (2π₯ 1π₯ 0π₯) |
72.74% (2π₯ 1π₯ 0π₯) |
- Python 3.12.
- CUDA-enabled GPU.
conda create --name qgentic-ai python=3.12 -y
conda activate qgentic-ai
git clone https://github.com/bogoconic1/Qgentic-AI.git
cd Qgentic-AI
pip install uv
bash install.sh
Add your kaggle.json file in the Qgentic-AI directory
If you want to download MLE-Bench Data for another competition, modify install.sh TASK_NAME and only execute prepare_data and copy_task_data
Create a .env file in the project root (or export directly):
GOOGLE_API_KEY=...
OPENAI_API_KEY=...
ANTHROPIC_API_KEY=...
FIRECRAWL_API_KEY=...
HF_TOKEN=...
GOOGLE_CLOUD_PROJECT=...
GOOGLE_CLOUD_LOCATION=global
GOOGLE_GENAI_USE_VERTEXAI=True
KAGGLE_USERNAME=
KAGGLE_KEY=
These keys are loaded via python-dotenv. Adjust the environment variables listed in
config.yaml if you need custom names or endpoints.
sudo apt-get install unzip
curl -L -o /workspace/meta-kaggle.zip https://www.kaggle.com/api/v1/datasets/download/kaggle/meta-kaggle
unzip meta-kaggle.zip -d /workspace/meta-kaggle
Then run
python create_metadata.py --competition-slug "enter slug"
You will see something like this
task/
ββ "enter slug"/
ββ description.md
ββ public_insights.md
ββ sample_submission.csv
ββ comp_metadata.yaml
ββ train files/test files
Before running the agent, create these files in task/<slug>/:
cv_splits.json: Cross-validation fold indices for reproducible train/val splitsmetric.py: Competition-specific evaluation metric implementation
Optionally, configure Human-In-The-Loop (HITL) instructions in config.yaml:
researcher.hitl_instructions: Guide research directionmodel_recommender.hitl_models: Override model selection with specific modelsdeveloper.hitl_instructions: Guide code implementation
python launch_agent.py --slug "enter slug" --iteration 1Time limits are controlled via config.yaml (runtime.baseline_time_limit).
researcher.txt/developer.txtcapture detailed logs for each iteration.code_{iteration}_v{version}.pyare the generated scripts; corresponding logs sit undercode_{iteration}_v{version}.txt.- Weights & Biases and Weave projects are initialised in
launch_agent.py; supply--wandb-entity/--wandb-project, exportWANDB_ENTITY/WANDB_PROJECT, or define them inconfig.yamlundertracking.wandb.
Key settings live in config.yaml (merged with project_config.py defaults):
-
llm: Model configurations for different components:
developer_model: Main Developer agent code generation (gpt-5)developer_tool_model: Developer tools - red flags, SOTA suggestions, debug (gemini-3-pro-preview)researcher_model: Main Researcher agent planning (claude-sonnet-4-5)researcher_tool_offline_model: EDA tool execution (gemini-3-pro-preview)researcher_tool_online_model: External dataset search (gemini-3-pro-preview)starter_model: Starter agent for initial exploration (gemini-3-pro-preview)model_recommender_model: Model recommendation agent (gemini-3-pro-preview)model_selector_model: Model selection (gemini-3-pro-preview)paper_summary_model: Research paper summarization (gemini-3-pro-preview)ensembler_model: Ensemble agent (gemini-3-pro-preview)leakage_review_model: Data leakage guardrail (gemini-3-pro-preview)
-
runtime: Execution parameters:
baseline_time_limit: Total time budget for baseline iteration in seconds (default: 432000 = 5 days)ensemble_time_limit: Total time budget for ensemble iteration in seconds (default: 432000 = 5 days)baseline_code_timeout: Timeout for baseline code execution in seconds (default: 43200 = 12 hours)ensemble_code_timeout: Timeout for ensemble code execution in seconds (default: 43200 = 12 hours)ask_eda_max_attempts: Max retry attempts for EDA/A/B test code generation (default: 5)download_datasets_max_attempts: Max retry attempts per query phrasing for external dataset discovery (default: 1)researcher_max_steps: Max steps for researcher exploration (default: 512)llm_max_retries: Max retries for LLM calls (default: 3)directory_listing_max_files: Max files shown in directory listings (default: 10)researcher_parallel_runs: Number of parallel researcher runs (default: 1)baseline_max_parallel_workers: Max parallel baseline workers when MIG disabled (default: 3)enable_mig: Enable NVIDIA MIG GPU isolation (auto-detects worker count from available MIG instances)enable_multi_gpu: Enable multi-GPU parallelism across physical GPUs (default: true)allowed_gpu_ids: Restrict to specific GPU IDs; empty uses all detected GPUs (default: [])enable_cpu_affinity: Enable CPU core pinning for parallel processes (default: true)reset_conda_envs_per_run: Reset conda environments before each run (default: true)use_validation_score: Use validation score from logs instead of MLE-bench grading (default: true)patch_mode_enabled: Experimental diff-based workflow (default: false)
-
paths: Root directories and naming templates for generated artifacts.
-
guardrails: Security and quality checks before code execution:
enable_code_safety: LLM-based critical security check using Gemini 2.5 Flash (default: true)leakage_review: LLM-based data leakage detection (default: true)logging_basicconfig_order: AST-based logging order check (default: true)
-
researcher: Researcher agent settings:
hitl_instructions: Human-In-The-Loop instructions list. If non-empty, these instructions are added to the researcher's system prompt to guide research direction (default:[]). Example:["Focus on time-series cross-validation", "Analyze seasonality patterns", "Consider external weather data"]
-
model_recommender: Model recommendation settings:
hitl_models: Human-In-The-Loop model list. If non-empty, these models will be used instead of LLM-based dynamic selection (default:[]). Example:["xgboost", "lightgbm", "deberta-v3-large"]enable_web_search: Enable web search for SOTA strategies (default: true)
-
developer: Developer agent settings:
hitl_instructions: Human-In-The-Loop instructions list. If non-empty, these instructions are added to the developer's system prompt to guide code implementation (default:[]). Example:["Use gradient clipping to prevent exploding gradients", "Implement mixed precision training", "Focus on domain-specific data augmentation"]
Patch Mode β The developer supports a token-efficient diff workflow. Toggle
runtime.patch_mode_enabled: trueto request unified diffs (with line numbers). DO NOTE this feature is unstable and can break!
Parallel Execution β Supports flexible GPU configurations (Multi-GPU, Single-GPU, and MIG):
- MIG Mode (
enable_mig: true): Auto-detects available MIG instances and isolates each baseline to a dedicated GPU partition. H100 80GB supports up to 7 MIG instances (1g.10gb profile).- Multi-GPU Mode (
enable_multi_gpu: true): Distributes baselines across multiple physical GPUs.- Single-GPU Mode: Sequential execution using
baseline_max_parallel_workersto control concurrency.- Each baseline runs in an isolated conda environment (
qgentic-model-{idx}) with dedicated CPU cores whenenable_cpu_affinity: true.
MIT