This folder contains all the tools and resources needed for participants to start hacking Biology.
-
Train set ground truth and test set data:
The data should already be present in your instance at
/home/ec2-user/SageMaker/data/
-
Install dependencies:
Create a virtual environment and install all dependencies (via
uv):bash install.sh
-
Start the vLLM server (once per team):
Start your own inference endpoint.
bash start_vllm_docker.sh
This will start a local LLM endpoint for you and your team, that you can access from e.g. localhost:8000 (check the contents of
start_vllm_docker.sh). -
Optional : Create a HF token for your account if you plan to use a private model or one with a required user agreement.
vllm_demo.ipynb- Demo notebook showing how to use the vLLM serversmolagents_demo.ipynb- Demo notebook for using SmolAgents. This might be a good solution to call python code and use the output to answer Q&A questions.generate_answers_demo.ipynb- Demo notebook for using the vLLM server to generate a submission.
upload_answers.py- Script to validate and upload JSONL files to S3
pyproject.toml- Python dependenciesinstall.sh- Installation scriptstart_vllm_docker.sh- Script to start the vLLM Docker container
The expected format of the answers is a JSONL file with the following fields:
{
"question": "Is G3BP1 druggable with monoclonal antibodies?", # From the original data
"options": "{\"A\": \"No\", \"B\": \"Yes\"}", # From the original data
"answer_letter": "A", # Response extracted from the model's raw_response
"raw_response": "<think>...</think><answer>A</answer>" # Optional, but better to include it
}With one line per question.
You need to provide the original question and options from the original datasets, as they are used for looking up the correct answer (the order does not matter).
Including the raw_response is optional to be on the leaderboard, but we will ask the winning team to provide it. In other words, we will only consider full submissions with the raw_responses for the leaderboard prize.
You can use the following script to upload your answers to the test set:
# Validate only
uv run python upload_answers.py test_answers.jsonl --team-name "Team1" --validate-only
# Upload
uv run python upload_answers.py test_answers.jsonl --team-name "Team1"
# Upload with team name and tag (e.g., model name)
uv run python upload_answers.py test_answers.jsonl --team-name "Team1" --tag "qwen3_8b_no_tooling"The leaderboard available at https://d18bag07vdubnx.cloudfront.net/
The total score is the percentage of the test set that has a correctly formatted answer and is also correct.
The provided data has the following format:
{
"question": "Is G3BP1 druggable with monoclonal antibodies?",
"options": "{\"A\": \"No\", \"B\": \"Yes\"}",
"answer": "A",
"question_type": "antibody",
"metadata": "{\"target_protein\": \"G3BP1\", \"original_question\": \"Target X can be targeted by Monoclonal Ab ?\", \"original_answer\": 0, \"answer_type\": \"binary\", \"question_category\": \"subquestion 6\", \"template_used\": \"Is {target} druggable with monoclonal antibodies?\", \"data_row_index\": 70}",
"dataset_name": "Therapeutic Target Profiling"
}{
"question": "Based on phenylbutazone (computed as the average activity of: CYP2C19, NR1I2, CYP2D6, CYP3A4, TP53, ESR2, EHMT2, CYP2C9, MCL1, PTGS2, and 6 more genes) signature activity patterns from bulk RNA-seq data, which cancer type is more similar to Pheochromocytoma and Paraganglioma?",
"options": "{\"A\": \"Ovarian serous cystadenocarcinoma\", \"B\": \"Prostate adenocarcinoma\"}",
"question_type": "cancer_similarity_binary",
"metadata": "{'options': array(['OV', 'PRAD'], dtype=object), 'signature': 'phenylbutazone', 'split': 'test', 'subject': 'PCPG'}",
"dataset_name": "TCGA Cancer Similarity"
}-
vLLM Server Not Running
# Check if Docker is running docker ps # Check the GPU usage nvidia-smi # Start the server ./start_vllm_docker.sh
If you want to start fresh, you can kill all docker processes and all python processes running on the GPU. This will kill the processes for everyone on your team.:
pkill -f "docker" # Kill all processes running on the GPU nvidia-smi | grep 'python' | awk '{ print $3 }' | xargs -n1 kill -9 # Check again nvidia-smi, GPUs should be free nvidia-smi
-
File Not Found Errors
- Ensure
hackathon-train.jsonis in the current directory - Check file paths in the notebook
- Ensure
-
Answer Format Issues
- You can ask the LLM to format its answer as
<answer>[letter]</answer>where letter is A, B, C, D, etc, then use a regex like shown ingenerate_answers_demo.ipynbto extract the answer. Bear in mind that the LLM might not always follow the format, you may need to do prompt engineering. - You just need to include the answer letter in the jsonl responses file. Make sure that it corresponds to one of the original options.
- You can ask the LLM to format its answer as
-
Encourage test time compute: Several studies have shown that LLMs can improve their performance with test time compute. You can encourage this by adding a prompt like "Think through the question step by step", or "First define each biological concept and then answer the question" etc...
-
Choosing the right model:
- You do not have to use vLLM for inference. You can use any system that you like. Feel free to query your favorite assistant to get a list of models that you can run on the available resources (8xL4 GPUs, 192Go VRAM).
- Some models are already finetuned on adjacent domains, e.g. MedGemma
- You can also look into larger quantized models that are published on Hugging Face (e.g. https://huggingface.co/models?other=base_model:quantized:openai/gpt-oss-120b).
-
Prompt optimization: Careful prompting both the system prompt and the user requests is the key to having good performances. This is where you can inject expert knowledge and guide the model's reasoning. You can also try some of the promising prompt optimization framework, using the training set : gepa, GAAPO or promptomatix are examples.
-
Post-training: You can use the training set to try to fine-tune open-weights models, probably up to 32B with LoRA. Unfortunately, the resources will not be enough to do reinforcement learning.
Good luck with your submissions! π§¬π