A Streamlit-based portal for reviewing Terminal Bench tasks with integrated validator and LLM review comparison.
./start_portal.sh- Upload task ZIP files or provide Google Drive links
- Run validator script on tasks
- View validation results with detailed feedback
- Step 1: Upload Task
- Step 2: Run Validator
- Step 3: Review
task.yaml(with LLM comparison) - Step 4: Review
solution.sh(with LLM comparison) - Step 5: Review
Dockerfile(with LLM comparison) - Step 6: Compare Model Test Results
- Step 7: Export Final Review Report
You can now directly paste Google Sheets URLs to load LLM reviews!
Example:
https://docs.google.com/spreadsheets/d/1VW0CrLLgjPRGs7fWlYwW78OEZ5S5MMuURwh7jCQC_SQ/edit?usp=sharing
How to Format Your Sheet:
- Column A: Section name (e.g.,
task.yaml,solution.sh,Dockerfile) - Column B: LLM review content
Or use any format - the parser will try to intelligently extract sections!
- Upload Files:
.txt,.md,.json,.docx,.csv - Google Drive Links: Direct file links
- Direct URLs: Any accessible document URL
- View task files on the left
- View LLM reviews on the right
- Edit LLM reviews inline
- Add manual review notes
- Combined review with your notes and LLM feedback
- JSON format for easy processing
- Includes validation results
- Go to "Upload" tab
- Upload ZIP file or paste Google Drive link
- Wait for extraction
- In sidebar, find "π€ LLM Review Document"
- Option A: Paste Google Sheets URL
- Option B: Upload file (.txt, .md, .json, .docx, .csv)
- Option C: Paste Google Drive file link
- Click "π₯ Load from URL" or file will auto-upload
- Navigate through steps using buttons or sidebar
- View file content with syntax highlighting
- View LLM review side-by-side
- Add your review notes
- Mark steps as complete
- Go to "Final Review" step
- Click "π₯ Export Review Report"
- Download JSON report with all reviews
pip install -r requirements.txtstreamlit>=1.28.0- Web frameworkgdown>=4.7.1- Google Drive downloadsPyYAML>=6.0- YAML parsingpython-docx>=0.8.11- DOCX support- API clients (optional, for future LLM features)
- Default: Uses
tb_validator_v6.pyin the same directory - Upload different validator in sidebar if needed
The parser automatically detects Terminal Bench review documents with sections like:
A) PROMPT
- Analysis of task.yaml...
- PASS/FAIL indicators...
B) TESTS
- Test coverage analysis...
E) DOCKERFILE
- Dockerfile review...
G) SOLUTION.SH
- Solution implementation review...
These sections are automatically mapped to:
A) PROMPTβtask.yamlreviewB) TESTSβtestsreviewE) DOCKERFILEβdockerfilereviewG) SOLUTIONβsolution.shreviewH) DOCKER-COMPOSEβdocker_composereview
task.yaml Review
----------------
Good structure, clear description...
solution.sh Review
------------------
Correct implementation...
Just export your Sheet as CSV and upload, or paste the Sheets URL directly!
- For Google Sheets: Make sure the sheet is shared (at least "Anyone with link can view")
- Section Names: Use filenames like
task.yaml,solution.sh,dockerfilefor automatic matching - Rich Formatting: Markdown is supported in LLM reviews (headers, lists, bold, etc.)
- Multiple Reviews: You can load multiple review documents - they will merge
- Check the sheet is shared publicly or with link
- Verify the URL format is correct
- Try exporting as CSV and uploading instead
- Check the success message shows parsed sections
- Verify your section names match file names
- Try a simpler format (two columns: Section, Review)
- Ensure task ZIP structure is correct
- Check validator script is present
- Look at detailed error message in output
Review Portal/
βββ review_portal_v2_llm.py # Main portal application
βββ llm_review_parser.py # Review document parser
βββ tb_validator_v6.py # Task validator script
βββ requirements.txt # Python dependencies
βββ start_portal.sh # Launch script
βββ README.md # This file
Current: v2.1 with Google Sheets Support
What's New:
- β Direct Google Sheets URL support
- β CSV parsing for exported Sheets
- β Improved file upload (added CSV)
- β Better error messages
- β Enhanced download handling
- Direct LLM API integration (GPT-4, Claude, Gemini)
- Batch review processing
- Review templates
- Automated scoring
Need Help? Check the error messages in the portal - they're designed to be helpful! π