Stars
WeChat-to-Codex bridge for running Codex app-server from chat, with threads, slash commands, approvals, agents, automation, uploads, and assistant records.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
AI generates a real, editable PowerPoint from any document — native shapes & animations, speaker notes voiced as audio narration, and the option to follow your own .pptx template, not slide images …
Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
Supercharge Your LLM Application Evaluations 🚀
List of papers on hallucination detection in LLMs.
ACL2023 - AlignScore, a metric for factual consistency evaluation.
Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while control…
A comprehensive guide to Generative Engine Optimization (GEO) — optimizing content for AI-driven search engines like ChatGPT, Gemini, and Perplexity. #AEO
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Open Images is a dataset of ~9 million images that have been annotated with image-level labels and bounding boxes spanning thousands of classes.
Unified Automated Evaluation for Hallucination Detection and Fact Verification
Official implementation of T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition
"他山之石、可以攻玉":复旦JADE团队发布的大模型测评与治理系列
This is the repository of HaluEval, a large-scale hallucination evaluation benchmark for Large Language Models.
Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
The simplest, fastest repository for training/finetuning medium-sized GPTs.
🔥[ACL 2025 Findings] DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-of-use, backed by research.
RefChecker provides automatic checking pipeline and benchmark dataset for detecting fine-grained hallucinations generated by Large Language Models.