SoMe is a comprehensive benchmark designed to evaluate the capabilities of Large Language Model (LLM)-based agents in realistic social media scenarios. This benchmark provides a standardized framework for testing and comparing social media agents across multiple dimensions of performance.
SoMe comprises a diverse collection of:
- 8 social media agent tasks
- 9,164,284 posts from various social media platforms
- 6,591 user profiles with rich behavioral data
- 25,686 reports from external websites
- 17,869 meticulously annotated task queries
Figure 1: An example of agentic task in SoMe
- [2026.01] π§ We release MailMind: An AI-powered Email System that Can Do Your Job
- [2025.11] π Our paper is accepted by AAAI 2026!
SoMe benchmark evaluates social media agents across 8 key tasks, covering diverse aspects of social media intelligence:
| Task Category | Task Name | Description |
|---|---|---|
| Post-centered | π¨ Realtime Event Detection (RED) | Identify and track emerging events in real-time |
| Post-centered | π Streaming Event Summary (SES) | Summarize ongoing events from streaming data |
| Post-centered | π« Misinformation Detection (MID) | Identify and flag potentially false or misleading information |
| User-centered | π― User Behavior Prediction (UBP) | Predict user interactions with social media content |
| User-centered | π User Emotion Analysis (UEA) | Analyze user emotions towards social media content |
| User-centered | π¬ User Comment Simulation (UCS) | Simulate realistic user comments |
| Comprehensive | π± Media Content Recommendation (MCR) | Recommend relevant media content based on user interests |
| Comprehensive | β Social Media Question-Answering (SMQ) | Accurately answer questions about social media content |
The SoMe benchmark includes comprehensive datasets for each task, with the following statistics:
| Task | # Query | # Data | Data Type |
|---|---|---|---|
| π¨ Real-time Event Detection | 568 | 476,611 | Posts |
| π Streaming Event Summary | 154 | 7,898,959 | Posts |
| π« Misinformation Detection | 1,451 | 27,137 | Posts & Knowledge |
| π― User Behavior Prediction | 3,000 | 840,200 | Posts & Users |
| π User Emotion Analysis | 2,696 | 840,200 | Posts & Users |
| π¬ User Comment Simulation | 4,000 | 840,200 | Posts & Users |
| π± Media Content Recommendation | 4,000 | 840,200 | Posts & Users |
| β Social Media Question-Answering | 2,000 | 8,651,759 | Posts & Users |
| Total | 17,869 | 9,242,907 | All |
We evaluated various agentic LLMs on the SoMe benchmark. Below are the comprehensive evaluation results across all 8 tasks:
Figure 2: Performance comparison of different agentic models across SoMe benchmark tasks
Social-Media-Agent/
βββ π€ agent.py # Main social media agent implementation
βββ π§ qwen_agent/ # Qwen-Agent library
βββ π tasks/ # Task-specific modules
β βββ π± media_content_recommend/
β βββ π« misinformation_detection/
β βββ π¨ realtime_event_detection/
β βββ β social_media_question_answering/
β βββ π streaming_event_summary/
β βββ π¬ user_comment_simulation/
β βββ π user_emotion_analysis/
β βββ π― user_behavior_prediction/
βββ π οΈ tools/ # Tools for social media analysis
βββ π§ͺ test_*.py # Test scripts for each task
βββ π eval_scripts/ # Evaluation scripts for scoring
βββ π results/ # Directory for storing results
βββ π datasets/ # Dataset directory
βββ πΎ database/ # Database directory
- Python 3.12+ installed on your system
- Git installed for repository cloning
- Sufficient disk space for data (recommended: 50GB+)
-
π₯ Clone the repository
git clone https://github.com/LivXue/SoMe.git cd SoMe -
π¦ Install dependencies
pip install -r requirements.txt
-
π₯ Download test data
- Hugging Face Dataset: Download Link
- Google Drive: Download Link
- Baidu Disk: Download Link (Password: SoMe)
After downloading, unzip the data into the
databasedirectory.
Each task can be evaluated using its corresponding test script:
# π¨ Realtime Event Detection
python test_realtime_event_detection.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π Streaming Event Summary
python test_streaming_event_summary.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π« Misinformation Detection
python test_misinformation_detection.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π― User Behavior Prediction
python test_user_behavior_prediction.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π User Emotion Analysis
python test_user_emotion_analysis.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π¬ User Comment Simulation
python test_user_comment_simulation.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# π± Media Content Recommendation
python test_media_content_recommend.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY
# β Social Media Question Answering
python test_social_media_question_answering.py --model MODEL_NAME --base_url MODEL_SERVER_URL --api_key API_KEY| Argument | Description | Example |
|---|---|---|
--model |
The model name to use | "deepseek-chat" |
--base_url |
The base URL for the model server | "https://api.deepseek.com" |
--api_key |
The API key for the model server | Your actual API key |
--output_path |
Output path for results | "results/my_experiment" |
After running the test scripts, evaluate the results using the provided evaluation scripts:
# Option 1: For tasks with LLM-based answer extraction
python eval_scripts/[TASK]_extraction.py
python eval_scripts/[TASK]_compute_score.py
# Option 2: For tasks with LLM-as-judge scoring
python eval_scripts/[TASK]_scoring.py
python eval_scripts/[TASK]_compute_score.pyNote: The LLM settings for evaluation are configured in
eval_scripts/settings.json
The benchmark supports various LLM models through OpenAI-compatible API endpoints:
- π§© Qwen series models (Qwen2.5, Qwen3, etc.)
- π OpenAI models (GPT-4, GPT-5, etc.)
- π Third-party models with OpenAI-compatible APIs (DeepSeek, Claude, etc.)
- π¦ Local models served with OpenAI-compatible wrappers (vLLM, Ollama, etc.)
If you use this benchmark in your research, please cite our paper:
@inproceedings{some2026,
title={SoMe: A Realistic Benchmark for LLM-based Social Media Agents},
author={Dizhan Xue and Jing Cui and Shengsheng Qian and Chuanrui Hu and Changsheng Xu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}We welcome contributions to improve the benchmark! Here's how you can help:
- π Report bugs by opening issues with detailed descriptions
- π‘ Suggest features for new tasks or improvements
- π§ Submit code via pull requests for bug fixes or enhancements
- π Add datasets to expand the benchmark coverage
- π Improve documentation for better usability
Please see our Contributing Guidelines for more details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
We would like to express our gratitude to:
- The Qwen team for their excellent Qwen-Agent framework, which forms the foundation of this benchmark
- All contributors who have helped develop and improve SoMe
- The social media platforms and data providers that make this research possible
- The AAAI 2026 reviewers for their valuable feedback
Early agentic LLMs each had their own tool calling formats, and they could not properly follow prompts to use Qwen's calling format. Therefore, I added support for other tool calling formats in Qwen-Agent. If you are using Qwen models or relatively new agentic LLMs, you can directly use the original Qwen-Agent repository.
For questions or inquiries about the benchmark, please contact:
- Dizhan Xue: xuedizhan17@mails.ucas.ac.cn
Visit our GitHub repository for the latest updates and discussions.