A fork of huggingface/skills — a collection of AI skills, agents, and evaluations.
This repository contains:
- Skills: Modular AI capabilities that can be composed into agents
- Agents: Autonomous AI systems built from skills
- Evals: Benchmarks and leaderboards for measuring skill performance
- Marketplace: Discoverable plugins for Claude and Cursor
.
├── .claude-plugin/ # Claude AI plugin configuration
│ ├── plugin.json # Plugin metadata and entry points
│ └── marketplace.json # Marketplace listing
├── .cursor-plugin/ # Cursor IDE plugin configuration
│ ├── plugin.json # Plugin metadata
│ └── marketplace.json # Marketplace listing
├── .github/
│ └── workflows/
│ ├── generate-agents.yml # CI: auto-generate agent configs
│ ├── push-evals-leaderboard.yml # CI: update evals leaderboard
│ └── push-hackers-leaderboard.yml # CI: update hackers leaderboard
└── skills/ # Core skill implementations
- Python 3.10+
piporuvfor package management
git clone https://github.com/your-org/skills.git
cd skills
pip install -e .python -m skills.evals run --skill <skill-name>Install via the Claude marketplace or load the .claude-plugin/plugin.json directly in your Claude environment.
Install via the Cursor marketplace or load the .cursor-plugin/plugin.json directly in your Cursor IDE.
- Fork the repository
- Create a feature branch (
git checkout -b feat/my-skill) - Add your skill under
skills/ - Add corresponding evals under
evals/ - Open a pull request
See CONTRIBUTING.md for detailed guidelines.
See .github/workflows/SECURITY.md for our security policy.
Apache 2.0 — see LICENSE for details.
Personal fork notes: I'm using this repo to experiment with building custom skills for my own workflows. Main areas of interest: text summarization and code review skills. Not intended for production use.
TODO:
- Build a summarization skill that handles long documents (>10k tokens) by chunking
- Experiment with a code review skill focused on Python style/type hints
- Compare eval results against upstream once I have a baseline
- Look into adding a
--dry-runflag to the evals runner so I can test without writing results- Try running evals against
gpt-4o-minias a cheaper baseline before committing to full runs- Try chunking strategy: overlap by ~10% between chunks to avoid losing context at boundaries — tested this, works well; settled on 15% overlap since 10% occasionally dropped a sentence at boundaries
- Set up local dev environment with
uvinstead ofpip— noticeably faster for resolving deps