A comprehensive testing framework for evaluating LLM responses on API security topics using DeepEval and OpenAI.
This project provides a complete testing suite for validating LLM-generated security advice and responses. It focuses on:
- API Security: Authentication, authorization, and common vulnerabilities
- Accuracy Testing: Ensuring responses are relevant and factually correct
- Hallucination Detection: Preventing fabricated or misleading security advice
- RAG Evaluation: Testing retrieval-augmented generation quality
- Prompt Regression: Comparing prompt versions and preventing regressions
- Inference Provider: Uses OpenAI for LLM inference
startDeepEval/
βββ .github/
β βββ workflows/
β βββ deepeval.yml # CI/CD pipeline
βββ datasets/
β βββ golden_dataset.json # Golden test cases for accuracy
β βββ rag_dataset.json # RAG test cases with retrieval context
βββ src/
β βββ __init__.py
β βββ llm_client.py # OpenAI client for security responses
β βββ rag_client.py # OpenAI RAG client with knowledge base
````markdown
# DeepEval Security Testing Starter
A comprehensive testing framework for evaluating LLM responses on API security topics using DeepEval and OpenAI.
## π Overview
This project provides a complete testing suite for validating LLM-generated security advice and responses. It focuses on:
- **API Security**: Authentication, authorization, and common vulnerabilities
- **Accuracy Testing**: Ensuring responses are relevant and factually correct
- **Hallucination Detection**: Preventing fabricated or misleading security advice
- **RAG Evaluation**: Testing retrieval-augmented generation quality
- **Prompt Regression**: Comparing prompt versions and preventing regressions
## π Project Structure
startDeepEval/ βββ .github/ β βββ workflows/ β βββ deepeval.yml # CI/CD pipeline βββ datasets/ β βββ golden_dataset.json # Golden test cases for accuracy β βββ rag_dataset.json # RAG test cases with retrieval context βββ src/ β βββ init.py β βββ llm_client.py # OpenAI client for security responses β βββ rag_client.py # OpenAI RAG client with knowledge base β βββ prompt_versions.py # Prompt version management βββ tests/ β βββ init.py β βββ conftest.py # Pytest fixtures β βββ test_accuracy.py # Accuracy and relevancy tests β βββ test_hallucination.py # Hallucination detection tests β βββ test_rag.py # RAG retrieval and generation tests β βββ test_prompt_regression.py # Prompt version regression tests βββ deepeval_results/ # Test results output directory βββ .env.example # Environment variables template βββ requirements.txt # Python dependencies βββ pyproject.toml # Project configuration βββ pytest.ini # Pytest configuration βββ README.md # This file
## π Getting Started
### Prerequisites
- Python 3.9+
- OpenAI API key
### Installation
1. **Clone the repository**:
```bash
git clone <your-repo-url>
cd startDeepEval
-
Install Python dependencies:
pip install -r requirements.txt
-
Set up environment variables:
cp .env.example .env # Edit .env and add your OpenAI API key: # OPENAI_API_KEY=your_api_key_here
Run all tests:
pytestRun specific test categories:
# Accuracy tests
pytest tests/test_accuracy.py
# Hallucination detection
pytest tests/test_hallucination.py
# RAG tests
pytest tests/test_rag.py
# Prompt regression
pytest tests/test_prompt_regression.pyRun with markers:
# Run only security tests
pytest -m security
# Run everything except slow tests
pytest -m "not slow"- AnswerRelevancyMetric: Measures how relevant the response is to the query
- FaithfulnessMetric: Ensures responses are grounded in provided context
- ContextualRelevancyMetric: Validates context relevance to the query
- HallucinationMetric: Detects fabricated information not supported by context
- BiasMetric: Identifies biased or unfair recommendations
- ContextualPrecisionMetric: Measures precision of retrieved context
- ContextualRecallMetric: Evaluates completeness of retrieved context
- ContextualRelevancyMetric: Assesses overall retrieval quality
- GEval: Custom criteria-based evaluation for comprehensiveness and quality
- Version Comparison: Ensures new prompts don't regress on key metrics
- Test discovery patterns
- Output formatting
- Custom markers for organizing tests
- Logging configuration
# Required: OpenAI API Key
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini
# DeepEval configuration
DEEPEVAL_TELEMETRY_OPT_OUT=true # Optional
CONFIDENCE_THRESHOLD=0.7 # Optional: default thresholdfrom deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
def test_my_security_feature(llm_client):
query = "How do I secure my API?"
response = llm_client.generate_security_response(query)
test_case = LLMTestCase(
input=query,
actual_output=response
)
metric = AnswerRelevancyMetric(threshold=0.7)
assert_test(test_case, [metric])def test_rag_retrieval(rag_client):
query = "How do I prevent SQL injection?"
result = rag_client.generate_rag_response(query)
test_case = LLMTestCase(
input=query,
actual_output=result["response"],
retrieval_context=result["retrieval_context"]
)
metric = ContextualRelevancyMetric(threshold=0.6)
assert_test(test_case, [metric])The project includes a GitHub Actions workflow (.github/workflows/deepeval.yml) that:
- Runs on push to main/develop branches
- Runs on pull requests
- Executes weekly on Sunday (for regression detection)
- Requires OpenAI API key configured as GitHub secret
Add your OpenAI API key as a repository secret:
- Go to your repository settings
- Navigate to Secrets and variables > Actions
- Add a new repository secret:
- Name:
OPENAI_API_KEY - Value: Your OpenAI API key
- Name:
Then push to trigger automated testing:
git push origin dev- Security Chatbot Validation: Ensure your security chatbot provides accurate advice
- Documentation QA: Validate generated security documentation
- Prompt Engineering: Test and compare different prompt versions
- Compliance: Verify responses align with security standards (OWASP, NIST)
- Regression Testing: Catch quality degradation in model updates
- Privacy-Conscious Testing: Run tests using OpenAI with appropriate handling of sensitive data
- Fork the repository
- Create a feature branch
- Add tests for new features
- Ensure all tests pass
- Submit a pull request
MIT License - feel free to use this starter template for your projects.