Skip to content

smilyutin/AIPerformance

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepEval Security Testing Starter

A comprehensive testing framework for evaluating LLM responses on API security topics using DeepEval and OpenAI.

πŸ” Overview

This project provides a complete testing suite for validating LLM-generated security advice and responses. It focuses on:

  • API Security: Authentication, authorization, and common vulnerabilities
  • Accuracy Testing: Ensuring responses are relevant and factually correct
  • Hallucination Detection: Preventing fabricated or misleading security advice
  • RAG Evaluation: Testing retrieval-augmented generation quality
  • Prompt Regression: Comparing prompt versions and preventing regressions
  • Inference Provider: Uses OpenAI for LLM inference

πŸ“ Project Structure

startDeepEval/
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── deepeval.yml          # CI/CD pipeline
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ golden_dataset.json       # Golden test cases for accuracy
β”‚   └── rag_dataset.json          # RAG test cases with retrieval context
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ llm_client.py             # OpenAI client for security responses
β”‚   β”œβ”€β”€ rag_client.py             # OpenAI RAG client with knowledge base
````markdown
# DeepEval Security Testing Starter

A comprehensive testing framework for evaluating LLM responses on API security topics using DeepEval and OpenAI.

## πŸ” Overview

This project provides a complete testing suite for validating LLM-generated security advice and responses. It focuses on:

- **API Security**: Authentication, authorization, and common vulnerabilities
- **Accuracy Testing**: Ensuring responses are relevant and factually correct
- **Hallucination Detection**: Preventing fabricated or misleading security advice
- **RAG Evaluation**: Testing retrieval-augmented generation quality
- **Prompt Regression**: Comparing prompt versions and preventing regressions

## πŸ“ Project Structure

startDeepEval/ β”œβ”€β”€ .github/ β”‚ └── workflows/ β”‚ └── deepeval.yml # CI/CD pipeline β”œβ”€β”€ datasets/ β”‚ β”œβ”€β”€ golden_dataset.json # Golden test cases for accuracy β”‚ └── rag_dataset.json # RAG test cases with retrieval context β”œβ”€β”€ src/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ llm_client.py # OpenAI client for security responses β”‚ β”œβ”€β”€ rag_client.py # OpenAI RAG client with knowledge base β”‚ └── prompt_versions.py # Prompt version management β”œβ”€β”€ tests/ β”‚ β”œβ”€β”€ init.py β”‚ β”œβ”€β”€ conftest.py # Pytest fixtures β”‚ β”œβ”€β”€ test_accuracy.py # Accuracy and relevancy tests β”‚ β”œβ”€β”€ test_hallucination.py # Hallucination detection tests β”‚ β”œβ”€β”€ test_rag.py # RAG retrieval and generation tests β”‚ └── test_prompt_regression.py # Prompt version regression tests β”œβ”€β”€ deepeval_results/ # Test results output directory β”œβ”€β”€ .env.example # Environment variables template β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ pyproject.toml # Project configuration β”œβ”€β”€ pytest.ini # Pytest configuration └── README.md # This file


## πŸš€ Getting Started

### Prerequisites

- Python 3.9+
- OpenAI API key

### Installation

1. **Clone the repository**:
   ```bash
   git clone <your-repo-url>
   cd startDeepEval
  1. Install Python dependencies:

    pip install -r requirements.txt
  2. Set up environment variables:

    cp .env.example .env
    # Edit .env and add your OpenAI API key:
    # OPENAI_API_KEY=your_api_key_here

Run all tests:

pytest

Run specific test categories:

# Accuracy tests
pytest tests/test_accuracy.py

# Hallucination detection
pytest tests/test_hallucination.py

# RAG tests
pytest tests/test_rag.py

# Prompt regression
pytest tests/test_prompt_regression.py

Run with markers:

# Run only security tests
pytest -m security

# Run everything except slow tests
pytest -m "not slow"

πŸ“Š Test Metrics

Accuracy Tests

  • AnswerRelevancyMetric: Measures how relevant the response is to the query
  • FaithfulnessMetric: Ensures responses are grounded in provided context
  • ContextualRelevancyMetric: Validates context relevance to the query

Hallucination Tests

  • HallucinationMetric: Detects fabricated information not supported by context
  • BiasMetric: Identifies biased or unfair recommendations

RAG Tests

  • ContextualPrecisionMetric: Measures precision of retrieved context
  • ContextualRecallMetric: Evaluates completeness of retrieved context
  • ContextualRelevancyMetric: Assesses overall retrieval quality

Prompt Regression Tests

  • GEval: Custom criteria-based evaluation for comprehensiveness and quality
  • Version Comparison: Ensures new prompts don't regress on key metrics

πŸ”§ Configuration

Pytest Configuration (pytest.ini)

  • Test discovery patterns
  • Output formatting
  • Custom markers for organizing tests
  • Logging configuration

Environment Variables (.env)

# Required: OpenAI API Key
OPENAI_API_KEY=your_key_here
OPENAI_MODEL=gpt-4o-mini

# DeepEval configuration
DEEPEVAL_TELEMETRY_OPT_OUT=true  # Optional
CONFIDENCE_THRESHOLD=0.7         # Optional: default threshold

πŸ“ Writing Tests

Basic Test Structure

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric

def test_my_security_feature(llm_client):
    query = "How do I secure my API?"
    response = llm_client.generate_security_response(query)
    
    test_case = LLMTestCase(
        input=query,
        actual_output=response
    )
    
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

RAG Test Structure

def test_rag_retrieval(rag_client):
    query = "How do I prevent SQL injection?"
    result = rag_client.generate_rag_response(query)
    
    test_case = LLMTestCase(
        input=query,
        actual_output=result["response"],
        retrieval_context=result["retrieval_context"]
    )
    
    metric = ContextualRelevancyMetric(threshold=0.6)
    assert_test(test_case, [metric])

πŸ€– CI/CD Integration

The project includes a GitHub Actions workflow (.github/workflows/deepeval.yml) that:

  • Runs on push to main/develop branches
  • Runs on pull requests
  • Executes weekly on Sunday (for regression detection)
  • Requires OpenAI API key configured as GitHub secret

Setting Up GitHub Actions

Add your OpenAI API key as a repository secret:

  1. Go to your repository settings
  2. Navigate to Secrets and variables > Actions
  3. Add a new repository secret:
    • Name: OPENAI_API_KEY
    • Value: Your OpenAI API key

Then push to trigger automated testing:

git push origin dev

🎯 Use Cases

  1. Security Chatbot Validation: Ensure your security chatbot provides accurate advice
  2. Documentation QA: Validate generated security documentation
  3. Prompt Engineering: Test and compare different prompt versions
  4. Compliance: Verify responses align with security standards (OWASP, NIST)
  5. Regression Testing: Catch quality degradation in model updates
  6. Privacy-Conscious Testing: Run tests using OpenAI with appropriate handling of sensitive data

πŸ“š Additional Resources

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Ensure all tests pass
  5. Submit a pull request

πŸ“„ License

MIT License - feel free to use this starter template for your projects.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages