Skip to content

SIFT (Smart Intelligent Finding Triaging) is an innovative AI-powered tool to automatically analyze GitLeaks security findings using Large Language Models (LLM). This tool helps reduce false positives and provides intelligent analysis of potential secrets detected in your codebase.

License

Notifications You must be signed in to change notification settings

AmadeusITGroup/sift

Copyright (C) 2025 Amadeus S.A.S. See the end of the file for license conditions.

SIFT (Smart Intelligent Finding Triaging)

SIFT (Smart Intelligent Finding Triaging) is an innovative AI-powered tool to automatically analyze GitLeaks security findings using Large Language Models (LLM). This tool helps reduce false positives and provides intelligent analysis of potential secrets detected in your codebase.

πŸš€ Overview

SIFT leverages the power of LLMs to analyze secrets detected by GitLeaks, providing:

  • Intelligent Analysis: AI-powered classification of true vs false positives
  • Multi-LLM Support: choosing between single-LLM or multi-LLM analysis modes for enhanced accuracy
  • Context-Aware: considering file paths, content context, and code patterns
  • Confidence Scoring: providing confidence levels for each analysis
  • Consensus Analysis: multi-LLM mode uses reviewer consensus for improved reliability
  • Batch Processing: efficiently processing multiple findings at once

πŸ“‹ Prerequisites

1. Install GitLeaks

First, install GitLeaks to scan your repositories for secrets:

# On Linux/macOS using Homebrew
brew install gitleaks

# On Linux using curl
curl -sSfL https://raw.githubusercontent.com/gitleaks/gitleaks/master/scripts/install.sh | sh -s -- -b /usr/local/bin

# On Windows using Chocolatey
choco install gitleaks

# Or download from GitHub releases
# Visit: https://github.com/gitleaks/gitleaks/releases

Verify the installation:

gitleaks version

2. Install Ollama

Install Ollama on your system:

# On Linux
curl -fsSL https://ollama.com/install.sh | sh

# On macOS
brew install ollama

# Or visit https://ollama.com/download for other installation methods

3. Pull Language Models

Download suitable models for analysis. Choose the model(s) based on your preferred analysis mode:

For Single-LLM Mode (Default)

# Recommended model
ollama pull mistral-small

# Alternative models
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mistral:7b

For Multi-LLM Mode (Enhanced Accuracy)

# Recommended combination for multi-LLM analysis
ollama pull mistral-small      # Analyzer 1
ollama pull llama3.1:8b        # Analyzer 2  
ollama pull qwen2.5:14b        # Reviewer (larger model for final decision)

# Alternative combinations
# Fast processing
ollama pull mistral-small && ollama pull phi3:3.8b

# Balanced performance
ollama pull mistral-small && ollama pull gemma2:9b

4. Start Ollama Server

Start the Ollama server (usually starts automatically after installation):

ollama serve

The server will run on http://localhost:11434 by default.

5. Install Python Dependencies

This project uses UV for dependency management:

# Install UV if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh

# Or for macOS
brew install uv

# Install project dependencies
uv sync

πŸ› οΈ Usage

Step 1: Generate GitLeaks SARIF Report

First, run GitLeaks on your repository to generate a SARIF report using a custom report format:

# Scan a repository and generate an output in SARIF format
gitleaks detect --source /path/to/your/repo --report-format template --report-template sift.tmpl --report-path findings.sarif

Step 2: Parse SARIF File

Extract findings from the SARIF file and generate prompts:

# Using UV
uv run parse_gitleaks_sarif.py findings.sarif --output-dir ./prompts

# Or directly with Python
python parse_gitleaks_sarif.py findings.sarif --output-dir ./prompts

This will create .prompt files in the ./prompts directory, one for each secret finding.

Step 3: Configure Analysis Mode

Create a configuration file to control analysis behavior:

# Copy default configuration
cp config.yaml my_config.yaml

# Edit my_config.yaml to customize settings

Step 4: Analyze with LLM

Send the prompts to Ollama for AI analysis:

Single-LLM Mode (Default)

# Using configuration file
uv run sift.py ./prompts --config my_config.yaml

# Or with command line override
uv run sift.py ./prompts --analysis-mode single

Multi-LLM Mode (Enhanced Accuracy)

# Using configuration file
uv run sift.py ./prompts --analysis-mode multi

# With verbose output to see individual analyses
uv run sift.py ./prompts --analysis-mode multi --verbose

🎯 Demo

Try SIFT with the provided demo examples to see how it works in practice.

Quick Demo

The examples/ directory contains sample data to demonstrate SIFT's capabilities:

# Navigate to the project directory
cd /path/to/sift

# Analyze the demo prompts (generated from example GitLeaks findings)
uv run sift.py examples/demo_prompts --output-dir ./demo_output

Demo Examples

The demo includes three realistic scenarios:

1. Production Secret (True Positive)

File: config/production.env
Finding: AWS Access Token AKIA1234567890ABCDEF

AWS_ACCESS_KEY_ID=AKIA1234567890ABCDEF

SIFT Analysis: βœ… TRUE POSITIVE (Confidence: 85%)

"This appears to be a legitimate secret. The file path and context suggest this is production configuration, and the secret format appears to be well-formed with high entropy."

2. Documentation Example (False Positive)

File: docs/README.md
Finding: Generic API Key sk-1234567890abcdef1234567890abcdef

export API_KEY=sk-1234567890abcdef1234567890abcdef

SIFT Analysis: ❌ FALSE POSITIVE (Confidence: 95%)

"This appears to be a documentation example or placeholder value. The file path contains 'docs/' and the secret format suggests this is documentation rather than a real secret."

3. Configuration Template (False Positive)

File: examples/config.example.js
Finding: Generic API Key YOUR_API_KEY_HERE

apiKey: 'YOUR_API_KEY_HERE',

SIFT Analysis: ❌ FALSE POSITIVE (Confidence: 95%)

"This appears to be a documentation example or placeholder value. The file path contains 'examples/' and the secret value looks like a placeholder."

Complete Demo Workflow

Follow this complete workflow to experience SIFT from SARIF input to final analysis:

# 1. Start with the example SARIF file
ls examples/example_gitleaks_sarif.json

# 2. Parse the SARIF file to generate prompts
uv run parse_gitleaks_sarif.py examples/example_gitleaks_sarif.json --output-dir ./demo_prompts

# 3. Analyze with SIFT (single-LLM mode)
uv run sift.py ./demo_prompts --output-dir ./demo_results

# 4. Or try multi-LLM mode for enhanced accuracy
uv run sift.py ./demo_prompts --analysis-mode multi --output-dir ./demo_results_multi

# 5. View the results
ls -la ./demo_results/
cat ./demo_results/*.json

Expected Output Structure

Each analysis produces a JSON result like this:

{
  "timestamp": "2025-07-21T13:46:42.184983",
  "prompt_file": "aws-access-token_config_production.env_12.prompt",
  "model_name": "mistral-small",
  "success": true,
  "analysis": {
    "result": "true",
    "reasons": "This appears to be a legitimate secret...",
    "confidence": 85
  }
}

Try the demo to see how SIFT can dramatically reduce false positives in your security scanning workflow!

Advanced Usage

Custom Ollama Server

If your Ollama server is running on a different host/port, update your configuration file:

# Edit your config.yaml
ollama:
  url: "http://your-server:11434"
  timeout: 120

Then run with the custom configuration:

uv run sift.py ./prompts --config my_config.yaml

Verbose Output

For detailed processing information:

uv run sift.py ./prompts --verbose

πŸ“ Output Format

The analysis results are saved as JSON files with the following structure:

{
  "timestamp": "2024-01-15T10:30:00.123456",
  "prompt_file": "finding_123.prompt",
  "model_name": "mistral-small",
  "success": true,
  "analysis": {
    "result": "false",
    "reasons": "This appears to be a sample API key in documentation. The file path 'docs/examples/config.md' and the placeholder-like format 'EXAMPLE_KEY_12345' indicate this is documentation rather than a real secret.",
    "confidence": 95
  }
}

Analysis Fields

  • result: "true" for true positive (potential real secret), "false" for false positive
  • reasons: Detailed explanation of the analysis
  • confidence: Confidence level as a percentage (0-100%)

🧠 How It Works

Single-LLM Mode

  1. SARIF Parsing: Extracts secret findings from GitLeaks SARIF output
  2. Prompt Generation: Creates structured prompts for each finding with context
  3. LLM Analysis: Sends prompts to Ollama with a specialized system prompt
  4. Result Processing: Parses and structures the AI analysis results

Multi-LLM Mode (Advanced)

  1. SARIF Parsing: Same as single-LLM mode
  2. Prompt Generation: Same as single-LLM mode
  3. Dual Analysis: Two different LLMs independently analyze each finding
  4. Reviewer Consensus: A third "reviewer" LLM synthesizes the analyses
  5. Final Decision: Provides enhanced accuracy through consensus

πŸ€– Multi-LLM Analysis Benefits

Why Use Multiple LLMs?

  • Reduced Bias: Different models have different training and biases
  • Higher Accuracy: Consensus approach reduces individual model errors
  • Transparency: See how different models analyze the same data
  • Reliability: Triple-check approach for critical security decisions

Analysis Flow

Security Alert
     |
     v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Analyzer 1    β”‚    β”‚   Analyzer 2    β”‚
β”‚ (e.g. Mistral)  β”‚    β”‚ (e.g. Llama)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     |                          |
     v                          v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Reviewer LLM                 β”‚
β”‚    (e.g. Qwen - Larger Model)           β”‚
β”‚     Resolves conflicts & decides        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     |
     v
Final Recommendation

πŸ”§ Configuration

System Prompt

The AI analysis behavior is controlled by the system.prompt file. You can modify this file to:

  • Adjust analysis criteria
  • Add domain-specific rules
  • Customize output format

πŸ§ͺ Testing Your Setup

Quick Setup Test

Run the setup test script to verify your configuration:

# Test your current setup
uv run  validate_setup.py

This will check:

  • Dependencies installation
  • Ollama server connection
  • Required models availability
  • Configuration file validity
  • System prompt files
  • Basic analysis functionality

Example Configurations

Use pre-configured examples for common scenarios:

# Fast single-LLM analysis
cp examples/config-examples/single-fast.yaml my_config.yaml

# Balanced multi-LLM analysis
cp examples/config-examples/multi-balanced.yaml my_config.yaml

Running Tests

Run the test suite to verify functionality:

# Run tests
uv run -m pytest --cov=sift tests/

πŸ” Troubleshooting

Common Issues

Ollama Connection Errors:

# Check if Ollama is running
ollama list

# Restart Ollama service
ollama serve

Model Not Found:

# List available models
ollama list

# Pull the required model
ollama pull mistral-small

Memory Issues:

  • Use smaller models (mistral:7b instead of larger ones)
  • Process fewer files at once
  • Ensure sufficient RAM (8GB+ recommended)

Debug Mode

Run with verbose output to see detailed processing:

uv run sift.py ./prompts --verbose

🀝 Contributing

We welcome contributions to this project! If you have an idea for a new feature, bug fix, or improvement, please open an issue or submit a pull request. Before contributing, please read our contributing guidelines.

License

Copyright 2025 Amadeus S.A.S.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

πŸ“š Additional Resources


Happy Secret Hunting! πŸ”πŸ€–

About

SIFT (Smart Intelligent Finding Triaging) is an innovative AI-powered tool to automatically analyze GitLeaks security findings using Large Language Models (LLM). This tool helps reduce false positives and provides intelligent analysis of potential secrets detected in your codebase.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages