A Python library for detecting file types using multiple inference strategies, including path-based extraction, magic number detection, and AI-powered content analysis.
- Multiple Inference Methods: Choose from lexical, magic-based, AI-powered, or cascading inference strategies
- Type-Safe API: Type hints and type-safe inference method selection
- Flexible Input: Supports both
Pathobjects and string paths - Performance Optimized: Cascading inferencer intelligently combines methods for optimal performance
- Well-Tested: Comprehensive test suite with logging support
- Extensible: Base class architecture for custom inferencer implementations
pip install filetype-detectorOr using rye:
rye syncImportant: MagicInferencer and CascadingInferencer require the libmagic system library to be installed.
sudo apt-get update
sudo apt-get install libmagic1sudo dnf install file-libs
# or for older versions:
# sudo yum install file-libssudo pacman -S fileUsing Homebrew:
brew install libmagicUsing MacPorts:
sudo port install fileWindows users need to use python-magic-bin as an alternative:
pip install python-magic-binOr download libmagic DLL manually from file.exe releases.
apk add --no-cache fileAfter installation, verify libmagic is available:
file --versionIf the command works, libmagic is properly installed.
Recommended: Use CascadingInferencer for the best balance of performance and accuracy:
from filetype_detector.mixture_inferencer import CascadingInferencer
inferencer = CascadingInferencer()
extension = inferencer.infer("document.pdf") # Returns: '.pdf'For more examples and usage patterns, see the User Guide.
Choose the right inferencer based on your needs:
| Inferencer | Avg. Time (per file) | Memory | Throughput | Best For |
|---|---|---|---|---|
| LexicalInferencer | < 0.001ms | Minimal | 50,000+ files/sec | Trusted extensions |
| MagicInferencer | ~1-5ms | Low | 200-500 files/sec | Content-based detection |
| MagikaInferencer | ~5-10ms* | High** | 100-200 files/sec | Highest accuracy (text) |
| CascadingInferencer | ~1-6ms | Medium | 150-400 files/sec | ⭐ Recommended default |
* After initial model load (~100-200ms one-time overhead)
** Model loaded into memory (~50-100MB)
For most use cases: Use CascadingInferencer - it automatically optimizes by using Magic for binary files and Magika for text files, providing the best balance of performance and accuracy.
For specific needs:
- Maximum speed:
LexicalInferencer(when extensions are trusted) - Content-based detection:
MagicInferencer(general purpose, binary files) - Highest accuracy:
MagikaInferencer(text files, confidence scores)
Fastest method - extracts file extensions directly from paths without reading file contents.
from filetype_detector.lexical_inferencer import LexicalInferencer
inferencer = LexicalInferencer()
extension = inferencer.infer("document.pdf") # Returns: '.pdf'
extension = inferencer.infer("file_without_ext") # Returns: ''Uses python-magic (libmagic) to detect file types based on magic numbers and file signatures.
from filetype_detector.magic_inferencer import MagicInferencer
inferencer = MagicInferencer()
extension = inferencer.infer("file.dat") # Returns actual type based on contentSystem Requirements: Requires libmagic system library. See Installation section.
AI-powered detection with confidence scores. Especially effective for text files.
from filetype_detector.magika_inferencer import MagikaInferencer
inferencer = MagikaInferencer()
extension = inferencer.infer("script.py") # Returns: '.py'
# With confidence score
extension, score = inferencer.infer_with_score("data.json") # Returns: ('.json', 0.98)Smart two-stage approach: uses Magic for all files, then Magika for text files.
from filetype_detector.mixture_inferencer import CascadingInferencer
inferencer = CascadingInferencer()
# Text file - uses Magic then Magika
extension = inferencer.infer("script.py") # Returns: '.py' (from Magika)
# Binary file - uses Magic only
extension = inferencer.infer("document.pdf") # Returns: '.pdf' (from Magic)System Requirements: Requires libmagic system library. See Installation section.
- ✅ Multiple inference strategies - Choose the right method for your use case
- ✅ Type-safe API - Full type hints and type-safe method selection
- ✅ Flexible input - Supports both
Pathobjects and string paths - ✅ Performance optimized - Cascading inferencer intelligently combines methods
- ✅ Well-tested - Comprehensive test suite
- ✅ Extensible - Base class architecture for custom implementations
For detailed usage examples, error handling, and advanced patterns, see the User Guide.
Run the test suite:
pytest tests/ -vWith logging (using loguru):
pytest tests/ -v -sRun specific test files:
pytest tests/test_cascading_inferencer.py -v
pytest tests/test_magic_inferencer.py -v
pytest tests/test_magika_inferencer.py -v
pytest tests/test_lexical_inferencer.py -vAll inferencers inherit from BaseInferencer, which defines a common interface:
from abc import ABC, abstractmethod
from typing import Union
from pathlib import Path
class BaseInferencer(ABC):
@abstractmethod
def infer(self, file_path: Union[Path, str]) -> str:
"""Infer file format from path."""
raise NotImplementedErrorYou can create custom inferencers by subclassing BaseInferencer:
from filetype_detector.base_inferencer import BaseInferencer
from typing import Union
from pathlib import Path
class CustomInferencer(BaseInferencer):
def infer(self, file_path: Union[Path, str]) -> str:
# Your custom logic here
return ".custom"📚 Full documentation available at: https://filetype-detector.readthedocs.io
- Getting Started - Installation and basic usage
- User Guide - Comprehensive guide with examples and performance tips
- API Reference - Complete API documentation
python-magic>=0.4.27: For magic number-based file detectionmagika>=1.0.1: Google's AI-powered file type detectionpytest>=8.4.2: Testing frameworkloguru>=0.7.3: Logging (used in tests)
- Python >= 3.8
This project is open source. See LICENSE file for details.
Contributions are welcome! Please ensure:
- All tests pass:
pytest tests/ -v - Code follows the existing style
- New features include appropriate tests
- Documentation is updated
- python-magic for libmagic bindings
- Google Magika for AI-powered file type detection