filetype-detector

A Python library for detecting file types using multiple inference strategies, including path-based extraction, magic number detection, and AI-powered content analysis.

Features

Multiple Inference Methods: Choose from lexical, magic-based, AI-powered, or cascading inference strategies
Type-Safe API: Type hints and type-safe inference method selection
Flexible Input: Supports both Path objects and string paths
Performance Optimized: Cascading inferencer intelligently combines methods for optimal performance
Well-Tested: Comprehensive test suite with logging support
Extensible: Base class architecture for custom inferencer implementations

Installation

Python Package

pip install filetype-detector

Or using rye:

rye sync

System Dependencies

Important: MagicInferencer and CascadingInferencer require the libmagic system library to be installed.

Ubuntu/Debian

sudo apt-get update
sudo apt-get install libmagic1

Fedora/RHEL/CentOS

sudo dnf install file-libs
# or for older versions:
# sudo yum install file-libs

Arch Linux

sudo pacman -S file

macOS

Using Homebrew:

brew install libmagic

Using MacPorts:

sudo port install file

Windows

Windows users need to use python-magic-bin as an alternative:

pip install python-magic-bin

Or download libmagic DLL manually from file.exe releases.

Alpine Linux (Docker)

apk add --no-cache file

Verification

After installation, verify libmagic is available:

file --version

If the command works, libmagic is properly installed.

Quick Start

Recommended: Use CascadingInferencer for the best balance of performance and accuracy:

from filetype_detector.mixture_inferencer import CascadingInferencer

inferencer = CascadingInferencer()
extension = inferencer.infer("document.pdf")  # Returns: '.pdf'

For more examples and usage patterns, see the User Guide.

Performance Comparison

Choose the right inferencer based on your needs:

Inferencer	Avg. Time (per file)	Memory	Throughput	Best For
LexicalInferencer	< 0.001ms	Minimal	50,000+ files/sec	Trusted extensions
MagicInferencer	~1-5ms	Low	200-500 files/sec	Content-based detection
MagikaInferencer	~5-10ms*	High**	100-200 files/sec	Highest accuracy (text)
CascadingInferencer	~1-6ms	Medium	150-400 files/sec	⭐ Recommended default

* After initial model load (~100-200ms one-time overhead)
** Model loaded into memory (~50-100MB)

Recommendation

For most use cases: Use CascadingInferencer - it automatically optimizes by using Magic for binary files and Magika for text files, providing the best balance of performance and accuracy.

For specific needs:

Maximum speed: LexicalInferencer (when extensions are trusted)
Content-based detection: MagicInferencer (general purpose, binary files)
Highest accuracy: MagikaInferencer (text files, confidence scores)

Available Inferencers

LexicalInferencer

Fastest method - extracts file extensions directly from paths without reading file contents.

from filetype_detector.lexical_inferencer import LexicalInferencer

inferencer = LexicalInferencer()
extension = inferencer.infer("document.pdf")  # Returns: '.pdf'
extension = inferencer.infer("file_without_ext")  # Returns: ''

MagicInferencer

Uses python-magic (libmagic) to detect file types based on magic numbers and file signatures.

from filetype_detector.magic_inferencer import MagicInferencer

inferencer = MagicInferencer()
extension = inferencer.infer("file.dat")  # Returns actual type based on content

System Requirements: Requires libmagic system library. See Installation section.

MagikaInferencer

AI-powered detection with confidence scores. Especially effective for text files.

from filetype_detector.magika_inferencer import MagikaInferencer

inferencer = MagikaInferencer()
extension = inferencer.infer("script.py")  # Returns: '.py'

# With confidence score
extension, score = inferencer.infer_with_score("data.json")  # Returns: ('.json', 0.98)

CascadingInferencer ⭐ Recommended

Smart two-stage approach: uses Magic for all files, then Magika for text files.

from filetype_detector.mixture_inferencer import CascadingInferencer

inferencer = CascadingInferencer()

# Text file - uses Magic then Magika
extension = inferencer.infer("script.py")  # Returns: '.py' (from Magika)

# Binary file - uses Magic only
extension = inferencer.infer("document.pdf")  # Returns: '.pdf' (from Magic)

System Requirements: Requires libmagic system library. See Installation section.

Key Features

✅ Multiple inference strategies - Choose the right method for your use case
✅ Type-safe API - Full type hints and type-safe method selection
✅ Flexible input - Supports both Path objects and string paths
✅ Performance optimized - Cascading inferencer intelligently combines methods
✅ Well-tested - Comprehensive test suite
✅ Extensible - Base class architecture for custom implementations

For detailed usage examples, error handling, and advanced patterns, see the User Guide.

Testing

Run the test suite:

pytest tests/ -v

With logging (using loguru):

pytest tests/ -v -s

Run specific test files:

pytest tests/test_cascading_inferencer.py -v
pytest tests/test_magic_inferencer.py -v
pytest tests/test_magika_inferencer.py -v
pytest tests/test_lexical_inferencer.py -v

Architecture

Base Class

All inferencers inherit from BaseInferencer, which defines a common interface:

from abc import ABC, abstractmethod
from typing import Union
from pathlib import Path

class BaseInferencer(ABC):
    @abstractmethod
    def infer(self, file_path: Union[Path, str]) -> str:
        """Infer file format from path."""
        raise NotImplementedError

Custom Inferencer

You can create custom inferencers by subclassing BaseInferencer:

from filetype_detector.base_inferencer import BaseInferencer
from typing import Union
from pathlib import Path

class CustomInferencer(BaseInferencer):
    def infer(self, file_path: Union[Path, str]) -> str:
        # Your custom logic here
        return ".custom"

Documentation

📚 Full documentation available at: https://filetype-detector.readthedocs.io

Getting Started - Installation and basic usage
User Guide - Comprehensive guide with examples and performance tips
API Reference - Complete API documentation

Dependencies

python-magic>=0.4.27: For magic number-based file detection
magika>=1.0.1: Google's AI-powered file type detection
pytest>=8.4.2: Testing framework
loguru>=0.7.3: Logging (used in tests)

Requirements

Python >= 3.8

License

This project is open source. See LICENSE file for details.

Contributing

Contributions are welcome! Please ensure:

All tests pass: pytest tests/ -v
Code follows the existing style
New features include appropriate tests
Documentation is updated

Acknowledgments

python-magic for libmagic bindings
Google Magika for AI-powered file type detection

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
docs		docs
src/filetype_detector		src/filetype_detector
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

License

devcomfort/filetype-detector

Folders and files

Latest commit

History

Repository files navigation

filetype-detector

Features

Installation

Python Package

System Dependencies

Ubuntu/Debian

Fedora/RHEL/CentOS

Arch Linux

macOS

Windows

Alpine Linux (Docker)

Verification

Quick Start

Performance Comparison

Recommendation

Available Inferencers

LexicalInferencer

MagicInferencer

MagikaInferencer

CascadingInferencer ⭐ Recommended

Key Features

Testing

Architecture

Base Class

Custom Inferencer

Documentation

Dependencies

Requirements

License

Contributing

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages