Skip to content

Dans-labs/filemetrix

Repository files navigation

FileMetrix Service

FileMetrix collects dataset identifiers and file-level metadata from OAI-PMH repositories, stores them in a PostgreSQL database, and exposes REST APIs and metrics for analysis and monitoring.

This repository contains the FileMetrix application, integration clients, and deployment helpers for local development and containerized runs.

Live demo (example): https://filemetrix.labs.dansdemo.nl/docs


Table of contents

  • About
  • Key features
  • Project layout
  • Configuration
  • Quick start (development)
  • Docker / Compose
  • HTTP endpoints and health checks
  • Logging & observability
  • Troubleshooting
  • Contributing
  • License

About

FileMetrix provides harvesting, storage and query capabilities for dataset and file-level metadata. It is intended for research data platforms and services that need to collect and analyse file-level characteristics (size, MIME type, checksums, embargo/publish dates) and produce metrics across repositories.

The service is implemented as a FastAPI application and uses SQLModel/SQLAlchemy for persistence to PostgreSQL.


Key features

  • OAI-PMH harvesting of dataset identifiers and resumption-token handling
  • File-level metadata fetching (via PID fetcher integration)
  • Storage of repositories, datasets and file metadata in PostgreSQL
  • REST API (FastAPI) with public and protected routes
  • Metrics and aggregation endpoints (counts grouped by MIME type, sizes, publication-month grouping, per-repository aggregation)
  • Optional email notifications for startup, harvest completion and errors
  • Health endpoint and Docker Compose integration for local testing

Project layout

Top-level source directory: src/filemetrix

  • src/filemetrix/main.py — application factory, FastAPI initialization and lifecycle
  • src/filemetrix/api/v1/ — API routes (PID fetcher, repo discovery, repo metrics, workflow controller, health)
  • src/filemetrix/infra/ — infrastructure helpers (settings via Dynaconf, database, mail utilities)
    • infra/commons.py — centralized settings proxy and send_mail implementation
    • infra/db.py — SQLModel models and DB helpers
  • src/filemetrix/services/ — service clients (OAI harvester client, PID fetcher integration, oneprovider/OneData helpers)
  • src/filemetrix/utils/ — small utilities
  • conf/ — example and production Dynaconf TOML files

Example tree (abridged):

src/filemetrix/
├─ api/
│  └─ v1/
│     ├─ health.py
│     ├─ pid_fetcher.py
│     ├─ repo_discovery.py
│     ├─ repo_metrics.py
│     └─ repo_workflow_controller.py
├─ infra/
│  ├─ commons.py
│  └─ db.py
├─ services/
│  ├─ oai_harvester_client.py
│  └─ onedata_hugger.py
└─ main.py

Configuration

Configuration is loaded via Dynaconf from conf/*.toml and environment variables. Copy and customize the example config before running in production:

cp conf/settings.example.toml conf/settings.toml
# or use conf/settings.production.toml as a template

Important environment variables (examples):

  • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME — PostgreSQL connection
  • MAIL_HOST, MAIL_PORT, MAIL_FROM, MAIL_TO — SMTP settings for notifications (MailDev available for local testing)
  • API_PREFIX — API route prefix, e.g. /api/v1
  • EXPOSE_PORT — HTTP port (default: 1966)
  • FILEMETRIX_SERVICE_API_KEY — API key for protected endpoints

For local development the repository includes .env.example (copy to .env) and conf/settings.example.toml.


Quick start — Development

Prerequisites:

  • Python 3.12+
  • A PostgreSQL instance (local, Docker, or remote) or use Docker Compose below

Install and prepare environment (using uv package manager described below) or use standard venv/pip:

  • Installing uv (the minimal steps):

    • Install via pip (works on Linux/macOS/Windows):
    pip install --user uv
    # or, if you use a virtualenv (recommended):
    pip install uv
    • macOS (Homebrew) option — install Python via Homebrew then install uv with pip:
    # install Python if you don't already have it via Homebrew
    brew install python
    # then install uv
    brew search uv
    brew info uv
    brew install uv
  • With uv:

uv venv .venv
uv sync --frozen --no-cache
  • Standard venv/pip alternative:
python -m venv .venv
source .venv/bin/activate
pip install -e .

Run the development server with autoreload:

# from the repository root
make run-dev
# or directly
.venv/bin/uvicorn src.filemetrix.main:app --reload --host 0.0.0.0 --port 1966

API docs will be available at: http://localhost:1966/docs


Docker / Compose (local integration)

The repository includes a Dockerfile and docker-compose.yaml to run the service alongside a Postgres and MailDev instance for local testing.

Start services with:

# builds the filemetrix image and starts containers
docker-compose up -d --build

Check logs:

docker-compose logs -f filemetrix

Open MailDev UI to inspect sent emails: http://localhost:1080

Notes:

  • filemetrix container runs a small prestart validation (src/filemetrix/validate_env.py) — the compose setup uses SKIP_ENV_VALIDATION=1 for local developer convenience but you should unset this for stricter validation in staging/production.
  • The Compose file configures a healthcheck for the filemetrix container that verifies DB connectivity using psql (the Dockerfile installs the postgresql-client).

HTTP endpoints and health checks

  • / — root information (hidden from docs)
  • /health — readiness/liveness check (runs a lightweight SELECT 1 against the DB and returns 200/503)
  • /docs — OpenAPI/Swagger UI (auto-generated)

OpenAPI tags are defined in main.py and each router is included with a tag and prefix. The API_PREFIX setting can be used to add a global prefix if desired.


Logging & observability


Troubleshooting — common issues

  • AttributeError for settings keys (e.g., MAIL_USR): ensure required keys exist in conf/settings.toml or as environment variables. The code performs case-insensitive lookups but prefers canonical uppercase env names.
  • DB connection failures on startup: verify DB_HOST, DB_PORT, DB_USER, DB_PASSWORD and ensure Postgres is reachable. Use the validate_env.py CLI to test connectivity:
python src/filemetrix/validate_env.py --strict --db-wait-timeout 60
  • Startup email not sent: confirm SMTP settings (MAIL_HOST, MAIL_PORT, MAIL_FROM, MAIL_TO) and check MailDev UI (http://localhost:1080). The app retries a few times on startup to allow MailDev to come up first.

Contributing

  • Open issues and PRs are welcome. Please run linters/tests and keep changes small and focused.
  • Use make targets to simplify local tasks (see Makefile): make install, make run-dev, make compose-up, make compose-down.

License

See the repository LICENSE file for license terms.


Architecture (high level)

Intereraction of components and data flow:

C4Container

title FileMetrix – C4 Container Model

Person(user, "User", "Developer, operator, or automation calling the API")

System_Boundary(fm, "FileMetrix") {

    Container(api, "FastAPI Application", "Python / FastAPI", "Exposes REST API endpoints for repository discovery, PID fetching, workflow control, and metrics")

    ContainerDb(db, "PostgreSQL Database", "PostgreSQL", "Stores repository info, harvested OAI records, file metadata, and derived metrics")

    Container(service_discovery, "Repository Discovery Service", "Python", "Validates repositories, queries re3data, and detects OAI-PMH endpoints")

    Container(service_harvest, "OAI-PMH Harvester", "Python", "Harvests dataset identifiers and metadata from OAI-PMH endpoints")

    Container(service_pid, "PID Metadata Fetcher", "Python", "Fetches PID metadata and dataset-level details from external PID resolvers")

    Container(service_onedata, "OneData Metadata Client", "Python", "Fetches fine-grained file-level metadata from OneData/OneProvider")

    Container(service_metrics, "Metrics Aggregator", "Python", "Computes aggregated metrics from stored metadata")
}

System_Ext(re3, "re3data", "Registry of research data repositories")
System_Ext(oai, "OAI-PMH Repositories", "OpenArchives-compliant data providers")
System_Ext(pid_ext, "PID Resolvers", "Handle/DOI resolvers or repository PID services")
System_Ext(onedata_ext, "OneData / OneProvider", "File-level metadata provider")

Rel(user, api, "Calls API endpoints")

Rel(api, service_discovery, "Starts repository discovery")
Rel(service_discovery, re3, "Queries registry")
Rel(service_discovery, oai, "Validates OAI-PMH endpoint")

Rel(api, service_harvest, "Triggers harvest")
Rel(service_harvest, oai, "Harvests dataset metadata")

Rel(api, service_pid, "Requests PID metadata")
Rel(service_pid, pid_ext, "Fetches dataset/PID info")

Rel(service_pid, service_onedata, "Requests file-level metadata")
Rel(service_onedata, onedata_ext, "Fetches metadata")

Rel(service_pid, db, "Stores metadata")
Rel(service_harvest, db, "")
Rel(service_onedata, db, "")

Rel(service_metrics, db, "Reads metadata for aggregation")
Rel(api, service_metrics, "")
Loading
  • The FileMetrix service harvests dataset identifiers via OAI-PMH and stores datasets and file metadata in PostgreSQL. It uses external PID fetcher services and transformer services (configurable) to collect file-level metadata.

Example API calls (curl)

All example calls assume the service runs on http://localhost:1966 and API_PREFIX is /api/v1 (default).

  1. List discovered repositories (re3data cache)
curl -sS http://localhost:1966/api/v1/repositories | jq '.'
  1. Fetch repository details (List Sets) from re3data by r3id
curl -sS http://localhost:1966/api/v1/repository-collections/<r3id> | jq '.'
  1. PID fetcher: retrieve repository info for a PID
curl -sS http://localhost:1966/api/v1/repository-info/doi:10.1234/abcd | jq '.'
  1. PID fetcher: fetch metadata files for a PID
curl -sS http://localhost:1966/api/v1/doi:10.1234/abcd | jq '.'
  1. Add a repository (protected route — ensure you include authorization in protected endpoints)
curl -X POST http://localhost:1966/api/v1/add-repo \
  -H "Content-Type: application/json" \
  -d '{"name": "Example repo", "url": "http://example.org/oai", "metadata_prefix": "oai_dc"}'
  1. Trigger a dataset harvest by repo id
curl -X POST http://localhost:1966/api/v1/harvest/1
  1. Repo metrics: list repositories
curl -sS http://localhost:1966/api/v1/repos | jq '.'
  1. Repo metrics: dataset count
curl -sS http://localhost:1966/api/v1/dataset/count | jq '.'
  1. Health endpoint
curl -v http://localhost:1966/health

Configuration reference (quick)

A compact table of the most important env vars / Dynaconf keys (see conf/settings.example.toml and docs/CONFIG.md for more details):

  • API_PREFIX — default /api/v1
  • EXPOSE_PORT — default 1966
  • FILEMETRIX_SERVICE_API_KEY — protect API endpoints
  • DB_USER, DB_PASSWORD, DB_HOST, DB_PORT, DB_NAME — Postgres config
  • MAIL_HOST, MAIL_PORT, MAIL_FROM, MAIL_TO, MAIL_USE_TLS, MAIL_USE_SSL, MAIL_USE_AUTH — SMTP
  • PID_FETCHER_URL — URL of PID fetcher service
  • PKL_TOKEN_FILE — path to store OAI resumption tokens

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •