FileMetrix collects dataset identifiers and file-level metadata from OAI-PMH repositories, stores them in a PostgreSQL database, and exposes REST APIs and metrics for analysis and monitoring.
This repository contains the FileMetrix application, integration clients, and deployment helpers for local development and containerized runs.
Live demo (example): https://filemetrix.labs.dansdemo.nl/docs
- About
- Key features
- Project layout
- Configuration
- Quick start (development)
- Docker / Compose
- HTTP endpoints and health checks
- Logging & observability
- Troubleshooting
- Contributing
- License
FileMetrix provides harvesting, storage and query capabilities for dataset and file-level metadata. It is intended for research data platforms and services that need to collect and analyse file-level characteristics (size, MIME type, checksums, embargo/publish dates) and produce metrics across repositories.
The service is implemented as a FastAPI application and uses SQLModel/SQLAlchemy for persistence to PostgreSQL.
- OAI-PMH harvesting of dataset identifiers and resumption-token handling
- File-level metadata fetching (via PID fetcher integration)
- Storage of repositories, datasets and file metadata in PostgreSQL
- REST API (FastAPI) with public and protected routes
- Metrics and aggregation endpoints (counts grouped by MIME type, sizes, publication-month grouping, per-repository aggregation)
- Optional email notifications for startup, harvest completion and errors
- Health endpoint and Docker Compose integration for local testing
Top-level source directory: src/filemetrix
src/filemetrix/main.py— application factory, FastAPI initialization and lifecyclesrc/filemetrix/api/v1/— API routes (PID fetcher, repo discovery, repo metrics, workflow controller, health)src/filemetrix/infra/— infrastructure helpers (settings via Dynaconf, database, mail utilities)infra/commons.py— centralized settings proxy andsend_mailimplementationinfra/db.py— SQLModel models and DB helpers
src/filemetrix/services/— service clients (OAI harvester client, PID fetcher integration, oneprovider/OneData helpers)src/filemetrix/utils/— small utilitiesconf/— example and production Dynaconf TOML files
Example tree (abridged):
src/filemetrix/
├─ api/
│ └─ v1/
│ ├─ health.py
│ ├─ pid_fetcher.py
│ ├─ repo_discovery.py
│ ├─ repo_metrics.py
│ └─ repo_workflow_controller.py
├─ infra/
│ ├─ commons.py
│ └─ db.py
├─ services/
│ ├─ oai_harvester_client.py
│ └─ onedata_hugger.py
└─ main.py
Configuration is loaded via Dynaconf from conf/*.toml and environment variables. Copy and customize the example config before running in production:
cp conf/settings.example.toml conf/settings.toml
# or use conf/settings.production.toml as a templateImportant environment variables (examples):
DB_USER,DB_PASSWORD,DB_HOST,DB_PORT,DB_NAME— PostgreSQL connectionMAIL_HOST,MAIL_PORT,MAIL_FROM,MAIL_TO— SMTP settings for notifications (MailDev available for local testing)API_PREFIX— API route prefix, e.g./api/v1EXPOSE_PORT— HTTP port (default: 1966)FILEMETRIX_SERVICE_API_KEY— API key for protected endpoints
For local development the repository includes .env.example (copy to .env) and conf/settings.example.toml.
Prerequisites:
- Python 3.12+
- A PostgreSQL instance (local, Docker, or remote) or use Docker Compose below
Install and prepare environment (using uv package manager described below) or use standard venv/pip:
-
Installing
uv(the minimal steps):- Install via pip (works on Linux/macOS/Windows):
pip install --user uv # or, if you use a virtualenv (recommended): pip install uv- macOS (Homebrew) option — install Python via Homebrew then install
uvwith pip:
# install Python if you don't already have it via Homebrew brew install python # then install uv brew search uv brew info uv brew install uv
-
With
uv:
uv venv .venv
uv sync --frozen --no-cache- Standard venv/pip alternative:
python -m venv .venv
source .venv/bin/activate
pip install -e .Run the development server with autoreload:
# from the repository root
make run-dev
# or directly
.venv/bin/uvicorn src.filemetrix.main:app --reload --host 0.0.0.0 --port 1966API docs will be available at: http://localhost:1966/docs
The repository includes a Dockerfile and docker-compose.yaml to run the service alongside a Postgres and MailDev instance for local testing.
Start services with:
# builds the filemetrix image and starts containers
docker-compose up -d --buildCheck logs:
docker-compose logs -f filemetrixOpen MailDev UI to inspect sent emails: http://localhost:1080
Notes:
filemetrixcontainer runs a small prestart validation (src/filemetrix/validate_env.py) — the compose setup usesSKIP_ENV_VALIDATION=1for local developer convenience but you should unset this for stricter validation in staging/production.- The Compose file configures a healthcheck for the
filemetrixcontainer that verifies DB connectivity usingpsql(theDockerfileinstalls thepostgresql-client).
/— root information (hidden from docs)/health— readiness/liveness check (runs a lightweightSELECT 1against the DB and returns 200/503)/docs— OpenAPI/Swagger UI (auto-generated)
OpenAPI tags are defined in main.py and each router is included with a tag and prefix. The API_PREFIX setting can be used to add a global prefix if desired.
- Logging is configured in
main.py(rotating file handler by default). ConfigureLOG_LEVELandLOG_FILEvia env orconf/settings.toml. - Optional OTLP export can be enabled with
OTLP_ENABLEand related settings.
- AttributeError for settings keys (e.g.,
MAIL_USR): ensure required keys exist inconf/settings.tomlor as environment variables. The code performs case-insensitive lookups but prefers canonical uppercase env names. - DB connection failures on startup: verify
DB_HOST,DB_PORT,DB_USER,DB_PASSWORDand ensure Postgres is reachable. Use thevalidate_env.pyCLI to test connectivity:
python src/filemetrix/validate_env.py --strict --db-wait-timeout 60- Startup email not sent: confirm SMTP settings (
MAIL_HOST,MAIL_PORT,MAIL_FROM,MAIL_TO) and check MailDev UI (http://localhost:1080). The app retries a few times on startup to allow MailDev to come up first.
- Open issues and PRs are welcome. Please run linters/tests and keep changes small and focused.
- Use
maketargets to simplify local tasks (seeMakefile):make install,make run-dev,make compose-up,make compose-down.
See the repository LICENSE file for license terms.
Intereraction of components and data flow:
C4Container
title FileMetrix – C4 Container Model
Person(user, "User", "Developer, operator, or automation calling the API")
System_Boundary(fm, "FileMetrix") {
Container(api, "FastAPI Application", "Python / FastAPI", "Exposes REST API endpoints for repository discovery, PID fetching, workflow control, and metrics")
ContainerDb(db, "PostgreSQL Database", "PostgreSQL", "Stores repository info, harvested OAI records, file metadata, and derived metrics")
Container(service_discovery, "Repository Discovery Service", "Python", "Validates repositories, queries re3data, and detects OAI-PMH endpoints")
Container(service_harvest, "OAI-PMH Harvester", "Python", "Harvests dataset identifiers and metadata from OAI-PMH endpoints")
Container(service_pid, "PID Metadata Fetcher", "Python", "Fetches PID metadata and dataset-level details from external PID resolvers")
Container(service_onedata, "OneData Metadata Client", "Python", "Fetches fine-grained file-level metadata from OneData/OneProvider")
Container(service_metrics, "Metrics Aggregator", "Python", "Computes aggregated metrics from stored metadata")
}
System_Ext(re3, "re3data", "Registry of research data repositories")
System_Ext(oai, "OAI-PMH Repositories", "OpenArchives-compliant data providers")
System_Ext(pid_ext, "PID Resolvers", "Handle/DOI resolvers or repository PID services")
System_Ext(onedata_ext, "OneData / OneProvider", "File-level metadata provider")
Rel(user, api, "Calls API endpoints")
Rel(api, service_discovery, "Starts repository discovery")
Rel(service_discovery, re3, "Queries registry")
Rel(service_discovery, oai, "Validates OAI-PMH endpoint")
Rel(api, service_harvest, "Triggers harvest")
Rel(service_harvest, oai, "Harvests dataset metadata")
Rel(api, service_pid, "Requests PID metadata")
Rel(service_pid, pid_ext, "Fetches dataset/PID info")
Rel(service_pid, service_onedata, "Requests file-level metadata")
Rel(service_onedata, onedata_ext, "Fetches metadata")
Rel(service_pid, db, "Stores metadata")
Rel(service_harvest, db, "")
Rel(service_onedata, db, "")
Rel(service_metrics, db, "Reads metadata for aggregation")
Rel(api, service_metrics, "")
- The
FileMetrixservice harvests dataset identifiers via OAI-PMH and stores datasets and file metadata in PostgreSQL. It uses external PID fetcher services and transformer services (configurable) to collect file-level metadata.
All example calls assume the service runs on http://localhost:1966 and API_PREFIX is /api/v1 (default).
- List discovered repositories (re3data cache)
curl -sS http://localhost:1966/api/v1/repositories | jq '.'- Fetch repository details (List Sets) from re3data by r3id
curl -sS http://localhost:1966/api/v1/repository-collections/<r3id> | jq '.'- PID fetcher: retrieve repository info for a PID
curl -sS http://localhost:1966/api/v1/repository-info/doi:10.1234/abcd | jq '.'- PID fetcher: fetch metadata files for a PID
curl -sS http://localhost:1966/api/v1/doi:10.1234/abcd | jq '.'- Add a repository (protected route — ensure you include authorization in protected endpoints)
curl -X POST http://localhost:1966/api/v1/add-repo \
-H "Content-Type: application/json" \
-d '{"name": "Example repo", "url": "http://example.org/oai", "metadata_prefix": "oai_dc"}'- Trigger a dataset harvest by repo id
curl -X POST http://localhost:1966/api/v1/harvest/1- Repo metrics: list repositories
curl -sS http://localhost:1966/api/v1/repos | jq '.'- Repo metrics: dataset count
curl -sS http://localhost:1966/api/v1/dataset/count | jq '.'- Health endpoint
curl -v http://localhost:1966/healthA compact table of the most important env vars / Dynaconf keys (see conf/settings.example.toml and docs/CONFIG.md for more details):
API_PREFIX— default/api/v1EXPOSE_PORT— default1966FILEMETRIX_SERVICE_API_KEY— protect API endpointsDB_USER,DB_PASSWORD,DB_HOST,DB_PORT,DB_NAME— Postgres configMAIL_HOST,MAIL_PORT,MAIL_FROM,MAIL_TO,MAIL_USE_TLS,MAIL_USE_SSL,MAIL_USE_AUTH— SMTPPID_FETCHER_URL— URL of PID fetcher servicePKL_TOKEN_FILE— path to store OAI resumption tokens