Debian Contents: Package File Statistics (CLI)

A small Python command line tool that downloads Debian's Contents-<arch>.gz index from a mirror, parses it, and prints the top 10 packages with the most files.

Example:

./package_statistics.py amd64

Output:

<package name 1> :  <number of files>
...
<package name 10>:  <number of files>

Problem summary

Debian repositories provide a "Contents index" mapping file paths to packages.
Each line in Contents-<arch>.gz follows the following format (simplified):

<file-path><whitespace><package[,package...]>

The goal is to count, for each package, how many file paths are associated with it, and print the top 10.

Reference: Debian repository format documentation (Contents indices)

Approach and thought process

Design goals

Correctness: follow the documented "Contents" format and handle common edge cases.
Scalability: the Contents file can be large, so avoid loading it fully into memory.
Deterministic output: stable ordering, especially when counts are tied.
Maintainability: clear structure, small functions, type hints, and basic tests.

Key decisions

Streaming download and parsing
- Download the .gz file in chunks to avoid high memory usage.
- Decompress and parse line-by-line using a text wrapper over gzip, so the program scales to large indices.
Counting strategy
- Use collections.Counter to accumulate package -> file_count.
- If a single file path is listed for multiple packages (pkg1,pkg2,...), each package gets an increment, since the index states that file is associated with each listed package.
Deterministic ranking
- Produce a stable ranking by sorting with:
  - primary: count descending
  - secondary: package name ascending
- This makes results reproducible across runs and Python versions.
Cache for faster iteration
- Store the downloaded file under an OS-friendly cache directory (e.g., ~/.cache/... or $XDG_CACHE_HOME/...).
- Provide an option to bypass the cache and force re-download (--no-cache) to keep behavior explicit.
Quality checks
- Keep code close to Python best practices: type hints, small functions, clear error handling.
- Add small unit tests that validate parsing and counting logic independently from network access.

Usage

Run

chmod +x package_statistics.py
./package_statistics.py amd64

Options

--top N : show top N packages (default: 10)
--base-url URL : set a different Debian mirror base URL
--no-cache : force re-download
--timeout SECONDS : network timeout

Development

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt

Run tests

From the repository root:

PYTHONPATH=. pytest -q

Lint / format (optional)

ruff check .
ruff format .

Notes / assumptions

Input architecture is expected to match Debian naming (e.g., amd64, arm64, i386, ppc64el, ...).
The tool assumes the Debian mirror layout used in the assignment (base URL points to .../dists/stable/main/ and the file name pattern is Contents-<arch>.gz).
Parsing is tolerant of extra whitespace and ignores malformed/empty lines.

Roadmap

Validate/normalize arch more strictly (or provide allowed list) and give better CLI feedback.
Add optional progress indication for large downloads (without spamming stdout).
Add retry/backoff for transient network errors.
Improve the --out option: currently an argument exists, but the implementation always uses the cache path; I’d either remove --out or wire it properly.
Add an integration test that uses a small real .gz fixture file.
Make output machine-readable (--json / --csv) in addition to text.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
tests		tests
.gitignore		.gitignore
.gitognore		.gitognore
LICENSE		LICENSE
README.md		README.md
package_statistics.py		package_statistics.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Debian Contents: Package File Statistics (CLI)

Problem summary

Approach and thought process

Design goals

Key decisions

Usage

Run

Options

Development

Setup

Run tests

Lint / format (optional)

Notes / assumptions

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Debian Contents: Package File Statistics (CLI)

Problem summary

Approach and thought process

Design goals

Key decisions

Usage

Run

Options

Development

Setup

Run tests

Lint / format (optional)

Notes / assumptions

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages