A small Python command line tool that downloads Debian's Contents-<arch>.gz index from a mirror, parses it, and prints the top 10 packages with the most files.
Example:
./package_statistics.py amd64Output:
<package name 1> : <number of files>
...
<package name 10>: <number of files>
Debian repositories provide a "Contents index" mapping file paths to packages.
Each line in Contents-<arch>.gz follows the following format (simplified):
<file-path><whitespace><package[,package...]>
The goal is to count, for each package, how many file paths are associated with it, and print the top 10.
Reference: Debian repository format documentation (Contents indices)
- Correctness: follow the documented "Contents" format and handle common edge cases.
- Scalability: the
Contentsfile can be large, so avoid loading it fully into memory. - Deterministic output: stable ordering, especially when counts are tied.
- Maintainability: clear structure, small functions, type hints, and basic tests.
-
Streaming download and parsing
- Download the
.gzfile in chunks to avoid high memory usage. - Decompress and parse line-by-line using a text wrapper over
gzip, so the program scales to large indices.
- Download the
-
Counting strategy
- Use
collections.Counterto accumulatepackage -> file_count. - If a single file path is listed for multiple packages (
pkg1,pkg2,...), each package gets an increment, since the index states that file is associated with each listed package.
- Use
-
Deterministic ranking
- Produce a stable ranking by sorting with:
- primary: count descending
- secondary: package name ascending
- This makes results reproducible across runs and Python versions.
- Produce a stable ranking by sorting with:
-
Cache for faster iteration
- Store the downloaded file under an OS-friendly cache directory (e.g.,
~/.cache/...or$XDG_CACHE_HOME/...). - Provide an option to bypass the cache and force re-download (
--no-cache) to keep behavior explicit.
- Store the downloaded file under an OS-friendly cache directory (e.g.,
-
Quality checks
- Keep code close to Python best practices: type hints, small functions, clear error handling.
- Add small unit tests that validate parsing and counting logic independently from network access.
chmod +x package_statistics.py
./package_statistics.py amd64--top N: show top N packages (default: 10)--base-url URL: set a different Debian mirror base URL--no-cache: force re-download--timeout SECONDS: network timeout
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txtFrom the repository root:
PYTHONPATH=. pytest -qruff check .
ruff format .- Input architecture is expected to match Debian naming (e.g.,
amd64,arm64,i386,ppc64el, ...). - The tool assumes the Debian mirror layout used in the assignment (base URL points to
.../dists/stable/main/and the file name pattern isContents-<arch>.gz). - Parsing is tolerant of extra whitespace and ignores malformed/empty lines.
- Validate/normalize arch more strictly (or provide allowed list) and give better CLI feedback.
- Add optional progress indication for large downloads (without spamming stdout).
- Add retry/backoff for transient network errors.
- Improve the --out option: currently an argument exists, but the implementation always uses the cache path; I’d either remove --out or wire it properly.
- Add an integration test that uses a small real .gz fixture file.
- Make output machine-readable (--json / --csv) in addition to text.