Skip to content

pooyaww/package_statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Debian Contents: Package File Statistics (CLI)

A small Python command line tool that downloads Debian's Contents-<arch>.gz index from a mirror, parses it, and prints the top 10 packages with the most files.

Example:

./package_statistics.py amd64

Output:

<package name 1> :  <number of files>
...
<package name 10>:  <number of files>

Problem summary

Debian repositories provide a "Contents index" mapping file paths to packages.
Each line in Contents-<arch>.gz follows the following format (simplified):

<file-path><whitespace><package[,package...]>

The goal is to count, for each package, how many file paths are associated with it, and print the top 10.

Reference: Debian repository format documentation (Contents indices)


Approach and thought process

Design goals

  • Correctness: follow the documented "Contents" format and handle common edge cases.
  • Scalability: the Contents file can be large, so avoid loading it fully into memory.
  • Deterministic output: stable ordering, especially when counts are tied.
  • Maintainability: clear structure, small functions, type hints, and basic tests.

Key decisions

  1. Streaming download and parsing

    • Download the .gz file in chunks to avoid high memory usage.
    • Decompress and parse line-by-line using a text wrapper over gzip, so the program scales to large indices.
  2. Counting strategy

    • Use collections.Counter to accumulate package -> file_count.
    • If a single file path is listed for multiple packages (pkg1,pkg2,...), each package gets an increment, since the index states that file is associated with each listed package.
  3. Deterministic ranking

    • Produce a stable ranking by sorting with:
      • primary: count descending
      • secondary: package name ascending
    • This makes results reproducible across runs and Python versions.
  4. Cache for faster iteration

    • Store the downloaded file under an OS-friendly cache directory (e.g., ~/.cache/... or $XDG_CACHE_HOME/...).
    • Provide an option to bypass the cache and force re-download (--no-cache) to keep behavior explicit.
  5. Quality checks

    • Keep code close to Python best practices: type hints, small functions, clear error handling.
    • Add small unit tests that validate parsing and counting logic independently from network access.

Usage

Run

chmod +x package_statistics.py
./package_statistics.py amd64

Options

  • --top N : show top N packages (default: 10)
  • --base-url URL : set a different Debian mirror base URL
  • --no-cache : force re-download
  • --timeout SECONDS : network timeout

Development

Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt

Run tests

From the repository root:

PYTHONPATH=. pytest -q

Lint / format (optional)

ruff check .
ruff format .

Notes / assumptions

  • Input architecture is expected to match Debian naming (e.g., amd64, arm64, i386, ppc64el, ...).
  • The tool assumes the Debian mirror layout used in the assignment (base URL points to .../dists/stable/main/ and the file name pattern is Contents-<arch>.gz).
  • Parsing is tolerant of extra whitespace and ignores malformed/empty lines.

Roadmap

  • Validate/normalize arch more strictly (or provide allowed list) and give better CLI feedback.
  • Add optional progress indication for large downloads (without spamming stdout).
  • Add retry/backoff for transient network errors.
  • Improve the --out option: currently an argument exists, but the implementation always uses the cache path; I’d either remove --out or wire it properly.
  • Add an integration test that uses a small real .gz fixture file.
  • Make output machine-readable (--json / --csv) in addition to text.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages