Echoes of a city’s quiet catalog
Médiathèque José Cabanis, Tolosa.
From the slow, patient stacks where the pink city’s breath settles into paper and dust, this catalogue was born. Not as a tidy ledger but as a tangled testimony of lives and seasons that bears the stains and slips of human passage. Children with bright breath, adults with careful hands all interfering the patient circulation of the months and years. This is a narrative of what a town borrows and forgets; so it might be read again.
Contents: (click to expand)
This project processes CSV exports of library loan counts and produces a set of analytical JSON outputs per category (e.g., movies, books, music). It first cleans mojibake and CSV artifacts and then provides insights by aggregating counts, detecting trends (progressions/regressions), extracting shooting stars, and computing top-n lists in several types of groupings.
The analysis is available as:
- A standalone python script
- A didactic jupyter notebook
Note
A thorough exploratory dataset analysis is available on Kaggle and in this repository
Mojibake and misaligned CSV fields quietly break analyses. Garbled characters (mojibake) turn informations into unreadable strings, while misplaced delimiters or extra semicolons shift columns so numeric counts and category flags land in the wrong fields. Together they lead to silent data corruption — resulting in wrong aggregations and incorrect classifications.
Repairing these issues requires careful encoding detection and targeted string fixes, plus defensive parsing (line cleanup and column coercion) so the analysis recovers true data without discarding valuable records.
Tip
The provided Jupyter notebook details these issues with pedagogy.
- Python 3.10+
- jupyterlab notebook if you want to run the notebook.
- Packages:
- pandas
- chardet
-
Clone or download the repository:
git clone https://github.com/brooks-code/toulouse-biblio-chronicle.git
-
(Recommended) Within the script's location, create and activate a virtual environment :
python -m venv venvsource venv/bin/activate(macOS / Linux) orvenv\Scripts\activate(Windows)
-
Install the dependencies:
pip install pandas chardet
Prerequisites:
-
Check the CSV files are in a folder named
dataset(or set a different folder in the Config). If not, they are available in this repository or here (search for the top-500 datasets). -
CSV filenames should contain one of the keywords (default:
films,imprimes,cds) so the script can find them via glob "{discriminator}.csv".-
Run the
toulouse_public_library.ipynbnotebook. -
Or run the script from the command line:
-
python toulouse_library.py- Outputs are written to
output/results_{discriminator}.json.
Note
By default the script iterates ("films", "imprimes", "cds").
Expected CSV: semi-colon delimited files containing rows with at least:
| Column name | Type | Description | Example |
|---|---|---|---|
| ANNEE | Integer | Year of the record | 2019 |
| Nbre de prêts | Integer | Number of loans | 93 |
| TITRE | String | Title of the work | Enfantillages |
| AUTEUR | String | Author | Aldebert |
| Editeur | String | Publisher | Skyzo Music |
| Indice | String | Index | S099.2 |
| BIB | String | Library code | CABANIS |
| COTE | String | Location label | E 780.2 ALD |
| Cat 1 | String | Category label 1 (Audience) | E |
| Cat 2 | String | Category label 2 (Media type) | CD |
- “–” represents missing data.
- category column (default:
Cat 1) contains audience markers such asA,E,BB,P. - Media type depends of the dataset type.
ANNEE;Nbre de prêts;TITRE;AUTEUR;Editeur;Indice;BIB;COTE;Cat 1;Cat 2
2020;95;La promesse de l'aube;Barbier, Eric;Paris : Pathé, 2018;PROM;CABANIS;PROM;A;-
Processing:
Note
- Lines may include artifact
;-;before a Cat 1 marker; the loader will fix it. - Text fields may contain mojibake substrings (see configurable replacements) which will be fixed.
- Only rows with Cat 1 in the
cat1_selector(default("A", "E")) are kept.
{
"top10_all_years_all_cat1": {
"question": "Top ten most popular items overall",
"result": [
{
"Item_ID": "Le Journal de Mickey --- -",
"Nbre de prêts": 28857,
"Cat 1": "E"
}
]
}
}| Option | Purpose | Default value |
|---|---|---|
| dataset_folder | Directory containing input CSV files | "dataset" |
| output_folder | Directory where JSON results are written | "output" |
| discriminator | Identifier used to select a CSV and name outputs (set per run) | None (set per run, e.g., "films") |
| discriminator_list | Discriminators iterated by the main entrypoint | ("films", "imprimes", "cds") |
| top_n | Number of top items to keep in each list | 10 |
| output_json | Output filename pattern (computed from discriminator) | results_{discriminator}.json |
| year_field | CSV column name containing the year | "ANNEE" |
| popularity_field | CSV column name containing popularity / loan counts | "Nbre de prêts" |
| title_field | CSV column name containing item title | "TITRE" |
| author_field | CSV column name containing item author | "AUTEUR" |
| cat1_field | CSV column name for audience category | "Cat 1" |
| cat1_selector | Cat1 values to keep (filter) | ("A", "E") |
| composite_field | Generated composite key (title + author) | "Item_ID" |
| replacements | Mojibake repair mapping (wrong -> correct) | see below |
'ãa'→'â''ãe'→'ê''ãi'→'î''ão'→'ô''ãu'→'û''âe'→'é''áa'→'à''áe'→'è''ðc'→'ç'
Script layout (high-level):
LoanAnalysis Module
├─ Config (dataclass)
├─ Utilities
│ ├─ find_first_matching_file(pattern)
│ ├─ detect_encoding(filepath, sample_bytes=200)
│ ├─ fix_line(line)
│ └─ apply_replacements(val, repl)
├─ LoanAnalysis (class)
│ ├─ load_and_clean_csv(filepath, encoding=None)
│ ├─ aggregate_by(df, groupby_fields, key_field, value_field)
│ ├─ top_n_from_agg(agg_df, value_field, top_n)
│ ├─ group_top_n(df, groupby_fields, key_field, value_field, top_n)
│ ├─ _build_pivot(df, index, columns, values)
│ ├─ compute_trends(pivot_df, trend="progression", top_n=None)
│ ├─ compute_trend_by_group(df, group_field, value_field, trend)
│ ├─ compute_sudden_disappearances(pivot_df, top_n=None)
│ ├─ compute_disappearances_by_group(df, group_field, value_field)
│ ├─ attach_cat1(record)
│ ├─ update_records_with_cat(records)
│ ├─ flatten_top_lists(input_data)
│ └─ run_all(df)
└─ main(cfg=None)Details: (click to expand)
-
find_first_matching_file(pattern: str) -> str-
Return the first file path matching glob pattern or raise FileNotFoundError.
pattern = os.path.join("dataset", "*films*.csv") path = find_first_matching_file(pattern)
-
-
detect_encoding(filepath: str, sample_bytes: int = 200) -> str-
Uses chardet on a byte sample and returns encoding (fallback
'utf-8').enc = detect_encoding("dataset/sample.csv")
-
-
fix_line(line: str) -> str- Strip trailing whitespace and fix
;-;artifact when it appears before aCat 1marker.
- Strip trailing whitespace and fix
-
apply_replacements(val: object, repl: Mapping[str, str]) -> object-
Replace substrings in strings according to mapping; returns unchanged non-strings.
clean = apply_replacements("Author Ãa", {"Ãa": "Â"})
-
Instantiate it with a Config object:
analysis = LoanAnalysis(cfg)
Key methods: (click to expand)
-
load_and_clean_csv(filepath: str, encoding: Optional[str] = None) -> pandas.DataFrame-
Reads a semicolon-delimited CSV, fixes lines, applies mojibake replacements, coerces numeric columns (
ANNEE,Nbre de prêts) into nullable Int64, buildsItem_IDas "TITLE --- AUTHOR", constructs internal cat1_map (composite -> cat1). -
Returns cleaned DataFrame filtered by
cat1_selector.df = analysis.load_and_clean_csv("dataset/loans_films.csv", encoding="utf-8")
-
-
aggregate_by(df, groupby_fields, key_field, value_field) -> DataFrame-
Groups by groupby_fields + key_field and sums value_field.
agg = analysis.aggregate_by(df, [cfg.year_field], cfg.composite_field, cfg.popularity_field)
-
-
top_n_from_agg(agg_df, value_field, top_n) -> DataFrame- Return top_n rows ordered by value_field descending.
-
group_top_n(df, groupby_fields, key_field, value_field, top_n) -> Dict[str, List[dict]]-
Produce top-n per group. Useful for per-year or per-cat results.
top_by_year_cat1 = analysis.group_top_n(df, [cfg.year_field, cfg.cat1_field], cfg.composite_field, cfg.popularity_field, 10)
-
-
_build_pivot(df, index, columns, values) -> DataFrame- Create pivot with composite key as index, years as columns, counts filled with zeros.
-
compute_trends(pivot_df, trend="progression", top_n=None) -> DataFrame-
Detect progression/regression items:
- Keeps items with >= 2 positive years.
- Finds first and last positive year and sums values at those years.
- Returns rows with diff > 0 for progressions, diff < 0 for regressions.
-
Returned columns: Item_ID, first_year, last_year, sum_first, sum_last, diff
pivot = analysis._build_pivot(df, cfg.composite_field, cfg.year_field, cfg.popularity_field) top_prog = analysis.compute_trends(pivot, trend="progression")
-
-
compute_trend_by_group(df, group_field, value_field, trend) -> Dict[str, List[dict]]- Runs compute_trends per group value (e.g., per Cat1).
-
compute_sudden_disappearances(pivot_df, top_n=None) -> DataFrame-
Finds items whose last positive year is followed by an existing year with zero counts (sudden disappearance).
-
Returns columns: Item_ID, last_year, loan_last
disp = analysis.compute_sudden_disappearances(pivot)
-
-
compute_disappearances_by_group(df, group_field, value_field) -> Dict[str, List[dict]]- Runs compute_sudden_disappearances per group value.
-
attach_cat1(record: dict) -> dict- Attach a
Cat 1value to a record based on itsItem_IDusing the internalcat1_map.
- Attach a
-
update_records_with_cat(records: Iterable[dict]) -> List[dict]- Apply attach_cat1 to multiple records.
-
flatten_top_lists(input_data) -> List[dict]- Recursively flatten nested list/dict containers of records to a flat list.
-
run_all(df: DataFrame) -> Dict[str, dict]-
Run a full suite of analyses (aggregations, trends, shooting stars) and return a structured result dict.
results = analysis.run_all(df)
-
Contributions are welcome! I appreciate your support: each contribution and feedback helps me grow and improve.
This project is intended as a practice on a real world use case, feel free to play with it. I'm open to any suggestion that will improve the code quality and deepen my software programming skills. If you'd like to contribute to this project, please fork the repository and submit a pull request with your changes.
This project is provided as-is, without warranty. Use at your own risk.
It is provided under a Creative Commons CC0 license. See the LICENSE file for details.
Big thanks to the Toulouse Metropolis open data portal team for making these datasets publicly available:
They are provided under a Licence Ouverte v2.0 (Etalab) license.
Done in september 2025 by brk🂽.
じゃあまた