The Toulouse public library dataset

Echoes of a city’s quiet catalog

Médiathèque José Cabanis, Tolosa.

Genesis

From the slow, patient stacks where the pink city’s breath settles into paper and dust, this catalogue was born. Not as a tidy ledger but as a tangled testimony of lives and seasons that bears the stains and slips of human passage. Children with bright breath, adults with careful hands all interfering the patient circulation of the months and years. This is a narrative of what a town borrows and forgets; so it might be read again.

Overview

This project processes CSV exports of library loan counts and produces a set of analytical JSON outputs per category (e.g., movies, books, music). It first cleans mojibake and CSV artifacts and then provides insights by aggregating counts, detecting trends (progressions/regressions), extracting shooting stars, and computing top-n lists in several types of groupings.

The analysis is available as:

A standalone python script
A didactic jupyter notebook

Note

A thorough exploratory dataset analysis is available on Kaggle and in this repository

Encountered issues

Mojibake and misaligned CSV fields quietly break analyses. Garbled characters (mojibake) turn informations into unreadable strings, while misplaced delimiters or extra semicolons shift columns so numeric counts and category flags land in the wrong fields. Together they lead to silent data corruption — resulting in wrong aggregations and incorrect classifications.

Repairing these issues requires careful encoding detection and targeted string fixes, plus defensive parsing (line cleanup and column coercion) so the analysis recovers true data without discarding valuable records.

Tip

The provided Jupyter notebook details these issues with pedagogy.

Requirements

Python 3.10+
jupyterlab notebook if you want to run the notebook.
Packages:
- pandas
- chardet

Installation

Clone or download the repository:
- git clone https://github.com/brooks-code/toulouse-biblio-chronicle.git
(Recommended) Within the script's location, create and activate a virtual environment :
- python -m venv venv
- source venv/bin/activate (macOS / Linux) or venv\Scripts\activate (Windows)
Install the dependencies:
- pip install pandas chardet

Usage

Prerequisites:

Check the CSV files are in a folder named dataset (or set a different folder in the Config). If not, they are available in this repository or here (search for the top-500 datasets).
CSV filenames should contain one of the keywords (default: films, imprimes, cds) so the script can find them via glob "{discriminator}.csv".
- Run the toulouse_public_library.ipynb notebook.
- Or run the script from the command line:

python toulouse_library.py

Outputs are written to output/results_{discriminator}.json.

Note

By default the script iterates ("films", "imprimes", "cds").

Dataset

Description

Expected CSV: semi-colon delimited files containing rows with at least:

Column name	Type	Description	Example
ANNEE	Integer	Year of the record	2019
Nbre de prêts	Integer	Number of loans	93
TITRE	String	Title of the work	Enfantillages
AUTEUR	String	Author	Aldebert
Editeur	String	Publisher	Skyzo Music
Indice	String	Index	S099.2
BIB	String	Library code	CABANIS
COTE	String	Location label	E 780.2 ALD
Cat 1	String	Category label 1 (Audience)	E
Cat 2	String	Category label 2 (Media type)	CD

“–” represents missing data.
category column (default: Cat 1) contains audience markers such as A, E, BB, P.
Media type depends of the dataset type.

Input CSV layout

ANNEE;Nbre de prêts;TITRE;AUTEUR;Editeur;Indice;BIB;COTE;Cat 1;Cat 2
2020;95;La promesse de l'aube;Barbier, Eric;Paris : Pathé, 2018;PROM;CABANIS;PROM;A;-

Processing:

Note

Lines may include artifact ;-; before a Cat 1 marker; the loader will fix it.
Text fields may contain mojibake substrings (see configurable replacements) which will be fixed.
Only rows with Cat 1 in the cat1_selector (default ("A", "E")) are kept.

Sample JSON output

{
    "top10_all_years_all_cat1": {
        "question": "Top ten most popular items overall",
        "result": [
            {
                "Item_ID": "Le Journal de Mickey --- -",
                "Nbre de prêts": 28857,
                "Cat 1": "E"
            }
            ]
    }
}

Configuration

Option	Purpose	Default value
dataset_folder	Directory containing input CSV files	`"dataset"`
output_folder	Directory where JSON results are written	`"output"`
discriminator	Identifier used to select a CSV and name outputs (set per run)	`None` (set per run, e.g., `"films"`)
discriminator_list	Discriminators iterated by the main entrypoint	`("films", "imprimes", "cds")`
top_n	Number of top items to keep in each list	`10`
output_json	Output filename pattern (computed from discriminator)	`results_{discriminator}.json`
year_field	CSV column name containing the year	`"ANNEE"`
popularity_field	CSV column name containing popularity / loan counts	`"Nbre de prêts"`
title_field	CSV column name containing item title	`"TITRE"`
author_field	CSV column name containing item author	`"AUTEUR"`
cat1_field	CSV column name for audience category	`"Cat 1"`
cat1_selector	Cat1 values to keep (filter)	`("A", "E")`
composite_field	Generated composite key (title + author)	`"Item_ID"`
replacements	Mojibake repair mapping (wrong -> correct)	see below

Default replacements mapping

'ãa' → 'â'
'ãe' → 'ê'
'ãi' → 'î'
'ão' → 'ô'
'ãu' → 'û'
'âe' → 'é'
'áa' → 'à'
'áe' → 'è'
'ðc' → 'ç'

Code structure and function reference

Module overview

Script layout (high-level):

 LoanAnalysis Module
 ├─ Config (dataclass)
 ├─ Utilities
 │  ├─ find_first_matching_file(pattern)
 │  ├─ detect_encoding(filepath, sample_bytes=200)
 │  ├─ fix_line(line)
 │  └─ apply_replacements(val, repl)
 ├─ LoanAnalysis (class)
 │  ├─ load_and_clean_csv(filepath, encoding=None)
 │  ├─ aggregate_by(df, groupby_fields, key_field, value_field)
 │  ├─ top_n_from_agg(agg_df, value_field, top_n)
 │  ├─ group_top_n(df, groupby_fields, key_field, value_field, top_n)
 │  ├─ _build_pivot(df, index, columns, values)
 │  ├─ compute_trends(pivot_df, trend="progression", top_n=None)
 │  ├─ compute_trend_by_group(df, group_field, value_field, trend)
 │  ├─ compute_sudden_disappearances(pivot_df, top_n=None)
 │  ├─ compute_disappearances_by_group(df, group_field, value_field)
 │  ├─ attach_cat1(record)
 │  ├─ update_records_with_cat(records)
 │  ├─ flatten_top_lists(input_data)
 │  └─ run_all(df)
 └─ main(cfg=None)

Utilities

Details: (click to expand)

find_first_matching_file(pattern: str) -> str
- Return the first file path matching glob pattern or raise FileNotFoundError.
```
pattern = os.path.join("dataset", "*films*.csv")
path = find_first_matching_file(pattern)
```
detect_encoding(filepath: str, sample_bytes: int = 200) -> str
- Uses chardet on a byte sample and returns encoding (fallback 'utf-8').
```
enc = detect_encoding("dataset/sample.csv")
```
fix_line(line: str) -> str
- Strip trailing whitespace and fix ;-; artifact when it appears before a Cat 1 marker.
apply_replacements(val: object, repl: Mapping[str, str]) -> object
- Replace substrings in strings according to mapping; returns unchanged non-strings.
```
clean = apply_replacements("Author Ãa", {"Ãa": "Â"})
```

LoanAnalysis class

Instantiate it with a Config object: analysis = LoanAnalysis(cfg)

Key methods: (click to expand)

load_and_clean_csv(filepath: str, encoding: Optional[str] = None) -> pandas.DataFrame
- Reads a semicolon-delimited CSV, fixes lines, applies mojibake replacements, coerces numeric columns (ANNEE, Nbre de prêts) into nullable Int64, builds Item_ID as "TITLE --- AUTHOR", constructs internal cat1_map (composite -> cat1).
- Returns cleaned DataFrame filtered by cat1_selector.
```
df = analysis.load_and_clean_csv("dataset/loans_films.csv", encoding="utf-8")
```
aggregate_by(df, groupby_fields, key_field, value_field) -> DataFrame
- Groups by groupby_fields + key_field and sums value_field.
```
agg = analysis.aggregate_by(df, [cfg.year_field], cfg.composite_field, cfg.popularity_field)
```
top_n_from_agg(agg_df, value_field, top_n) -> DataFrame
- Return top_n rows ordered by value_field descending.
group_top_n(df, groupby_fields, key_field, value_field, top_n) -> Dict[str, List[dict]]
- Produce top-n per group. Useful for per-year or per-cat results.
```
top_by_year_cat1 = analysis.group_top_n(df, [cfg.year_field, cfg.cat1_field], cfg.composite_field, cfg.popularity_field, 10)
```
_build_pivot(df, index, columns, values) -> DataFrame
- Create pivot with composite key as index, years as columns, counts filled with zeros.
compute_trends(pivot_df, trend="progression", top_n=None) -> DataFrame
- Detect progression/regression items:
  - Keeps items with >= 2 positive years.
  - Finds first and last positive year and sums values at those years.
  - Returns rows with diff > 0 for progressions, diff < 0 for regressions.
- Returned columns: Item_ID, first_year, last_year, sum_first, sum_last, diff
```
pivot = analysis._build_pivot(df, cfg.composite_field, cfg.year_field, cfg.popularity_field)
top_prog = analysis.compute_trends(pivot, trend="progression")
```
compute_trend_by_group(df, group_field, value_field, trend) -> Dict[str, List[dict]]
- Runs compute_trends per group value (e.g., per Cat1).
compute_sudden_disappearances(pivot_df, top_n=None) -> DataFrame
- Finds items whose last positive year is followed by an existing year with zero counts (sudden disappearance).
- Returns columns: Item_ID, last_year, loan_last
```
disp = analysis.compute_sudden_disappearances(pivot)
```
compute_disappearances_by_group(df, group_field, value_field) -> Dict[str, List[dict]]
- Runs compute_sudden_disappearances per group value.
attach_cat1(record: dict) -> dict
- Attach a Cat 1 value to a record based on its Item_ID using the internal cat1_map.
update_records_with_cat(records: Iterable[dict]) -> List[dict]
- Apply attach_cat1 to multiple records.
flatten_top_lists(input_data) -> List[dict]
- Recursively flatten nested list/dict containers of records to a flat list.
run_all(df: DataFrame) -> Dict[str, dict]
- Run a full suite of analyses (aggregations, trends, shooting stars) and return a structured result dict.
```
results = analysis.run_all(df)
```

Contributing

Contributions are welcome! I appreciate your support: each contribution and feedback helps me grow and improve.

This project is intended as a practice on a real world use case, feel free to play with it. I'm open to any suggestion that will improve the code quality and deepen my software programming skills. If you'd like to contribute to this project, please fork the repository and submit a pull request with your changes.

Disclaimer & License

This project is provided as-is, without warranty. Use at your own risk.

It is provided under a Creative Commons CC0 license. See the LICENSE file for details.

Acknowledgments

Big thanks to the Toulouse Metropolis open data portal team for making these datasets publicly available:

They are provided under a Licence Ouverte v2.0 (Etalab) license.

Done in september 2025 by brk🂽.

じゃあまた

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
dataset		dataset
img		img
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
toulouse_library.py		toulouse_library.py
toulouse_library_dataset_analysis.ipynb		toulouse_library_dataset_analysis.ipynb
toulouse_public_library.ipynb		toulouse_public_library.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Toulouse public library dataset

Genesis

Table of Contents

Overview

Encountered issues

Requirements

Installation

Usage

Dataset

Description

Input CSV layout

Sample JSON output

Configuration

Default replacements mapping

Code structure and function reference

Module overview

Utilities

LoanAnalysis class

Contributing

Disclaimer & License

Acknowledgments

About

Uh oh!

Uh oh!

Languages

License

brooks-code/toulouse-biblio-chronicle

Folders and files

Latest commit

History

Repository files navigation

The Toulouse public library dataset

Genesis

Table of Contents

Overview

Encountered issues

Requirements

Installation

Usage

Dataset

Description

Input CSV layout

Sample JSON output

Configuration

Default replacements mapping

Code structure and function reference

Module overview

Utilities

LoanAnalysis class

Contributing

Disclaimer & License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages