RAF: Retrieval-Augmented Dataset Assembly for Fair Clustering

RAF is a retrieval-augmented dataset assembly framework for downstream fair clustering. Instead of treating fairness only as a constraint imposed on a fixed input dataset, RAF improves the data foundation before clustering by selectively acquiring real samples from external data sources.

The key idea is simple: when a minority group is under-represented in some semantic regions of the data space, fair clustering algorithms may have too few relevant samples to produce both useful and fair clusters. RAF addresses this limitation by retrieving and accepting external samples that are not only from under-represented groups, but also semantically aligned with the majority-group distribution of the target dataset.

This repository contains the public-facing RAF runtime and experiment execution code.

Why RAF?

Fair clustering aims to partition data while reducing group-level disparities with respect to a sensitive attribute. Existing methods usually operate on a fixed dataset. This setting is limiting when the dataset itself is imbalanced: if minority samples are missing in some regions of the embedding space, a clustering algorithm can only rebalance the samples that already exist. In such cases, fairness constraints may force semantically misaligned assignments and degrade clustering utility.

RAF takes a data-centric view. It asks a different question:

Given an initial dataset, a set of external data sources, and a limited acquisition budget, which external samples should be assembled to improve downstream fair clustering?

RAF is designed for this retrieval-augmented setting. It first identifies where the initial dataset lacks minority-group support, then allocates budget to promising external sources, and finally values each retrieved candidate by its marginal contribution to cross-group distributional alignment.

Method Overview

RAF follows a two-stage selection--valuation workflow.

1. Demand-aware source selection

RAF first pre-clusters the query dataset and builds a demand matrix over cluster--group combinations. Each entry reflects how many additional samples are needed for a sensitive group in a specific semantic region. This demand signal is used to guide a multi-armed-bandit source selection policy, so that RAF spends more budget on sources that are more likely to return useful candidates.

2. MaxSim-based candidate valuation

After a candidate is retrieved, RAF estimates whether the sample reduces the distributional gap between majority and minority groups in the embedding space. RAF uses a MaxSim-style vector-set similarity objective to measure the marginal contribution of each candidate. A sample is accepted only when it improves the current majority--minority alignment.

3. Efficient incremental valuation

Naively recomputing the full MaxSim gain for every candidate is expensive. RAF therefore supports incremental valuation and an optional hybrid-index mode. The hybrid mode combines nearest-neighbor search with inverted lists to reduce the number of majority samples that must be rechecked for each candidate.

The recommended runtime setting is:

policy         = fair_eps_greedy
valuation_mode = incremental_hybrid

Repository Status

This repository is a cleaned public-facing subset of the RAF project. It focuses on runtime execution and reproducible experiment sweeps.

Included:

RAF/: core RAF runtime package
RAF/run_pipeline.py: generic single-run CLI
RAF/run_experiments.py: generic sweep / experiment CLI
source_config.example.json: example source manifest
requirements.txt: minimal runtime dependencies
pyproject.toml: package metadata and CLI entry points

Excluded on purpose:

dataset construction scripts
query/source builders
tests
result folders
dataset-specific PowerShell wrappers
analysis, comparison, and summarization scripts
bundled third-party fair-clustering repositories

The public release assumes that RAF-ready query and source files have already been prepared.

Repository Layout

RAF/
  README.md
  pyproject.toml
  requirements.txt
  source_config.example.json
  RAF/
    __init__.py
    bandit.py
    clustering.py
    config.py
    data.py
    demand.py
    evaluation.py
    experiments.py
    fair_external_eval.py
    pipeline.py
    run_experiments.py
    run_pipeline.py
    selection.py
    vector_ops.py

Installation

Option 1: Install runtime dependencies

python -m pip install -r requirements.txt

Option 2: Install RAF as an editable package

python -m pip install -e .

Optional: install HNSW support

hnswlib is only required when using incremental_hybrid valuation.

python -m pip install -e ".[hnsw]"

On Windows PowerShell, use:

python -m pip install -e ".[hnsw]"

Data

Input format

RAF expects preprocessed query and source files in Parquet format.

Column	Default name	Description
Embedding	`embedding`	Vector representation of each sample.
Sensitive attribute	`is_english_name`	Sensitive group label used for fair dataset assembly.
Extra columns	user-defined	Optional metadata columns preserved for analysis or evaluation.

Each source file must contain at least the embedding column and the sensitive attribute column.

Passing source files

RAF supports two source-loading modes.

Mode A: source directory

--source_dir /path/to/sources \
--source_glob "source_*.parquet"

Mode B: source manifest

--source_config source_config.example.json

Use the provided source_config.example.json as the template for declaring source files and source-level metadata.

Original datasets

The experiments in the RAF paper are designed around multi-modal datasets. Raw datasets are not bundled in this repository. Please follow the corresponding dataset licenses and access policies.

Dataset	Modality	Original source
SciSciNet-v2	scientific publication metadata / text-derived embeddings	https://www.openresearchbeacon.org/project/sciscinet/
Folktables	tabular census data	https://github.com/socialfoundations/folktables
FairFace	face images	https://github.com/joojs/fairface
Argoverse 2 Motion Forecasting	trajectory / autonomous-driving scenarios	https://argoverse.github.io/user-guide/

Processed RAF-ready datasets

Processed datasets will be released separately.

Dataset	RAF-ready query/source files
SciSciNet-v2	https://huggingface.co/datasets/pengyueli/RAF_SciSciNet
Folktables	https://huggingface.co/datasets/pengyueli/RAF_FolkTables
FairFace	https://huggingface.co/datasets/pengyueli/RAF_FairFace
Argoverse 2	https://huggingface.co/datasets/pengyueli/RAF_Argoverse

Quick Start

Run RAF once

python -m RAF.run_pipeline \
  --query_path /path/to/query.parquet \
  --source_dir /path/to/sources \
  --source_glob "source_*.parquet" \
  --policy fair_eps_greedy \
  --valuation_mode incremental_hybrid \
  --limit_sources 20

Windows PowerShell version:

python -m RAF.run_pipeline `
  --query_path D:\path\to\query.parquet `
  --source_dir D:\path\to\sources `
  --source_glob source_*.parquet `
  --policy fair_eps_greedy `
  --valuation_mode incremental_hybrid `
  --limit_sources 20

You can pass a source manifest instead of a source directory:

python -m RAF.run_pipeline \
  --query_path /path/to/query.parquet \
  --source_config source_config.example.json \
  --policy fair_eps_greedy \
  --valuation_mode incremental_hybrid

If RAF is installed as an editable package, the console entry point is also available:

raf-run \
  --query_path /path/to/query.parquet \
  --source_dir /path/to/sources

To inspect all available command-line options:

python -m RAF.run_pipeline --help

Running Experiments

Use run_experiments.py for repeated runs and budget sweeps.

python -m RAF.run_experiments \
  --query_path /path/to/query.parquet \
  --source_dir /path/to/sources \
  --source_glob "source_*.parquet" \
  --policy fair_eps_greedy \
  --valuation_mode incremental_hybrid \
  --max_cost_start 6000 \
  --max_cost_end 20000 \
  --max_cost_step 2000 \
  --runs 5

Windows PowerShell version:

python -m RAF.run_experiments `
  --query_path D:\path\to\query.parquet `
  --source_dir D:\path\to\sources `
  --source_glob source_*.parquet `
  --policy fair_eps_greedy `
  --valuation_mode incremental_hybrid `
  --max_cost_start 6000 `
  --max_cost_end 20000 `
  --max_cost_step 2000 `
  --runs 5

If installed with pip install -e ., you can also run:

raf-experiments \
  --query_path /path/to/query.parquet \
  --source_dir /path/to/sources

To inspect all experiment options:

python -m RAF.run_experiments --help

External Fair-Clustering Evaluation

fair_external_eval.py is kept for compatibility with external fair-clustering solvers. The external solver repository is not bundled in this public release.

If you use:

--eval_clustering_method external_fair_relax_merge

then you must also provide:

--external_fair_algo_dir /path/to/external/solver

Install the external solver's dependencies separately and follow its license.

Reproducibility Notes

RAF does not bundle raw datasets or processed query/source files.
Public code starts from RAF-ready Parquet inputs.
The default embedding column is embedding.
The default sensitive attribute column is is_english_name.
incremental_hybrid requires hnswlib.
External fair-clustering solvers are optional and must be installed separately.
Dataset-specific preprocessing should be documented together with the processed dataset release.

Citation

Please also the original datasets used in your experiments according to their official citation instructions.

License

Raw datasets and external solvers are governed by their own licenses and terms of use.

Contact

For questions about RAF, please open an issue in this repository or contact the authors (pengyueli@whu.edu.cn).

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
RAF		RAF
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAF: Retrieval-Augmented Dataset Assembly for Fair Clustering

Table of Contents

Why RAF?

Method Overview

1. Demand-aware source selection

2. MaxSim-based candidate valuation

3. Efficient incremental valuation

Repository Status

Repository Layout

Installation

Option 1: Install runtime dependencies

Option 2: Install RAF as an editable package

Optional: install HNSW support

Data

Input format

Passing source files

Mode A: source directory

Mode B: source manifest

Original datasets

Processed RAF-ready datasets

Quick Start

Run RAF once

Running Experiments

External Fair-Clustering Evaluation

Reproducibility Notes

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAF: Retrieval-Augmented Dataset Assembly for Fair Clustering

Table of Contents

Why RAF?

Method Overview

1. Demand-aware source selection

2. MaxSim-based candidate valuation

3. Efficient incremental valuation

Repository Status

Repository Layout

Installation

Option 1: Install runtime dependencies

Option 2: Install RAF as an editable package

Optional: install HNSW support

Data

Input format

Passing source files

Mode A: source directory

Mode B: source manifest

Original datasets

Processed RAF-ready datasets

Quick Start

Run RAF once

Running Experiments

External Fair-Clustering Evaluation

Reproducibility Notes

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages