Code for our experiments on PII detection using Presidio, Azure AI Language, prompted GPT-4o-mini, fine-tuned GPT-4o-mini, and Verifier models on the CRAPII and TSCC datasets.
- Kaggle dataset: https://www.kaggle.com/datasets/langdonholmes/cleaned-repository-of-annotated-pii/data
- Paper: https://educationaldatamining.org/edm2024/proceedings/2024.EDM-posters.88/index.html
- Citation:
Holmes, L., Crossley, S. A., Wang, J., & Zhang, W. (2024). Cleaned Repository of Annotated PII. Proceedings of the 17th International Conference on Educational Data Mining (EDM).
To access the TSCC dataset, see the paper (instructions inside for how to request the data):
https://ecp.ep.liu.se/index.php/sltc/article/view/575
Download the following files and place them in the data/ directory:
obfuscated_data_06.jsonpii_true_entities.csvoriginal_transcripts.txtplaceholder_locations_new.txt
Create the Base Train Set, Verifier Train Set, and Test Set from CRAPII:
python mk_train.pyRun inference with Presidio (en_core_web_lg and en_core_web_trf) on the Test Set:
python presidio_inference.py lg
python presidio_inference.py trfRun the Azure model on the CRAPII Test Set:
- Run all cells in
azure_inference.ipynb
Open and run all cells in:
prompted_gpt_inference.ipynb
Fine-tune on the CRAPII Base Train Set:
- Run all cells in
ft_gpt_training.ipynb
Then evaluate on the CRAPII Test Set:
- Run all cells in
ft_gpt_inference.ipynb
Create the dataset and train the Verifier models:
- Run all cells in
verifier_training.ipynb
Then perform inference on the Test Set:
- Run all cells in
verifier_inference.ipynb
Create and split the TSCC dataset into train/test:
python TSCC_dataset_creation.pyFine-tune GPT-4o-mini on the TSCC train split:
- Run all cells in
gpt_ft_tscc.ipynb
Run experiments on TSCC:
- Execute all notebooks that start with
TSCC
If you use this repository, please cite our paper:
arXiv: https://arxiv.org/abs/2501.09765
This project is licensed under the MIT License – see the LICENSE file for details.