Skip to content

Tools and methods for detecting and anonymizing Personally Identifiable Information (PII) using AI-driven approaches. This repository includes implementations of fine-tuned models and comparative evaluations for enhancing data privacy in educational content.

License

Notifications You must be signed in to change notification settings

AnonJD/PrivacyAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Enhancing the De-identification of Personally Identifiable Information in Educational Data

arXiv

Code for our experiments on PII detection using Presidio, Azure AI Language, prompted GPT-4o-mini, fine-tuned GPT-4o-mini, and Verifier models on the CRAPII and TSCC datasets.


0. Datasets

CRAPII (Cleaned Repository of Annotated PII)

TSCC

To access the TSCC dataset, see the paper (instructions inside for how to request the data):
https://ecp.ep.liu.se/index.php/sltc/article/view/575


1. Prepare Data

Download the following files and place them in the data/ directory:

  • obfuscated_data_06.json
  • pii_true_entities.csv
  • original_transcripts.txt
  • placeholder_locations_new.txt

2. Dataset Creation

Create the Base Train Set, Verifier Train Set, and Test Set from CRAPII:

python mk_train.py

3. Inference on the CRAPII Dataset

3.1 Presidio

Run inference with Presidio (en_core_web_lg and en_core_web_trf) on the Test Set:

python presidio_inference.py lg 
python presidio_inference.py trf

3.2 Azure AI Language

Run the Azure model on the CRAPII Test Set:

  • Run all cells in azure_inference.ipynb

3.3 Prompted GPT-4o-mini

Open and run all cells in:

  • prompted_gpt_inference.ipynb

3.4 Fine-tuned GPT-4o-mini

Fine-tune on the CRAPII Base Train Set:

  • Run all cells in ft_gpt_training.ipynb

Then evaluate on the CRAPII Test Set:

  • Run all cells in ft_gpt_inference.ipynb

3.5 Verifier Setup

Create the dataset and train the Verifier models:

  • Run all cells in verifier_training.ipynb

Then perform inference on the Test Set:

  • Run all cells in verifier_inference.ipynb

4. TSCC Dataset

Create and split the TSCC dataset into train/test:

python TSCC_dataset_creation.py

Fine-tune GPT-4o-mini on the TSCC train split:

  • Run all cells in gpt_ft_tscc.ipynb

Run experiments on TSCC:

  • Execute all notebooks that start with TSCC

5. Paper

If you use this repository, please cite our paper:
arXiv: https://arxiv.org/abs/2501.09765


License

This project is licensed under the MIT License – see the LICENSE file for details.

About

Tools and methods for detecting and anonymizing Personally Identifiable Information (PII) using AI-driven approaches. This repository includes implementations of fine-tuned models and comparative evaluations for enhancing data privacy in educational content.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published