A Python tool for detecting and anonymizing privacy-sensitive information in open-ended Dutch survey responses or other open answers.
SEAA helps identify and anonymize potentially privacy-sensitive information in text responses, particularly useful for processing survey data. Any csv file with open answers can be processed. It uses dictionary-based matching that is updated by user interaction to:
- Detect unknown words that might contain private information
- Flag known privacy-sensitive terms (names, medical conditions, etc.)
- Replace sensitive information with category markers (e.g., [NAME], [ILLNESS])
- Allow users to expand the whitelist/blacklist of words through interactive review
- User input is expanded in the dictionaries and used for future analyses
NOTE: this tool is primarily designed for Dutch text, but includes translation capabilities for non-Dutch responses.
Disclaimer: SEAA is a tool for anonimisation of text data, but does not replace a manual check of results nor can SEAA or it's creators be held responsible for misdetection of privacy-related data.
%%{init: {'sequence': {'theme': 'hand'}}}%%
sequenceDiagram
participant Input as Input Files
participant SEAA as SEAA Process
participant Dict as Dictionaries
participant User as User Review
participant Out as Output Files
Input->>SEAA: Standard CSV
activate SEAA
SEAA->>SEAA: Load & Clean Text
SEAA->>SEAA: Detect Language
alt Non-Dutch Text
SEAA->>SEAA: Translate to Dutch
end
loop Word Check
SEAA->>Dict: Check against dictionaries
Dict-->>SEAA: Return matches
end
SEAA->>Out: Write SEAA_output.csv
SEAA->>Out: Write unknown_words.csv
deactivate SEAA
loop For each unknown word
Out->>User: Present word
User->>Dict: Add to whitelist/blacklist
end
Dict->>Dict: Update dictionaries
Before installing SEAA, ensure you have:
- Python 3.7 or higher installed
- Git installed
- Basic understanding of command line operations
- A modern web browser (Chrome, Firefox, Safari, or Edge)
- Clone the repository and switch to the AL_local_flask branch:
git clone https://github.com/uashogeschoolutrecht/SEAA.git
cd SEAA- Install required dependencies:
pip install -r requirements.txtSEAA provides a web interface for processing and anonymizing your data. There are two ways to run the application:
- Open a terminal in the project directory
- Run the Flask application:
python app.py- Open your web browser and navigate to:
http://localhost:5000
Note: The development server is not suitable for production use.
Once the application is running, you can:
- Access the main interface at
http://localhost:5000for file processing - View the documentation at
Documentation - Upload your CSV files through the web interface
- Process and download anonymized results
- Help improve the dictionaries through the interactive review process (optional)
Your input CSV file must:
- Use semicolon as the separator
- Contain these columns in order:
respondent_id- Unique identifier for each respondentAnswer- The text responses to analyzequestion_id- Identifier for the question being answered
Example input CSV format:
respondent_id;Answer;question_id
1001;"Mijn docent Peter heeft mij enorm geholpen";Q1
1002;"Ik had moeite met concentratie tijdens de lessen";Q1
The tool generates several output files:
-
SEAA_output.csv: Main analysis results containing:- Original text
- Censored text
- Privacy flags
- Detected sensitive words
-
avg_words_count.csv: List of unknown words for review -
Updated dictionary files in
dict/folder:whitelist.txt: Safe wordsblacklist.txt: Privacy-sensitive words
The SEAA_output.csv contains the following columns:
respondent_id: Original respondent identifierAnswer: Original text responsequestion_id: Original question identifieranswer_clean: Cleaned version of the text (lowercase, normalized)contains_privacy: Binary flag (1/0) indicating if privacy-sensitive content was detectedunknown_words: List of words not found in the dictionary or whitelistflagged_words: List of words matched against the privacy-sensitive dictionariesanswer_censored: Text with privacy-sensitive words replaced by category markers (e.g., [NAME], [ILLNESS]) and unknown words replaced by [UNKOWN]total_word_count: Total number of words in the responseunknown_word_count: Number of words not found in dictionaries (still need to be reviewed)flagged_word_count: Number of privacy-sensitive words detectedunknown_words_not_flagged: Unknown words that are not in the dictionariesflagged_word_type: Categories of privacy-sensitive content found (e.g., "name, illness")language: Detected language of the response (e.g., 'nl' for Dutch, 'en' for English)
Example row:
respondent_id;Answer;question_id;answer_clean;contains_privacy;unknown_words;flagged_words;answer_censored;total_word_count;unknown_word_count;flagged_word_count;unknown_words_not_flagged;flagged_word_type;language
1;"Mijn docent Peter heeft mij geholpen met mijn loopbaanbegleidingstraject";"Q1";"mijn docent peter heeft mij geholpen met mijn loopbaanbegleidingstraject";1;"";peter;"Mijn docent [NAME] heeft mij geholpen met mijn [UNKOWN]";10;0;1;;"name";"nl"
The tool will present unknown words for review, allowing you to:
- Add words to the whitelist (safe words)
- Add words to the blacklist (privacy-sensitive words)
- Skip words for later review
Example interaction:
"docent" kwam 45 keer voor in de open antwoorden.
Wil je dit woord toevoegenaan de whitelist? (j/n/blacklist): j
Woord "docent" is toegevoegd aan de whitelist
"janssen" kwam 12 keer voor in de open antwoorden.
Wil je dit woord toevoegenaan de whitelist? (j/n/blacklist): blacklist
Woord "janssen" is toegevoegd aan de blacklist
The tool uses several dictionary files in the dict/ folder:
wordlist.txt: Base dictionary of common wordswhitelist.txt: User-approved safe wordsblacklist.txt: Known privacy-sensitive wordsillness.txt: Medical conditions and health-related termsstudiebeperking.txt: Study limitationsnames.txt: Common first names plus some last namesfamilie.txt: Family relationship termsplaatsnamen.txt: All locations in the Netherlands from the Dutch censuspersoonlijke_omstandigheden.txt: Personal circumstances
The tool automatically detects the language of responses. For non-Dutch text, it uses a translation service to convert the text to Dutch before processing. This allows SEAA to handle multilingual datasets while maintaining consistent anonymization rules.
The translation system:
- Detects the source language using language detection
- Translates non-Dutch text to Dutch using multiple translation services
- Falls back to alternative translators if one fails
- Handles large texts by breaking them into manageable chunks
- Dictionary-based approach may miss complex or context-dependent privacy information
- Translation quality may affect anonymization accuracy for non-Dutch responses
- Regular maintenance of dictionaries is recommended for optimal performance