The Data Censoror is a data engineering project designed to automate the process of redacting sensitive information from text files. This tool applies a multi-layered approach to identify and censor names, dates, phone numbers, addresses, and email addresses. With capabilities to process large datasets and output redacted copies of text files, the Data Censoror can be a valuable asset for applications requiring data privacy and confidentiality, such as redacting information in police reports, court transcripts, and medical records.
- Flexible Input/Output: Accepts multiple text files and processes them based on a user-specified glob pattern.
- Comprehensive Censorship Options: Detects and censors names, dates, phone numbers, addresses, and email addresses.
- Layered Censoring Approach: Combines regular expressions, spaCy, and Google Cloud Natural Language API to achieve high accuracy in identifying sensitive information.
- Detailed Censorship Statistics: Generates and outputs statistics on the redaction process for user analysis.
- Python 3.11
- Pipenv (for dependency management)
- Google Cloud Natural Language API credentials (for enhanced entity recognition)
-
Clone this repository:
git clone https://github.com/Vveanta/Data_Censoror.git cd Data_Censoror
-
Install dependencies using Pipenv:
pipenv install
-
Place your Google Cloud Natural Language API credentials JSON file in the
files/
directory. Update the environment variable in the code if necessary.
To execute the program, run the following command:
pipenv run python censoror.py --input '*.txt' --names --dates --phones --address --output 'files/' --stats stderr
- --input: Glob pattern for input text files.
- --names: Censor names.
- --dates: Censor dates in various formats.
- --phones: Censor phone numbers in common formats.
- --address: Censor addresses.
- --output: Directory to store the censored files.
- --stats: Location to output the censorship statistics (stderr, stdout, or file path).
This command will process all .txt
files in the current directory, censor the specified types of information, output censored files in the files/
directory, and print censorship statistics to stderr.
The Data Censoror uses a combination of natural language processing (NLP) libraries and regular expressions for effective and precise redaction.
preprocess_text_for_phones
: Censors phone numbers using regex patterns for various formats.preprocess_text_for_dates
: Detects and censors dates in multiple date formats.censor_text_with_google_nlp
: Leverages the Google Cloud Natural Language API to identify and censor names and addresses with additional accuracy.create_matcher
: Initializes spaCy’s Matcher to detect sensitive entities like phone numbers based on custom patterns.apply_censoring
: Censors identified sensitive information by replacing it with a block character.
This approach, beginning with regex for precision, followed by Google Cloud NLP for broader detection, and spaCy’s NER for effective date and phone number censorship, ensures a thorough and efficient redaction process.
To censor only phone numbers and dates, use:
pipenv run python censoror.py --input '*.txt' --dates --phones --output 'censored_files/' --stats stdout
This will output censored files to the censored_files/
directory and print redaction statistics to stdout.
Email addresses are automatically censored due to the nature of their sensitivity, regardless of the specific flags provided by the user. This feature ensures that any email address within a document is protected by default.
The address flag also includes location-based entities (e.g., city names or landmarks) due to the potential for sensitive information leakage. This approach is intended to capture any mentions that could imply a physical address.
- Partial Addresses: Censorship of incomplete addresses may result in false negatives.
- Embedded Names in Paths: Names embedded within other strings (e.g.,
data/JohnDoe
) may not be recognized and censored. - Location Censorship: City names or landmarks might be redacted, even without explicit address context, to prevent under-censorship.
- The Google Cloud NLP API is accessible to the user for accurate entity recognition.
- Sensitive information is only censored based on the provided flags (with the exception of email addresses).
- The tool censors any mention of a location as part of the address flag.
Unit tests are located in the tests/
directory. Each test file is designed to verify the censorship functionality of specific entity types.
To run the tests:
pipenv run python -m pytest
test_censor_address.py
: Verifies address censorship.test_censor_name.py
: Checks that names are properly censored.test_censor_phone.py
: Ensures phone numbers are accurately redacted.
This project demonstrates the integration of various NLP techniques and APIs to create an effective and reliable data censorship tool. By combining regex, spaCy, and Google Cloud NLP, the Data Censoror offers a comprehensive solution to the challenge of sensitive data redaction.
- Input: Reads multiple text files using glob patterns.
- Censorship: Applies regex-based phone and date censorship, Google Cloud NLP-based name and address censorship, and spaCy for additional patterns.
- Output: Saves redacted copies of the text files and generates a summary of redacted terms.