Skip to content

Automate sensitive information redaction from text files using regex, spaCy, and Google Cloud NLP, ensuring data privacy across various document types.

License

Notifications You must be signed in to change notification settings

Vveanta/Data_Censoror

Repository files navigation

Data Censoror - Redacting Sensitive Information from Text Files

Project Overview

The Data Censoror is a data engineering project designed to automate the process of redacting sensitive information from text files. This tool applies a multi-layered approach to identify and censor names, dates, phone numbers, addresses, and email addresses. With capabilities to process large datasets and output redacted copies of text files, the Data Censoror can be a valuable asset for applications requiring data privacy and confidentiality, such as redacting information in police reports, court transcripts, and medical records.

Features

  • Flexible Input/Output: Accepts multiple text files and processes them based on a user-specified glob pattern.
  • Comprehensive Censorship Options: Detects and censors names, dates, phone numbers, addresses, and email addresses.
  • Layered Censoring Approach: Combines regular expressions, spaCy, and Google Cloud Natural Language API to achieve high accuracy in identifying sensitive information.
  • Detailed Censorship Statistics: Generates and outputs statistics on the redaction process for user analysis.

Installation and Setup

Requirements

  • Python 3.11
  • Pipenv (for dependency management)
  • Google Cloud Natural Language API credentials (for enhanced entity recognition)

Installation

  1. Clone this repository:

    git clone https://github.com/Vveanta/Data_Censoror.git
    cd Data_Censoror
    
  2. Install dependencies using Pipenv:

    pipenv install
    
  3. Place your Google Cloud Natural Language API credentials JSON file in the files/ directory. Update the environment variable in the code if necessary.

Running the Censoror

To execute the program, run the following command:

pipenv run python censoror.py --input '*.txt' --names --dates --phones --address --output 'files/' --stats stderr

Parameters:

  • --input: Glob pattern for input text files.
  • --names: Censor names.
  • --dates: Censor dates in various formats.
  • --phones: Censor phone numbers in common formats.
  • --address: Censor addresses.
  • --output: Directory to store the censored files.
  • --stats: Location to output the censorship statistics (stderr, stdout, or file path).

This command will process all .txt files in the current directory, censor the specified types of information, output censored files in the files/ directory, and print censorship statistics to stderr.


Implementation Details

The Data Censoror uses a combination of natural language processing (NLP) libraries and regular expressions for effective and precise redaction.

Core Functions

  • preprocess_text_for_phones: Censors phone numbers using regex patterns for various formats.
  • preprocess_text_for_dates: Detects and censors dates in multiple date formats.
  • censor_text_with_google_nlp: Leverages the Google Cloud Natural Language API to identify and censor names and addresses with additional accuracy.
  • create_matcher: Initializes spaCy’s Matcher to detect sensitive entities like phone numbers based on custom patterns.
  • apply_censoring: Censors identified sensitive information by replacing it with a block character.

This approach, beginning with regex for precision, followed by Google Cloud NLP for broader detection, and spaCy’s NER for effective date and phone number censorship, ensures a thorough and efficient redaction process.


Usage Examples

Basic Censorship Example

To censor only phone numbers and dates, use:

pipenv run python censoror.py --input '*.txt' --dates --phones --output 'censored_files/' --stats stdout

This will output censored files to the censored_files/ directory and print redaction statistics to stdout.

Full Censorship with Email Detection

Email addresses are automatically censored due to the nature of their sensitivity, regardless of the specific flags provided by the user. This feature ensures that any email address within a document is protected by default.

Note on Address Censorship

The address flag also includes location-based entities (e.g., city names or landmarks) due to the potential for sensitive information leakage. This approach is intended to capture any mentions that could imply a physical address.


Limitations and Assumptions

Known Limitations

  • Partial Addresses: Censorship of incomplete addresses may result in false negatives.
  • Embedded Names in Paths: Names embedded within other strings (e.g., data/JohnDoe) may not be recognized and censored.
  • Location Censorship: City names or landmarks might be redacted, even without explicit address context, to prevent under-censorship.

Assumptions

  • The Google Cloud NLP API is accessible to the user for accurate entity recognition.
  • Sensitive information is only censored based on the provided flags (with the exception of email addresses).
  • The tool censors any mention of a location as part of the address flag.

Testing

Unit tests are located in the tests/ directory. Each test file is designed to verify the censorship functionality of specific entity types.

To run the tests:

pipenv run python -m pytest

Test Files

  • test_censor_address.py: Verifies address censorship.
  • test_censor_name.py: Checks that names are properly censored.
  • test_censor_phone.py: Ensures phone numbers are accurately redacted.

Technical Design

This project demonstrates the integration of various NLP techniques and APIs to create an effective and reliable data censorship tool. By combining regex, spaCy, and Google Cloud NLP, the Data Censoror offers a comprehensive solution to the challenge of sensitive data redaction.

Data Flow

  1. Input: Reads multiple text files using glob patterns.
  2. Censorship: Applies regex-based phone and date censorship, Google Cloud NLP-based name and address censorship, and spaCy for additional patterns.
  3. Output: Saves redacted copies of the text files and generates a summary of redacted terms.

External Resources

About

Automate sensitive information redaction from text files using regex, spaCy, and Google Cloud NLP, ensuring data privacy across various document types.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages