Data Censoror - Redacting Sensitive Information from Text Files

Project Overview

The Data Censoror is a data engineering project designed to automate the process of redacting sensitive information from text files. This tool applies a multi-layered approach to identify and censor names, dates, phone numbers, addresses, and email addresses. With capabilities to process large datasets and output redacted copies of text files, the Data Censoror can be a valuable asset for applications requiring data privacy and confidentiality, such as redacting information in police reports, court transcripts, and medical records.

Features

Flexible Input/Output: Accepts multiple text files and processes them based on a user-specified glob pattern.
Comprehensive Censorship Options: Detects and censors names, dates, phone numbers, addresses, and email addresses.
Layered Censoring Approach: Combines regular expressions, spaCy, and Google Cloud Natural Language API to achieve high accuracy in identifying sensitive information.
Detailed Censorship Statistics: Generates and outputs statistics on the redaction process for user analysis.

Installation and Setup

Requirements

Python 3.11
Pipenv (for dependency management)
Google Cloud Natural Language API credentials (for enhanced entity recognition)

Installation

Clone this repository:

git clone https://github.com/Vveanta/Data_Censoror.git
cd Data_Censoror

Install dependencies using Pipenv:
```
pipenv install
```
Place your Google Cloud Natural Language API credentials JSON file in the files/ directory. Update the environment variable in the code if necessary.

Running the Censoror

To execute the program, run the following command:

pipenv run python censoror.py --input '*.txt' --names --dates --phones --address --output 'files/' --stats stderr

Parameters:

--input: Glob pattern for input text files.
--names: Censor names.
--dates: Censor dates in various formats.
--phones: Censor phone numbers in common formats.
--address: Censor addresses.
--output: Directory to store the censored files.
--stats: Location to output the censorship statistics (stderr, stdout, or file path).

This command will process all .txt files in the current directory, censor the specified types of information, output censored files in the files/ directory, and print censorship statistics to stderr.

Implementation Details

The Data Censoror uses a combination of natural language processing (NLP) libraries and regular expressions for effective and precise redaction.

Core Functions

preprocess_text_for_phones: Censors phone numbers using regex patterns for various formats.
preprocess_text_for_dates: Detects and censors dates in multiple date formats.
censor_text_with_google_nlp: Leverages the Google Cloud Natural Language API to identify and censor names and addresses with additional accuracy.
create_matcher: Initializes spaCy’s Matcher to detect sensitive entities like phone numbers based on custom patterns.
apply_censoring: Censors identified sensitive information by replacing it with a block character.

This approach, beginning with regex for precision, followed by Google Cloud NLP for broader detection, and spaCy’s NER for effective date and phone number censorship, ensures a thorough and efficient redaction process.

Usage Examples

Basic Censorship Example

To censor only phone numbers and dates, use:

pipenv run python censoror.py --input '*.txt' --dates --phones --output 'censored_files/' --stats stdout

This will output censored files to the censored_files/ directory and print redaction statistics to stdout.

Full Censorship with Email Detection

Email addresses are automatically censored due to the nature of their sensitivity, regardless of the specific flags provided by the user. This feature ensures that any email address within a document is protected by default.

Note on Address Censorship

The address flag also includes location-based entities (e.g., city names or landmarks) due to the potential for sensitive information leakage. This approach is intended to capture any mentions that could imply a physical address.

Limitations and Assumptions

Known Limitations

Partial Addresses: Censorship of incomplete addresses may result in false negatives.
Embedded Names in Paths: Names embedded within other strings (e.g., data/JohnDoe) may not be recognized and censored.
Location Censorship: City names or landmarks might be redacted, even without explicit address context, to prevent under-censorship.

Assumptions

The Google Cloud NLP API is accessible to the user for accurate entity recognition.
Sensitive information is only censored based on the provided flags (with the exception of email addresses).
The tool censors any mention of a location as part of the address flag.

Testing

Unit tests are located in the tests/ directory. Each test file is designed to verify the censorship functionality of specific entity types.

To run the tests:

pipenv run python -m pytest

Test Files

test_censor_address.py: Verifies address censorship.
test_censor_name.py: Checks that names are properly censored.
test_censor_phone.py: Ensures phone numbers are accurately redacted.

Technical Design

This project demonstrates the integration of various NLP techniques and APIs to create an effective and reliable data censorship tool. By combining regex, spaCy, and Google Cloud NLP, the Data Censoror offers a comprehensive solution to the challenge of sensitive data redaction.

Data Flow

Input: Reads multiple text files using glob patterns.
Censorship: Applies regex-based phone and date censorship, Google Cloud NLP-based name and address censorship, and spaCy for additional patterns.
Output: Saves redacted copies of the text files and generates a summary of redacted terms.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data/input		data/input
files		files
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Pipfile		Pipfile
README.md		README.md
__init__.py		__init__.py
censoror.py		censoror.py
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Censoror - Redacting Sensitive Information from Text Files

Project Overview

Features

Installation and Setup

Requirements

Installation

Running the Censoror

Parameters:

Implementation Details

Core Functions

Usage Examples

Basic Censorship Example

Full Censorship with Email Detection

Note on Address Censorship

Limitations and Assumptions

Known Limitations

Assumptions

Testing

Test Files

Technical Design

Data Flow

External Resources

About

Releases

Packages

Languages

License

Vveanta/Data_Censoror

Folders and files

Latest commit

History

Repository files navigation

Data Censoror - Redacting Sensitive Information from Text Files

Project Overview

Features

Installation and Setup

Requirements

Installation

Running the Censoror

Parameters:

Implementation Details

Core Functions

Usage Examples

Basic Censorship Example

Full Censorship with Email Detection

Note on Address Censorship

Limitations and Assumptions

Known Limitations

Assumptions

Testing

Test Files

Technical Design

Data Flow

External Resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages