This repository contains the Speech Redaction Framework, a collection of components and code from Google Cloud that you can use to redact sensitive information from audio files.
It can and:
- Process uploaded audio files to Cloud Storage.
- Write the findings to Google Cloud Storage.
- Redact sensitive information from the audio file with Google Cloud Data Loss Prevention API.
Speech Redaction Framework Limitations:
- The framework can only process .wav or .flac files. This is a limitation within the Framework code not Cloud Speech-to-Text API.
The process follows:
- An audio file is uploaded to Cloud Storage.
- The Cloud Function is triggered on object.create.
- The Audio Process Cloud Function sends a long running job request to Cloud Speech-to-Text.
- Speech-to-Text processes audio file.
- The Cloud Function then sends the job ID from Cloud Speech-to-Text with additional metadata to Cloud Pub/Sub.
- The Cloud Dataflow job identifies sensitive information and writes the findings to a JSON file on Cloud Storage.
- A second Cloud Function is triggered on object.create that reads the findings JSON file, redacts sensitive information from the audio file and writes the redacted audio file to Cloud Storage.
Grant the necessary IAM roles either through the Google Cloud Console or by using the gcloud commands provided below.
1. Add the following roles to Compute Engine default service account (PROJECT_NUMBER-compute@developer.gserviceaccount.com):
- Eventarc Event Receiver role
- Cloud Speech-to-Text Service Agent
- DLP User
- DLP Administrator
- Pub/Sub Admin
- Pub/Sub Subscriber
- Pub/Sub Viewer
- Dataflow Worker
- Cloud Run Invoker
gcloud commands:
export PROJECT_ID="<PROJECT_ID>"
export GCE_SERVICE_ACCOUNT="$(gcloud projects describe ${PROJECT_ID} --format='value(projectNumber)')-compute@developer.gserviceaccount.com"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/eventarc.eventReceiver"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/speech.serviceAgent"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/dlp.user"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/dlp.admin"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/pubsub.admin"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/pubsub.subscriber"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/pubsub.viewer"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/dataflow.worker"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCE_SERVICE_ACCOUNT}" \ --role="roles/run.invoker"gcloud commands:
GCS_SERVICE_ACCOUNT="$(gcloud storage service-agent --project=${PROJECT_ID})"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${GCS_SERVICE_ACCOUNT}" \ --role='roles/pubsub.publisher'gsutil mb -l <REGION> gs://<BUCKET_NAME>3. Through the Google Cloud Console create a folder named tmp in the newly created bucket for the DataFlow staging files
⚠ NOTE: The bucket must be regional.
gsutil mb -l <REGION> gs://<BUCKET_NAME>⚠ NOTE: The bucket must be regional.
gsutil mb -l <REGION> gs://<BUCKET_NAME>gsutil mb -l <REGION> gs://<BUCKET_NAME>gcloud pubsub topics create [YOUR_TOPIC_NAME]gcloud services enable dataflowgcloud services enable speechgcloud services enable dlp.googleapis.comIn the cloned repo, go to the srf-audio-process-func directory and deploy the following Cloud Function.
⚠ NOTE: On line 29 of the
index.jsfile, add your TOPIC_NAME you created in step 7.
⚠ NOTE: Region must be the same as the Uploaded Audio Files bucket.
gcloud functions deploy srfAudioProcessFunc --region=<REGION> --stage-bucket=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME] --runtime=nodejs20 --trigger-bucket=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME]⚠ NOTE: If you run into any timeout issues with Cloud Functions, it is recommend to increase the timeout and optionally increase the Cloud Function resources.
In the cloned repo, go to the srf-redaction-func directory and deploy the following Cloud Function.
⚠ NOTE: Before deploying the redact function, on line 19 of the
index.jsfile, add your Redacted Audio Files bucket name.
gcloud functions deploy srfRedactionFunc --region=<REGION> --stage-bucket=[YOUR_UPLOADED_AUDIO_FILES_BUCKET_NAME] --runtime=nodejs20 --trigger-bucket=[YOUR_DLP_BUCKET_BUCKET_NAME]⚠ NOTE: For large audio files, it is recommend to change the Cloud Function memory allocation.
In the cloned repo, go to srf-longrun-job-dataflow directory and deploy the Cloud Dataflow Pipeline. Run the commands below to deploy the dataflow job:
# MacOS/Linux
# python3 --version Python 3.7.8
python3 -m venv env
source env/bin/activate
pip3 install apache-beam[gcp]Please wait as it might take a few minutes to complete.
You can provide an existing [INSPECT_TEMPLATE_ID] if you already have an DLP Inspection template created or refer to section [Optional: DLP inspection template creation] to create a new one.
python3 srflongrunjobdataflow.py --project=[YOUR_PROJECT_ID] --input_topic=projects/[YOUR_PROJECT_ID]/topics/[YOUR_TOPIC_NAME] --runner=DataflowRunner --temp_location=gs://[YOUR_DATAFLOW_STAGING_BUCKET]/tmp --output=[YOUR_DLP_FINDINGS_BUCKET] --region=[GOOGLE_CLOUD_REGION] --requirements_file="requirements.txt"Once the steps are completed above, upload your audio files to the Uploaded Audio Files storage bucket. Once the file is processed you will find the DLP findings in the DLP Findings storage bucket and the redacted audio files in the Redacted Audio Files storage bucket.
To create a DLP Inspection template, you can utilize the create_template.py Python script. This script requires Service Account impersonation, necessitating the creation of a new service account or the provision of existing service account keys in JSON format.
Before running the script, modify inspect_template_congig.json file to specify built-in infoTypes and custom infoTypes accordingly to your business needs.
- DLP Administrator
- DLP Inspect Templates Editor
- DLP User
export PROJECT_ID="<PROJECT_ID>"
export SA_ID="<SA_ID>" # Choose a unique ID for your service account
export SA_EMAIL="${SA_ID}@${PROJECT_ID}.iam.gserviceaccount.com"
export KEY_FILE_PATH="~/${SA_ID}-key.json" # Path where the key file will be saved
# 1. Set Project
gcloud config set project ${PROJECT_ID}
# 2. Create Service Account
gcloud iam service-accounts create ${SA_ID} \ --display-name="DLP operation SA" \ --description="Service account for DLP operations"
# 3. Grant Roles
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="serviceAccount:${SA_EMAIL}" \ --role="roles/dlp.admin"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="${SA_EMAIL}" \ --role="roles/dlp.inspectTemplatesEditor"
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member="${SA_EMAIL}" \ --role="roles/dlp.user"⚠ NOTE: Organization policy iam.disableServiceAccountKeyCreation must be not enforced at the project level for this
Create and Download Key:
gcloud iam service-accounts keys create ${KEY_FILE_PATH} \ --iam-account=${SA_EMAIL} A JSON file containing the service account's private key will be downloaded to your computer.
⚠ Important: Store this file securely. Anyone with this file can access your Google Cloud resources as the service account.
Export the key in json format, place the SA json and the inspect template in the same path where will run the template script creation.
pip install google-cloud-dlp
python3 create_template.py [PROJECT_ID] [JSON_SA] [JSON_INSPECT_TEMPLATE]This command will output the template ID that you will need to pass as part of the parameters to configure the dataflow job.
This is not an officially supported Google product