2022 Thesis
2022 Thesis
A Degree Thesis
Submitted to the Faculty of the
Escola Tècnica d'Enginyeria de Telecomunicació de
Barcelona
Universitat Politècnica de Catalunya
by
Blanca Ruiz Díaz
In partial fulfilment
of the requirements for the degree in
TELECOMMUNICATIONS TECHNOLOGIES AND SERVICES
ENGINEERING
The capabilities of the tool and its key features must be analyzed to know its scope,
allowing me to design use cases to exemplify them and propose improvements to the
tool.
The behavior in the first use case with documents was positive, although the generated
custom infotype is not capable of locating all the information because it has a generic
pattern. In the second use case, the errors in the job inspection for images was
significant, reflecting the lack of maturity of the tool for these files. The new de-
identification method to pixelate the information works well, but it will be necessary
to improve it in a future work.
1
Resum
Aquesta tesi està enfocada en l’estudi de Cloud Data Loss Prevention, una eina DLP
nativa de cloud que ofereix Google Cloud Platform per ajudar en la seguretat de les
dades. L’objectiu del projecte és avaluar-ne el comportament en diferents escenaris
configurats en un entorn de laboratori, on puc estudiar com el servei inspecciona la
informació per localitzar dades de caràcter sensible.
Cal analitzar les capacitats de l’eina i les seves característiques principals per conèixer
l’abast, de manera que em permeti disenyar casos d’ús per exemplificar-les i proposar
millores en el servei.
El comportament del primer cas d’ús amb documents ha sigut positiu, malgrat que el
tipus d’informació personalitzat que s’ha generat no és capaç de localitzar tota la
informació al estar definit com a patró genèric. En el segon cas d’ús, els errors en la
inspecció per a imatges han sigut significatius, reflexant una falta de maduresa per part
de l’eina amb aquest tipus d’arxius. El nou mètode de desidentificació per pixelar la
informació funciona bé, tot i que caldrà millorar-lo en un treball futur.
2
Resumen
Esta tesis se centra en el estudio de Cloud Data Loss Prevention, una herramienta DLP
nativa en la nube que ofrece Google Cloud Platform para ayudar en la seguridad de los
datos. El objetivo del proyecto es evaluar su comportamiento en diferentes escenarios
configurados en un entorno de laboratorio, donde puedo estudiar la manera en la que
el servicio inspecciona la información para localizar datos sensibles.
El comportamiento del primer caso de uso con documentos fue positivo, aunque el
infotipo personalizado generado no es capaz de localizar toda la información al tener
un patrón genérico. En el segundo caso de uso, los errores en la inspección para
imágenes fueron significativos, lo que reflejaría la falta de madurez de la herramienta
para este tipo de archivos. El nuevo método de desidentificación para pixelar la
información funciona bien, pero será necesario mejorarlo en un trabajo futuro.
3
Acknowledgements
During these months of research and project development, I have had the opportunity
to work with people who have helped me to understand the architecture of the cloud
and its security, because I started this thesis without any knowledge about cloud.
Thanks to the support of my advisor, José Luis, and my coworker Álvaro, I have been
able to carry out this project. I have learnt a lot from them during this time.
Also, I want to thank the support that my friends and my family have given me. They
have suffered and enjoyed at the same time all this journey with me. I dedicate this
thesis to all of them.
Finally, I want to thank to my project supervisor, Israel Martín, for all the
recommendations and the supervision given to successfully deliver this document, as
well as his advice during the thesis. Thank you so much.
4
Revision history and approval record
Name E-mail
5
Table of contents
Abstract ........................................................................................................................................................ 1
Resum ........................................................................................................................................................... 2
Resumen ...................................................................................................................................................... 3
Acknowledgements ................................................................................................................................. 4
Revision history and approval record ............................................................................................. 5
Table of contents ...................................................................................................................................... 6
List of Figures ............................................................................................................................................ 8
List of Tables: ............................................................................................................................................. 9
1. Introduction .................................................................................................................................... 10
1.1. Project requirements and specifications ..................................................................... 11
1.2. Work Plan and milestones ................................................................................................ 12
1.3. Gantt Diagram………………………………….…………………………………………………….14
2. Cloud Computing……………………………………………………………………………………………15
2.1 Introduction to Cloud Computing………………………………………………………………15
2.2. Main service models in the cloud……………….………………….…………………………16
2.3. Main deployment models in the cloud………………………………………………………17
2.4. Data Security……………………………………………………………………………………………19
3. State of the art of the technology used or applied in this thesis: ................................ 20
3.1. Introduction to Data Loss Prevention .......................................................................... 20
3.2. DLP products .......................................................................................................................... 22
3.3. Approach to Cloud Native DLP tools……….…………………………………………………24
3.4. Capabilities that cloud native DLP should/could provide…………..………………25
3.5. Main Cloud Native DLP tools……………………………………………………………………..31
3.6. Comparison between the main Cloud Native DLP tools……………...……………….32
4. Methodology / project development:.................................................................................... 34
4.1. Key features of Cloud DLP…………………………………………….…………………….…..34
4.2. Design of use cases……………………..………………………………………………………...…39
4.3. Use case 1: Identification of sensitive data in documents…………………….……41
4.4. Use case 2: Detection and de-identification of sensitive data in images.….…44
6
5. Results ............................................................................................................................................. ..46
5.1. Results of use case 1………………………………………………………………….……………..46
5.2. Results of use case 2………………………………………………………………………………...51
6. Budget ................................................................................................................................................ 54
7. Conclusions and future development: .................................................................................. 56
Bibliography: ........................................................................................................................................... 57
Appendices ............................................................................................................................................... 59
Glossary ..................................................................................................................................................... 62
7
List of Figures
8
List of Tables:
Table 4: Milestones………………………………………………………………………………………………14
9
1. Introduction
This thesis was carried out in the Cloud Security Department at Accenture S.L, which
is a global consulting and professional services firm offering advisory, strategy
consulting, technology, and operation services.
This is a project aimed out to understand how information is protected in the cloud
and how the Data Loss Prevention (DLP) service works to prevent sensitive data
leakage from organizations. It will focus on studying Google Cloud Data Loss
Prevention, a cloud native DLP service of Google Cloud Platform (GCP), as well as the
capabilities it supports. The project will exemplify these capabilities by deploying a lab
environment with different use cases designed in order to observe the behavior of the
tool and its scope.
1. Provide a comparison report on the main cloud native tools that address the
identification of sensitive data in the cloud.
2. List the capabilities of Google Cloud Data Loss Prevention to know its scope.
3. Design use cases to exemplify the capabilities of the tool and propose new
improvements.
4. Deploy a lab environment to conduct some use cases to assess the performance
of the service.
10
1.1. Project requirements and specifications
Project requirements:
• Create 2-3 use cases that are suitable to study the capacity of the service and
its weak points.
• Provide new types of sensitive data that are not covered by the service.
Project specifications:
• The proposed use cases should be clear and reflect possible scenarios where
the company’s sensitive information must be protected.
• Make a demo with a logic data set to represent the function of the service, as
well as clearly exemplify its capabilities.
11
1.2. Work Plan and Milestones
This section includes the Work Breakdown Structure of the project and the updated
Work Packages, tasks, and milestones. I have needed to modify the dedication weeks
in the work packages because the research part has taken me more time than I initially
expected. During the thesis everything has progressed correctly, with no incidents.
Short description: Research information about cloud and its Planned start date:01/03/2022
function, how to protect the data in it and native tools.
Planned end date: 24/03/2022
Start event:01/03/2022
End event: 29/04/22
Internal task T3: Comparison of the main native tools with DLP
Table 1: Work Package 1
12
Project: Designing use cases WP ref: (WP2)
Short description: Design 2-3 use cases in order to exemplify the Planned start date: 27/03/2022
capabilities of the Cloud DLP service.
Planned end date: 29/04/2022
Internal task T1: Analyzing the weak points of Cloud DLP Deliverables: - Dates: -
Internal task T2: Doing the draft of the possible use cases and
masking solutions
Short description: Create the demo with the new improvements Planned start date: 02/05/2022
Planned end date: 20/06/2022
Internal task T1: Creating the lab environment and configuring Deliverables: - Dates: -
the use cases
13
Milestones
Table 4: Milestones
14
2. Cloud Computing
2.1. Introduction to Cloud Computing
The term Cloud is known as a space where we can store resources and consume
services without the need to keep them on our electronic devices. This allows us to
have our information in the cloud without consuming local resources, freeing up
storage space and improving the performance of our machines.
The cloud is made up by a global network of remote servers located in linked data
centers to process, share or store data and consume them quickly anywhere from any
device with Internet access. This technology is known as Cloud Computing. [1]
Cloud Computing is responsible for providing computing resources on request
through the Internet. This concept has allowed the creation of new service models,
where users and companies can access applications or repositories, as well as build
their own infrastructures without the need to have them in a specific place. [2]
This technology has given a lot of benefits in the Information and Communications
Technology (ICT) environment: [3]
▪ Cost reduction: The development of the Internet and the new communication
technologies have allowed companies to avoid physical Infrastructure
Technology (IT) infrastructures, as well as the need to acquire software
licenses, turning them into subscriptions with cloud computing service
providers such as Amazon Web Services (AWS), Microsoft Azure or GCP.
▪ Data control and security: Cloud will offer many advanced security features
that guarantee the data security stored, such as access management with roles
or monitoring the user activity.
▪ Scalability and flexibility: This feature will allow to scale up or down their IT
departments efficiently according to demand, without having to invest in
powerful infrastructures.
▪ Unlimited capacity: Thanks to the quick scalability, the cloud has unlimited
storage capacity to store any type of data in different storages, avoiding the
need to use local resources to host our information.
▪ Pay for what you use: This is a payment model in cloud where users only pay
for what is consumed. This allows to avoid unnecessary payments for wasted
resources, which represent significant losses for the companies.
15
2.2. Main service models in the cloud
Cloud Computing is mainly offered in the following three service models:
There are several types of SaaS, such as email systems like Gmail, team
collaboration such as Microsoft Teams or social networks like Twitter, as well
as streaming platforms such as Twitch or Netflix. [4]
They can be hosted and tested in a controlled environment, without the need
to manage infrastructure maintenance, facilitating the development of the
application and reducing costs. Some examples of PaaS are Google App Engine
and Heroku. [5]
Amazon Web Services, Microsoft Azure and Google Cloud Platform are
examples of Infrastructure as a Service. All of them allow users to work with
virtual machines in the cloud and customize storage resources, CPU, and
operating systems. [6]
16
Figure 3: Main service models in the cloud
▪ Shared cloud
It will offer resources from one provider over infrastructures shared among
multiple clients. Being a shared environment, it will be less expensive and will
facilitate scalability, although it could cause performance problems. The shared
cloud will be useful where there are peaks in demand because it can create
elastic platforms that will be modified depending on the needs of each moment,
paying only for the resources used. [7]
▪ Private cloud
This deployment model is focused on those organizations that want an exclusive
cloud environment, where the resources will not be shared. It will be made up
of its own infrastructure and machines under the requested demand. Although
its scalability and management are more limited, this model offers more security
17
and control than the shared cloud, avoiding performance issues from third
parties. Deploying a private cloud is usually quite expensive, so it is used for data
and applications where performance and security need to be ensured. [7]
▪ Hybrid cloud
It will combine private and public cloud resources. This model is useful for
those organizations that have their own infrastructure but want to take
advantage of the services provided by an external provider. Both clouds will
interact to provide agility by sharing workloads depending on the need and the
cost that it may assume. Hybrid cloud environments are effective not only in
improving computing and data security, but also in cost savings and reducing
dependencies with on-premises infrastructures. [7]
▪ Multi cloud
Also known as community cloud. It appears when an organization decides to
combine multiple clouds, either public or private. This strategy provides more
flexibility because it is possible to determine which cloud services are used,
reducing reliance on a single cloud provider. If it includes on-premises or
private cloud infrastructure, it is considered a hybrid multi cloud model. [8]
18
2.4. Data Security
When the information is stored or shared in the cloud, the need to protect the data
against possible attacks or vulnerabilities that affect their privacy becomes evident.
Data is considered sensitive when contains personal identification information (PII),
which will be very protected by General Data Protection Regulation (GDPR). The
compliance of these protection laws imposed by this regulation will be mandatory if
the organizations do not want to receive economic sanctions, as well as important
consequences such as the loss of the reputation or bankruptcy of the company.
Protecting sensitive data in the cloud will require different approaches, having
different security needs depending on its state. Data that is attached to an email will
not be protected in the same way as if it is stored on a repository.
The three possible states will be the following: [11]
▪ Data at Rest: This state will be for the information that is not being used or
processed, it is simply stored in the cloud through devices and systems such as
repositories, databases, servers, or computers.
Sensitive information will be protected from being stolen or leaked if the company
applies the following prevention actions:
19
3. State of the art of the technology used in this thesis
3.1. Introduction to Data Loss Prevention
DLP is a solution focused on preventing data leaks within an organization, protecting
the information in its three possible states. It is defined as a set of tools which apply
content inspection techniques and contextual analysis to analyze user actions related
to data usage. This will prevent unauthorized information disclosure outside the
corporate network. [9]
DLP generally consists of three main elements: [10]
2. Detection: The data activity is tracked to assess if the actions performed by the
users is accepted or not by the organization and its security policies, protecting
it from anomalies.
3. Prevention: Actions are applied to the data according to the results obtained
in the inspection. DLP policies will determine the prevention method in each
case. These include blocking actions such as editing or sharing files, acting
against suspicious users and remove their access rights, or sending
notifications and alerts to users to warn them of the violations they were going
to commit. This ensures that unauthorized information is leaked or extracted
outside the organization.
20
DLP POLICIES
In order to avoid data loss in the above situations, the state of the data must be known,
as well as how to protect it. DLP techniques monitor data activity and evaluate if the
actions attempted in a particular scenario are accepted by predefined DLP policies
within a specific context.
These policies will be accompanied by a set of rules and conditions that will determine
if the expected data behavior is occurring. It also contains actions to be applied in case
of any anomaly, as well as notifications or alerts.
This prevention system makes it possible to obtain information for specific periods of
time to assess the impact. You can discover processes where policies are being violated,
review them, and locate the unprotected information within the organization to apply
actions in time. [12]
▪ Context analysis: This method of analysis will focus on anything that does not
include the content of the information. It will examine the properties of the
metadata, such as its format, location, creation and modification date, or size.
21
3.2. DLP products
DLP products will be categorized in the market as dedicated or integrated:
Implementing a DLP architecture can be complex because different DLP vendors will
have different configurations and capabilities in their tools. Before deciding to have
multiple DLP providers in an organization’s environment, it is necessary to
understand how to manage the possible issues such as management overhead and lack
of consistency in the obtained results. This will affect its accuracy and its incident
response. This problem happens because there is no way to have a DLP integration
between them, they will work independently and must be configured separately as
they have different approaches and needs. These errors increasing the unnecessary
storage, poor quality in the analysis obtained, and consequences of making poor
decisions. To avoid this, a single E-DLP provider could be used, but if the organization
decides to use different I-DLP providers, it will have various levels of data risk at
different points in its security architecture.
22
Each DLP product may be focused on one or several types of DLP solutions. To know
which ones will be needed for data security, the scope of each DLP solution must be
known to understand the control they will provide and apply them depending on the
origin and destination of the data to protect. [14]
Endpoint DLP is the most powerful option for any DLP architecture. It provides data
discovery and control for local storage, as well as control in removable media, data-
via-browser, and data-via-email. The control capabilities include blocking or masking,
and some vendors include classification. However, most endpoint DLP solutions come
from E-DLP vendors and are fully integrated.
Cloud Access Security Brokers (CASB) is a multifunctional tool which provides
security around cloud environments. It is an API-based solution which will be able to
discover and manage data security within the cloud environment as a data at rest
approach.
Secure Web Gateway (SWG) will redirect browser traffic to an inspection system
before forwarding to the intended destination, and Secure Email Gateway (SEG) will
do the same but with email traffic. Firewall DLP can be useful to address data leakage
issues from unmanaged devices on the network, restricting data movement.
23
3.3. Approach to Cloud Native DLP solutions
The objective of this section will be to define Cloud-Native DLP and list the capabilities
that this type of solution should provide to control sensitive data in an organization’s
corporate network. Also, I will study the native DLP solutions offered by different
cloud providers, such as GCP, Microsoft Azure and AWS to make a comparison of the
capabilities that they support. [14]
CLOUD-NATIVE DLP
Cloud-native DLP is an environment directly used by users which is designed to
process unstructured data. This DLP control only protect the data where the provider
is offering its cloud space. Cloud providers are increasing the capabilities of Cloud-
Native DLP, expanding them to the endpoint. [14]
One of the disadvantages of cloud native DLP solutions are the lack of consistency.
Unlike CASB solutions, each cloud provider will offer different features in their
products. This will make difficult to define the capabilities that this type of solution
should have.[14]
24
3.4. Capabilities that Cloud Native DLP should/could provide
In this section, the capabilities that Cloud native DLP should provide are listed.
1. Data classification
The information in an organization’s network must be identified to categorize
it according to its level of confidentiality. With data classification, the DLP tool
can take the appropriate actions based on the DLP policies configured for a
specific category. It will allow, through contextual analysis and content
inspection, to evaluate the information and assign it a category to recognize the
value of the data. In this way, the organization will have control over the
information and will protect it from security risks. In order to carry out this
categorization, different inspection and analysis methods may be used: [15]
▪ Database fingerprinting
It will be responsible of identifying sensitive data from databases.
When fingerprinting analysis is used, it helps to have a more accurate
classification and will reduce the number of false positives. [13]
▪ Keywords
This pattern allows the detection of sensitive information from key words
or phrases. It is necessary to specify them as a rule in order to find them if
the pattern matches. Also, it is possible to tag the document that contains
these keywords to be protected. Some examples of keyword rules could be
“confidential” or “internal use”. [16]
25
▪ Regular expressions
It allows DLP to detect matches of complex sensitive information. This type
of pattern is made up of a string with simple and special characters, where
each of them will have a meaning. The language type must be understood to
correctly define the pattern and find the desired match. One type of
information detected from a regular expression could be an email address:
[13]
▪ Data labeling
It uses labeling to classify data based on its type or its sensitivity level in
order to simplify the analysis process. It allows us to know the value of its
information and take actions to avoid risks. By using classification, the
organization gains more security over the data, as it will be categorized to
increase control and protect it if an anomaly is detected. [18]
In general, the sensitivity level of the data can be labeled using the following
labels:
▪ HIGH SENSIBILITY
This labeling includes two types of data:
26
o Confidential data: Information that can be a risk to the
company’s operation if the data is lost or theft, but it is not
considered critical. Some examples would be customer
information or worker wages.
▪ MEDIUM SENSIBILITY
Information that is not confidential but is not publicly available. It
resides in the corporate network for internal use only. This data
does not pose a danger to the integrity of the organization if it was
stolen. Marketing strategies, as well as emails or documents without
confidential or critical information, could be categorized as internal
data.
▪ LOW SENSITIVITY
Unrestricted information will be categorized in this labeling, such as
data for public use which is accessible to everyone. This is the case
of content found on a public web page, such as descriptions about an
organization or a product, as well as the addresses of a company,
among others.
Some cloud native DLP tools allow to label the risk level of the data to high,
moderate, or low. In addition, they can include labeling on users to determine
if they have access to that information, as well as what actions they can make
on it: read only, permissions to modify, print or forward, among others.
File Type
This context analysis method will be able to identify a document by its type.
Some cloud providers allow to add rules during the inspection in order to
specify which types of data should be inspected and which will be ignored. [19]
27
2. Data Discovery
It is crucial for companies to have control over their data to comply with today’s
strict security regulations, as well as fast interventions to protect themselves
from possible leaks and handle them before they occur. This prevention
technique is responsible for performing deep inspection of large amounts of
data to simplify the analysis and detection of sensitive information hosted on
the corporate network.
It provides a broader and more accurate view by exploring multiple data
sources, identifying relationships between them that were previously
unknown, causing an optimization impact. In this way, it will be possible to
discover and locate important data that had not been identified before, linking
data sources that were dispersed, and identify hidden information patterns
that must be integrated and evaluated. All this will improve the decisions that
the organization makes about the data and will facilitate its integrity and
confidentiality thanks to the prompt response they offer.
Data discovery will be able to analyze different format files: Text files, Microsoft
Documents (PowerPoint, Word, Excel), images or compressed files. [20]
3. Data Egress
In the activity of a network, we find the data egress to external locations. This
process must be managed to prevent information leakage if the data reaches an
unauthorized destination by the organization. Some of the data output
channels in the cloud are via email and external repositories. Thanks to the
prevention techniques of Data Discovery and Data Classification, protection
measures can be applied in the cloud environment to ensure that sensitive data
does is not exposed. [21]
The supported formats to inspect images are PDF, JPEG, PNG, JPG and TIFF
format, which can be stored in repositories or attached to emails. Depending
on the capabilities of the cloud provider, images formats such as GIF or BMP
can also be analyzed. [22]
28
5. Detailed reporting
When the inspection job is finished, a report will appear as a resume with all
relevant information. If sensitive data is detected during the scan, a finding will
be created. This report will include the number of data identifiers found, as well
as the total bytes analyzed and the total number of findings.
Detailed reports will be possible if the sensitive data is sent to another native
cloud services, such as databases, monitoring tools or security control services,
where will be able to show the real sensitive data founded in plain text and
other features like the type of data identifier, its source, and its location. [23]
6. Monitoring
Monitoring the network activity allows to track the data flow and control its
status. In case of detecting any anomaly, it will be able to notify it. This
capability does not usually integrate in the DLP tool, but it is possible to send
the results of the inspection to a native monitoring tool. [24]
7. User activity
This capability monitors and tracks end user behavior. This will help to detect
and stop insider threats, whether unintentional or intentional purposes.
Thanks to this, organizations can protect sensitive data while ensuring
compliance with data privacy and security regulations. [24]
8. Integration
Cloud native DLP tools will be able to integrate their service with other native
tools from their providers such as repositories, databases, or email services.
This will facilitate their inspection and the communication between other
native tools to implement new services related with process optimization. [25]
9. Leakage handling
Through inspection techniques, DLP systems can recognize which data is
sensitive and track it to check if the security policies established by the
organization are being complied with. If an incident occurs with that
information, DLP will try to handle it to prevent data leakage or theft.
DLP systems will be able to anticipate leaks of sensitive data with an adequate
prevention approach to prevent them from occurring through specific
techniques: [26]
29
▪ Data encryption: Encryption can be used to protect sensitive
information by making it incomprehensible unless the secret
encryption key is known.
They may also have leak management capabilities once an incident has been
detected. Different actions will be taken to remedy the loss of this data: [26]
▪ Notification actions do not interrupt the data flow, they will send a
warning to the users about the policy violation that they have
committed. In this way, employees are educated by giving them the
opportunity to reverse the action.
▪ Audit actions will be the least invasive technique because only lets
a log of the incident occurred.
30
3.5. Main Cloud Native DLP tools
The cloud native DLP tools of the main cloud providers are presented below:
The service will be able to de-identify data and optionally re-identify them applying
redact and masking solutions to hide information thanks to OCR technique and data
identifiers. It has many capabilities which are accessible through APIs, requiring the
user to build a DLP solution given the low-level coding blocks available. [25]
It is useful because it allows persistent protection: the label travels with the data,
regardless of where it is stored, sent, or shared. Azure can prevent data leak thanks to
its data tracking. This will allow us to monitor protected documents, seeing who is
accessing them and when. If security issues are found, AIP will have prevention
capabilities to limiting actions such as file expiration dates or revoke access. [27]
Amazon Macie
Amazon Web Services offers Amazon Macie as its data
discovery tool for data loss prevention. It will be able to
detect personal data within native buckets thanks to the
pattern matching and protect it in AWS environment.
It automates evaluations and monitors data for its security and keeps data privacy.
Macie will create detailed reports with findings to review and remediate potential
31
issues detected. Also integrates other native tools to submit incidents but does not
provide any blocking or prevention capability directly. [28]
Compression No No Yes
archives
DATA
DISCOVERY Images Yes Yes No
Video No No No
32
Access Control Custom Yes Yes
Redacting Yes No No
33
4. Methodology / project development:
Once the DLP services provided by the main cloud providers were defined, as well as
the basic and advanced capabilities they should support on their tools, it is verified
that Cloud DLP is the most complete option, directly providing leakage handling
methods like masking and OCR capabilities to inspect and redact images.
This section will be fully exploiting all the features of the native Google Cloud tool to
make a general technical description that allows understanding its operation and
configuration. It will also find weak points in the supporting capabilities that the
service may have. The goal is to study if Cloud DLP actually performs as it claims in its
documentation, or if its capabilities are more limited.
34
▪ Google Datastore: It is a NoSQL database using in web and
mobile applications, which will use to do SQL data queries.
It will be necessary to introduce the ID of the project and the
type of data you want to analyze in the inspection. [32]
DATA CLASSIFICATION
In order to protect the sensitive data stored on the corporate work, Cloud DLP will
perform content inspections to classify it and find out its location. This will allow us to
have the information identified, as well as its type and how it is being used. The user
can enable automatic data classification in its GCP repositories. It will be able to scan
structured and unstructured information in text files, images, Microsoft documents,
PDFs, and binary data. [33]
USING INFOTYPES
Cloud DLP will be able to inspect our information if we provide a list of the types of
sensitive data which we want to locate. It will use infotypes to define a type of data to
be located in the corporate network. The service can recognize them thanks to
infotype detectors, which are configured to detect the desired information if they
match with their matching criteria.
By default, DLP has 150 different infotype detectors built and ready to use, such as
social security numbers, emails, phone numbers, or person names. If you want to
identify an email address in a table, there is the EMAIL_ADDRESS infotype detector
defined to recognize this kind of sensitive data.
If none of the default infotype detectors is valid in our use case, Cloud DLP allows you
to add custom infotype detectors. You can implement new types of sensitive data and
configure the detection behavior, such as regular expression pattern or a list of
keywords or phrases that must match with the information analyzed. [34]
MATCH LIKELIHOOD
Scan results are categorized based on how likely they are to represent a match. Cloud
DLP uses match likelihood to indicate how likely it is that a piece of data matches a
given infotype. There are 5 possible values for likelihood: VERY_LIKELY, LIKELY,
POSSIBLE, UNLIKELY, VERY_UNLIKELY. They are arranged in order from highest to
lowest probability of a match. [35]
35
When you want to start an inspection, it is recommendable to set in the request the
minimum level of likelihood to scan data that you want to retrieve. To prove this, I
used the Google Data Loss Prevention Demo to check this characteristic and see how
likelihood works when I introduce the same information but in different context. The
infotype PHONE_NUMBER appears as a finding, but in different bucketized
representation: [36]
The likelihood affects in the number of matching findings that are returned in the
response. If your inspection is configured with likelihood LIKELY, the response will
not include the second finding because it will contain findings with likelihood LIKELY
and VERY_LIKELY. If you set the minimum likelihood as VERY_UNLIKELY, all findings
will appear in the response, although in some cases it is not a reliable option as it
introduces false positives.
36
DATA DE-IDENTIFICATION
Once Cloud DLP performs the information scan, and the results of the inspection are
obtained, DLP will be able to take preventive actions de-identifying sensitive data in
text content. This process will remove identifying information from data applying a
de-identification transformation to redact, mask, tokenize or transform text and
images and guarantee data privacy in storages and tables. Some of these types are one-
way and they can’t be reversed once the action is done, such as masking or redaction.
Others may be re-identified, like encryption transformation. [37]
TEMPLATES CONFIGURATION
Templates can be used to create and maintain configurations information for reuse in
Cloud DLP. This characteristic is useful to speed up the inspection process
configuration and define the de-identification transformations to apply in sensitive
data.
The service supports two types of templates: [38]
▪ Inspection templates: It will contain configuration information related to data
analysis and data classification. It is mandatory to include the infotypes to scan
for, and it is recommended to specify the likelihood in confidence threshold
section.
Once the template is created, it appears in configuration section ready for its use.
37
JOB INSPECTIONS
To trigger content inspection in storages and databases, a job inspection must be
configured in Cloud DLP.
To create a job inspection, it will be necessary to fill up the following fields: [39]
▪ Input Data configuration: The name of the job inspection, as well as its ID
which must be unique, will be specified first. The resource location where will
be hosted should be defined, and the path or table ID where is the input data to
inspect. In addition, Cloud DLP allows sampling, so we can optionally scan only
part of the information depending on the needs of the organization and the
volume of data stored.
▪ Add actions: Cloud DLP offers different actions that will be performed once the
inspection job is finished. You can decide where you want to save the findings
founded during the analysis. The most common option is through Big Query
notifications, where the sensitive data and its characteristics will send to a
specific table. Also, you can enable the email option, which notifies you when
the inspection is finished, or publish the results in other native tools such as
Cloud Monitoring.
38
▪ Scheduling: We can schedule the inspection job at a certain time interval, or
turn into a job trigger, which runs periodically. Another option will be run
immediately once, without applying scheduling.
All this configuration will summarize in a JSON representation, which will send to the
Cloud DLP API through content.inspect method. We could also do the configuration
programmatically instead of in the tool, which allows you to interact with the Cloud
API deeply and access to other types of capabilities and methods that are not allowed
to be configured from the tool, such as deidentification techniques. [39]
1 2 3 4 5
3. Design the scenario of the use case and propose custom configurations or new
improvements and how to implement them.
4. Generate the inspection work, which will contain the rules, conditions and
actions adjusted to each use case to identify and control sensitive information.
Save the results to see the detailed report in another native tool accepted by
Cloud DLP.
5. Apply a de-identification job in order to protect and hide the findings, as well
as take actions to handle the leakage.
39
The infrastructure used to represent the uses cases will be Google Cloud Platform,
where you can find all its native tools available in the cloud, offering different IaaS,
PaaS, SaaS services. In this environment, users can test and develop their applications,
as well as store their data in storage solutions and configure the security and network
management. [40]
Having described the main features of the Google Cloud DLP solution and listing the
capabilities that they can support; I will be able to understand the needs that must be
covered by organizations and users in the service. The following sections will show
the design of two use cases to exemplify the capabilities of the tool and how responds
in different scenarios.
40
4.3. Use case 1: Identification of sensitive data in documents
Scenario
This use case studies how Cloud DLP responds when it inspects a repository that
contains unstructured information. For this, a bucket in Cloud Storage is created,
storing PDF documents with sensitive content. Data analysis is done by launching an
inspection job configured to classify and identify the infotypes defined in it, as well as
include where the findings must to be sent. A new custom infotype will be created to
observe the behavior of this capability. This job will generate a report that will indicate
the status of the inspection and the number of findings obtained from each infotype.
The detailed report will be available in a Big Query table, which will show all the
findings with their inspection results and their characteristics, such as their location
and their likelihood. A de-identification template will be created to mask the sensitive
data found to the API through the content.deidentify method.
For this, I will use the Google Data Fusion tool, which allows me to create a pipeline to
transform data and save them in another native storage tool. In this case, I will store
the masked results in a redact_bucket_results bucket in Cloud Storage, in a CSV format.
Implementation
A cloud storage bucket called testing_bucket_tfg is configured in the laboratory, which
will store 15 different bills in PDF format with sensitive information:
41
Figure 14: testing_bucket_tfg bucket
In Cloud DLP, the inspection job must be configured. The job trigger-inspect-bills must
be capable to analyze the input data stored in Figure 14 with no sampling. The infotype
detectors PERSON_NAME, SPAIN_NIF_NUMBER, PHONE_NUMBER, STREET_ADDRESS,
and EMAIL_ADDRESS will be integrated in the service, but there is no infotype in the
tool capable of detecting amounts. So, it is necessary to create a new custom infotype
to identify this data in the bills.
This custom infotype will be known as GENERIC_AMOUNT, defined with a regular
expression. To define it, I have considered that negative values could appear in the bill,
as well as decimals and high values. Any information with the ‘€’ symbol will be an
amount. That is why I have defined the minimum probability of the pattern as LIKELY.
42
Once the custom infotype is declared, and the infotype detectors are included in the
inspection, I decided to set the probability threshold to LIKELY to find the maximum
results with the lowest possible rate of false positives. In Big Query, I have created a
table called results_1 where all the results found in the inspection will be saved. I have
had to specify it in the inspection to get details about the location and likelihood of
each particular result. Otherwise, I will only have general statistics about the number
of infotypes found, without the concrete values.
This table will be the input data in a pipeline created in Cloud Data Fusion, called
Masking_ Pipeline_1. Cloud DLP will take this data and will transform the infotypes
masking them with ‘#’ symbol as will be defined in the de-identification job and in the
transformation template. Then, the process will save the masked results in the
redact_bucket_results storage, where they can be consulted in CSV format. [41]
43
4.4. Use case 2: Detection and de-identification of sensitive data in
images
Scenario
It is about observing the behavior that Cloud DLP has when it inspects and de-
identifies an image that contains sensitive information.
For this use case, I have decided to analyze a personal DNI and trigger a job inspection
to identify the critical data contained. Once located, I will protect the user by de-
identifying the information found, as well as any other information that could put their
privacy at risk.
Implementation
The personal DNI will be stored in a local path in JPG format called DNI.jpg. To inspect
it, I configure an inspection job in Node.js to request the analysis called Image.js. In the
script, I will include all the necessary parameters to run the API call and import the
required Google Cloud and Node.js libraries. As the input data is an image, Cloud DLP
will be able to inspect it if it is sent in base64 encoding. It will be necessary to convert
before call the content.inspect method.
Also, I will define the infotypes to detect: FIRST_NAME, LAST_NAME, DATE_OF_BIRTH,
SPAIN_DNI_NUMBER, GENDER, DATE, STREET_ADDRESS and GENERIC_ID. The
minimum likelihood will be VERY_UNLIKELY to get the maximum number of results.
Once I get the inspection results in the console, I extend the code generated for the
inspection to performs a de-identification job to transform the information found.
Cloud DLP has configured redact.image method, which can redact the results in a black
box to mask them. [42]
This transformation process is not reversible. For this reason, the masked image will
save in an output path declared in the code. In this use case, the result will be stored
in the same path as the input data, in result.jpg.
Redacting the image is useful to hide the sensitive information, but the result is not
very delicate. A new masking solution could be proposed to make the result cleaner
and more attractive than the existing one. It is intended to implement a transformation
process by pixelating the sensitive information founded to achieve a less aggressive
effect on the image. For this, NodeJS has the Jimp library to performs the image
processing, which allow us to manipulate an image thanks to its included methods.
44
The pixelate() function is an unbuilt function which applies pixelation effect over an
image or region. Its syntax will take the size of the pixel defined as a constant and the
bounding box values that contains the coordinates where sensitive data resides. In this
way, it will be possible to apply the pixelation in the region passed by parameter to
achieve this new way of data transformation.
If Cloud DLP inspects an image, it will save in declared variables the coordinates when
sensitive data was located:
Once the bounding boxes are saved, the pixilation function will be applied in the region
with pixels of size 7, which I have considered adequate to de-identify the information
and achieve the desired clean transformation:
The complete code for this implementation, as well as the inspection job configuration
is attached in appendices section.
45
5. Results
In this section, I will include the results obtained in the use cases implemented before.
With the information provided by the report, the percentage of each infotype can be
calculated to analyze the impact they had on the inspection. According to Cloud DLP,
the GENERIC_AMOUNT infotype configured in the tool is the most found in the bucket,
representing slightly more than half of the findings of the configured job:
46
To draw conclusions about the behavior of the service when inspected the bucket, it is
necessary to analyze the results saved in Big Query table, which contains the detailed
analysis report.
As the table contains relevant information on each of the located infotypes, such as
how many infotypes have been found in each bill and the probability of coincidence
that they have had, I can know how many false positives have been detected in the
inspection.
In the following table, I include all the false positives found in the detailed report,
separating by their infotype detectors, as well as the total findings of each type:
EMAIL_ADDRESS 19 0
GENERIC_AMOUNT 345 0
PERSON_NAME 105 51
PHONE_NUMBER 36 0
STREET_ADDRESS 98 9
SPAIN_NIF_NUMBER 20 0
TOTAL 623 60
I have found 60 false positives in the inspection result. This represents de 9,63% of
the total findings, which is a high number since it is an inspection that includes results
with LIKELY and VERY_LIKELY coincidence. Only the infotypes PERSON_NAME and
STREET_ADDRESS have false positives in their detections. Observing these errors, the
service introduces rare characters as a match for the pattern defined in
PERSON_NAME infotype detector:
47
This problem is caused by analyzing sensitive data in unstructured data. Data
identifiers can generate false positives and not be entirely accurate during detection if
they are used on this type of data. They are more effective in structured data as they
are easier to categorize.
In the case of STREET_ADDRESS, there are errors caused by the inaccurate declared
pattern as shown in the following figure:
The following graph represents the number of false positives found in both infotypes
with probability measures:
In addition to studying the false positives detected as findings during the inspection,
performing an analysis of the data that has not been detected by the configured job is
necessary to understand the behavior of the service.
Knowing the sensitive information contained in the bills, as well as the detailed report
shown in results_1 table, I have been able to locate the information that has not been
detected by Cloud DLP. I summarized this analysis in the following table, which I
specified the percentage of the unidentified data about the total of each infotype:
48
Infotype detector Number of unidentified Percentage with
data reference to total
EMAIL_ADDRESS 2 9,52%
GENERIC_AMOUNT 90 20,69%
PERSON_NAME 0 0%
PHONE_NUMBER 13 26,53%
STREET_ADDRESS 7 7,29%
SPAIN_NIF_NUMBER 0 0%
I detected 112 unidentified data in the job inspection. Having set the probability of
matching to LIKELY, it is logical that unidentified data appear if they have not met with
this threshold rule.
In the case of PHONE_NUMBER, the 13 results are public and custom service numbers.
They do not represent a threat by not revealing private information about the client
or the company. It is the same case presented by the data found from EMAIL_ADDRESS
and STREET_ADDRESS, so they will not be considered as an error during the
conclusions of the service behavior. They are not sensitive data, so their matching with
the detector will be lower than a personal data identified.
The created custom infotype GENERIC_AMOUNT has worked correctly since it has
been able to identify a large part of the amounts defined in the bills. However, it was
unable to discover 90 of them, leaving them unclassified.
This is due to the regular expression pattern defined. As entered in the configuration,
Cloud DLP can identify an amount if it contains the ‘€’ symbol. If not, it will not
consider that this data corresponds to the infotype created, so it will not be added as
a GENERIC_AMOUNT finding.
Here is an example with a fictitious bill that I have created for the analysis. In it, you
can see how there are two ways to represent an amount: one of them uses the €
symbol to define it, and the other appears with no symbol, only as a value.
49
That is why the inspection job has not been able to locate them. It is very difficult for
DLP tools to interpret this type of data, which requires very complex patterns.
Depending on how the information is presented to us, data identifiers may be more
accurate during inspect.
The custom infotype GENERIC_AMOUNT was created without considering the type of
document to inspect in this use case. It was intended to find a way to define an amount
in a generic way to be used by default in the service as a new built-in identifier. So, the
most logical way to find it is next to its currency identifier.
If the infotype had been implemented for this particular use case, it would not have
been defined the regular expression in this way. If it is known that the bucket only
contains bills, the custom infotype would be defined without the symbol restriction,
making sure that it captures all the amounts. Although the false positives would be
higher when entering all the values in the results, those that correspond to the
amounts would be covered and identified.
50
In conclusion, the behavior of the service in this use case is positive. Although it
introduces false positives in the results, the data that has not been identified is high,
but it is not considered critical. In the case of the custom infotype, unidentified data is
significant. Although it is confidential information that should be protected, it does not
pose a danger to the company if it is leaked. It should be targeted with a change in the
regular expression defined and with the help of keywords or exclusion rules that allow
Cloud DLP to identify these amounts, giving it a custom approach.
The results are printed on the console showing the value of the infotype, its type, and
the matching likelihood. In this way, the input data and the findings of the inspection
can be analyzed to conclude if the behavior of the service has been adequate.
Checking the results obtained, it is possible to find out if any false positive has been
detected in any of the defined infotypes. It is summarized in the following table:
FIRST_NAME 1 1
LAST_NAME 1 0
LOCATION 3 0
SPAIN_DNI_NUMBER 1 0
GENDER 0 0
DATE_OF_BIRTH 1 0
DATE 2 0
STREET_ADDRESS 1 1
GENERIC_ID 8 2
TOTAL 18 4
51
There are 4 false positives detected during the analysis, representing the 22,22% of
the total findings. This value is very significant, but it is logical by the type of analysis
accepted with a VERY_UNLIKELY likelihood.
Figure 25: First name false positive Figure 26: Generic ID false positive
FIRST_NAME 1 100%
LAST_NAME 1 50%
LOCATION 0 0%
SPAIN_DNI_NUMBER 0 0%
GENDER 1 100%
DATE_OF_BIRTH 0 0%
DATE 0 0%
GENERIC_ID 0 0%
TOTAL 3 17,64%
52
There are 3 sensitive data that the service can not detect. The FIRST_NAME defined in
the image is not located by the inspection, which confused it with the surname as
commented in the previous analysis. In the case of GENDER, the pattern definition
does not allow to find out that the ‘F’ letter in the DNI means woman.
As it is, the inspection has too many errors to be usable in images. Although Cloud DLP
has the OCR additional capability with a great potential, it is still quite immature to use
in real cases. The results are not reliable and consistent enough to ensure its correct
behavior in images. The image inspection needs to improve its detection capabilities,
otherwise its correct behavior can not be guaranteed. If a company decides to use the
Cloud DLP API to detect sensitive data in images, they should be aware of this
limitation by part of the tool. Using this without revision would cause serious
problems for the organization, where the loss of this data would involve critical
consequences such as huge economic sanctions for revealing user sensitive
information.
The result obtained once the pixelate function has finished is showed in Figure 27.
With the generated code, I have achieved that the last finding found in the inspection
is the one that appears masked.
I have not been able to get the pixelation applied to the rest of findings since the
implementation of that part of the code requires more research that it seemed, so I
have not had enough time to include it in the thesis.
Despite this, the result shows that the pixelation masking proposal works correctly
and could be implemented in the service once the method is sufficient mature.
53
6. Budget
This section will analyze all the costs of the project during 16 weeks of work.
As it is basically software, I have not needed any kind of components to do it.
The following table shows an estimation of the general salary of the team. I have
included the hours worked per week, as well as the cost per hour, obtaining the gross
salary. Apart from this, I have considered the 30% for Social Security (SS) paid by the
company to obtain the total salary of all the workers:
54
Once I have made the calculation, I obtained the total depreciation of the material
used:
Regarding to the software resources, the total cost of the lab environment is the
following, which includes all the tools used in the use cases and the functionalities
enabled during the implementation:
To sum up, the final budget for the project is the following:
Costs Total
Worker salaries 25064 €
Equipment 4.800 €
Amortization 4.080 €
Software 5.200 €
TOTAL COST 39.144 €
55
7. Conclusions and future development:
In conclusion, the thesis has reflected the importance of detecting sensitive
information in the cloud, as well as de-identifying it to prevent possible attacks or
unintentional leaks.
The comparison of the main cloud native tools shows that Cloud DLP is the most
complete tool and the one that includes a great number of capabilities in its service,
but with limitations. Through the designed use cases, I assessed its behavior in
different situations with input unstructured data, where the behavior in the inspection
of documents is adequate. The custom infotype generates more errors in the analysis
as it is generic, so this type of capability will be aimed to improve the results for
specific use cases.
On the other hand, the behavior when the tool inspects images failed. Data protection
is critical for any business, and in this use case it has not been able to detect all the
sensitive information contained in the DNI. Cloud DLP cannot ensure a high efficiency
in this type of file due to a lack of mature. About the proposed de-identification method,
I was able to save the coordinates where the infotype was located to apply the function
in that region. The data transformation works correctly, but only in the last finding
found. It will be necessary to extend the code, making it useful and efficient to be
introduced as a new data transformation method.
56
Bibliography:
[1] https://www.powerdata.es/cloud
[2] https://www.salesforce.com/mx/cloud-computing/
[3] https://intelequia.com/blog/post/2055/ventajas-del-cloud-computing
[4] https://www.redhat.com/es/topics/cloud-computing/what-is-saas
[5] https://www.redhat.com/es/topics/cloud-computing/what-is-paas
[6] https://www.redhat.com/es/topics/cloud-computing/what-is-iaas
[7] https://nexica.com/es/blog/modelos-de-despliegue-cloud-cloud-privado-cloud-p%C3%BAblico-y-cloud-
h%C3%ADbrido
[8] https://www.redeszone.net/tutoriales/redes-cable/que-es-multicloud-ventajas/
[9] https://www.cybrary.it/blog/introduction-to-data-loss-prevention/
[10] https://dspace.library.uvic.ca/bitstream/handle/1828/11339/Alhindi_Hanan_PhD_2019.pdf?sequence=1&i
sAllowed=y. Page 22: Data Loss Prevention
[11] https://www.sealpath.com/es/blog/tres_estados_info/
[12] https://www.itdigitalsecurity.es/reportajes/2019/01/dlp-o-como-prevenir-la-fuga-de-datos
[13] https://www.mcafee.com/blogs/enterprise/cloud-security/do-you-dlp-understanding-the-difference-
between-content-awareness-and-contextual-
analysis/#:~:text=Data%20loss%20prevention%20(DLP)%2C,file%20servers%20or%20in%20cloud
[14] Gartner: Data Loss Prevention: Comparing Architecture Options. Published 3rd December 2020 – ID
G00731429
[15] https://digitalguardian.com/blog/what-data-classification-data-classification-definition
[16] https://docs.trendmicro.com/all/ent/imsec/v1.6/en-us/imsec_1.6_olh/dac_keywords.html
[17] Gartner: Guide to Cloud Data Security Concepts. Published 21st September 2021 – ID G00756156
[18] https://www.microsoft.com/en-us/insidetrack/using-azure-information-protection-to-classify-and-label-
corporate-data
[19] https://techdocs.broadcom.com/us/en/symantec-security-software/information-security/data-loss-
prevention/15-8/about-data-loss-prevention-policies-v27576413-d327e9/supported-formats-for-file-type-
identification-v41600705-d327e133471.html
[20] https://digitalguardian.com/dskb/data-discovery
[21] https://digitalguardian.com/dskb/data-egress
[22] https://knowledge.broadcom.com/external/article/160504/detect-sensitive-data-in-an-image-file-w.html
[23] https://cloud.google.com/dlp/docs/analyzing-and-reporting
[24] https://securityintelligence.com/data-activity-monitoring-and-data-loss-prevention-a-balanced-approach-
to-securing-your-critical-assets/
[25] https://cloud.google.com/dlp?hl=es
[26] https://is.muni.cz/th/asqds/thesis.pdf. Page 10: Leakage handling
[27] https://docs.microsoft.com/es-es/azure/information-protection/what-is-information-protection
[28] https://docs.aws.amazon.com/macie/latest/user/macie-user-guide.pdf#what-is-macie
[29] https://cloudacademy.com/course/introduction-to-google-cloud-data-loss-prevention/introduction-to-
google-cloud-data-loss-prevention/
https://github.com/MicrosoftDocs/Azure-RMSDocs/blob/master/Azure-RMSDocs/rms-client/client-
admin-guide-file-types.md
https://cloud.google.com/dlp/docs/infotypes-reference?hl=es-419
57
https://docs.aws.amazon.com/macie/latest/user/discovery-supported-formats.html
https://cloud.google.com/dlp/docs/inspecting-storage?hl=es_419
https://stealthbits.com/blog/using-the-azure-information-protection-aip-scanner-to-discover-sensitive-
data/
https://docs.aws.amazon.com/macie/latest/user/discovery-supported-formats.html
https://cloud.google.com/dlp/docs/sensitivity-risk-calculation
https://docs.microsoft.com/es-es/azure/information-protection/aip-classification-and-protection
https://cloud.google.com/dlp/docs/concepts-image-redaction
https://techcommunity.microsoft.com/t5/security-compliance-and-identity/azure-information-protection-
documentation-update-for-november/ba-p/287364
https://aws.amazon.com/blogs/security/how-to-create-custom-alerts-with-amazon-macie/
[30] https://cloud.google.com/storage?hl=es
[31] https://cloud.google.com/bigquery?hl=es
[32] https://cloud.google.com/datastore?hl=es
[33] https://cloud.google.com/dlp/docs/classification-redaction#storage_classification
[34] https://cloud.google.com/dlp/docs/concepts-infotypes?hl=es_419
[35] https://cloud.google.com/dlp/docs/likelihood
[36] https://cloud.google.com/dlp/demo/#!/#!%2F
[37] https://cloud.google.com/dlp/docs/deidentify-sensitive-data
[38] https://cloud.google.com/dlp/docs/concepts-templates
[39] https://cloud.google.com/dlp/docs/creating-job-triggers?hl=es_419
[40] https://cloud.google.com/gcp/?hl=es
[41] https://cloud.google.com/data-fusion/docs/create-data-pipeline
[42] https://cloud.google.com/dlp/docs/redacting-sensitive-data-images
58
Appendices
Image.js
// Imports the Google Cloud libraries
const DLP = require('@google-cloud/dlp');
//const vision = require('@google-cloud/vision');
var jimp = require('jimp');
const gm = require('gm');
// Instantiates a client
const dlp = new DLP.DlpServiceClient();
//const client = new vision.ImageAnnotatorClient();
// The path to a local file to inspect. Can be a JPG or PNG image file.
const filepath = 'blanca2.jpg';
59
async function inspectAndPixelateImage() {
// Construct file data to inspect
const imageRedactionConfigs = infoTypes.map(infoType => {
return {infoType: infoType};
});
const fileTypeConstant =
['image/jpeg', 'image/bmp', 'image/png', 'image/svg'].indexOf(
mime.getType(filepath)
) + 1;
const fileBytes =
Buffer.from(fs.readFileSync(filepath)).toString('base64');
const item = {
byteItem: {
type: fileTypeConstant,
data: fileBytes,
},
};
60
// Run request
const [responseInspect] = await dlp.inspectContent(inspectRequest);
const findings = responseInspect.result.findings;
var results = 0;
if (findings.length > 0) {
console.log('Findings:');
findings.forEach(finding => {
if (includeQuote) {
results = results + 1;
console.log(`\tQuote: ${finding.quote}`);
console.log(`\tInfo type: ${finding.infoType.name}`);
console.log(`\tLikelihood: ${finding.likelihood}`);
console.log(`\n`);
finding.location.contentLocations.forEach(location => {
location.imageLocation.boundingBoxes.forEach(box => {
console.log(`\t\tTop: ${box.top}`);
console.log(`\t\tLeft: ${box.left}`);
console.log(`\t\tHeight: ${box.height}`);
console.log(`\t\tWidth: ${box.width}`);
console.log(`\n`);
const size = 7;
var top = box.top;
var left = box.left;
var height = box.height;
const width = box.width;
jimp.read('blanca2.jpg').then(coord => {
return coord
.pixelate(size, left, top, width, height)
.write(outputPath);
})
})
});
}})
console.log(`Total findings: ${results}`);
// fs.writeFileSync(outputPath, image);
console.log(`Saved image redaction results to path: ${outputPath}`);
}}
inspectAndPixelateImage();
61
Glossary
62
AIP: Azure Information Protection
63