Google Cloud Certified - Professional Data Engineer Practice Exam 1 - Results
Google Cloud Certified - Professional Data Engineer Practice Exam 1 - Results
Exam 1 - Results
Return to review
Attempt 2
All knowledge areas
All questions
Question 1: Correct
You create an important report for your large team in Google Data Studio 360. The
report uses Google BigQuery as its data source. You notice that visualizations are not
showing data that is less than 1 hour old. What should you do?
(Correct)
D. Clear your browser history for the past hour then reload the tab showing the visualizations.
Explanation
Correct answer is A as Data Studio caches data for performance and as the latest data is not
shown, the caching can be disabled to fetch the latest data.
Refer GCP documentation - Data Studio Caching
Option B is wrong as BigQuery does not cache the data.
Options C & D are wrong this would not allow fetching of latest data.
Question 2: Correct
You company’s on-premises Hadoop and Spark jobs have been migrated to Cloud
Dataproc. When using Cloud Dataproc clusters, you can access the YARN web interface
by configuring a browser to connect through which proxy?
A. HTTPS
B. VPN
C. SOCKS
(Correct)
D. HTTP
Explanation
Correct answer is C as the internal services can be accessed using the SOCKS proxy server.
Refer GCP documentation - Dataproc - Connecting to web interfaces
You can connect to web interfaces running on a Cloud Dataproc cluster using your project's
Cloud Shell or the Cloud SDK gcloud command-line tool:
Cloud Shell: The Cloud Shell in the Google Cloud Platform Console has the Cloud
SDK commands and utilities pre-installed, and it provides a Web Preview feature that allows
you to quickly connect through an SSH tunnel to a web interface port on a cluster. However,
a connection to the cluster from Cloud Shell uses local port forwarding, which opens a
connection to only one port on a cluster web interface—multiple commands are needed to
connect to multiple ports. Also, Cloud Shell sessions automatically terminate after a period
of inactivity (30 minutes).
gcloud command-line tool: The gcloud compute ssh command with dynamic port
forwarding allows you to establish an SSH tunnel and run a SOCKS proxy server on top of
the tunnel. After issuing this command, you must configure your local browser to use the
SOCKS proxy. This connection method allows you to connect to multiple ports on a cluster
web interface.
Question 3: Correct
Your company is planning to migrate their on-premises Hadoop and Spark jobs to
Dataproc. Which role must be assigned to a service account used by the virtual
machines in a Dataproc cluster, so they can execute jobs?
A. Dataproc Worker
(Correct)
B. Dataproc Viewer
C. Dataproc Runner
D. Dataproc Editor
Explanation
Correct answer is A as the compute engine should have Dataproc Worker role assigned.
Refer GCP documentation - Dataproc Service Accounts
Service accounts have IAM roles granted to them. Specifying a user-managed service
account when creating a Cloud Dataproc cluster allows you to create and utilize clusters
with fine-grained access and control to Cloud resources. Using multiple user-managed
service accounts with different Cloud Dataproc clusters allows for clusters with different
access to Cloud resources.
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or
have all the permissions granted by Dataproc Worker role).
Question 4: Correct
You currently have a Bigtable instance you've been using for development running a
development instance type, using HDD's for storage. You are ready to upgrade your
development instance to a production instance for increased performance. You also
want to upgrade your storage to SSD's as you need maximum performance for your
instance. What should you do?
A. Upgrade your development instance to a production instance, and switch your storage type
from HDD to SSD.
B. Export your Bigtable data into a new instance, and configure the new instance type as
production with SSD's
(Correct)
C. Run parallel instances where one instance is using HDD and the other is using SSD.
D. Use the Bigtable instance sync tool in order to automatically synchronize two different
instances, with one having the new storage configuration.
Explanation
Correct answer is B as the storage for the cluster cannot be updated. You need to define the
new cluster and copy or import the data to it.
Refer GCP documentation - Bigtable Choosing HDD vs SSD
Switching between SSD and HDD storage
When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage
for the cluster is permanent. You cannot use the Google Cloud Platform Console to change
the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data
from the existing instance and import the data into a new instance. Alternatively, you can use
a Cloud Dataflow or Hadoop MapReduce job to copy the data from one instance to another.
Keep in mind that migrating an entire instance takes time, and you might need to add nodes
to your Cloud Bigtable clusters before you migrate your instance.
Option A is wrong as storage type cannot be changed.
Options C & D are wrong as it would have two clusters running at the same time with same
data, thereby increasing cost.
Question 5: Correct
You have spent a few days loading data from comma-separated values (CSV) files into
the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of
click events. For convenience, you chose a simple schema where every field is treated as
the STRING type. Now, you want to compute web session durations of users who visit
your site, and you want to change its data type to the TIMESTAMP. You want to
minimize the migration effort without making future queries computationally
expensive. What should you do?
A. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the
TIMESTAMP type. Reload the data.
B. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the
numeric values from the column DT for each row. Reference the column TS instead of the
column DT from now on.
C. Create a view CLICK_STREAM_V, where strings from the column DT are cast into
TIMESTAMP values. Reference the view CLICK_STREAM_V instead of the table
CLICK_STREAM from now on.
D. Construct a query to return every row of the table CLICK_STREAM, while using the built-in
function to cast strings from the column DT into TIMESTAMP values. Run the query into a
destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type.
Reference the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now
on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
(Correct)
Explanation
Correct answer is D as the column type cannot be changed and the column needs to casting
loaded into a new table using either SQL Query or import/export.
Refer GCP documentation - BigQuery Changing Schema
Changing a column's data type is not supported by the GCP Console, the classic BigQuery
web UI, the command-line tool, or the API. If you attempt to update a table by applying a
schema that specifies a new data type for a column, the following error is
returned: BigQuery error in update operation: Provided Schema does not match
Table [PROJECT_ID]:[DATASET].[TABLE].
Using a SQL query — Choose this option if you are more concerned about simplicity
and ease of use, and you are less concerned about costs.
Recreating the table — Choose this option if you are more concerned about costs,
and you are less concerned about simplicity and ease of use.
A. Change the dataset from a regional location to multi-region location, specifying the regions to
be included.
B. Export the data from BigQuery into a bucket in the new location, and import it into a new
dataset at the new location.
C. Copy the data from the dataset in the source region to the dataset in the target region using
BigQuery commands.
D. Export the data from BigQuery into nearby bucket in Cloud Storage. Copy to a new regional
bucket in Cloud Storage in the new location and Import into the new dataset.
(Correct)
Explanation
Correct answer is D as the dataset location cannot be changed once created. The dataset needs
to be copied using Cloud Storage.
Refer GCP documentation - BigQuery Exporting Data
You cannot change the location of a dataset after it is created. Also, you cannot move a
dataset from one location to another. If you need to move a dataset from one location to
another, follow this process:
1. Export the data from your BigQuery tables to a regional or multi-region Cloud
Storage bucket in the same location as your dataset. For example, if your dataset is in
the EU multi-region location, export your data into a regional or multi-region bucket
in the EU.There are no charges for exporting data from BigQuery, but you do incur
charges for storing the exported data in Cloud Storage. BigQuery exports are subject
to the limits on export jobs.
2. Copy or move the data from your Cloud Storage bucket to a regional or multi-region
bucket in the new location. For example, if you are moving your data from the US
multi-region location to the Tokyo regional location, you would transfer the data to a
regional bucket in Tokyo. Note that transferring data between regions incurs network
egress charges in Cloud Storage.
3. After you transfer the data to a Cloud Storage bucket in the new location, create a
new BigQuery dataset (in the new location). Then, load your data from the Cloud
Storage bucket into BigQuery.You are not charged for loading the data into
BigQuery, but you will incur charges for storing the data in Cloud Storage until you
delete the data or the bucket. You are also charged for storing the data in BigQuery
after it is loaded. Loading data into BigQuery is subject to the limits on load jobs.
Question 7: Correct
A company has loaded its complete financial data for last year for analytics into
BigQuery. A Data Analyst is concerned that a BigQuery query could be too expensive.
Which methods can be used to reduce the number of rows processed by BigQuery?
A. Use the LIMIT clause to limit the number of values in the results.
B. Use the SELECT clause to limit the amount of data in the query. Partition data by date so the
query can be more focused.
(Correct)
C. Set the Maximum Bytes Billed, which will limit the number of bytes processed but still run the
query if the number of bytes requested goes over the limit.
D. Use GROUP BY so the results will be grouped into fewer output values.
Explanation
Correct answer is B as SELECT with partition would limit the data for querying.
Refer GCP documentation - BigQuery Cost Best Practices
Best practice: Partition your tables by date.
If possible, partition your BigQuery tables by date. Partitioning your tables allows you to
query relevant subsets of data which improves performance and reduces costs.
For example, when you query partitioned tables, use the _PARTITIONTIME pseudo column to
filter for a date or a range of dates. The query processes data only in the partitions that are
specified by the date or range.
Option A is wrong as LIMIT does not reduce cost as the amount of data queried is still the
same.
Best practice: Do not use a LIMIT clause as a method of cost control.
Applying a LIMIT clause to a query does not affect the amount of data that is read. It merely
limits the results set output. You are billed for reading all bytes in the entire table as
indicated by the query.
The amount of data read by the query counts against your free tier quota despite the
presence of a LIMIT clause.
Option C is wrong as the query would fail and would not execute if the Maximum bytes limit
is exceeded by the query.
Best practice: Use the maximum bytes billed setting to limit query costs.
You can limit the number of bytes billed for a query using the maximum bytes billed setting.
When you set maximum bytes billed, if the query will read bytes beyond the limit, the query
fails without incurring a charge.
Option D is wrong as GROUP BY would return less output, but would still query the entire
data.
Question 8: Correct
Your company receives streaming data from IoT sensors capturing various parameters.
You need to calculate a running average for each of the parameter on streaming data,
taking into account the data that can arrive late and out of order. How would you
design the system?
A. Use Cloud Pub/Sub and Cloud Dataflow with Sliding Time Windows.
(Correct)
A. The model is working extremely well, indicating the hyperparameters are set correctly.
(Correct)
(Correct)
Cloud IoT Core is a fully managed service that allows you to easily and securely connect,
manage, and ingest data from millions of globally dispersed devices. Cloud IoT Core, in
combination with other services on Cloud IoT platform, provides a complete solution for
collecting, processing, analyzing, and visualizing IoT data in real time to support improved
operational efficiency.
Cloud IoT Core, using Cloud Pub/Sub underneath, can aggregate dispersed device data into
a single global system that integrates seamlessly with Google Cloud data analytics services.
Use your IoT data stream for advanced analytics, visualizations, machine learning, and more
to help improve operational efficiency, anticipate problems, and build rich models that better
describe and optimize your business.
Question 11: Correct
You are building storage for files for a data pipeline on Google Cloud. You want to
support JSON files. The schema of these files will occasionally change. Your analyst
teams will use running aggregate ANSI SQL queries on this data. What should you do?
A. Use BigQuery for storage. Provide format files for data load. Update the format files as
needed.
B. Use BigQuery for storage. Select "Automatically detect" in the Schema section.
(Correct)
C. Use Cloud Storage for storage. Link data as temporary tables in BigQuery and turn on the
"Automatically detect" option in the Schema section of BigQuery.
D. Use Cloud Storage for storage. Link data as permanent tables in BigQuery and turn on the
"Automatically detect" option in the Schema section of BigQuery.
Explanation
Correct answer is B as the requirement is to support occasionally (schema) changing JSON
files and aggregate ANSI SQL queries: you need to use BigQuery, and it is quickest to use
'Automatically detect' for schema changes.
Refer GCP documentation - BigQuery Auto-Detection
Schema auto-detection is available when you load data into BigQuery, and when you query
an external data source.
When auto-detection is enabled, BigQuery starts the inference process by selecting a random
file in the data source and scanning up to 100 rows of data to use as a representative sample.
BigQuery then examines each field and attempts to assign a data type to that field based on
the values in the sample.
To see the detected schema for a table:
When enabled, BigQuery makes a best-effort attempt to automatically infer the schema for
CSV and JSON files.
A is not correct because you should not provide format files: you can simply turn on the
'Automatically detect' schema changes flag.
C and D are not correct as Cloud Storage is not ideal for this scenario; it is cumbersome, adds
latency and doesn't add value.
Question 12: Correct
You have 250,000 devices which produce a JSON device status event every 10 seconds.
You want to capture this event data for outlier time series analysis. What should you
do?
A. Ship the data into BigQuery. Develop a custom application that uses the BigQuery API to
query the dataset and displays device outlier data based on your business requirements.
B. Ship the data into BigQuery. Use the BigQuery console to query the dataset and display device
outlier data based on your business requirements.
C. Ship the data into Cloud Bigtable. Use the Cloud Bigtable cbt tool to display device outlier
data based on your business requirements.
(Correct)
D. Ship the data into Cloud Bigtable. Install and use the HBase shell for Cloud Bigtable to query
the table for device outlier data based on your business requirements.
Explanation
Correct answer is C as the time series data with its data type, volume, and query pattern best
fits BigTable capabilities.
Refer GCP documentation - Bigtable Time Series data and CBT
Options A & B are wrong as BigQuery is not suitable for the query pattern in this scenario.
Option D is wrong as you can use the simpler method of 'cbt tool' to support this scenario.
Question 13: Correct
You are building a data pipeline on Google Cloud. You need to select services that will
host a deep neural network machine-learning model also hosted on Google Cloud. You
also need to monitor and run jobs that could occasionally fail. What should you do?
A. Use Cloud Machine Learning to host your model. Monitor the status of the Operation object
for 'error' results.
B. Use Cloud Machine Learning to host your model. Monitor the status of the Jobs object for
'failed' job states.
(Correct)
C. Use a Kubernetes Engine cluster to host your model. Monitor the status of the Jobs object for
'failed' job states.
D. Use a Kubernetes Engine cluster to host your model. Monitor the status of Operation object
for 'error' results.
Explanation
Correct answer is B as the requirement is to host an Machine Learning Deep Neural Network
job it is ideal to use the Cloud Machine Learning service. Monitoring works on Jobs object.
Refer GCP documentation - ML Engine Managing Jobs
You can use projects.jobs.get to get the status of a job. This method is also provided
as gcloud ml jobs describe and in the Jobs page in the Google Cloud Platform Console.
Regardless of how you get the status, the information is based on the members of the Job
resource. You'll know the job is complete when Job.state in the response is equal to one of
these values:
SUCCEEDED
FAILED
CANCELLED
A. Build an application that calls the Cloud Vision API. Inspect the generated MID values to
supply the image labels.
B. Build an application that calls the Cloud Vision API. Pass landmark locations as base64-
encoded strings.
(Correct)
C. Build and train a classification model with TensorFlow. Deploy the model using Cloud
Machine Learning Engine. Pass landmark locations as base64-encoded strings.
D. Build and train a classification model with TensorFlow. Deploy the model using Cloud
Machine Learning Engine. Inspect the generated MID values to supply the image labels.
Explanation
Correct answer is B as the requirement is to quickly develop a model that generates landmark
labels from photos, it can be easily supported by Cloud Vision API.
Refer GCP documentation - Cloud Vision
Cloud Vision offers both pretrained models via an API and the ability to build custom models
using AutoML Vision to provide flexibility depending on your use case.
Cloud Vision API enables developers to understand the content of an image by
encapsulating powerful machine learning models in an easy-to-use REST API. It quickly
classifies images into thousands of categories (such as, “sailboat”), detects individual
objects and faces within images, and reads printed words contained within images. You can
build metadata on your image catalog, moderate offensive content, or enable new marketing
scenarios through image sentiment analysis.
Option A is wrong as you should not inspect the generated MID values; instead, you should
simply pass the image locations to the API and use the labels, which are output.
Options C & D are wrong as you should not build a custom classification TF model for this
scenario, as it would require time.
Question 15: Correct
You regularly use prefetch caching with a Data Studio report to visualize the results of
BigQuery queries. You want to minimize service costs. What should you do?
A. Set up the report to use the Owner's credentials to access the underlying data in BigQuery, and
direct the users to view the report only once per business day (24-hour period).
B. Set up the report to use the Owner's credentials to access the underlying data in BigQuery, and
verify that the 'Enable cache' checkbox is selected for the report.
(Correct)
C. Set up the report to use the Viewer's credentials to access the underlying data in BigQuery, and
also set it up to be a 'view-only' report.
D. Set up the report to use the Viewer's credentials to access the underlying data in BigQuery,
and verify that the 'Enable cache' checkbox is not selected for the report.
Explanation
Correct option is B as you must set Owner credentials to use the 'enable cache' option in
BigQuery. It is also a Google best practice to use the ‘enable cache’ option when the business
scenario calls for using prefetch caching.
Refer GCP documentation - Datastudio data caching
The prefetch cache is only active for data sources that use owner's credentials to access the
underlying data.
Options A, C, & D are wrong as cache auto-expires every 12 hours; a prefetch cache is only
for data sources that use the Owner's credentials and not the Viewer's credentials
Question 16: Correct
Your customer is moving their corporate applications to Google Cloud Platform. The
security team wants detailed visibility of all projects in the organization. You provision
the Google Cloud Resource Manager and set up yourself as the org admin. What
Google Cloud Identity and Access Management (Cloud IAM) roles should you give to
the security team?
(Correct)
A. Google BigQuery
(Correct)
(Correct)
Option A is wrong as the stack is correct, however the order is not correct.
Option B is wrong as Dataproc is not an ideal tool for analysis. Cloud Dataproc is a fast,
easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop
clusters in a simpler, more cost-efficient way.
Option D is wrong as App Engine is not an ideal ingestion tool to handle IoT data.
Question 19: Correct
Your company is planning the infrastructure for a new large-scale application that will
need to store over 100 TB or a petabyte of data in NoSQL format for Low-latency
read/write and High-throughput analytics. Which storage option should you use?
A. Cloud Bigtable
(Correct)
B. Cloud Spanner
C. Cloud SQL
D. Cloud Datastore
Explanation
Correct answer is A as Bigtable is an ideal solution to provide low latency, high throughput
data processing storage option with analytics
Refer GCP documentation - Storage Options
Low- IoT, finance,
latency adtech
read/write
Personalizatio
A scalable, fully managed NoSQL wide- access
n, recommendations
column database that is suitable for both High-
Cloud Monitoring
low-latency single-point lookups and throughput data
Bigtable Geospatial
precalculated analytics. processing
datasets
Time
Graphs
series support
B. Enable your IoT devices to generate a timestamp when sending messages. Use Cloud
Dataflow to process messages, and use windows, watermarks (timestamp), and triggers to process
late data.
(Correct)
D. Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Pub/Sub
to process messages by timestamp and fix out of order issues.
Explanation
Correct answer is B as Cloud Pub/Sub can help handle the streaming data. However, Cloud
Pub/Sub does not handle the ordering, which can be done using Dataflow and adding
watermarks to the messages from the source.
Refer GCP documentation - Cloud Pub/Sub ordering & Subscriber
How do you assign an order to messages published from different publishers? Either the
publishers themselves have to coordinate, or the message delivery service itself has to attach
a notion of order to every incoming message. Each message would need to include the
ordering information. The order information could be a timestamp (though it has to be a
timestamp that all servers get from the same source in order to avoid issues of clock drift), or
a sequence number (acquired from a single source with ACID guarantees). Other messaging
systems that guarantee ordering of messages require settings that effectively limit the system
to multiple publishers sending messages through a single server to a single subscriber.
Typically, Cloud Pub/Sub delivers each message once and in the order in which it was
published. However, messages may sometimes be delivered out of order or more than once.
In general, accommodating more-than-once delivery requires your subscriber to
be idempotent when processing messages. You can achieve exactly once processing of Cloud
Pub/Sub message streams using Cloud Dataflow PubsubIO . PubsubIO de-duplicates
messages on custom message identifiers or those assigned by Cloud Pub/Sub. You can also
achieve ordered processing with Cloud Dataflow by using the standard sorting APIs of the
service. Alternatively, to achieve ordering, the publisher of the topic to which you subscribe
can include a sequence token in the message.
Options A & C are wrong as SQL and BigQuery do not support ingestion and ordering of IoT
data and would need other services like Pub/Sub.
Option D is wrong as Cloud Pub/Sub does not perform ordering of messages.
Question 21: Correct
Your company has data stored in BigQuery in Avro format. You need to export this
Avro formatted data from BigQuery into Cloud Storage. What is the best method of
doing so from the web console?
A. Convert the data to CSV format the BigQuery export options, then make the transfer.
B. Use the BigQuery Transfer Service to transfer Avro data to Cloud Storage.
C. Click on Export Table in BigQuery, and provide the Cloud Storage location to export to
(Correct)
D. Create a Dataflow job to manage the conversion of Avro data to CSV format, then export to
Cloud Storage.
Explanation
Correct answer is C as BigQuery can export Avro data natively to Cloud Storage.
Refer GCP documentation - BigQuery Exporting Data
After you've loaded your data into BigQuery, you can export the data in several formats.
BigQuery can export up to 1 GB of data to a single file. If you are exporting more than 1 GB
of data, you must export your data to multiple files. When you export your data to multiple
files, the size of the files will vary.
You cannot export data to a local file or to Google Drive, but you can save query results to a
local file. The only supported export location is Google Cloud Storage.
For Export format, choose the format for your exported data: CSV, JSON (Newline
Delimited), or Avro.
Option A is wrong as BigQuery can export Avro data natively to Cloud Storage and does not
need to be converted to CSV format.
Option B is wrong as BigQuery Transfer Service is for moving BigQuery data to Google
SaaS applications (AdWords, DoubleClick, etc.). You will want to do a normal export of
data, which works with Avro formatted data.
Option D is wrong as Google Cloud Dataflow can be used to read data from BigQuery
instead of manually exporting it, but doesn't work through console.
Question 22: Correct
Your company has its input data hosted in BigQuery. They have existing Spark scripts
for performing analysis which they want to reuse. The output needs to be stored in
BigQuery for future analysis. How can you set up your Dataproc environment to use
BigQuery as an input and output source?
B. Manually use a Cloud Storage bucket to import and export to and from both BigQuery and
Dataproc
(Correct)
D. You can only use Cloud Storage or HDFS for your Dataproc input and output.
Explanation
Correct answer is C as Dataproc has a BigQuery connector library which allows it directly
interface with BigQuery.
Refer GCP documentation - Dataproc BigQuery Connector
You can use a BigQuery connector to enable programmatic read/write access to BigQuery.
This is an ideal way to process data that is stored in BigQuery. No command-line access is
exposed. The BigQuery connector is a Java library that enables Hadoop to process data
from BigQuery using abstracted versions of the Apache Hadoop InputFormat and
OutputFormat classes.
Option A is wrong Bigtable syncing service does not exist.
Options B & D are wrong as Dataproc can directly interface with BigQuery.
Question 23: Correct
You are building new real-time data warehouse for your company and will use Google
BigQuery streaming inserts. There is no guarantee that data will only be sent in once
but you do have a unique ID for each row of data and an event timestamp. You want to
ensure that duplicates are not included while interactively querying data. Which query
type should you use?
B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS
NOT NULL.
D. Use the ROW_NUMBER window function with PARTITION by unique ID along with
WHERE row equals 1.
(Correct)
Explanation
Correct answer is D as the best approach is to ROW_NUMBER with PARTITION by the
UNIQUE_ID and filter it by row_number = 1.
Refer GCP documentation - BigQuery Streaming Data - Removing Duplicates
To remove duplicates, perform the following query. You should specify a destination table,
allow large results, and disable result flattening.
Question 24: Correct
Your company handles data processing for a number of different clients. Each client
prefers to use their own suite of analytics tools, with some allowing direct query access
via Google BigQuery. You need to secure the data so that clients cannot see each other’s
data. You want to ensure appropriate access to the data. Which three steps should you
take? (Choose three)
(Correct)
(Correct)
F. Use the appropriate identity and access management (IAM) roles for each client’s users.
(Correct)
Explanation
Correct answers are B, D & F. As the access control can be done using IAM roles on the
dataset only to the specific approved users.
Refer GCP documentation - BigQuery Access Control
BigQuery uses Identity and Access Management (IAM) to manage access to resources. The
three types of resources available in BigQuery are organizations, projects, and datasets. In
the IAM policy hierarchy, datasets are child resources of projects. Tables and views are
child resources of datasets — they inherit permissions from their parent dataset.
To grant access to a resource, assign one or more roles to a user, group, or service account.
Organization and project roles affect the ability to run jobs or manage the project's
resources, whereas dataset roles affect the ability to access or modify the data inside of a
particular dataset.
Options A & C are wrong as the access control can only be applied on dataset and views, not
on partitions and tables.
Option E is wrong as service account is mainly for machines and would be a single account.
Question 25: Correct
Your company has hired a new data scientist who wants to perform complicated
analyses across very large datasets stored in Google Cloud Storage and in a Cassandra
cluster on Google Compute Engine. The scientist primarily wants to create labelled data
sets for machine learning projects, along with some visualization tasks. She reports that
her laptop is not powerful enough to perform her tasks and it is slowing her down. You
want to help her perform her tasks. What should you do?
C. Host a visualization tool on a VM on Google Compute Engine.
D. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
(Correct)
Explanation
Correct answer is D as Cloud Datalab provides a powerful interactive, scalable tool on
Google Cloud with the ability to analyze, visualize data.
Refer GCP documentation - Datalab
Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and
visualize data and build machine learning models on Google Cloud Platform. It runs on
Google Compute Engine and connects to multiple cloud services easily so you can focus on
your data science tasks.
Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of
modules and a robust knowledge base. Cloud Datalab enables analysis of your data on
Google BigQuery, Cloud Machine Learning Engine, Google Compute Engine, and Google
Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions).
Whether you're analyzing megabytes or terabytes, Cloud Datalab has you covered. Query
terabytes of data in BigQuery, run local analysis on sampled data and run training jobs on
terabytes of data in Cloud Machine Learning Engine seamlessly.
Use Cloud Datalab to gain insight from your data. Interactively explore, transform, analyze,
and visualize your data using BigQuery, Cloud Storage and Python.
Go from data to deployed machine-learning (ML) models ready for prediction. Explore data,
build, evaluate and optimize Machine Learning models using TensorFlow or Cloud Machine
Learning Engine.
Options A, B & C do not provides all the abilities.
Question 26: Correct
You are working on a sensitive project involving private user data. You have set up a
project on Google Cloud Platform to house your work internally. An external
consultant is going to assist with coding a complex transformation in a Google Cloud
Dataflow pipeline for your project. How should you maintain users’ privacy?
B. Grant the consultant the Cloud Dataflow Developer role on the project.
(Correct)
C. Create a service account and allow the consultant to log on with it.
D. Create an anonymized sample of the data for the consultant to work with in a different project.
Explanation
Correct answer is B as the Dataflow developer role would help provide the third-party
consultant access to create and work on the Dataflow pipeline. However, it does not provide
access to view the data, thus maintaining user's privacy.
Refer GCP documentation - Dataflow roles
dataflow.<resource-
roles/dataflow.viewer type>.list jobs, messages, metrics
dataflow.<resource-type>.get
All of the above, as well as:
roles/dataflow.develope dataflow.jobs.create
r dataflow.jobs.drain jobs
dataflow.jobs.cancel
All of the above, as well as:
compute.machineTypes.get
storage.buckets.get
roles/dataflow.admin
storage.objects.create
NA
storage.objects.get
storage.objects.list
Option A is wrong as it would not allow the consultant to work on the pipeline.
Option C is wrong as the consultant cannot use the service account to login.
Option D is wrong as it does not enable collabaration.
Question 27: Correct
Your software uses a simple JSON format for all messages. These messages are
published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to
create a real-time dashboard for the CFO. During testing, you notice that some
messages are missing in the dashboard. You check the logs, and all messages are being
published to Cloud Pub/Sub successfully. What should you do next?
B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
(Correct)
C. Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
D. Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub
pushing messages to Cloud Dataflow.
Explanation
Correct answer is B as the issue can be debugged by running a fixed dataset and checking the
output.
Refer GCP documentation - Dataflow logging
Option A is wrong as the Dashboard uses data provided by Dataflow, the input source for
Dashboard seems to be the issue
Option C is wrong as Monitoring would not help find missing messages in Cloud Pub/Sub.
Option D is wrong as Dataflow cannot be configured as Push endpoint with Cloud Pub/Sub.
Question 28: Incorrect
Your company is in a highly regulated industry. One of your requirements is to ensure
individual users have access only to the minimum amount of information required to do
their jobs. You want to enforce this requirement with Google BigQuery. Which three
approaches can you take? (Choose three)
(Incorrect)
(Correct)
(Correct)
(Correct)
Explanation
Correct answers are D, E & F
Option D would help limit access to approved users only.
Option E as it would help segregate the data with the ability to provide access to users as per
their needs.
Option F as it would help in auditing.
Refer GCP documentation - BigQuery Dataset Access Control & Access Control
You share access to BigQuery tables and views using project- level IAM roles and dataset-
level access controls. Currently, you cannot apply access controls directly to tables or views.
Project-level access controls determine the users, groups, and service accounts allowed to
access all datasets, tables, views, and table data within a project. Dataset-level access
controls determine the users, groups, and service accounts allowed to access the tables,
views, and table data in a specific dataset.
Option A is wrong as disabiling writes does not prevent the users from reading and does not
align with the least privilege principle.
Option B is wrong as access cannot be control on tables.
Option C is wrong as data is encrypted by default, however it does not align with the least
privilege principle.
Question 29: Correct
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud
Pub/Sub subscription as the source. You need to make an update to the code that will
make the new Cloud Dataflow pipeline incompatible with the current version. You do
not want to lose any data when making this update. What should you do?
(Correct)
B. Update the current pipeline and provide the transform mapping JSON object.
C. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old
pipeline.
D. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
Explanation
Correct answer is A as the key requirement is not to lose the data, the Dataflow pipeline can
be stopped using the Drain option. Drain options would cause Dataflow to stop any new
processing, but would also allow the existing processing to complete
Refer GCP documentation - Dataflow Stopping a Pipeline
Using the Drain option to stop your job tells the Cloud Dataflow service to finish your job in
its current state. Your job will immediately stop ingesting new data from input sources.
However, the Cloud Dataflow service will preserve any existing resources, such as worker
instances, to finish processing and writing any buffered data in your pipeline. When all
pending processing and write operations are complete, the Cloud Dataflow service will clean
up the GCP resources associated with your job.
Note: Your pipeline will continue to incur the cost of maintaining any associated GCP
resources until all processing and writing has completed.
Use the Drain option to stop your job if you want to prevent data loss as you bring down
your pipeline.
(Correct)
A. Cloud Datastore
B. Cloud Dataflow
C. Cloud Pub/Sub
(Correct)
D. Cloud Bigtable
Explanation
Correct answer is C as Cloud Storage upload events can push Cloud Pub/Sub to trigger a
Cloud Function to ingest and process the image.
Refer GCP documentation - Cloud Storage Pub/Sub Notifications
Cloud Pub/Sub Notifications sends information about changes to objects in your buckets
to Cloud Pub/Sub, where the information is added to a Cloud Pub/Sub topic of your choice
in the form of messages. For example, you can track objects that are created and deleted in
your bucket. Each notification contains information describing both the event that triggered
it and the object that changed.
Cloud Pub/Sub Notifications are the recommended way to track changes to objects in your
Cloud Storage buckets because they're faster, more flexible, easier to set up, and more cost-
effective.
Options A, B & D are wrong as they cannot be configured for notifications from Cloud
Storage.
Question 32: Correct
Your company is in a highly regulated industry. One of your requirements is to ensure
external users have access only to the non PII fields information required to do their
jobs. You want to enforce this requirement with Google BigQuery. Which access control
method would you use?
C. Use Authorized view with the same dataset with proper permissions
D. Use Authorized view with the different dataset with proper permissions
(Correct)
Explanation
Correct answer is D as the controlled access can be granted using Authorized view. The
Authorized view needs to be in a different dataset than the source.
Refer GCP documentation - BigQuery Authorized Views
Giving a view access to a dataset is also known as creating an authorized view in BigQuery.
An authorized view allows you to share query results with particular users and groups
without giving them access to the underlying tables. You can also use the view's SQL query to
restrict the columns (fields) the users are able to query.
When you create the view, it must be created in a dataset separate from the source data
queried by the view. Because you can assign access controls only at the dataset level, if the
view is created in the same dataset as the source data, your users would have access to both
the view and the data.
Options A, B & C are wrong as they would provide access to the complete datasets with the
source included.
Question 33: Correct
Your company is developing a next generation pet collar that collects biometric
information to assist potential millions of families with promoting healthy lifestyles for
their pets. Each collar will push 30kb of biometric data In JSON format every 2 seconds
to a collection platform that will process and analyze the data providing health trending
information back to the pet owners and veterinarians via a web portal. Management
has tasked you to architect the collection platform ensuring the following requirements
are met.
1. Provide the ability for real-time analytics of the inbound biometric data
2. Ensure processing of the biometric data is highly durable, elastic and parallel
3. The results of the analytic processing should be persisted for data mining
Which architecture outlined below win meet the initial requirements for the platform?
A. Utilize Cloud Storage to collect the inbound sensor data, analyze data with Dataproc and save
the results to BigQuery.
B. Utilize Cloud Pub/Sub to collect the inbound sensor data, analyze the data with Dataflow and
save the results to BigQuery.
(Correct)
C. Utilize Cloud Pub/Sub to collect the inbound sensor data, analyze the data with Dataflow and
save the results to Cloud SQL.
D. Utilize Cloud Pub/Sub to collect the inbound sensor data, analyze the data with Dataflow and
save the results to Bigtable.
Explanation
Correct answer is B as Cloud Pub/Sub provides elastic and scalable ingestion, Dataflow
provides processing and BigQuery analytics.
Refer GCP documentation - IoT
Google Cloud Pub/Sub provides a globally durable message ingestion service. By creating
topics for streams or channels, you can enable different components of your application to
subscribe to specific streams of data without needing to construct subscriber-specific
channels on each device. Cloud Pub/Sub also natively connects to other Cloud Platform
services, helping you to connect ingestion, data pipelines, and storage systems.
Google Cloud Dataflow provides the open Apache Beam programming model as a managed
service for processing data in multiple ways, including batch operations, extract-transform-
load (ETL) patterns, and continuous, streaming computation. Cloud Dataflow can be
particularly useful for managing the high-volume data processing pipelines required for IoT
scenarios. Cloud Dataflow is also designed to integrate seamlessly with the other Cloud
Platform services you choose for your pipeline.
Google BigQuery provides a fully managed data warehouse with a familiar SQL interface, so
you can store your IoT data alongside any of your other enterprise analytics and logs. The
performance and cost of BigQuery means you might keep your valuable data longer, instead
of deleting it just to save disk space.
Option A is wrong as Cloud Storage is not an ideal ingestion service for real time high
frequency data. Also Dataproc is a fast, easy-to-use, fully-managed cloud service for running
Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.
Option C is wrong as Cloud SQL is a relational database and not suited for analytics data
storage.
Option D is wrong as Bigtable is not ideal for long term analytics data storage.
Question 34: Correct
Which of the following statements about the Wide & Deep Learning model are true?
(Choose two)
A. Wide model is used for memorization, while the deep model is used for generalization.
(Correct)
B. Wide model is used for generalization, while the deep model is used for memorization.
C. A good use for the wide and deep model is a recommender system.
(Correct)
D. A good use for the wide and deep model is a small-scale linear regression problem.
Explanation
Correct answers are A & C as Wide learning model is good for memorization and a Deep
learning model is generalization. Both Wide and Deep learning model can help build good
recommendation engine.
Refer Google blog - Wide Deep learning together
The human brain is a sophisticated learning machine, forming rules by memorizing everyday
events (“sparrows can fly” and “pigeons can fly”) and generalizing those learnings to apply
to things we haven't seen before (“animals with wings can fly”). Perhaps more powerfully,
memorization also allows us to further refine our generalized rules with exceptions
(“penguins can't fly”). As we were exploring how to advance machine intelligence, we asked
ourselves the question—can we teach computers to learn like humans do, by combining the
power of memorization and generalization?
It's not an easy question to answer, but by jointly training a wide linear model (for
memorization) alongside a deep neural network (for generalization), one can combine the
strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning.
It's useful for generic large-scale regression and classification problems with sparse inputs
(categorical features with a large number of possible feature values), such as recommender
systems, search, and ranking problems.
Question 35: Correct
A financial organization wishes to develop a global application to store transactions
happening from different part of the world. The storage system must provide low
latency transaction support and horizontal scaling. Which GCP service is appropriate
for this use case?
A. Bigtable
B. Datastore
C. Cloud Storage
D. Cloud Spanner
(Correct)
Explanation
Correct answer is D as Spanner provides Global scale, low latency and the ability to scale
horizontally.
Refer GCP documentation - Storage Options
Adtec
Mission-
h
critical applications
Finan
High
Mission-critical, relational database service with cial services
Cloud transactions
transactional consistency, global scale, and high Globa
Spanner Scale +
availability. l supply
consistency
chain
requirements
Retail
Question 36: Correct
A retailer has 1PB of historical purchase dataset, which is largely unlabeled. They want
to categorize the customer into different groups as per their spend. Which type of
Machine Learning algorithm is suited to achieve this?
A. Classification
B. Regression
C. Association
D. Clustering
(Correct)
Explanation
Correct answer is D as the data is unlabelled, unsupervised learning technique of Clustering
can be applied to categorize the data.
Refer GCP documentation - Machine Learning
In unsupervised learning, the goal is to identify meaningful patterns in the data. To
accomplish this, the machine must learn from an unlabeled data set. In other words, the
model has no hints how to categorize each piece of data and must infer its own rules for
doing so.
A. Nearline
(Correct)
B. Standard
C. Multi-Regional
(Correct)
D. Dual-Regional
E. Regional
Explanation
Correct answers are A & C as Multi-Regional and Nearline storage classes provide multi-
region geo-redundant deployment, which can sustain regional failure.
Refer GCP documentation - Cloud Storage Classes
Multi-Regional Storage is geo-redundant.
The geo-redundancy of Nearline Storage data is determined by the type of location in which
it is stored: Nearline Storage data stored in multi-regional locations is redundant across
multiple regions, providing higher availability than Nearline Storage data stored in regional
locations.
Data that is geo-redundant is stored redundantly in at least two separate geographic places
separated by at least 100 miles. Objects stored in multi-regional locations are geo-
redundant, regardless of their storage class.
Geo-redundancy occurs asynchronously, but all Cloud Storage data is redundant within at
least one geographic place as soon as you upload it.
Geo-redundancy ensures maximum availability of your data, even in the event of large-scale
disruptions, such as natural disasters. For a dual-regional location, geo-redundancy is
achieved using two specific regional locations. For other multi-regional locations, geo-
redundancy is achieved using any combination of data centers within the specified multi-
region, which may include data centers that are not explicitly available as regional
locations.
Options B & D are wrong as they do not exist
Option E is wrong as Regional storage class is not geo-redundant. Data stored in a narrow
geographic region and Redundancy is across availability zones
Question 38: Correct
Your company wants to develop an REST based application for image analysis. This
application would help detect individual objects and faces within images, and reads
printed words contained within images. You need to do a quick Proof of Concept (PoC)
to implement and demo the same. How would you design your application?
A. Create and Train a model using Tensorflow and Develop an REST based wrapper over it
B. Use Cloud Image Intelligence API and Develop an REST based wrapper over it
C. Use Cloud Natural Language API and Develop an REST based wrapper over it
D. Use Cloud Vision API and Develop an REST based wrapper over it
(Correct)
Explanation
Correct answer is D as Cloud Vision API provide pre-built models to identify and detect
objects and faces within images.
Refer GCP documentation - AI Products
Cloud Vision API enables you to derive insight from your images with our powerful
pretrained API models or easily train custom vision models with AutoML Vision Beta. The
API quickly classifies images into thousands of categories (such as “sailboat” or “Eiffel
Tower”), detects individual objects and faces within images, and finds and reads printed
words contained within images. AutoML Vision lets you build and train custom ML models
with minimal ML expertise to meet domain-specific business needs.
Question 39: Correct
Your company is developing an online video hosting platform. Users can upload their
videos, which would be available for all the other users to view and share. As a
compliance requirement, the videos need to undergo content moderation before it is
available for all the users. How would you design your application?
A. Use Cloud Vision API to identify video with inappropriate content and mark it for manual
checks.
B. Use Cloud Natural Language API to identify video with inappropriate content and mark it for
manual checks.
C. Use Cloud Speech-to-Text API to identify video with inappropriate content and mark it for
manual checks.
D. Use Cloud Video Intelligence API to identify video with inappropriate content and mark it for
manual checks.
(Correct)
Explanation
Correct answer is D as Cloud Video Intelligence can be used to perform content moderation.
Refer GCP documentation - Cloud Video Intelligence
Google Cloud Video Intelligence makes videos searchable, and discoverable, by extracting
metadata with an easy to use REST API. You can now search every moment of every video
file in your catalog. It quickly annotates videos stored in Google Cloud Storage, and helps
you identify key entities (nouns) within your video; and when they occur within the video.
Separate signals from noise, by retrieving relevant information within the entire video, shot-
by-shot, -or per frame.
Identify when inappropriate content is being shown in a given video. You can instantly
conduct content moderation across petabytes of data and more quickly and efficiently filter
your content or user-generated content.
Option A is wrong as Vision is for image analysis.
Option B is wrong as Natural Language is for text analysis
Option C is wrong as Speech-to-Text is for audio to text conversion.
Question 40: Correct
Your company has a variety of data processing jobs. Dataflow jobs to process real time
streaming data using Pub/Sub. Data pipelines working with on-premises data. Dataproc
spark batch jobs running weekly analytics with Cloud Storage. They want a single
interface to manage and monitor the jobs. Which service would help implement a
common monitoring and execution platform?
A. Cloud Scheduler
B. Cloud Composer
(Correct)
C. Cloud Spanner
D. Cloud Pipeline
Explanation
Correct answer is B as Cloud Composer's managed nature allows you to focus on authoring,
scheduling, and monitoring your workflows as opposed to provisioning resources.
Refer GCP documentation - Cloud Composer
Cloud Composer is a fully managed workflow orchestration service that empowers you to
author, schedule, and monitor pipelines that span across clouds and on-premises data
centers. Built on the popular Apache Airflow open source project and operated using the
Python programming language, Cloud Composer is free from lock-in and easy to use.
Cloud Composer's managed nature allows you to focus on authoring, scheduling, and
monitoring your workflows as opposed to provisioning resources.
Option A is wrong as Cloud Scheduler is a fully managed enterprise-grade cron job
scheduler. It is not an multi-cloud orchestration tool.
Option C is wrong as Google Cloud Spanner is relational database
Option D is wrong as Google Cloud Pipeline service does not exist.
Question 41: Incorrect
Your company hosts its analytical data in a BigQuery dataset for analytics. They need
to provide controlled access to certain tables and columns within the tables to a third
party. How do you design the access with least privilege?
B. Grant fine grained DATA VIEWER access to the tables and columns within the dataset
C. Create Authorized views for tables in a same project and grant access to the teams
(Incorrect)
D. Create Authorized views for tables in a separate project and grant access to the teams
(Correct)
Explanation
Correct answer is D as the controlled access can be provided using Authorized views created
in a separate project.
Refer GCP documentation - BigQuery Authorized View
BigQuery is a petabyte-scale analytics data warehouse that you can use to run SQL queries
over vast amounts of data in near realtime.
Giving a view access to a dataset is also known as creating an authorized view in BigQuery.
An authorized view allows you to share query results with particular users and groups
without giving them access to the underlying tables. You can also use the view's SQL query to
restrict the columns (fields) the users are able to query.
When you create the view, it must be created in a dataset separate from the source data
queried by the view. Because you can assign access controls only at the dataset level, if the
view is created in the same dataset as the source data, your data analysts would have access
to both the view and the data.
Options A & B are wrong as access cannot be controlled over table, but only projects and
datasets.
Option C is wrong as Authorized views should be created in a separate project. If they are
created in the same project, the users would have access to the underlying tables as well.
Question 42: Incorrect
Your company is hosting its analytics data in BigQuery. All the Data analysts have been
provided with the IAM owner role to their respective projects. As a compliance
requirement, all the data access logs needs to be captured for audits. Also, the access to
the logs needs to be limited to the Auditor team only. How can the access be controlled?
A. Export the data access logs using aggregated sink to Cloud Storage in an existing project and
grant VIEWER access to the project to the Auditor team
B. Export the data access logs using project sink to BigQuery in an existing project and grant
VIEWER access to the project to the Auditor team
(Incorrect)
C. Export the data access logs using project sink to Cloud Storage in a separate project and grant
VIEWER access to the project to the Auditor team
D. Export the data access logs using aggregated sink to Cloud Storage in a separate project and
grant VIEWER access to the project to the Auditor team
(Correct)
Explanation
Correct answer is D as the Data Analysts have OWNER roles to the projects, the logs need to
be exported to a separate project which only the Auditor team has access to. Also, as there are
multiple projects aggregated export sink can be used to export data access logs from all
projects.
Refer GCP documentation - BigQuery Auditing and Aggregated Exports
You can create an aggregated export sink that can export log entries from all the projects,
folders, and billing accounts of an organization. As an example, you might use this feature to
export audit log entries from an organization's projects to a central location.
A. Dataflow
B. Dataproc
D. Dataprep
(Correct)
Explanation
Correct answer is D as Dataprep provides the ability to detect, clean and transform data
through a Graphical Interface without any programming knowledge.
Refer GCP documentation - Dataprep
Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning,
and preparing structured and unstructured data for analysis. Cloud Dataprep is serverless
and works at any scale. There is no infrastructure to deploy or manage. Easy data
preparation with clicks and no code.
Cloud Dataprep automatically detects schemas, datatypes, possible joins, and anomalies
such as missing values, outliers, and duplicates so you get to skip the time-consuming work
of profiling your data and go right to the data analysis.
Cloud Dataprep automatically identifies data anomalies and helps you to take corrective
action fast. Get data transformation suggestions based on your usage pattern. Standardize,
structure, and join datasets easily with a guided approach.
Options A, B & C are wrong as they all need programming knowledge.
Question 44: Correct
Your company is migrating to the Google cloud and looking for HBase alternative.
Current solution uses a lot of custom code using the observer coprocessor. You are
required to find the best alternative for migration while using managed services, is
possible?
A. Dataflow
B. HBase on Dataproc
(Correct)
C. Bigtable
D. BigQuery
Explanation
Correct answer is B as Bigtable is an HBase managed service alternative on Google Cloud.
However, it does not support Coprocessors. So the best solution is to use HBase with
Dataproc which can be installed using initialization actions.
Refer GCP documentation - Bigtable HBase differences
Coprocessors are not supported. You cannot create classes that implement the
interface org.apache.hadoop.hbase.coprocessor .
Options A & D are wrong as Dataflow and BigQuery are not HBase alternative
Option C is wrong as Bigtable does not support Coprocessors.
Question 45: Correct
You have multiple Data Analysts who work with the dataset hosted in BigQuery within
the same project. As a BigQuery Administrator, you are required to grant the data
analyst only the privilege to create jobs/queries and an ability to cancel self-submitted
jobs. Which role should assign to the user?
A. User
B. Jobuser
(Correct)
C. Owner
D. Viewer
Explanation
Correct answer is B as JobUser access grants users permissions to run jobs and cancel their
own jobs within the same project
Refer GCP documentation - BigQuery Access Control
roles/bigquery.jobUse Permissions to run jobs, including queries, within the project. The
r jobUser role can get information about their own jobs and cancel
their own jobs.
Rationale: This role allows the separation of data access from the
ability to run work in the project, which is useful when team
members query data from multiple projects. This role does not
allow access to any BigQuery data. If data access is required, grant
dataset-level access controls.
Resource Types:
Organization
Project
(Correct)
A. Train Model using Tensorflow to identify PII and filter the information
B. Store the data in BigQuery and create a Authorized view for the users
C. Use Data Loss Prevention APIs to identify the PII information and filter the information
(Correct)
D. Use Cloud Natural Language API to identify PII and filter the information
Explanation
Correct answer is C as Data Loss Prevention APIs can be used to quickly redact the sensitive
information.
Refer GCP documentation - Cloud DLP
Cloud DLP helps you better understand and manage sensitive data. It provides fast, scalable
classification and redaction for sensitive data elements like credit card numbers, names,
social security numbers, US and selected international identifier numbers, phone numbers
and GCP credentials. Cloud DLP classifies this data using more than 90 predefined
detectors to identify patterns, formats, and checksums, and even understands contextual
clues. You can optionally redact data as well using techniques like masking, secure hashing,
bucketing, and format-preserving encryption.
Option A is wrong as building and training a model is not a quick and easy solution.
Option B is wrong as the data would still be stored in the base tables and accessible.
Option D is wrong as Cloud Natural APIs is for text analysis and does not handle sensitive
information redaction.
Question 48: Correct
You are designing a relational data repository on Google Cloud to grow as needed. The
data will be transactionally consistent and added from any location in the world. You
want to monitor and adjust node count for input traffic, which can spike unpredictably.
What should you do?
A. Use Cloud Spanner for storage. Monitor storage usage and increase node count if more than
70% utilized.
B. Use Cloud Spanner for storage. Monitor CPU utilization and increase node count if more than
70% utilized for your time span.
(Correct)
C. Use Cloud Bigtable for storage. Monitor data stored and increase node count if more than 70%
utilized.
D. Use Cloud Bigtable for storage. Monitor CPU utilization and increase node count if more than
70% utilized for your time span.
Explanation
Correct answer is B as the requirement is to support relational data service with
transactionally consistently and globally scalable transactions, Cloud Spanner is an ideal
choice. CPU utilization is the recommended metric for scaling, per Google best practices,
linked below.
Refer GCP documentation -
Storage Options @ https://cloud.google.com/storage-options/ & Spanner Monitoring @
https://cloud.google.com/spanner/docs/monitoring
Option A is wrong as storage utilization is not a correct scaling metric for load.
Options C & D are wrong Bigtable is regional and not a relational data service.
Question 49: Correct
You are working on a project with two compliance requirements. The first requirement
states that your developers should be able to see the Google Cloud Platform billing
charges for only their own projects. The second requirement states that your finance
team members can set budgets and view the current charges for all projects in the
organization. The finance team should not be able to view the project contents. You
want to set permissions. What should you do?
A. Add the finance team members to the default IAM Owner role. Add the developers to a
custom role that allows them to see their own spend only.
B. Add the finance team members to the Billing Administrator role for each of the billing
accounts that they need to manage. Add the developers to the Viewer role for the Project.
(Correct)
C. Add the developers and finance managers to the Viewer role for the Project.
D. Add the finance team to the Viewer role for the Project. Add the developers to the Security
Reviewer role for each of the billing accounts.
Explanation
Correct answer is B as there are 2 requirements, Finance team able to set budgets on project
but not view project contents and developers able to only view billing charges of their
projects. Finance with Billing Administrator role can set budgets and Developer with viewer
role can view billing charges aligning with the principle of least privileges.
Refer GCP documentation - IAM Billing @ https://cloud.google.com/iam/docs/job-
functions/billing
Option A is wrong as GCP recommends using pre-defined roles instead of using primitive
roles and custom roles.
Option C is wrong as viewer role to finance would not provide them the ability to set budgets.
Option D is wrong as viewer role to finance would not provide them the ability to set
budgets. Also, Security Reviewer role enables the ability to view custom roles but not
administer them for the developers which they don't need.
Question 50: Correct
Your customer wants to capture multiple GBs of aggregate real-time key performance
indicators (KPIs) from their game servers running on Google Cloud Platform and
monitor the KPIs with low latency. How should they capture the KPIs?
A. Output custom metrics to Stackdriver from the game servers, and create a Dashboard in
Stackdriver Monitoring Console to view them.
B. Schedule BigQuery load jobs to ingest analytics files uploaded to Cloud Storage every ten
minutes, and visualize the results in Google Data Studio.
C. Store time-series data from the game servers in Google Bigtable, and view it using Google
Data Studio.
(Correct)
D. Insert the KPIs into Cloud Datastore entities, and run ad hoc analysis and visualizations of
them in Cloud Datalab.
Explanation
Correct answer is C as Bigtable is an ideal solution for storing time series data with the
ability to provide analytics at real time at a very low latency. Data can be viewed using
Google Data Studio.
Refer GCP documentation - Data lifecycle @ https://cloud.google.com/solutions/data-
lifecycle-cloud-platform
Cloud Bigtable is a managed, high-performance NoSQL database service designed for
terabyte- to petabyte-scale workloads. Cloud Bigtable is built on Google’s internal Cloud
Bigtable database infrastructure that powers Google Search, Google Analytics, Google
Maps, and Gmail. The service provides consistent, low-latency, and high-throughput storage
for large-scale NoSQL data. Cloud Bigtable is built for real-time app serving workloads, as
well as large-scale analytical workloads.
Cloud Bigtable schemas use a single-indexed row key associated with a series of columns;
schemas are usually structured either as tall or wide and queries are based on row key. The
style of schema is dependent on the downstream use cases and it’s important to consider data
locality and distribution of reads and writes to maximize performance. Tall schemas are
often used for storing time-series events, data that is keyed in some portion by a timestamp,
with relatively fewer columns per row. Wide schemas follow the opposite approach, a
simplistic identifier as the row key along with a large number of columns
Option A is wrong as Stackdriver is not an ideal solution for time series data and it does not
provide analytics capability.
Option B is wrong as BigQuery does not provide low latency access and with jobs scheduled
at every 10 minutes does not meet the real time criteria.
Option D is wrong as Datastore does not provide analytics capability.
Retake test
Continue