Sertif GCP
Sertif GCP
5. You need to create a near real-time inventory dashboard that reads the
   main inventory tables in your BigQuery data warehouse. Historical
   inventory data is stored as inventory balances by item and location. You
   have several thousand updates to inventory every hour. You want to
   maximize performance of the dashboard and ensure that the data is
   accurate. What should you do?
                                                             Streams
                                                             inventory
                                                             changes near
                                                             real-time:
                                                             BigQuery
   streaming ingests data immediately, keeping the inventory movement
   table constantly updated.
   Daily balance calculation: Joining the movement table with the historical
   balance table provides an accurate view of current inventory levels
   without affecting the actual balance table.
   Nightly update for historical data: Updating the main inventory balance
   table nightly ensures long-term data consistency while maintaining near
   real-time insights through the view.
   This approach balances near real-time updates with efficiency and data
   accuracy, making it the optimal solution for the given scenario.
6. You have a data stored in BigQuery. The data in the BigQuery dataset must
   be highly available. You need to define a storage, backup, and recovery
   strategy of this data that minimizes cost. How should you configure the
   BigQuery table that have a recovery point objective (RPO) of 30 days?
                                                             We have
                                                             external
                                                             dependency
                                                             "after the load
                                                             job with variable
   execution time completes"
   which requires DAG -> Airflow (Cloud Composer)
   The reasons:
   A scheduler like Cloud Scheduler won't handle the dependency on the
   BigQuery load completion time
   Using Composer allows creating a DAG workflow that can:
   Trigger the BigQuery load
   Wait for BigQuery load to complete
   Trigger the Dataprep Dataflow job
   Dataflow template allows easy reuse of the Dataprep transformation logic
   Composer coordinates everything based on the dependencies in an
   automated workflow
9. You are managing a Cloud Dataproc cluster. You need to make a job run
   faster while minimizing costs, without losing work in progress on your
   clusters. What should you do?
10.You work for a shipping company that uses handheld scanners to read
   shipping labels. Your company has strict data privacy standards that
   require scanners to only transmit tracking numbers when events are sent
   to Kafka topics. A recent software update caused the scanners to
   accidentally transmit recipients' personally identifiable information (PII) to
   analytics systems, which violates user privacy rules. You want to quickly
   build a scalable solution using cloud-native managed services to prevent
   exposure of PII to the analytics systems. What should you do?
                                                      Quick to implement: Using
                                                      managed services reduces
                                                      development time and
                                                      effort compared to
                                                      building solutions from
                                                      scratch. Scalability:
                                                                can easily Cloud
                                                      Functions and Cloud DLP
   API are designed to handle large volumes of data. Accuracy: Cloud DLP API
   has advanced PII detection capabilities. Flexibility: You can customize the
   processing logic in Cloud Function to meet your specific needs. Security:
   Sensitive data is handled securely within a controlled cloud environment.
11.You have developed three data processing jobs. One executes a Cloud
   Dataflow pipeline that transforms data uploaded to Cloud Storage and
   writes results to BigQuery. The second ingests data from on-premises
   servers and uploads it to Cloud Storage. The third is a Cloud Dataflow
   pipeline that gets information from third-party data providers and uploads
   the information to Cloud Storage. You need to be able to schedule and
   monitor the execution of these three workflows and manually execute
   them when needed. What should you do?
12.You have Cloud Functions written in Node.js that pull messages from Cloud
   Pub/Sub and send the data to BigQuery. You observe that the message
   processing rate on the Pub/Sub topic is orders of magnitude higher than
   anticipated, but there is no error logged in Cloud Logging. What are the
   two most likely causes of this problem? (Choose two.)
                                                     By not acknowleding the
                                                     pulled message, this result
                                                     in it be putted back in Cloud
                                                     Pub/Sub, meaning the
                                                     messages accumulate
                                                     instead of being consumed
                                                     and removed from Pub/Sub.
                                                     The same thing can happen
   ig the subscriber maintains the lease on the message it receives in case of
   an error. This reduces the overall rate of processing because messages get
   stuck on the first subscriber. Also, errors in Cloud Function do not show up
14.You are creating a new pipeline in Google Cloud to stream IoT data from
   Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the
   data, you notice that roughly 2% of the data appears to be corrupt. You
   need to modify the Cloud Dataflow pipeline to filter out this corrupt data.
   What should you do?
15.You have historical data covering the last three years in BigQuery and a
   data pipeline that delivers new data to BigQuery daily. You have noticed
   that when the Data Science team runs a query filtered on a date column
   and limited to 30`"90 days of data, the query scans the entire table. You
   also noticed that your bill is increasing more quickly than you expected.
   You want to resolve the issue as cost-effectively as possible while
   maintaining the ability to conduct SQL queries. What should you do?
                                                             A partitioned table
                                                             is a special table
                                                             that is divided into
                                                             segments, called
                                                             partitions, that
                                                             make it easier to
                                                             manage and query
   your data. By dividing a large table into smaller partitions, you can
   improve query performance, and you can control costs by reducing the
   number of bytes read by a query.
16.You operate a logistics company, and you want to improve event delivery
   reliability for vehicle-based sensors. You operate small data centers around
   the world to capture these events, but leased lines that provide
   connectivity from your event collection infrastructure to your event
   processing infrastructure are unreliable, with unpredictable latency. You
   want to address this issue in the most cost-effective way. What should you
   do?
   have the data acquisition devices publish data to Cloud Pub/Sub. This
   would provide a reliable messaging service for your event data, allowing
   you to ingest and process your data in a timely manner, regardless of the
   reliability of the leased lines. Cloud Pub/Sub also offers automatic retries
   and fault-tolerance, which would further improve the reliability of your
                                                     event delivery.
                                                     Additionally, using Cloud
                                                     Pub/Sub would allow you
                                                     to easily scale up or
                                                     down your event
                                                     processing infrastructure
                                                     as needed, which would
                                        help to minimize costs.
17.You are a retailer that wants to integrate your online sales capabilities with
   different in-home assistants, such as Google Home. You need to interpret
   customer voice commands and issue an order to the backend systems.
   Which solutions should you choose?
18.Your company has a hybrid cloud initiative. You have a complex data
   pipeline that moves data between cloud provider services and leverages
   services from each of the cloud providers. Which cloud-native service
   should you use to orchestrate the entire pipeline?
                            Cloud Composer is considered suitable across
                            multiple cloud providers, as it is built on Apache
                            Airflow, which allows for workflow orchestration
                            across different cloud environments and even on-
                            premises data centers, making it a good choice
                            for multi-cloud strategies; however, its tightest
   integration is with Google Cloud Platform services.
19.You use a dataset in BigQuery for analysis. You want to provide third-party
   companies with access to the same dataset. You need to keep the costs of
   data sharing low and ensure that the data is current. Which solution
   should you choose?
   Shared datasets are collections of tables and views in BigQuery defined by
   a data publisher and make up the unit of cross-project / cross-
   organizational sharing. Data subscribers get an opaque, read-only, linked
   dataset inside their project and VPC perimeter that they can combine with
   their own datasets and connect to solutions from Google Cloud or our
   partners. For example, a retailer might create a single exchange to share
   demand forecasts to the 1,000’s of vendors in their supply chain–having
   joined historical sales data with weather, web clickstream, and Google
   Trends data in their own BigQuery project, then sharing real-time outputs
   via Analytics Hub. The publisher can add metadata, track subscribers, and
                                                       see aggregated usage
                                                       metrics.
   Delta tables contain all change events for a particular table since the
   initial load. Having all change events available can be valuable for
   identifying trends, the state of the entities that a table represents at a
   particular moment, or change frequency.
   The best way to merge data frequently and consistently is to use a MERGE
   statement, which lets you combine multiple INSERT, UPDATE, and DELETE
   statements into a single atomic operation.
21.You are designing a data processing pipeline. The pipeline must be able to
   scale automatically as load increases. Messages must be processed at
   least once and must be ordered within windows of 1 hour. How should you
   design the solution?
                                                       the combination of
                                                       Cloud Pub/Sub for
                                                       scalable ingestion
                                                       and Cloud Dataflow
                                                       for scalable stream
                                                       processing with
   windowing capabilities makes option D the most appropriate
   solution for the given requirements. It minimizes management
   overhead, ensures scalability, and provides the necessary features for at-
   least-once processing and ordered processing within time windows.
22.You need to set access to BigQuery for different departments within your
   company. Your solution should comply with the following requirements:
   ✑ Each department should have access only to their data.
   ✑ Each department will have one or more leads who need to be able to
   create and update tables and provide them to their team.
   ✑ Each department has data analysts who need to be able to query but
   not modify data.
   How should you set access to the data in BigQuery?
                                                          Option B provides the
                                                          most secure and
                                                          appropriate solution by
                                                          leveraging dataset-level
                                                          access control. It adheres
                                                          to the principle of least
                                                          privilege, granting leads
   the specific permissions they need to manage their department's data (via
   WRITER) while allowing analysts to perform their tasks without the risk of
   accidental or malicious modifications (via READER). The dataset acts as a
   natural container for data isolation, fulfilling all the requirements outlined
                                                              in the scenario.
23.You operate a database that stores stock trades and an application that
   retrieves average stock price for a given company over an adjustable
   window of time. The data is stored in Cloud Bigtable where the datetime of
   the stock trade is the beginning of the row key. Your application has
   thousands of concurrent users, and you notice that performance is starting
   to degrade as more stocks are added. What should you do to improve the
   performance of your application?
27.You decided to use Cloud Datastore to ingest vehicle telemetry data in real
   time. You want to build a storage system that will account for the long-
   term data growth, while keeping the costs low. You also want to create
   snapshots of the data periodically, so that you can make a point-in-time
   (PIT) recovery, or clone a copy of the data for Cloud Datastore in a
   different environment. You want to archive these snapshots for a long
   time. Which two methods can accomplish this? (Choose two.)
28.You need to create a data pipeline that copies time-series transaction data
   so that it can be queried from within BigQuery by your data science team
   for analysis. Every hour, thousands of transactions are updated with a new
   status. The size of the initial dataset is 1.5 PB, and it will grow by 3 TB per
   day. The data is heavily structured, and your data science team will build
   machine learning models based on this data. You want to maximize
   performance and usability for your data science team. Which two
   strategies should you adopt? (Choose two.)
    Use nested and repeated fields to denormalize data storage and increase
   query performance.
   Denormalization is a common strategy for increasing read performance for
   relational datasets that were previously normalized. The recommended
   way to denormalize data in BigQuery is to use nested and repeated fields.
   It's best to use this strategy when the relationships are hierarchical and
   frequently queried together, such as in parent-child relationships.
30.You have a petabyte of analytics data and need to design a storage and
   processing platform for it. You must be able to perform data warehouse-
   style analytics on the data in Google Cloud and expose the dataset as files
   for batch analysis tools in other cloud providers. What should you do?
   The question emphasizes the need for a quick solution with low cost. While
   GPUs and TPUs offer greater potential performance, they require
   significant development effort (writing kernels) before they can be utilized.
   Sticking with CPUs and scaling the cluster is the fastest and most cost-
   effective way to improve training time immediately, given the reliance on
   custom C++ ops without existing GPU/TPU kernel support.
   The large discrepancy in RMSE, with the training error being higher, points
   directly to underfitting. Increasing the model's complexity by adding layers
   or expanding the input representation is the most appropriate strategy to
   address this issue and improve the model's performance.
37.As your organization expands its usage of GCP, many teams have started
   to create their own projects. Projects are further multiplied to
   accommodate different stages of deployments and target audiences. Each
   project requires unique access control configurations. The central IT team
   needs to have access to all projects. Furthermore, data from Cloud Storage
   buckets and BigQuery datasets must be shared for use in other projects in
   an ad hoc way. You want to simplify access control management by
   minimizing the number of policies. Which two steps should you take?
   (Choose two.)
39.A data scientist has created a BigQuery ML model and asks you to create
   an ML pipeline to serve predictions. You have a REST API application with
   the requirement to serve predictions for an individual user ID with latency
   under 100 milliseconds. You use the following query to generate
   predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL
   'dataset.model', table user_features). How should you create the ML
   pipeline?
   The key requirements are serving predictions for individual user IDs with
   low (sub-100ms) latency.
   Option D meets this by batch predicting for all users in BigQuery ML,
   writing predictions to Bigtable for fast reads, and allowing the application
   access to query Bigtable directly for low latency reads.
   Since the application needs to serve low-latency predictions for individual
   user IDs, using Dataflow to batch predict for all users and write to Bigtable
   allows low-latency reads. Granting the Bigtable Reader role allows the
   application to retrieve predictions for a specific user ID from Bigtable.
41.You are building a new application that you need to collect data from in a
   scalable way. Data arrives continuously from the application throughout
   the day, and you expect to generate approximately 150 GB of JSON data
   per day by the end of the year. Your requirements are:
   ✑ Decoupling producer from consumer
   ✑ Space and cost-efficient storage of the raw ingested data, which is to be
   stored indefinitely
   ✑ Near real-time SQL query
   ✑ Maintain at least 2 years of historical data, which will be queried with
   SQL
   Which pipeline should you use to meet these requirements?
   The most effective way to address the performance issue in this Dataflow
   pipeline is to increase the processing capacity by either adding more
   workers (horizontal scaling) or using more powerful workers (vertical
   scaling). Both options A and B directly address the identified CPU
   bottleneck and are the most appropriate solutions.
43.You have a data pipeline with a Dataflow job that aggregates and writes
   time series metrics to Bigtable. You notice that data is slow to update in
   Bigtable. This data feeds a dashboard used by thousands of users across
   the organization. You need to support additional concurrent users and
   reduce the amount of time required to write the data. Which two actions
   should you take? (Choose two.)
   To improve the data pipeline's performance and address the slow updates
   in Bigtable, the most effective solutions are to increase the processing
   power of the Dataflow job (by adding workers) and increase the capacity
   of the Bigtable cluster (by adding nodes). Both options B and C directly
   target the potential bottlenecks and are the most appropriate actions to
   take.
44.You have several Spark jobs that run on a Cloud Dataproc cluster on a
   schedule. Some of the jobs run in sequence, and some of the jobs run
   concurrently. You need to automate this process. What should you do?
   For orchestrating Spark jobs on Dataproc with specific sequencing and
   concurrency requirements, Cloud Composer with Airflow DAGs provides
   the most flexible, scalable, and manageable solution. It allows you to
   define dependencies, schedule execution, and monitor the entire workflow
   in a centralized and reliable manner.
45.You are building a new data pipeline to share data between two different
   types of applications: jobs generators and job runners. Your solution must
   scale to accommodate increases in usage and must accommodate the
   addition of new applications without negatively affecting the performance
   of existing ones. What should you do?
   Cloud Pub/Sub is the best solution for this data pipeline because it
   provides the necessary decoupling, scalability, and extensibility to meet
   the requirements. It enables independent scaling of job generators and
   runners, simplifies the addition of new applications, and ensures reliable
   message delivery.
47.You need to create a new transaction table in Cloud Spanner that stores
   product sales data. You are deciding what to use as a primary key. From a
   performance perspective, which strategy should you choose?
48.Data Analysts in your company have the Cloud IAM Owner role assigned to
   them in their projects to allow them to work with multiple GCP products in
   their projects. Your organization requires that all BigQuery data access
   logs be retained for 6 months. You need to ensure that only audit
   personnel in your company can access the data access logs for all
   projects. What should you do?
52.You receive data files in CSV format monthly from a third party. You need
   to cleanse this data, but every third month the schema of the files
   changes. Your requirements for implementing these transformations
   include:
   ✑ Executing the transformations on a schedule
   ✑ Enabling non-developer analysts to modify transformations
   ✑ Providing a graphical tool for designing transformations
   What should you do?
   Dataprep by Trifacta is an intelligent data service for visually exploring,
   cleaning, and preparing structured and unstructured data for analysis,
   reporting, and machine learning. Because Dataprep is serverless and
   works at any scale, there is no infrastructure to deploy or manage. Your
   next ideal data transformation is suggested and predicted with each UI
   input, so you don’t have to write code.
   The most efficient ways to start using Hive in Cloud Dataproc with ORC
   files already in Cloud Storage are:
1. Copy to HDFS using gsutil and Hadoop tools for maximum
   performance.
2. Use the Cloud Storage connector for initial access, then replicate
   key data to HDFS for optimized performance.
   Both options allow you to leverage the benefits of having data in the local
   HDFS for improved Hive query performance. Option D provides more
   flexibility by allowing you to choose what data to replicate based on your
   needs.
55.You work for a shipping company that has distribution centers where
   packages move on delivery lines to route them properly. The company
   wants to add cameras to the delivery lines to detect and track any visual
   damage to the packages in transit. You need to create a way to automate
   the detection of damaged packages and flag them for human review in
   real time while the packages are in transit. Which solution should you
   choose?
   For this scenario, where you need to automate the detection of damaged
   packages in real time while they are in transit, the most suitable solution
   among the provided options would be B.
   Here's why this option is the most appropriate:
   Real-Time Analysis: AutoML provides the capability to train a custom
   model specifically tailored to recognize patterns of damage in packages.
   This model can process images in real-time, which is essential in your
   scenario.
   Integration with Existing Systems: By building an API around the AutoML
   model, you can seamlessly integrate this solution with your existing
   package tracking applications. This ensures that the system can flag
   damaged packages for human review efficiently.
   Customization and Accuracy: Since the model is trained on your specific
   corpus of images, it can be more accurate in detecting damages relevant
   to your use case compared to pre-trained models.
56.You are migrating your data warehouse to BigQuery. You have migrated all
   of your data into tables in a dataset. Multiple users from your organization
   will be using the data. They should only see certain tables based on their
   team membership. How should you set user permissions?
   The simplest and most effective way to control user access to specific
   tables in BigQuery is to assign the bigquery.dataViewer role (or a custom
   role) at the table level. This provides the necessary granular control, is
   easy to manage, and scales well.
57.You need to store and analyze social media postings in Google BigQuery at
   a rate of 10,000 messages per minute in near real-time. Initially, design
   the application to use streaming inserts for individual postings. Your
   application also performs data aggregations right after the streaming
   inserts. You discover that the queries after streaming inserts do not exhibit
   strong consistency, and reports from the queries might miss in-flight data.
   How can you adjust your application design?
   Option D provides the most practical and efficient way to address the
   consistency issues with BigQuery streaming inserts while maintaining near
   real-time data availability. It leverages the benefits of streaming inserts for
   high-volume data ingestion and balances data freshness with accuracy by
   waiting for a period based on estimated latency.
58.You want to build a managed Hadoop system as your data lake. The data
   transformation process is composed of a series of Hadoop jobs executed in
   sequence. To accomplish the design of separating storage from compute,
   you decided to use the Cloud Storage connector to store all input data,
   output data, and intermediary data. However, you noticed that one
   Hadoop job runs very slowly with Cloud Dataproc, when compared with
   the on-premises bare-metal Hadoop environment (8-core nodes with 100-
   GB RAM). Analysis shows that this particular Hadoop job is disk I/O
   intensive. You want to resolve the issue. What should you do?
   The most effective way to resolve the performance issue for a disk I/O
   intensive Hadoop job in Cloud Dataproc is to allocate sufficient persistent
   disks and store the intermediate data on the local HDFS. This reduces
   network overhead and allows the job to access data much faster,
   improving overall performance.
   BigQuery is the most appropriate storage solution for this use case due to
   its scalability, geospatial processing capabilities, high-speed ingestion,
   machine learning integration, and suitability for dashboard creation. It
   directly addresses all the key requirements for storing, processing, and
   analyzing the ship telemetry data to predict delivery delays.
61.You operate an IoT pipeline built around Apache Kafka that normally
   receives around 5000 messages per second. You want to use Google Cloud
   Platform to create an alert as soon as the moving average over 1 hour
   drops below 4000 messages per second. What should you do?
   Dataflow with Sliding Time Windows: Dataflow allows you to work with
   event-time windows, making it suitable for time-series data like incoming
   IoT messages. Using sliding windows every 5 minutes allows you to
   compute moving averages efficiently.
   Sliding Time Window: The sliding time window of 1 hour every 5 minutes
   enables you to calculate the moving average over the specified time
   frame.
   Computing Averages: You can efficiently compute the average when each
   sliding window closes. This approach ensures that you have real-time
   visibility into the message rate and can detect deviations from the
   expected rate.
   Alerting: When the calculated average drops below 4000 messages per
   second, you can trigger an alert from within the Dataflow pipeline, sending
   it to your desired alerting mechanism, such as Cloud Monitoring, Pub/Sub,
   or another notification service.
   Scalability: Dataflow can scale automatically based on the incoming data
   volume, ensuring that you can handle the expected rate of 5000
   messages per second.
62.You plan to deploy Cloud SQL using MySQL. You need to ensure high
   availability in the event of a zone failure. What should you do?
69.You work for a mid-sized enterprise that needs to move its operational
   system transaction data from an on-premises database to GCP. The
   database is about 20 TB in size. Which database should you choose?
   Cloud SQL is a fully managed service that scales up automatically and
   supports SQL queries, it does not inherently guarantee transactional
   consistency or the ability to scale up to 6 TB for all its database engines.
70.You need to choose a database to store time series CPU and memory
   usage for millions of computers. You need to store this data in one-second
   interval samples. Analysts will be performing real-time, ad hoc analytics
   against the database. You want to avoid being charged for every query
   executed and ensure that the schema design will allow for future growth of
   the dataset. Which database and data model should you choose?
   Bigtable with a narrow table design is the most suitable solution for this
   scenario. It provides the scalability, low-latency reads, cost-effectiveness,
   and schema flexibility needed to store and analyze time series data from
   millions of computers. The narrow table model ensures efficient storage
   and retrieval of data, while the Bigtable's pricing model avoids per-query
   charges.
71.You want to archive data in Cloud Storage. Because some data is very
   sensitive, you want to use the `Trust No One` (TNO) approach to encrypt
   your data to prevent the cloud provider staff from decrypting your data.
   What should you do?
   Additional authenticated data (AAD) is any string that you pass to Cloud
   Key Management Service as part of an encrypt or decrypt request. AAD is
   used as an integrity check and can help protect your data from a confused
   deputy attack. The AAD string must be no larger than 64 KiB.
   Cloud KMS will not decrypt ciphertext unless the same AAD value is used
   for both encryption and decryption.
   AAD is bound to the encrypted data, because you cannot decrypt the
   ciphertext unless you know the AAD, but it is not stored as part of the
   ciphertext. AAD also does not increase the cryptographic strength of the
   ciphertext. Instead it is an additional check by Cloud KMS to authenticate
   a decryption request.
72.You have data pipelines running on BigQuery, Dataflow, and Dataproc. You
   need to perform health checks and monitor their behavior, and then notify
   the team managing the pipelines if they fail. You also need to be able to
   work across multiple projects. Your preference is to use managed products
   or features of the platform. What should you do?
74.You work for a large bank that operates in locations throughout North
   America. You are setting up a data storage system that will handle bank
   account transactions. You require ACID compliance and the ability to
   access data with SQL. Which solution is appropriate?
   Since the banking transaction system requires ACID compliance and SQL
   access to the data, Cloud Spanner is the most appropriate solution. Unlike
   Cloud SQL, Cloud Spanner natively provides ACID transactions and
   horizontal scalability.
   Enabling stale reads in Spanner (option A) would reduce data consistency,
   violating the ACID compliance requirement of banking transactions.
   BigQuery (option C) does not natively support ACID transactions or SQL
   writes which are necessary for a banking transactions system.
   Cloud SQL (option D) provides ACID compliance but does not scale
   horizontally like Cloud Spanner can to handle large transaction volumes.
   By using Cloud Spanner and specifically locking read-write transactions,
   ACID compliance is ensured while providing fast, horizontally scalable SQL
   processing of banking transactions.
   When you want to move your Apache Spark workloads from an on-
   premises environment to Google Cloud, we recommend using Dataproc to
   run Apache Spark/Apache Hadoop clusters. Dataproc is a fully managed,
   fully supported service offered by Google Cloud. It allows you to separate
   storage and compute, which helps you to manage your costs and be more
   flexible in scaling your workloads.
   https://cloud.google.com/bigquery/docs/migration/hive#data_migration
   Migrating Hive data from your on-premises or other cloud-based source
   cluster to BigQuery has two steps:
   1. Copying data from a source cluster to Cloud Storage
   2. Loading data from Cloud Storage into BigQuery
77.You work for a financial institution that lets customers register online. As
   new customers register, their user data is sent to Pub/Sub before being
   ingested into BigQuery. For security reasons, you decide to redact your
   customers' Government issued Identification Number while allowing
   customer service representatives to view the original values when
   necessary. What should you do?
   Before loading the data into BigQuery, use Cloud Data Loss Prevention
   (DLP) to replace input values with a cryptographic format-preserving
   encryption token.
   The key reasons are:
   DLP allows redacting sensitive PII like SSNs before loading into BigQuery.
   This provides security by default for the raw SSN values.
   Using format-preserving encryption keeps the column format intact while
   still encrypting, allowing business logic relying on SSN format to continue
   functioning.
   The encrypted tokens can be reversed to view original SSNs when
   required, meeting the access requirement for customer service reps.
78.You are migrating a table to BigQuery and are deciding on the data model.
   Your table stores information related to purchases made across several
   store locations and includes information like the time of the transaction,
   items purchased, the store ID, and the city and state in which the store is
   located. You frequently query this table to see how many of each item
   were sold over the past 30 days and to look at purchasing trends by state,
   city, and individual store. How would you model this table for the best
   query performance?
   Cloud Dataproc allows you to run Apache Hadoop jobs with minimal
   management. It is a managed Hadoop service.
   Using the Google Cloud Storage (GCS) connector, Dataproc can access
   data stored in GCS, which allows data persistence beyond the life of the
   cluster. This means that even if the cluster is deleted, the data in GCS
   remains intact. Moreover, using GCS is often cheaper and more durable
   than using HDFS on persistent disks.
80.You are updating the code for a subscriber to a Pub/Sub feed. You are
   concerned that upon deployment the subscriber may erroneously
   acknowledge messages, leading to message loss. Your subscriber is not
   set up to retain acknowledged messages. What should you do to ensure
   that you can recover from errors after deployment?
81.You work for a large real estate firm and are preparing 6 TB of home sales
   data to be used for machine learning. You will use SQL to transform the
   data and use BigQuery ML to create a machine learning model. You plan to
   use the model for predictions against a raw dataset that has not been
   transformed. How should you set up your workflow in order to prevent
   skew at prediction time?
82.You are analyzing the price of a company's stock. Every 5 seconds, you
   need to compute a moving average of the past 30 seconds' worth of data.
   You are reading data from Pub/Sub and using DataFlow to conduct the
   analysis. How should you set up your windowed pipeline?
   Since you need to compute a moving average of the past 30 seconds'
   worth of data every 5 seconds, a sliding window is appropriate. A sliding
   window allows overlapping intervals and is well-suited for computing
   rolling aggregates.
   Window Duration: The window duration should be set to 30 seconds to
   cover the required 30 seconds' worth of data for the moving average
   calculation.
   Window Period: The window period or sliding interval should be set to 5
   seconds to move the window every 5 seconds and recalculate the moving
   average with the latest data.
   Trigger: The trigger should be set to AfterWatermark.pastEndOfWindow()
   to emit the computed moving average results when the watermark
   advances past the end of the window. This ensures that all data within the
   window is considered before emitting the result.
84.You work for a large financial institution that is planning to use Dialogflow
   to create a chatbot for the company's mobile app. You have reviewed old
   chat logs and tagged each conversation for intent based on each
   customer's stated intention for contacting customer service. About 70% of
   customer requests are simple requests that are solved within 10 intents.
   The remaining 30% of inquiries require much longer, more complicated
   requests. Which intents should you automate first?
   This is the best approach because it follows the Pareto principle (80/20
   rule). By automating the most common 10 intents that address 70% of
   customer requests, you free up the live agents to focus their time and
   effort on the more complex 30% of requests that likely require human
   insight/judgement. Automating the simpler high-volume requests first
   allows the chatbot to handle those easily, efficiently routing only the
   trickier cases to agents. This makes the best use of automation for high-
   volume simple cases and human expertise for lower-volume complex
   issues.
87.You want to rebuild your batch pipeline for structured data on Google
   Cloud. You are using PySpark to conduct data transformations at scale, but
   your pipelines are taking over twelve hours to run. To expedite
   development and pipeline run time, you want to use a serverless tool and
   SOL syntax. You have already moved your raw data into Cloud Storage.
   How should you build the pipeline on Google Cloud while meeting speed
   and processing requirements?
   The core issue is the use of SideInputs for joining data, leading to
   materialization and replication overhead. CoGroupByKey provides a more
   efficient, parallel approach to join operations in Dataflow by avoiding
   materialization and reducing replication. Therefore, switching to
   CoGroupByKey is the most effective way to expedite the Dataflow job in
   this scenario.
89.You are building a real-time prediction engine that streams files, which
   may contain PII (personal identifiable information) data, into Cloud Storage
   and eventually into BigQuery. You want to ensure that the sensitive data is
   masked but still maintains referential integrity, because names and emails
   are often used as join keys. How should you use the Cloud Data Loss
   Prevention API (DLP API) to ensure that the PII data is not accessible by
   unauthorized individuals?
91.You are migrating an application that tracks library books and information
   about each book, such as author or year published, from an on-premises
   data warehouse to BigQuery. In your current relational database, the
   author information is kept in a separate table and joined to the book
   information on a common key. Based on Google's recommended practice
   for schema design, how would you structure the data to ensure optimal
   speed of queries about the author of each book that has been borrowed?
92.You need to give new website users a globally unique identifier (GUID)
   using a service that takes in data points and returns a GUID. This data is
   sourced from both internal and external systems via HTTP calls that you
   will make via microservices within your pipeline. There will be tens of
   thousands of messages per second and that can be multi-threaded. and
   you worry about the backpressure on the system. How should you design
   your pipeline to minimize that backpressure?
   Option D is the best approach to minimize backpressure in this scenario.
   By batching the jobs into 10-second increments, you can throttle the rate
   at which requests are made to the external GUID service. This prevents
   too many simultaneous requests from overloading the service.
   Considering the requirement for handling large files and the need for real-
   time data integration, Option C (gsutil for the migration; Pub/Sub and
   Dataflow for the real-time updates) seems to be the most appropriate.
   gsutil will effectively handle the large file transfers, while Pub/Sub and
   Dataflow provide a robust solution for real-time data capture and
   processing, ensuring continuous updates to your warehouse on Google
   Cloud.
94.You are using Bigtable to persist and serve stock market data for each of
   the major indices. To serve the trading application, you need to access
   only the most recent stock prices that are streaming in. How should you
   design your row key and tables to ensure that you can access the data
   with the simplest query?
   A single table for all indices keeps the structure simple.
   Using a reverse timestamp as part of the row key ensures that the most
   recent data comes first in the sorted order. This design is beneficial for
   quickly accessing the latest data.
   For example: you can convert the timestamp to a string and format it in
   reverse order, like "yyyyMMddHHmmss", ensuring newer dates and times
   are sorted lexicographically before older ones.
95.You are building a report-only data warehouse where the data is streamed
   into BigQuery via the streaming API. Following Google's best practices, you
   have both a staging and a production table for the data. How should you
   design your data loading to ensure that there is only one master dataset
   without affecting performance on either the ingestion or reporting pieces?
96.You issue a new batch job to Dataflow. The job starts successfully,
   processes a few elements, and then suddenly fails and shuts down. You
   navigate to the Dataflow monitoring interface where you find errors
   related to a particular DoFn in your pipeline. What is the most likely cause
   of the errors?
   While your job is running, you might encounter errors or exceptions in your
   worker code. These errors generally mean that the DoFns in your pipeline
   code have generated unhandled exceptions, which result in failed tasks in
   your Dataflow job.
   Exceptions in user code (for example, your DoFn instances) are reported in
   the Dataflow monitoring interface.
97.Your new customer has requested daily reports that show their net
   consumption of Google Cloud compute resources and who used the
   resources. You need to quickly and efficiently generate these daily reports.
   What should you do?
98.The Development and External teams have the project viewer Identity and
   Access Management (IAM) role in a folder named Visualization. You want
   the Development Team to be able to read data from both Cloud Storage
   and BigQuery, but the External Team should only be able to read data from
   BigQuery. What should you do?
   Development team: needs to access both Cloud Storage and BQ ->
   therefore we put the Development team inside a perimeter so it can
   access both the Cloud Storage and the BQ
   External team: allowed to access only BQ -> therefore we put Cloud
   Storage behind the restricted API and leave the external team outside of
   the perimeter, so it can access BQ, but is prohibited from accessing the
   Cloud Storage
99.Your startup has a web application that currently serves customers out of a
   single region in Asia. You are targeting funding that will allow your startup
   to serve customers globally. Your current goal is to optimize for cost, and
   your post-funding goal is to optimize for global presence and performance.
   You must use a nativeJDBC driver. What should you do?
   This option allows for optimization for cost initially with a single region
   Cloud Spanner instance, and then optimization for global presence and
   performance after funding with multi-region instances.
   Cloud Spanner supports native JDBC drivers and is horizontally scalable,
   providing very high performance. A single region instance minimizes costs
   initially. After funding, multi-region instances can provide lower latency
   and high availability globally.
   Cloud SQL does not scale as well and has higher costs for multiple high
   availability regions. Bigtable does not support JDBC drivers natively.
   Therefore, Spanner is the best choice here for optimizing both for cost
   initially and then performance and availability globally post-funding.
102.      You are loading CSV files from Cloud Storage to BigQuery. The files
   have known data quality issues, including mismatched data types, such as
   STRINGs and INT64s in the same column, and inconsistent formatting of
   values such as phone numbers or addresses. You need to create the data
   pipeline to maintain data quality and perform the required cleansing and
   transformation. What should you do?
   Data Fusion is the best choice for this scenario because it provides a
   comprehensive platform for building and managing data pipelines,
   including data quality features and pre-built transformations for handling
   the specific data issues in your CSV files. It simplifies the process and
   reduces the amount of manual coding required compared to using SQL-
   based approaches.
103.      You are developing a new deep learning model that predicts a
   customer's likelihood to buy on your ecommerce site. After running an
   evaluation of the model against both the original training data and new
   test data, you find that your model is overfitting the data. You want to
   improve the accuracy of the model when predicting new data. What
   should you do?
    To improve the accuracy of a model that's overfitting, the most effective
    strategies are to:
   Increase the amount of training data: This helps the model learn
    more generalizable patterns.
   Decrease the number of input features: This helps the model focus on
    the most relevant information and avoid learning noise.
    Therefore, option B is the most suitable approach to address overfitting
    and improve the model's accuracy on new data.
    Option D provides the most efficient and streamlined approach for this
    scenario. By using an Apache Beam custom connector with Dataflow and
    Avro format, you can directly read, transform, and stream the proprietary
    data into BigQuery while minimizing resource consumption and
    maximizing performance.
106.      An online brokerage company requires a high volume trade
   processing architecture. You need to create a secure queuing system that
   triggers jobs. The jobs will run in Google Cloud and call the company's
   Python API to execute trades. You need to efficiently implement a solution.
   What should you do?
108.      You have 15 TB of data in your on-premises data center that you
   want to transfer to Google Cloud. Your data changes weekly and is stored
   in a POSIX-compliant source. The network operations team has granted
   you 500 Mbps bandwidth to the public internet. You want to follow Google-
   recommended practices to reliably transfer your data to Google Cloud on a
   weekly basis. What should you do?
   Like gsutil, Storage Transfer Service for on-premises data enables transfers
   from network file system (NFS) storage to Cloud Storage. Although gsutil
   can support small transfer sizes (up to 1 TB), Storage Transfer Service for
   on-premises data is designed for large-scale transfers (up to petabytes of
   data, billions of files).
   Cloud SQL for PostgreSQL provides full ACID compliance, unlike Bigtable
   which provides only atomicity and consistency guarantees.
   Enabling high availability removes the need for manual failover as Cloud
   SQL will automatically failover to a standby replica if the leader instance
   goes down.
   Point-in-time recovery in MySQL requires manual intervention to restore
   data if needed.
   BigQuery does not provide transactional guarantees required for an ACID
   database.
   Therefore, a Cloud SQL for PostgreSQL instance with high availability
   meets the ACID and minimal intervention requirements best. The
   automatic failover will ensure availability and uptime without
   administrative effort.
111.     You are using BigQuery and Data Studio to design a customer-facing
   dashboard that displays large quantities of aggregated data. You expect a
   high volume of concurrent users. You need to optimize the dashboard to
   provide quick visualizations with minimal latency. What should you do?
   This approach allows the model to benefit from both the historical data
   (existing data) and the new data, ensuring that it adapts to changing
   preferences while retaining knowledge from the past. By combining both
   types of data, the model can learn to make recommendations that are up-
   to-date and relevant to users' evolving preferences.
113.     You work for a car manufacturer and have set up a data pipeline
   using Google Cloud Pub/Sub to capture anomalous sensor events. You are
   using a push subscription in Cloud Pub/Sub that calls a custom HTTPS
   endpoint that you have created to take action of these anomalous events
   as they occur. Your custom HTTPS endpoint keeps getting an inordinate
   amount of duplicate messages. What is the most likely cause of these
   duplicate messages?
   The import and export feature uses the native RDB snapshot feature of
   Redis to import data into or export data out of a Memorystore for Redis
   instance. The use of the native RDB format prevents lock-in and makes it
   very easy to move data within Google Cloud or outside of Google Cloud.
   Import and export uses Cloud Storage buckets to store RDB files.
120.      You need ads data to serve AI models and historical data for
   analytics. Longtail and outlier data points need to be identified. You want
   to cleanse the data in near-real time before running it through AI models.
   What should you do?
121.      You are collecting IoT sensor data from millions of devices across
   the world and storing the data in BigQuery. Your access pattern is based
   on recent data, filtered by location_id and device_version with the
   following query:
   You want to optimize your queries for cost and performance. How should
   you structure your data?
   Partitioning by create_date:
   Aligns with query pattern: Filters for recent data based on create_date, so
   partitioning by this column allows BigQuery to quickly narrow down the
   data to scan, reducing query costs and improving performance.
   Manages data growth: Partitioning effectively segments data by date,
   making it easier to manage large datasets and optimize storage costs.
   Clustering by location_id and device_version:
   Enhances filtering: Frequently filtering by location_id and device_version,
   clustering physically co-locates related data within partitions, further
   reducing scan time and improving performance.
122.      A live TV show asks viewers to cast votes using their mobile phones.
   The event generates a large volume of data during a 3-minute period. You
   are in charge of the "Voting infrastructure" and must ensure that the
   platform can handle the load and that all votes are processed. You must
   display partial results while voting is open. After voting closes, you need to
   count the votes exactly once while optimizing cost. What should you do?
126.      You are using BigQuery with a multi-region dataset that includes a
   table with the daily sales volumes. This table is updated multiple times per
   day. You need to protect your sales table in case of regional failures with a
   recovery point objective (RPO) of less than 24 hours, while keeping costs
   to a minimum. What should you do?
127.     You are troubleshooting your Dataflow pipeline that processes data
   from Cloud Storage to BigQuery. You have discovered that the Dataflow
   worker nodes cannot communicate with one another. Your networking
   team relies on Google Cloud network tags to define firewall rules. You need
   to identify the issue while following Google-recommended networking
   security practices. What should you do?
How should you redesign the BigQuery table to support faster access?
   - Create a copy of the necessary tables into a new dataset that doesn't use
   CMEK, ensuring the data is accessible without requiring the partner to
   have access to the encryption key.
   - Analytics Hub can then be used to share this data securely and efficiently
   with the partner organization, maintaining control and governance over
   the shared data.
131.     You are developing an Apache Beam pipeline to extract data from a
   Cloud SQL instance by using JdbcIO. You have two projects running in
   Google Cloud. The pipeline will be deployed and executed on Dataflow in
   Project A. The Cloud SQL. instance is running in Project B and does not
   have a public IP address. After deploying the pipeline, you noticed that the
   pipeline failed to extract data from the Cloud SQL instance due to
   connection failure. You verified that VPC Service Controls and shared VPC
   are not in use in these projects. You want to resolve this error while
   ensuring that the data does not go through the public internet. What
   should you do?
132.      You have a BigQuery table that contains customer data, including
   sensitive information such as names and addresses. You need to share the
   customer data with your data analytics and consumer support teams
   securely. The data analytics team needs to access the data of all the
   customers, but must not be able to access the sensitive data. The
   consumer support team needs access to all data columns, but must not be
   able to access customers that no longer have active contracts. You
   enforced these requirements by using an authorized dataset and policy
   tags. After implementing these steps, the data analytics team reports that
   they still have access to the sensitive columns. You need to ensure that
   the data analytics team does not have access to restricted data. What
   should you do? (Choose two.)
   The two best answers are D and E. You need to both enforce the policy
   tags (E) and remove the broad data viewing permission (D) to effectively
   restrict the data analytics team's access to sensitive information. This
   combination ensures that the policy tags are actually enforced and that
   the team lacks the underlying permissions to bypass those restrictions.
133.      You have a Cloud SQL for PostgreSQL instance in Region’ with one
   read replica in Region2 and another read replica in Region3. An
   unexpected event in Region’ requires that you perform disaster recovery
   by promoting a read replica in Region2. You need to ensure that your
   application has the same database capacity available before you switch
   over the connections. What should you do?
134.      You orchestrate ETL pipelines by using Cloud Composer. One of the
   tasks in the Apache Airflow directed acyclic graph (DAG) relies on a third-
   party service. You want to be notified when the task does not succeed.
   What should you do?
   Direct Trigger:
   The on_failure_callback parameter is specifically designed to invoke a
   function when a task fails, ensuring immediate notification.
   Customizable Logic:
   You can tailor the notification function to send emails, create alerts, or
   integrate with other notification systems, providing flexibility.
135.      Your company has hired a new data scientist who wants to perform
   complicated analyses across very large datasets stored in Google Cloud
   Storage and in a Cassandra cluster on Google Compute Engine. The
   scientist primarily wants to create labelled data sets for machine learning
   projects, along with some visualization tasks. She reports that her laptop is
   not powerful enough to perform her tasks and it is slowing her down. You
   want to help her perform her tasks. What should you do?
137.       You store and analyze your relational data in BigQuery on Google
   Cloud with all data that resides in US regions. You also have a variety of
   object stores across Microsoft Azure and Amazon Web Services (AWS), also
   in US regions. You want to query all your data in BigQuery daily with as
   little movement of data as possible. What should you do?
138.      You have a variety of files in Cloud Storage that your data science
   team wants to use in their models. Currently, users do not have a method
   to explore, cleanse, and validate the data in Cloud Storage. You are
   looking for a low code solution that can be used by your data science team
   to quickly cleanse and explore data within Cloud Storage. What should you
   do?
   Dataprep is the most suitable option because it's a low-code tool
   specifically designed for data exploration, cleansing, and validation
   directly within Cloud Storage. It aligns perfectly with the requirements
   outlined in the problem statement.
139.      You are building an ELT solution in BigQuery by using Dataform. You
   need to perform uniqueness and null value checks on your final tables.
   What should you do to efficiently integrate these checks into your
   pipeline?
142.      Your organization has two Google Cloud projects, project A and
   project B. In project A, you have a Pub/Sub topic that receives data from
   confidential sources. Only the resources in project A should be able to
   access the data in that topic. You want to ensure that project B and any
   future project cannot access data in the project A topic. What should you
   do?
   -It allows us to create a secure boundary around all resources in Project A,
   including the Pub/Sub topic.
   - It prevents data exfiltration to other projects and ensures that only
   resources within the perimeter (Project A) can access the sensitive data.
   - VPC Service Controls are specifically designed for scenarios where you
   need to secure sensitive data within a specific context or boundary in
   Google Cloud.
143.      You stream order data by using a Dataflow pipeline, and write the
   aggregated result to Memorystore. You provisioned a Memorystore for
   Redis instance with Basic Tier, 4 GB capacity, which is used by 40 clients
   for read-only access. You are expecting the number of read-only clients to
   increase significantly to a few hundred and you need to be able to support
   the demand. You want to ensure that read and write access availability is
   not impacted, and any changes you make can be deployed quickly. What
   should you do?
144.     You have a streaming pipeline that ingests data from Pub/Sub in
   production. You need to update this streaming pipeline with improved
   business logic. You need to ensure that the updated pipeline reprocesses
   the previous two days of delivered Pub/Sub messages. What should you
   do? (Choose two.)
   D&E
   Both retain-acked-messages and Seek are required to achieve the
   desired reprocessing. retain-acked-messages keeps the messages
   available, and Seek allows the updated pipeline to rewind and read those
   messages again. They are complementary functionalities that solve
   different parts of the problem.
145.      You currently use a SQL-based tool to visualize your data stored in
   BigQuery. The data visualizations require the use of outer joins and
   analytic functions. Visualizations must be based on data that is no less
   than 4 hours old. Business users are complaining that the visualizations
   are too slow to generate. You want to improve the performance of the
   visualization queries while minimizing the maintenance overhead of the
   data preparation pipeline. What should you do?
146.     You are deploying 10,000 new Internet of Things devices to collect
   temperature data in your warehouses globally. You need to process, store
   and analyze these very large datasets in real time. What should you do?
   Google Cloud Pub/Sub allows for efficient ingestion and real-time data
   streaming.
   Google Cloud Dataflow can process and transform the streaming data in
   real-time.
   Google BigQuery is a fully managed, highly scalable data warehouse that
   is well-suited for real-time analysis and querying of large datasets.
147.      You need to modernize your existing on-premises data strategy. Your
   organization currently uses:
   • Apache Hadoop clusters for processing multiple large data sets,
   including on-premises Hadoop Distributed File System (HDFS) for data
   replication.
   • Apache Airflow to orchestrate hundreds of ETL pipelines with thousands
   of job steps.
   You need to set up a new architecture in Google Cloud that can handle
   your Hadoop workloads and requires minimal changes to your existing
   orchestration processes. What should you do?
148.      You recently deployed several data processing jobs into your Cloud
   Composer 2 environment. You notice that some tasks are failing in Apache
   Airflow. On the monitoring dashboard, you see an increase in the total
   workers memory usage, and there were worker pod evictions. You need to
   resolve these errors. What should you do? (Choose two.)
   Both increasing worker memory (D) and increasing the Cloud Composer
   environment size (B) are crucial for solving the problem. The environment
   size provides the necessary resources, while increasing worker memory
   allows the workers to utilize those resources effectively. They work
   together to address the root cause of worker memory issues and pod
   evictions.
149.      You are on the data governance team and are implementing
   security requirements to deploy resources. You need to ensure that
   resources are limited to only the europe-west3 region. You want to follow
   Google-recommended practices.
   What should you do?
153.     You are deploying a MySQL database workload onto Cloud SQL. The
   database must be able to scale up to support several readers from various
   geographic regions. The database must be highly available and meet low
   RTO and RPO requirements, even in the event of a regional outage. You
   need to ensure that interruptions to the readers are minimal during a
   database failover. What should you do?
   Option C provides the most robust and highly available solution by
   combining a highly available primary instance with a highly available read
   replica in another region. This approach ensures that the database can
   withstand both zonal and regional failures, while cascading read replicas
   provide scalability and low latency for read workloads.
154.      You are planning to load some of your existing on-premises data
   into BigQuery on Google Cloud. You want to either stream or batch-load
   data, depending on your use case. Additionally, you want to mask some
   sensitive data before loading into BigQuery. You need to do this in a
   programmatic way while keeping costs to a minimum. What should you
   do?
156.      The data analyst team at your company uses BigQuery for ad-hoc
   queries and scheduled SQL pipelines in a Google Cloud project with a slot
   reservation of 2000 slots. However, with the recent introduction of
   hundreds of new non time-sensitive SQL pipelines, the team is
   encountering frequent quota errors. You examine the logs and notice that
   approximately 1500 queries are being triggered concurrently during peak
   time. You need to resolve the concurrency issue. What should you do?
158.     You are designing a data mesh on Google Cloud by using Dataplex
   to manage data in BigQuery and Cloud Storage. You want to simplify data
   asset permissions. You are creating a customer virtual lake with two user
   groups:
   • Data engineers, which require full data lake access
   • Analytic users, which require access to curated data
   You need to assign access rights to these two groups. What should you do?
   Option A provides the most straightforward and efficient way to manage
   permissions in Dataplex by using its built-in roles (dataplex.dataOwner and
   dataplex.dataReader). This simplifies permission management and
   ensures that each user group has the appropriate level of access to the
   data lake.
159.      You are designing the architecture of your application to store data
   in Cloud Storage. Your application consists of pipelines that read data from
   a Cloud Storage bucket that contains raw data, and write the data to a
   second bucket after processing. You want to design an architecture with
   Cloud Storage resources that are capable of being resilient if a Google
   Cloud regional failure occurs. You want to minimize the recovery point
   objective (RPO) if a failure occurs, with no impact on applications that use
   the stored data. What should you do?
   Option C provides the best balance of high availability, low RPO, and
   minimal impact on applications. Dual-region buckets with turbo replication
   offer a robust and efficient solution for storing data in Cloud Storage with
   regional failure resilience.
160.      You have designed an Apache Beam processing pipeline that reads
   from a Pub/Sub topic. The topic has a message retention duration of one
   day, and writes to a Cloud Storage bucket. You need to select a bucket
   location and processing strategy to prevent data loss in case of a regional
   outage with an RPO of 15 minutes. What should you do?
   Option D provides the most robust and efficient solution for preventing
   data loss and ensuring business continuity during a regional outage. It
   combines the high availability of dual-region buckets with turbo
   replication, proactive monitoring, and a well-defined failover process.
161.      You are preparing data that your machine learning team will use to
   train a model using BigQueryML. They want to predict the price per square
   foot of real estate. The training data has a column for the price and a
   column for the number of square feet. Another feature column called
   ‘feature1’ contains null values due to missing data. You want to replace
   the nulls with zeros to keep more data points. Which query should you
   use?
   a. Option A is the correct choice because it retains all the original columns
   and specifically addresses the issue of null values in ‘feature1’ by
   replacing them with zeros, without altering any other columns or
   performing unnecessary calculations. This makes the data ready for use in
   BigQueryML without losing any important information.
   Option C is not the best choice because it includes the EXCEPT clause for
   the price and square_feet columns, which would exclude these columns
   from the results. This is not desirable since you need these columns for
   the machine learning model to predict the price per square foot
163.     You are developing a model to identify the factors that lead to sales
   conversions for your customers. You have completed processing your data.
   You want to continue through the model development lifecycle. What
   should you do next?
   you've just concluded processing data, ending up with clean and prepared
   data for the model. Now you need to decide how to split the data for
   testing and for training. Only afterwards, you can train the model,
   evaluate it, fine tune it and, eventually, predict with it
164.     You have one BigQuery dataset which includes customers’ street
   addresses. You want to retrieve all occurrences of street addresses from
   the dataset. What should you do?
165.      Your company operates in three domains: airlines, hotels, and ride-
   hailing services. Each domain has two teams: analytics and data science,
   which create data assets in BigQuery with the help of a central data
   platform team. However, as each domain is evolving rapidly, the central
   data platform team is becoming a bottleneck. This is causing delays in
   deriving insights from data, and resulting in stale data when pipelines are
   not kept up to date. You need to design a data mesh architecture by using
   Dataplex to eliminate the bottleneck. What should you do?
   You have an inventory of VM data stored in the BigQuery table. You want
   to prepare the data for regular reporting in the most cost-effective way.
   You need to exclude VM rows with fewer than 8 vCPU in your report. What
   should you do?
   This approach allows you to set up a custom log sink with an advanced
   filter that targets the specific table and then export the log entries to
   Google Cloud Pub/Sub. Your monitoring tool can subscribe to the Pub/Sub
   topic, providing you with instant notifications when relevant events occur
   without being inundated with notifications from other tables.
   Options A and B do not offer the same level of customization and
   specificity in targeting notifications for a particular table.
   Option C is almost correct but doesn't mention the use of an advanced log
   filter in the sink configuration, which is typically needed to filter the logs to
   a specific table effectively. Using the Stackdriver API for more advanced
   configuration is often necessary for fine-grained control over log filtering.
169.      Your company's data platform ingests CSV file dumps of booking
   and user profile data from upstream sources into Cloud Storage. The data
   analyst team wants to join these datasets on the email field available in
   both the datasets to perform analysis. However, personally identifiable
   information (PII) should not be accessible to the analysts. You need to de-
   identify the email field in both the datasets before loading them into
   BigQuery for analysts. What should you do?
   Format-preserving encryption (FPE) with FFX in Cloud DLP is a strong
   choice for de-identifying PII like email addresses. FPE maintains the format
   of the data and ensures that the same input results in the same encrypted
   output consistently. This means the email fields in both datasets can be
   encrypted to the same value, allowing for accurate joins in BigQuery while
   keeping the actual email addresses hidden.
170.     You have important legal hold documents in a Cloud Storage bucket.
   You need to ensure that these documents are not deleted or modified.
   What should you do?
172.      You are deploying a batch pipeline in Dataflow. This pipeline reads
   data from Cloud Storage, transforms the data, and then writes the data
   into BigQuery. The security team has enabled an organizational constraint
   in Google Cloud, requiring all Compute Engine instances to use only
   internal IP addresses and no external IP addresses. What should you do?
   - Private Google Access for services allows VM instances with only internal
   IP addresses in a VPC network or on-premises networks (via Cloud VPN or
   Cloud Interconnect) to reach Google APIs and services.
   - When you launch a Dataflow job, you can specify that it should use
   worker instances without external IP addresses if Private Google Access is
   enabled on the subnetwork where these instances are launched.
   - This way, your Dataflow workers will be able to access Cloud Storage and
   BigQuery without violating the organizational constraint of no external IPs.
175.      You are deploying an Apache Airflow directed acyclic graph (DAG) in
   a Cloud Composer 2 instance. You have incoming files in a Cloud Storage
   bucket that the DAG processes, one file at a time. The Cloud Composer
   instance is deployed in a subnetwork with no Internet access. Instead of
   running the DAG based on a schedule, you want to run the DAG in a
   reactive way every time a new file is received. What should you do?
   - Enable Airflow REST API: In Cloud Composer, enable the "Airflow web
   server" option.
   - Set Up Cloud Storage Notifications: Create a notification for new files,
   routing to a Cloud Function.
   - Create PSC Endpoint: Establish a PSC endpoint for Cloud Composer.
   - Write Cloud Function: Code the function to use the Airflow REST API (via
   PSC endpoint) to trigger the DAG.
176.      You are planning to use Cloud Storage as part of your data lake
   solution. The Cloud Storage bucket will contain objects ingested from
   external systems. Each object will be ingested once, and the access
   patterns of individual objects will be random. You want to minimize the
   cost of storing and retrieving these objects. You want to ensure that any
   cost optimization efforts are transparent to the users and applications.
   What should you do?
   - Autoclass automatically analyzes access patterns of objects and
   automatically transitions them to the most cost-effective storage class
   within Standard, Nearline, Coldline, or Archive.
   - This eliminates the need for manual intervention or setting specific age
   thresholds.
   - No user or application interaction is required, ensuring transparency.
177.      You have several different file type data sources, such as Apache
   Parquet and CSV. You want to store the data in Cloud Storage. You need to
   set up an object sink for your data that allows you to use your own
   encryption keys. You want to use a GUI-based solution. What should you
   do?
178.      Your business users need a way to clean and prepare data before
   using the data for analysis. Your business users are less technically savvy
   and prefer to work with graphical user interfaces to define their
   transformations. After the data has been transformed, the business users
   want to perform their analysis directly in a spreadsheet. You need to
   recommend a solution that they can use. What should you do?
    It uses Dataprep to address the need for a graphical interface for data
   cleaning.
    It leverages BigQuery for scalable data storage.
    It employs Connected Sheets to enable analysis directly within a
   spreadsheet, fulfilling all the stated requirements.
179.     You are working on a sensitive project involving private user data.
   You have set up a project on Google Cloud Platform to house your work
   internally. An external consultant is going to assist with coding a complex
   transformation in a Google Cloud Dataflow pipeline for your project. How
   should you maintain users' privacy?
180.      You have two projects where you run BigQuery jobs:
   • One project runs production jobs that have strict completion time SLAs.
   These are high priority jobs that must have the required compute
   resources available when needed. These jobs generally never go below a
   300 slot utilization, but occasionally spike up an additional 500 slots.
   • The other project is for users to run ad-hoc analytical queries. This
   project generally never uses more than 200 slots at a time. You want these
   ad-hoc queries to be billed based on how much data users scan rather
   than by slot capacity.
   You need to ensure that both projects have the appropriate compute
   resources available. What should you do?
182.      You are on the data governance team and are implementing
   security requirements. You need to encrypt all your data in BigQuery by
   using an encryption key managed by your team. You must implement a
   mechanism to generate and store encryption material only on your on-
   premises hardware security module (HSM). You want to rely on Google
   managed solutions. What should you do?
   - Cloud EKM allows you to use encryption keys managed in external key
   management systems, including on-premises HSMs, while using Google
   Cloud services.
   - This means that the key material remains in your control and
   environment, and Google Cloud services use it via the Cloud EKM
   integration.
   - This approach aligns with the need to generate and store encryption
   material only on your on-premises HSM and is the correct way to integrate
   such keys with BigQuery.
183.      You maintain ETL pipelines. You notice that a streaming pipeline
   running on Dataflow is taking a long time to process incoming data, which
   causes output delays. You also noticed that the pipeline graph was
   automatically optimized by Dataflow and merged into one step. You want
   to identify where the potential bottleneck is occurring. What should you
   do?
   From the Dataflow documentation: "There are a few cases in your pipeline
   where you may want to prevent the Dataflow service from performing
   fusion optimizations. These are cases in which the Dataflow service might
   incorrectly guess the optimal way to fuse operations in the pipeline, which
   could limit the Dataflow service's ability to make use of all available
   workers.
   You can insert a Reshuffle step. Reshuffle prevents fusion, checkpoints the
   data, and performs deduplication of records. Reshuffle is supported by
   Dataflow even though it is marked deprecated in the Apache Beam
   documentation."
184.      You are running your BigQuery project in the on-demand billing
   model and are executing a change data capture (CDC) process that
   ingests data. The CDC process loads 1 GB of data every 10 minutes into a
   temporary table, and then performs a merge into a 10 TB target table.
   This process is very scan intensive and you want to explore options to
   enable a predictable cost model. You need to create a BigQuery
   reservation based on utilization information gathered from BigQuery
   Monitoring and apply the reservation to the CDC process. What should you
   do?
   The most effective and recommended way to ensure a BigQuery
   reservation applies to your CDC process, which involves multiple jobs and
   potential different datasets/service accounts, is to create the reservation
   at the project level. This guarantees that all BigQuery workloads within
   the project, including your CDC process, will utilize the reserved capacity.
   - Lowest RPO: Time travel offers point-in-time recovery for the past seven
   days by default, providing the shortest possible recovery point objective
   (RPO) among the given options. You can recover data to any state within
   that window.
   - No Additional Costs: Time travel is a built-in feature of BigQuery,
   incurring no extra storage or operational costs.
   - Managed Service: BigQuery handles time travel automatically,
   eliminating manual backup and restore processes.
186.       You are building a streaming Dataflow pipeline that ingests noise
   level data from hundreds of sensors placed near construction sites across
   a city. The sensors measure noise level every ten seconds, and send that
   data to the pipeline when levels reach above 70 dBA. You need to detect
   the average noise level from a sensor when data is received for a duration
   of more than 30 minutes, but the window ends when no data has been
   received for 15 minutes. What should you do?
   to detect average noise levels from sensors, the best approach is to use
   session windows with a 15-minute gap duration (Option A). Session
   windows are ideal for cases like this where the events (sensor data) are
   sporadic. They group events that occur within a certain time interval (15
   minutes in your case) and a new window is started if no data is received
   for the duration of the gap. This matches your requirement to end the
   window when no data is received for 15 minutes, ensuring that the
   average noise level is calculated over periods of continuous data
187.      You are creating a data model in BigQuery that will hold retail
   transaction data. Your two largest tables, sales_transaction_header and
   sales_transaction_line, have a tightly coupled immutable relationship.
   These tables are rarely modified after load and are frequently joined when
   queried. You need to model the sales_transaction_header and
   sales_transaction_line tables to improve the performance of data analytics
   queries. What should you do?
   - Draining the old pipeline ensures that it finishes processing all in-flight
   data before stopping, which prevents data loss and inconsistencies.
   - After draining, you can start the new pipeline, which will begin processing
   new data from where the old pipeline left off.
    - This approach maintains a smooth transition between the old and new
    versions, minimizing latency increases and avoiding data gaps or overlaps.
189.     Your organization's data assets are stored in BigQuery, Pub/Sub, and
   a PostgreSQL instance running on Compute Engine. Because there are
   multiple domains and diverse teams using the data, teams in your
   organization are unable to discover existing data assets. You need to
   design a solution to improve data discoverability while keeping
   development and configuration efforts to a minimum. What should you
   do?
190.     You are building a model to predict whether or not it will rain on a
   given day. You have thousands of input features and want to see if you can
   improve training speed by removing some features while having a
   minimum effect on model accuracy. What can you do?
191.      You need to create a SQL pipeline. The pipeline runs an aggregate
   SQL transformation on a BigQuery table every two hours and appends the
   result to another existing BigQuery table. You need to configure the
   pipeline to retry if errors occur. You want the pipeline to send an email
   notification after three consecutive failures. What should you do?
    Option B leverages the power of Cloud Composer's workflow orchestration
    and the BigQueryInsertJobOperator's capabilities to create a
    straightforward, reliable, and maintainable SQL pipeline that meets all the
    specified requirements, including retries and email notifications after three
    consecutive failures.
   It makes the tag template public, enabling all employees to search for
    tables based on the tags without needing extra permissions.
   It directly grants BigQuery data access to the HR group only on the
    necessary tables, minimizing configuration overhead and ensuring
    compliance with the restricted data access requirement.
    By combining public tag visibility with targeted BigQuery permissions,
    Option C provides the most straightforward and least complex way to
    achieve the desired access control and searchability for your BigQuery
    data and Data Catalog tags.
194.      You are creating the CI/CD cycle for the code of the directed acyclic
   graphs (DAGs) running in Cloud Composer. Your team has two Cloud
   Composer instances: one instance for development and another instance
   for production. Your team is using a Git repository to maintain and develop
   the code of the DAGs. You want to deploy the DAGs automatically to Cloud
   Composer when a certain tag is pushed to the Git repository. What should
   you do?
   It uses Cloud Build to automate the deployment process based on Git tags.
   It directly deploys DAG code to the Cloud Storage buckets used by Cloud
    Composer, eliminating the need for additional infrastructure.
   It aligns with the recommended approach for managing DAGs in Cloud
    Composer.
    By leveraging Cloud Build and Cloud Storage, Option A minimizes the
    configuration overhead and complexity while providing a robust and
    automated CI/CD pipeline for your Cloud Composer DAGs.
195.     You have a BigQuery table that ingests data directly from a Pub/Sub
   subscription. The ingested data is encrypted with a Google-managed
   encryption key. You need to meet a new organization policy that requires
   you to use keys from a centralized Cloud Key Management Service (Cloud
   KMS) project to encrypt data at rest. What should you do?
197.     You are designing a Dataflow pipeline for a batch processing job.
   You want to mitigate multiple zonal failures at job submission time. What
   should you do?
198.      You are designing a real-time system for a ride hailing app that
   identifies areas with high demand for rides to effectively reroute available
   drivers to meet the demand. The system ingests data from multiple
   sources to Pub/Sub, processes the data, and stores the results for
   visualization and analysis in real-time dashboards. The data sources
   include driver location updates every 5 seconds and app-based booking
   events from riders. The data processing involves real-time aggregation of
   supply and demand data for the last 30 seconds, every 2 seconds, and
   storing the results in a low-latency system for visualization. What should
   you do?
   Tumbling windows are the best choice for this ride-hailing app because
   they provide accurate 2-second aggregations without the complexities of
   overlapping data. This is crucial for real-time decision-making and
   ensuring accurate visualization of supply and demand.
   Hopping windows introduce potential inaccuracies and complexity, making
   them less suitable for this scenario. While they can be useful in other
   situations, they are not the optimal choice for real-time aggregation with
   strict accuracy requirements.
   Side Output for Failed Messages: Dataflow allows you to use side outputs
   to handle messages that fail processing. In your DoFn , you can catch
   exceptions and write the failed messages to a separate PCollection . This
   PCollection can then be written to a new Pub/Sub topic.
   New Pub/Sub Topic for Monitoring: Creating a dedicated Pub/Sub topic for
   failed messages allows you to monitor it specifically for alerting purposes.
   This provides a clear view of any issues with your business logic.
   topic/num_unacked_messages_by_region Metric: This Cloud Monitoring
   metric tracks the number of unacknowledged messages in a Pub/Sub
   topic. By monitoring this metric on your new topic, you can identify when
   messages are failing to be processed correctly.
200.      You want to store your team’s shared tables in a single dataset to
   make data easily accessible to various analysts. You want to make this
   data readable but unmodifiable by analysts. At the same time, you want to
   provide the analysts with individual workspaces in the same project, where
   they can create and store tables for their own use, without the tables
   being accessible by other analysts. What should you do?
   You want to improve the performance of this data read. What should you
   do?
   This function exports the whole table to temporary files in Google Cloud
   Storage, where it will later be read from.
   This requires almost no computation, as it only performs an export job,
   and later Dataflow reads from GCS (not from BigQuery).
   BigQueryIO.read.fromQuery() executes a query and then reads the results
   received after the query execution. Therefore, this function is more time-
   consuming, given that it requires that a query is first executed (which will
   incur in the corresponding economic and computational costs).
202.      You are running a streaming pipeline with Dataflow and are using
   hopping windows to group the data as the data arrives. You noticed that
   some data is arriving late but is not being marked as late data, which is
   resulting in inaccurate aggregations downstream. You need to find a
   solution that allows you to capture the late data in the appropriate
   window. What should you do?
203.      You work for a large ecommerce company. You store your
   customer's order data in Bigtable. You have a garbage collection policy set
   to delete the data after 30 days and the number of versions is set to 1.
   When the data analysts run a query to report total customer spending, the
   analysts sometimes see customer data that is older than 30 days. You
   need to ensure that the analysts do not see customer data older than 30
   days while minimizing cost and overhead. What should you do?
204.     You are using a Dataflow streaming job to read messages from a
   message bus that does not support exactly-once delivery. Your job then
   applies some transformations, and loads the result into BigQuery. You want
   to ensure that your data is being streamed into BigQuery with exactly-
   once delivery semantics. You expect your ingestion throughput into
   BigQuery to be about 1.5 GB per second. What should you do?
   This approach directly addresses the issue by filtering out data older than
   30 days at query time, ensuring that only the relevant data is retrieved. It
   avoids the overhead and potential delays associated with garbage
   collection and manual deletion processes
205.       You have created an external table for Apache Hive partitioned data
   that resides in a Cloud Storage bucket, which contains a large number of
   files. You notice that queries against this table are slow. You want to
   improve the performance of these queries. What should you do?
   - BigLake Table: BigLake allows for more efficient querying of data lakes
   stored in Cloud Storage. It can handle large datasets more effectively than
   standard external tables.
   - Metadata Caching: Enabling metadata caching can significantly improve
   query performance by reducing the time taken to read and process
   metadata from a large number of files.
206.      You have a network of 1000 sensors. The sensors generate time
   series data: one metric per sensor per second, along with a timestamp.
   You already have 1 TB of data, and expect the data to grow by 1 GB every
   day. You need to access this data in two ways. The first access pattern
   requires retrieving the metric from one specific sensor stored at a specific
   timestamp, with a median single-digit millisecond latency. The second
   access pattern requires running complex analytic queries on the data,
   including joins, once a day. How should you store this data?
    - Bigtable excels at incredibly fast lookups by row key, often reaching
    single-digit millisecond latencies.
    - Constructing the row key with sensor ID and timestamp enables efficient
    retrieval of specific sensor readings at exact timestamps.
    - Bigtable's wide-column design effectively stores time series data,
    allowing for flexible addition of new metrics without schema changes.
    - Bigtable scales horizontally to accommodate massive datasets
    (petabytes or more), easily handling the expected data growth.
207.     You have 100 GB of data stored in a BigQuery table. This data is
   outdated and will only be accessed one or two times a year for analytics
   with SQL. For backup purposes, you want to store this data to be
   immutable for 3 years. You want to minimize storage costs. What should
   you do?
208.     You have thousands of Apache Spark jobs running in your on-
   premises Apache Hadoop cluster. You want to migrate the jobs to Google
   Cloud. You want to use managed services to run your jobs instead of
   maintaining a long-lived Hadoop cluster yourself. You have a tight timeline
   and want to keep code changes to a minimum. What should you do?
   Dataproc is the most suitable choice for migrating your existing Apache
   Spark jobs to Google Cloud because it is a fully managed service that
   supports Apache Spark and Hadoop workloads with minimal changes to
   your existing code. Moving your data to Cloud Storage and running jobs on
   Dataproc offers a fast, efficient, and scalable solution for your needs.
209.      You are administering shared BigQuery datasets that contain views
   used by multiple teams in your organization. The marketing team is
   concerned about the variability of their monthly BigQuery analytics spend
   using the on-demand billing model. You need to help the marketing team
   establish a consistent BigQuery analytics spend each month. What should
   you do?
   This option provides the marketing team with a predictable monthly cost
   by reserving a fixed number of slots, ensuring that they have dedicated
   resources without the variability introduced by autoscaling or on-demand
   pricing. This setup also simplifies budgeting and financial planning for the
   marketing team, as they will have a consistent expense each month.
211.      You have data located in BigQuery that is used to generate reports
   for your company. You have noticed some weekly executive report fields
   do not correspond to format according to company standards. For
   example, report errors include different telephone formats and different
   country code identifiers. This is a frequent issue, so you need to create a
   recurring job to normalize the data. You want a quick solution that requires
   no coding. What should you do?
212.       Your company is streaming real-time sensor data from their factory
   floor into Bigtable and they have noticed extremely poor performance.
   How should the row key be redesigned to improve Bigtable performance
   on queries that populate real-time dashboards?
   It enables efficient range scans for retrieving data for specific sensors
    across time.
   It distributes writes to prevent hotspots and maintain write performance.
   It ensures data locality for recent queries, improving read performance for
    real-time dashboards.
    By using <sensorid>#<timestamp> as the row key structure, you
    optimize Bigtable for the specific access patterns of your real-time
    dashboards, resulting in improved query performance and a better user
    experience.
216.
   Your organization is modernizing their IT services and migrating to Google
   Cloud. You need to organize the data that will be stored in Cloud Storage
   and BigQuery. You need to enable a data mesh approach to share the data
   between sales, product design, and marketing departments. What should
   you do?
217.      You work for a large ecommerce company. You are using Pub/Sub to
   ingest the clickstream data to Google Cloud for analytics. You observe that
   when a new subscriber connects to an existing topic to analyze data, they
   are unable to subscribe to older data. For an upcoming yearly sale event in
   two months, you need a solution that, once implemented, will enable any
   new subscriber to read the last 30 days of data. What should you do?
   - Topic Retention Policy: This policy determines how long messages are
   retained by Pub/Sub after they are published, even if they have not been
   acknowledged (consumed) by any subscriber.
   - 30 Days Retention: By setting the retention policy of the topic to 30 days,
   all messages published to this topic will be available for consumption for
   30 days. This means any new subscriber connecting to the topic can
   access and analyze data from the past 30 days.
218.      You are designing the architecture to process your data from Cloud
   Storage to BigQuery by using Dataflow. The network team provided you
   with the Shared VPC network and subnetwork to be used by your
   pipelines. You need to enable the deployment of the pipeline on the
   Shared VPC network. What should you do?
   Shared VPC and Network Access: When using a Shared VPC, you need to
   grant specific permissions to service accounts in the service project
   (where your Dataflow pipeline runs) to access resources in the host
   project's network.
   compute.networkUser Role: This role grants the necessary permissions for
   a service account to use the network resources in the Shared VPC. This
   includes accessing subnets, creating instances, and communicating with
   other services within the network.
   Service Account for Pipeline Execution: The service account that executes
   your Dataflow pipeline is the one that needs these network permissions.
   This is because the Dataflow service uses this account to create and
   manage worker instances within the Shared VPC network.
222.      You have an upstream process that writes data to Cloud Storage.
   This data is then read by an Apache Spark job that runs on Dataproc.
   These jobs are run in the us-central1 region, but the data could be stored
   anywhere in the United States. You need to have a recovery process in
   place in case of a catastrophic single region failure. You need an approach
   with a maximum of 15 minutes of data loss (RPO=15 mins). You want to
   ensure that there is minimal latency when reading the data. What should
   you do?
   Normalizing the database into separate Patients and Visits tables, along
   with creating other necessary tables, is the best solution for handling the
   increased data size while ensuring efficient query performance and
   maintainability. This approach addresses the root problem instead of
   applying temporary fixes.
224.      Your company's customer and order databases are often under
   heavy load. This makes performing analytics against them difficult without
   harming operations. The databases are in a MySQL cluster, with nightly
   backups taken using mysqldump. You want to perform analytics with
   minimal impact on operations. What should you do?
   - Aligns with ELT Approach: Dataform is designed for ELT (Extract, Load,
   Transform) pipelines, directly executing SQL transformations within
   BigQuery, matching the developers' preference.
   -SQL as Code: It enables developers to write and manage SQL
   transformations as code, promoting version control, collaboration, and
   testing.
   - Intuitive Coding Environment: Dataform provides a user-friendly interface
   and familiar SQL syntax, making it easy for SQL-proficient developers to
   adopt.
   - Scheduling and Orchestration: It includes built-in scheduling capabilities
   to automate pipeline execution, simplifying pipeline management.
227.      You work for a farming company. You have one BigQuery table
   named sensors, which is about 500 MB and contains the list of your 5000
   sensors, with columns for id, name, and location. This table is updated
   every hour. Each sensor generates one metric every 30 seconds along
   with a timestamp, which you want to store in BigQuery. You want to run an
   analytical query on the data once a week for monitoring purposes. You
   also want to minimize costs. What data model should you use?
   This approach offers several advantages:
   Cost Efficiency: Partitioning the metrics table by timestamp helps reduce
   query costs by allowing BigQuery to scan only the relevant partitions.
   Data Organization: Keeping metrics in a separate table maintains a clear
   separation between sensor metadata and sensor metrics, making it easier
   to manage and query the data2.
   Performance: Using INSERT statements to append new metrics ensures
   efficient data ingestion without the overhead of frequent updates
228.     You are managing a Dataplex environment with raw and curated
   zones. A data engineering team is uploading JSON and CSV files to a
   bucket asset in the curated zone but the files are not being automatically
   discovered by Dataplex. What should you do to ensure that the files are
   discovered by Dataplex?
   Raw zones store structured data, semi-structured data such as CSV files
   and JSON files, and unstructured data in any format from external sources.
   Curated zones store structured data. Data can be stored in Cloud Storage
   buckets or BigQuery datasets. Supported formats for Cloud Storage
   buckets include Parquet, Avro, and ORC.
229.      You have a table that contains millions of rows of sales data,
   partitioned by date. Various applications and users query this data many
   times a minute. The query requires aggregating values by using AVG,
   MAX, and SUM, and does not require joining to other tables. The required
   aggregations are only computed over the past year of data, though you
   need to retain full historical data in the base tables. You want to ensure
   that the query results always include the latest data from the tables, while
   also reducing computation cost, maintenance overhead, and duration.
   What should you do?
   - Using the Cloud SQL Auth proxy is a recommended method for secure
   connections, especially when dealing with dynamic IP addresses.
   - The Auth proxy provides secure access to your Cloud SQL instance
   without the need for Authorized Networks or managing IP addresses.
   - It works by encapsulating database traffic and forwarding it through a
   secure tunnel, using Google's IAM for authentication.
   - Leaving the Authorized Networks empty means you're not allowing any
   direct connections based on IP addresses, relying entirely on the Auth
   proxy for secure connectivity. This is a secure and flexible solution,
   especially for applications with dynamic IPs.
233.       You are migrating a large number of files from a public HTTPS
   endpoint to Cloud Storage. The files are protected from unauthorized
   access using signed URLs. You created a TSV file that contains the list of
   object URLs and started a transfer job by using Storage Transfer Service.
   You notice that the job has run for a long time and eventually failed.
   Checking the logs of the transfer job reveals that the job was running fine
   until one point, and then it failed due to HTTP 403 errors on the remaining
   files. You verified that there were no changes to the source system. You
   need to fix the problem to resume the migration process. What should you
   do?
   HTTP 403 errors: These errors indicate unauthorized access, but since you
   verified the source system and signed URLs, the issue likely lies with
   expired signed URLs. Renewing the URLs with a longer validity period
   prevents this issue for the remaining files.
   Separate jobs: Splitting the file into smaller chunks and submitting them
   as separate jobs improves parallelism and potentially speeds up the
   transfer process.
   Avoid manual intervention: Options A and D require manual intervention
   and complex setups, which are less efficient and might introduce risks.
   Longer validity: While option B addresses expired URLs, splitting the file
   offers additional benefits for faster migration.
234.     You work for an airline and you need to store weather data in a
   BigQuery table. Weather data will be used as input to a machine learning
   model. The model only uses the last 30 days of weather data. You want to
   avoid storing unnecessary data and minimize costs. What should you do?
   It uses partitioning to improve query performance when selecting data
    within a date range.
   It automates data deletion through partition expiration, ensuring that only
    the necessary data is stored.
    By using a partitioned table with partition expiration, you can effectively
    manage your weather data in BigQuery, optimize query performance, and
    minimize storage costs.
    Sumber dan konten terkait
235.     You have Google Cloud Dataflow streaming pipeline running with a
   Google Cloud Pub/Sub subscription as the source. You need to make an
   update to the code that will make the new Cloud Dataflow pipeline
   incompatible with the current version. You do not want to lose any data
   when making this update. What should you do?
   It leverages the drain flag to ensure that all data is processed before the
    pipeline is shut down for the update.
   It allows for a seamless transition to the updated pipeline without any data
    loss.
    By using the drain flag, you can safely update your Dataflow pipeline with
    incompatible changes while preserving data integrity.
236.      You need to look at BigQuery data from a specific table multiple
   times a day. The underlying table you are querying is several petabytes in
   size, but you want to filter your data and provide simple aggregations to
   downstream users. You want to run queries faster and get up-to-date
   insights quicker. What should you do?
   It provides the best query performance by storing pre-computed results.
   It offers up-to-date insights through automatic refresh capabilities.
   It can be more cost-effective than repeatedly querying the large table.
    By creating a materialized view, you can significantly improve query
    performance and get up-to-date insights faster, while reducing the load on
    your BigQuery table.
240.      You are configuring networking for a Dataflow job. The data pipeline
   uses custom container images with the libraries that are required for the
   transformation logic preinstalled. The data pipeline reads the data from
   Cloud Storage and writes the data to BigQuery. You need to ensure cost-
   effective and secure communication between the pipeline and Google APIs
   and services. What should you do?
   This approach ensures that your worker VMs can access Google APIs and
   services securely without using external IP addresses, which reduces costs
   and enhances security by keeping the traffic within Google's network
241.      You are using Workflows to call an API that returns a 1KB JSON
   response, apply some complex business logic on this response, wait for
   the logic to complete, and then perform a load from a Cloud Storage file to
   BigQuery. The Workflows standard library does not have sufficient
   capabilities to perform your complex logic, and you want to use Python's
   standard library instead. You want to optimize your workflow for simplicity
   and speed of execution. What should you do?
   Using a Cloud Function allows you to run your Python code in a serverless
   environment, which simplifies deployment and management. It also
   ensures quick execution and scalability, as Cloud Functions can handle the
   processing of your JSON response efficiently
242.      You are administering a BigQuery on-demand environment. Your
   business intelligence tool is submitting hundreds of queries each day that
   aggregate a large (50 TB) sales history fact table at the day and month
   levels. These queries have a slow response time and are exceeding cost
   expectations. You need to decrease response time, lower query costs, and
   minimize maintenance. What should you do?
243.      You have several different unstructured data sources, within your
   on-premises data center as well as in the cloud. The data is in various
   formats, such as Apache Parquet and CSV. You want to centralize this data
   in Cloud Storage. You need to set up an object sink for your data that
   allows you to use your own encryption keys. You want to use a GUI-based
   solution. What should you do?
244.      You are using BigQuery with a regional dataset that includes a table
   with the daily sales volumes. This table is updated multiple times per day.
   You need to protect your sales table in case of regional failures with a
   recovery point objective (RPO) of less than 24 hours, while keeping costs
    to a minimum. What should you do?
    This approach ensures that sensitive data elements are protected through
    masking, which meets data privacy requirements. At the same time, it
    retains the data in a usable form for future analyses
247.      Your software uses a simple JSON format for all messages. These
   messages are published to Google Cloud Pub/Sub, then processed with
   Google Cloud Dataflow to create a real-time dashboard for the CFO. During
   testing, you notice that some messages are missing in the dashboard. You
   check the logs, and all messages are being published to Cloud Pub/Sub
   successfully. What should you do next?
   This will allow you to determine if the issue is with the pipeline or with the
   dashboard application. By analyzing the output, you can see if the
   messages are being processed correctly and determine if there are any
   discrepancies or missing messages. If the issue is with the pipeline, you
   can then debug and make any necessary updates to ensure that all
   messages are processed correctly. If the issue is with the dashboard
   application, you can then focus on resolving that issue. This approach
   allows you to isolate and identify the root cause of the missing messages
   in a controlled and efficient manner.
Solution Concept –
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
   Part of our competitive advantage is that we penalize ourselves for late
   shipments and deliveries. Knowing where out shipments are at all times
   has a direct correlation to our bottom line and profitability. Additionally, I
   don't want to commit capital to building out a server environment.
   Flowlogistic wants to use Google BigQuery as their primary analysis
   system, but they still have Apache Hadoop and Spark workloads that they
   cannot move to BigQuery. Flowlogistic does not know how to store the
   data that is common to both workloads. What should they do?
   Company Background –
   The company started as a regional trucking company, and then expanded
   into other logistics market. Because they have not updated their
   infrastructure, managing and tracking orders and shipments has become a
   bottleneck. To improve operations, Flowlogistic developed proprietary
   technology for tracking shipments in real time at the parcel level.
   However, they are unable to deploy it because their technology stack,
   based on Apache Kafka, cannot support the processing volume. In
   addition, Flowlogistic wants to further analyze their orders and shipments
   to determine how best to deploy their resources.
   Solution Concept –
   Flowlogistic wants to implement two concepts using the cloud:
   ✑ Use their proprietary technology in a real-time inventory-tracking
   system that indicates the location of their loads
   ✑ Perform analytics on all their orders and shipment logs, which contain
   both structured and unstructured data, to determine how best to deploy
   resources, which markets to expand info. They also want to use predictive
   analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
   We have grown so quickly that our inability to upgrade our infrastructure is
   really hampering further growth and efficiency. We are efficient at moving
   shipments around the world, but we are inefficient at moving data around.
   We need to organize our information so we can more easily understand
   where our customers are and what they are shipping.
   CTO Statement –
   IT has never been a priority for us, so as our data has grown, we have not
   invested enough in our technology. I have a good staff to manage IT, but
   they are so busy managing our infrastructure that I cannot get them to do
   the things that really matter, such as organizing our data, building the
   analytics, and figuring out how to implement the CFO' s tracking
   technology.
   CFO Statement –
   Part of our competitive advantage is that we penalize ourselves for late
   shipments and deliveries. Knowing where out shipments are at all times
   has a direct correlation to our bottom line and profitability. Additionally, I
   don't want to commit capital to building out a server environment.
   Company Background –
The company started as a regional trucking company, and then expanded
into other logistics market. Because they have not updated their
infrastructure, managing and tracking orders and shipments has become a
bottleneck. To improve operations, Flowlogistic developed proprietary
technology for tracking shipments in real time at the parcel level.
However, they are unable to deploy it because their technology stack,
based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
Use their proprietary technology in a real-time inventory-tracking system
that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic's CEO wants to gain rapid insight into their customer base so
his sales team can be better informed in the field. This team is not very
technical, so they've purchased a visualization tool to simplify the creation
of BigQuery reports. However, they've been overwhelmed by all the data
in the table, and are spending a lot of money on queries trying to find the
data they need. You want to solve their problem in the most cost-effective
way. What should you do?
   Creating a view in BigQuery allows you to define a virtual table that is a
   subset of the original data, containing only the necessary columns or
   filtered data that the sales team requires for their reports. This approach is
   cost-effective because it doesn't involve exporting data to external tools or
   creating additional tables, and it ensures that the sales team is working
   with the specific data they need without running expensive queries on the
   full dataset. It simplifies the data for non-technical users while keeping the
   data in BigQuery, which is a powerful and cost-efficient data warehousing
   solution.
   Company Background –
   The company started as a regional trucking company, and then expanded
   into other logistics market. Because they have not updated their
   infrastructure, managing and tracking orders and shipments has become a
   bottleneck. To improve operations, Flowlogistic developed proprietary
   technology for tracking shipments in real time at the parcel level.
   However, they are unable to deploy it because their technology stack,
   based on Apache Kafka, cannot support the processing volume. In
   addition, Flowlogistic wants to further analyze their orders and shipments
   to determine how best to deploy their resources.
   Solution Concept –
   Flowlogistic wants to implement two concepts using the cloud:
   ✑ Use their proprietary technology in a real-time inventory-tracking
   system that indicates the location of their loads
   ✑ Perform analytics on all their orders and shipment logs, which contain
   both structured and unstructured data, to determine how best to deploy
   resources, which markets to expand info. They also want to use predictive
   analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑ Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
   IT has never been a priority for us, so as our data has grown, we have not
   invested enough in our technology. I have a good staff to manage IT, but
   they are so busy managing our infrastructure that I cannot get them to do
   the things that really matter, such as organizing our data, building the
   analytics, and figuring out how to implement the CFO' s tracking
   technology.
   CFO Statement –
   Part of our competitive advantage is that we penalize ourselves for late
   shipments and deliveries. Knowing where out shipments are at all times
   has a direct correlation to our bottom line and profitability. Additionally, I
   don't want to commit capital to building out a server environment.
   Company Background –
   Founded by experienced telecom executives, MJTelco uses technologies
   originally developed to overcome communications challenges in space.
   Fundamental to their operation, they need to create a distributed data
   infrastructure that drives real-time analysis and incorporates machine
   learning to continuously optimize their topologies. Because their hardware
   is inexpensive, they plan to overdeploy the network allowing them to
   account for the impact of dynamic regional politics on location availability
   and cost. Their management and operations teams are situated all around
the globe creating many-to-many relationship between data consumers
and provides in their system. After careful consideration, they decided
public cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement -
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
   CFO Statement –
   The project is too large for us to maintain the hardware and software
   required for the data and analysis. Also, we cannot afford to staff an
   operations team to monitor so many data feeds, so we will rely on
   automation and infrastructure. Google Cloud's machine learning will allow
   our quantitative researchers to work on our high-value problems instead of
   problems with our data pipelines.
   Company Background –
   Founded by experienced telecom executives, MJTelco uses technologies
   originally developed to overcome communications challenges in space.
   Fundamental to their operation, they need to create a distributed data
   infrastructure that drives real-time analysis and incorporates machine
   learning to continuously optimize their topologies. Because their hardware
   is inexpensive, they plan to overdeploy the network allowing them to
   account for the impact of dynamic regional politics on location availability
   and cost. Their management and operations teams are situated all around
   the globe creating many-to-many relationship between data consumers
   and provides in their system. After careful consideration, they decided
   public cloud is the perfect environment to support their needs.
   Solution Concept –
   MJTelco is running a successful proof-of-concept (PoC) project in its labs.
   They have two primary needs:
   ✑ Scale and harden their PoC to support significantly more data flows
   generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
   You need to compose visualizations for operations teams with the
   following requirements:
   ✑ The report must include telemetry data from all 50,000 installations for
   the most resent 6 weeks (sampling once every minute).
   ✑ The report must not be more than 3 hours delayed from live data.
   ✑ The actionable report should only show suboptimal links.
   ✑ Most suboptimal links should be sorted to the top.
   ✑ Suboptimal links can be grouped and filtered by regional geography.
   ✑ User response time to load the report must be <5 seconds.
   Which approach meets the requirements?
   Loading the data into BigQuery and using Data Studio 360 provides the
   best balance of scalability, performance, ease of use, and functionality to
   meet MJTelco's visualization requirements.
254.      You create an important report for your large team in Google Data
   Studio 360. The report uses Google BigQuery as its data source. You notice
   that visualizations are not showing data that is less than 1 hour old. What
   should you do?
   The most direct and effective way to ensure your Data Studio report shows
   the latest data (less than 1 hour old) is to disable caching in the report
   settings. This will force Data Studio to query BigQuery for fresh data each
   time the report is accessed.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑ Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
   Our business model relies on our patents, analytics and dynamic machine
   learning. Our inexpensive hardware is organized to be highly reliable,
   which gives us cost advantages. We need to quickly stabilize our large
   distributed data pipelines to meet our reliability and capacity
   commitments.
   CTO Statement –
   Our public cloud services must operate as advertised. We need resources
   that scale and keep our data secure. We also need environments in which
   our data scientists can carefully study and quickly adapt our models.
   Because we rely on automation to process our data, we also need our
   development and test environments to work as we iterate.
   CFO Statement –
   The project is too large for us to maintain the hardware and software
   required for the data and analysis. Also, we cannot afford to staff an
   operations team to monitor so many data feeds, so we will rely on
   automation and infrastructure. Google Cloud's machine learning will allow
   our quantitative researchers to work on our high-value problems instead of
   problems with our data pipelines.
   You create a new report for your large team in Google Data Studio 360.
   The report uses Google BigQuery as its data source. It is company policy
   to ensure employees can view only the data associated with their region,
   so you create and populate a table for each region. You need to enforce
   the regional access policy to the data. Which two actions should you take?
   (Choose two.)
   Organize your tables into regional datasets and then grant view access on
   those datasets to the appropriate regional security groups. This ensures
   that users only have access to the data relevant to their region.
   Company Background –
Founded by experienced telecom executives, MJTelco uses technologies
originally developed to overcome communications challenges in space.
Fundamental to their operation, they need to create a distributed data
infrastructure that drives real-time analysis and incorporates machine
learning to continuously optimize their topologies. Because their hardware
is inexpensive, they plan to overdeploy the network allowing them to
account for the impact of dynamic regional politics on location availability
and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
✑ Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑ Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑ Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
   which gives us cost advantages. We need to quickly stabilize our large
   distributed data pipelines to meet our reliability and capacity
   commitments.
   CTO Statement –
   Our public cloud services must operate as advertised. We need resources
   that scale and keep our data secure. We also need environments in which
   our data scientists can carefully study and quickly adapt our models.
   Because we rely on automation to process our data, we also need our
   development and test environments to work as we iterate.
   CFO Statement –
   The project is too large for us to maintain the hardware and software
   required for the data and analysis. Also, we cannot afford to staff an
   operations team to monitor so many data feeds, so we will rely on
   automation and infrastructure. Google Cloud's machine learning will allow
   our quantitative researchers to work on our high-value problems instead of
   problems with our data pipelines.
   MJTelco needs you to create a schema in Google Bigtable that will allow for
   the historical analysis of the last 2 years of records. Each record that
   comes in is sent every 15 minutes, and contains a unique identifier of the
   device and a data record. The most common query is for all the data for a
   given device for a given day.
   Which schema should you use?
   Optimized for Most Common Query: The most common query is for all data
   for a given device on a given day. This schema directly matches the query
   pattern by including both date and device_id in the row key. This enables
   efficient retrieval of the required data using a single row key prefix scan.
   Scalability: As the number of devices and data points increases, this
   schema distributes the data evenly across nodes in the Bigtable cluster,
   avoiding hotspots and ensuring scalability.
   Data Organization: By storing data points as column values within each
   row, you can easily add new data points or timestamps without modifying
   the table structure.
257.      Your company has recently grown rapidly and now ingesting data at
   a significantly higher rate than it was previously. You manage the daily
   batch MapReduce analytics jobs in Apache Hadoop. However, the recent
   increase in data has meant the batch jobs are falling behind. You were
   asked to recommend ways the development team could increase the
   responsiveness of the analytics without increasing costs. What should you
   recommend they do?
   Both Pig & Spark requires rewriting the code so its an additional overhead,
   but as an architect I would think about a long lasting solution. Resizing
   Hadoop cluster can resolve the problem statement for the workloads at
   that point in time but not on longer run. So Spark is the right choice,
   although its a cost to start with, it will certainly be a long lasting solution
258.       You work for a large fast food restaurant chain with over 400,000
   employees. You store employee information in Google BigQuery in a Users
   table consisting of a FirstName field and a LastName field. A member of IT
   is building an application and asks you to modify the schema and data in
   BigQuery so the application can query a FullName field consisting of the
   value of the FirstName field concatenated with a space, followed by the
   value of the LastName field for each employee. How can you make that
   data available while minimizing cost?
259.      You are deploying a new storage system for your mobile application,
   which is a media streaming service. You decide the best fit is Google Cloud
   Datastore. You have entities with multiple properties, some of which can
   take on multiple values. For example, in the entity 'Movie' the property
   'actors' and the property 'tags' have multiple values but the property 'date
   released' does not. A typical query would ask for all movies with
   actor=<actorname> ordered by date_released or all movies with
   tag=Comedy ordered by date_released. How should you avoid a
   combinatorial explosion in the number of indexes?
260.      You work for a manufacturing plant that batches application log files
   together into a single log file once a day at 2:00 AM. You have written a
   Google Cloud Dataflow job to process that log file. You need to make sure
   the log file in processed once per day as inexpensively as possible. What
   should you do?
   Using the Google App Engine Cron Service to run the Cloud Dataflow job
   allows you to automate the execution of the job. By creating a cron job,
   you can ensure that the Dataflow job is triggered exactly once per day at a
   specified time. This approach is automated, reliable, and fits the
   requirement of processing the log file once per day.
261.      You work for an economic consulting firm that helps companies
   identify economic trends as they happen. As part of your analysis, you use
   Google BigQuery to correlate customer data with the average prices of the
   100 most common goods sold, including bread, gasoline, milk, and others.
   The average prices of these goods are updated every 30 minutes. You
   want to make sure this data stays up to date so you can combine it with
   other data in BigQuery as cheaply as possible. What should you do?
   In summary, option B provides the most efficient and cost-
   effective way to keep your economic data up-to-date in BigQuery
   while minimizing overhead. You store the frequently changing data in a
   cheaper storage service (Cloud Storage) and then use BigQuery's ability to
   query data directly from that storage (federated tables) to combine it with
   your other data. This avoids the need for constant, expensive data loading
   into BigQuery.
262.      You are designing the database schema for a machine learning-
   based food ordering service that will predict what users want to eat. Here
   is some of the information you need to store:
   ✑ The user profile: What the user likes and doesn't like to eat
   ✑ The user account information: Name, address, preferred meal times
   ✑ The order information: When orders are made, from where, to whom
   The database will be used to store all the transactional data of the
   product. You want to optimize the data schema. Which Google Cloud
   Platform product should you use?
264.     Your company produces 20,000 files every hour. Each data file is
   formatted as a comma separated values (CSV) file that is less than 4 KB.
   All files must be ingested on Google Cloud Platform before they can be
   processed. Your company site has a 200 ms latency to Google Cloud, and
   your Internet connection bandwidth is limited as 50 Mbps. You currently
   deploy a secure FTP (SFTP) server on a virtual machine in Google Compute
   Engine as the data ingestion point. A local SFTP client runs on a dedicated
   machine to transmit the CSV files as is. The goal is to make reports with
   data from the previous day available to the executives by 10:00 a.m. each
   day. This design is barely able to keep up with the current volume, even
   though the bandwidth utilization is rather low. You are told that due to
   seasonality, your company expects the number of files to double for the
   next three months. Which two actions should you take? (Choose two.)
265.      An external customer provides you with a daily dump of data from
   their database. The data flows into Google Cloud Storage GCS as comma-
   separated values (CSV) files. You want to analyze this data in Google
   BigQuery, but the data could have rows that are formatted incorrectly or
   corrupted. How should you build this pipeline?
267.      You are training a spam classifier. You notice that you are overfitting
   the training data. Which three actions can you take to resolve this
   problem? (Choose three.)
   To address the problem of overfitting in training a spam classifier, you
   should consider the following three actions:
   A. Get more training examples:
   Why: More training examples can help the model generalize better to
   unseen data. A larger dataset typically reduces the chance of overfitting,
   as the model has more varied examples to learn from.
   C. Use a smaller set of features:
   Why: Reducing the number of features can help prevent the model from
   learning noise in the data. Overfitting often occurs when the model is too
   complex for the amount of data available, and having too many features
   can contribute to this complexity.
   E. Increase the regularization parameters:
   Why: Regularization techniques (like L1 or L2 regularization) add a penalty
   to the model for complexity. Increasing the regularization parameter will
   strengthen this penalty, encouraging the model to be simpler and thus
   reducing overfitting.
268.      You are implementing security best practices on your data pipeline.
   Currently, you are manually executing jobs as the Project Owner. You want
   to automate these jobs by taking nightly batch files containing non-public
   information from Google Cloud Storage, processing them with a Spark
   Scala job on a Google Cloud Dataproc cluster, and depositing the results
   into Google BigQuery. How should you securely run this workload?
269.      You are using Google BigQuery as your data warehouse. Your users
   report that the following simple query is running very slowly, no matter
   when they run the query:
   SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP
   BY country
   You check the query plan for the query and see the following output in the
   Read section of Stage:1:
What is the most likely cause of the delay for this query?
   The most likely cause of the delay for this query is option D. Most rows in
   the [myproject:mydataset.mytable] table have the same value in the
   country column, causing data skew.
   Group by queries in BigQuery can run slowly when there is significant data
   skew on the grouped columns. Since the query is grouping by country, if
   most rows have the same country value, all that data will need to be
   shuffled to a single reducer to perform the aggregation. This can cause a
   data skew slowdown.
   Options A and B might cause general slowness but are unlikely to affect
   this specific grouping query. Option C could also cause some slowness but
   not to the degree that heavy data skew on the grouped column could. So
   D is the most likely root cause. Optimizing the data distribution to reduce
   skew on the grouped column would likely speed up this query.
271.      Your organization has been collecting and analyzing data in Google
   BigQuery for 6 months. The majority of the data analyzed is placed in a
   time-partitioned table named events_partitioned. To reduce the cost of
   queries, your organization created a view called events, which queries
   only the last 14 days of data. The view is described in legacy SQL. Next
   month, existing applications will be connecting to BigQuery to read the
   events data via an ODBC connection. You need to ensure the applications
   can connect. Which two actions should you take? (Choose two.)
275.      An online retailer has built their current application on Google App
   Engine. A new initiative at the company mandates that they extend their
   application to allow their customers to transact directly via the application.
   They need to manage their shopping transactions and analyze combined
   data from multiple datasets using a business intelligence (BI) tool. They
   want to use only a single database for this purpose. Which Google Cloud
   database should they choose?
   Cloud SQL would be the most appropriate choice for the online retailer in
   this scenario. Cloud SQL is a fully-managed relational database service
   that allows for easy management and analysis of data using SQL. It is well-
   suited for applications built on Google App Engine and can handle the
   transactional workload of an e-commerce application, as well as the
   analytical workload of a BI tool.
276.     Your weather app queries a database every 15 minutes to get the
   current temperature. The frontend is powered by Google App Engine and
   server millions of users. How should you design the frontend to respond to
   a database failure?
277.     You launched a new gaming app almost three years ago. You have
   been uploading log files from the previous day to a separate Google
   BigQuery table with the table name format LOGS_yyyymmdd. You have
   been using table wildcard functions to generate daily and monthly reports
   for all time ranges. Recently, you discovered that some queries that cover
   long date ranges are exceeding the limit of 1,000 tables and failing. How
   can you resolve this issue?
   Sharded tables, like LOGS_yyyymmdd, are useful for managing data, but
   querying across a long date range with table wildcards can lead to
   inefficiencies and exceed the 1,000 table limit in BigQuery. Instead of
   using multiple sharded tables, you should consider converting these into a
   partitioned table.
   A partitioned table allows you to store all the log data in a single table, but
   logically divides the data into partitions (e.g., by date). This way, you can
   efficiently query data across long date ranges without hitting the 1,000
   table limit.
   Preemptible workers are the default secondary worker type. They are
   reclaimed and removed from the cluster if they are required by Google
   Cloud for other tasks. Although the potential removal of preemptible
   workers can affect job stability, you may decide to use preemptible
   instances to lower per-hour compute costs for non-critical data processing
   or to create very large clusters at a lower total cost
279.      Your company receives both batch- and stream-based event data.
   You want to process the data using Google Cloud Dataflow over a
   predictable time period. However, you realize that in some instances data
   can arrive late or out of order. How should you design your Cloud Dataflow
   pipeline to handle data that is late or out of order?
   Watermarks are a way to indicate that some data may still be in transit
   and not yet processed. By setting a watermark, you can define a time
   period during which Dataflow will continue to accept late or out-of-order
   data and incorporate it into your processing. This allows you to maintain a
   predictable time period for processing while still allowing for some
   flexibility in the arrival of data.
   Timestamps, on the other hand, are used to order events correctly, even if
   they arrive out of order. By assigning timestamps to each event, you can
   ensure that they are processed in the correct order, even if they don't
   arrive in that order.
280.        You have some data, which is shown in the graphic below. The two
   dimensions are X and Y, and the shade of each dot represents what class
   it is. You want to classify this data accurately using a linear algorithm. To
   do this you need to add a synthetic feature. What should the value of that
   feature be?
   The synthetic feature that should be added in this case is the squared
   value of the distance from the origin (0,0). This is equivalent to X2+Y2. By
   adding this feature, the classifier will be able to make more accurate
   predictions by taking into account the distance of each data point from the
   origin.
   X2 and Y2 alone will not give enough information to classify the data
   because they do not take into account the relationship between X and Y.
281.     You are integrating one of your internal IT applications and Google
   BigQuery, so users can query BigQuery from the application's interface.
   You do not want individual users to authenticate to BigQuery and you do
   not want to give them access to the dataset. You need to securely access
   BigQuery from your IT application. What should you do?
282.      You are building a data pipeline on Google Cloud. You need to
   prepare data using a casual method for a machine-learning process. You
   want to support a logistic regression model. You also need to monitor and
   adjust for null values, which must remain real-valued and cannot be
   removed. What should you do?
283.      You set up a streaming data insert into a Redis cluster via a Kafka
   cluster. Both clusters are running on Compute Engine instances. You need
   to encrypt data at rest with encryption keys that you can create, rotate,
   and destroy as needed. What should you do?
   Cloud Key Management Service (KMS) is a fully managed service that
   allows you to create, rotate, and destroy encryption keys as needed. By
   creating encryption keys in Cloud KMS, you can use them to encrypt your
   data at rest in the Compute Engine cluster instances, which is running
   your Redis and Kafka clusters. This ensures that your data is protected
   even when it is stored on disk.
285.     You are selecting services to write and transform JSON messages
   from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You
   want to minimize service costs. You also want to monitor and
   accommodate input data volume that will vary in size with minimal
   manual intervention. What should you do?
   using Cloud Dataflow for transformations with monitoring via Stackdriver
   and leveraging its default autoscaling settings, is the best choice. Cloud
   Dataflow is purpose-built for this type of workload, providing seamless
   scalability and efficient processing capabilities for streaming data. Its
   autoscaling feature minimizes manual intervention and helps manage
   costs by dynamically adjusting resources based on the actual processing
   needs, which is crucial for handling fluctuating data volumes efficiently
   and cost-effectively.
288.     You are designing storage for very large text files for a data pipeline
   on Google Cloud. You want to support ANSI SQL queries. You also want to
   support compression and parallel load from the input locations using
   Google recommended practices. What should you do?
   The advantages of creating external tables are that they are fast to create
   so you skip the part of importing data and no additional monthly billing
   storage costs are accrued to your account since you only get charged for
   the data that is stored in the data lake, which is comparatively cheaper
   than storing it in BigQuery
290.     You are designing storage for 20 TB of text files as part of deploying
   a data pipeline on Google Cloud. Your input data is in CSV format. You
   want to minimize the cost of querying aggregate values for multiple users
   who will query the data in Cloud Storage with multiple engines. Which
   storage service and schema design should you use?
291.     You are designing storage for two relational tables that are part of a
   10-TB database on Google Cloud. You want to support transactions that
   scale horizontally. You also want to optimize data for range queries on non-
   key columns. What should you do?
   Bigtable is Google's NoSQL Big Data database service. It's the same
   database that powers many core Google services, including Search,
   Analytics, Maps, and Gmail. Bigtable is designed to handle massive
   workloads at consistent low latency and high throughput, so it's a great
   choice for both operational and analytical applications, including IoT, user
   analytics, and financial data analysis.
   Bigtable is an excellent option for any Apache Spark or Hadoop uses that
   require Apache HBase. Bigtable supports the Apache HBase 1.0+ APIs and
   offers a Bigtable HBase client in Maven, so it is easy to use Bigtable with
   Dataproc.
   BigQuery provides built-in logging of all data access, including the user's
   identity, the specific query run and the time of the query. This log can be
   used to provide an auditable record of access to the data. Additionally,
   BigQuery allows you to control access to the dataset using Identity and
   Access Management (IAM) roles, so you can ensure that only authorized
   personnel can view the dataset.
295.     Your neural network model is taking days to train. You want to
   increase the training speed. What can you do?
   Subsampling your training dataset can help increase the training speed of
   your neural network model. By reducing the size of your training dataset,
   you can speed up the process of updating the weights in your neural
   network. This can help you quickly test and iterate your model to improve
   its accuracy.
   Subsampling your test dataset, on the other hand, can lead to inaccurate
   evaluation of your model's performance and may result in overfitting. It is
   important to evaluate your model's performance on a representative test
   dataset to ensure that it can generalize to new data.
   Increasing the number of input features or layers in your neural network
   can also improve its performance, but this may not necessarily increase
   the training speed. In fact, adding more layers or features can increase the
   complexity of your model and make it take longer to train. It is important
   to balance the model's complexity with its performance and training time.
296.      You are responsible for writing your company's ETL pipelines to run
   on an Apache Hadoop cluster. The pipeline will require some checkpointing
   and splitting pipelines. Which method should you use to write the
   pipelines?
   This will likely have the most impact on transfer speeds as it addresses the
   bottleneck in the transfer between your data center and GCP. Increasing
   the CPU size or the size of the Google Persistent Disk on the server may
   help with processing the data once it has been transferred, but will not
   address the bottleneck in the transfer itself. Increasing the network
   bandwidth from Compute Engine to Cloud Storage would also help with
   processing the data once it has been transferred but will not address the
   bottleneck in the transfer itself as well.
298.     You are building new real-time data warehouse for your company
   and will use Google BigQuery streaming inserts. There is no guarantee
   that data will only be sent in once but you do have a unique ID for each
   row of data and an event timestamp. You want to ensure that duplicates
   are not included while interactively querying data. Which query type
   should you use?
   This approach will assign a row number to each row within a unique ID
   partition, and by selecting only rows with a row number of 1, you will
   ensure that duplicates are excluded in your query results. It allows you to
   filter out redundant rows while retaining the latest or earliest records
   based on your timestamp column.
   Options A, B, and C do not address the issue of duplicates effectively or
   interactively as they do not explicitly remove duplicates based on the
   unique ID and event timestamp.
   Company Background –
   Founded by experienced telecom executives, MJTelco uses technologies
   originally developed to overcome communications challenges in space.
   Fundamental to their operation, they need to create a distributed data
   infrastructure that drives real-time analysis and incorporates machine
   learning to continuously optimize their topologies. Because their hardware
   is inexpensive, they plan to overdeploy the network allowing them to
   account for the impact of dynamic regional politics on location availability
   and cost.
   Their management and operations teams are situated all around the globe
   creating many-to-many relationship between data consumers and
   provides in their system. After careful consideration, they decided public
   cloud is the perfect environment to support their needs.
   Solution Concept –
   MJTelco is running a successful proof-of-concept (PoC) project in its labs.
   They have two primary needs:
   ✑ Scale and harden their PoC to support significantly more data flows
   generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each. Allow analysis and presentation against
data tables tracking up to 2 years of data storing approximately 100m
records/day Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
   MJTelco is building a custom interface to share data. They have these
   requirements:
      1. They need to do aggregations over their petabyte-scale datasets.
      2. They need to scan specific time range rows with a very fast
          response time (milliseconds).
   Which combination of Google Cloud Platform products should you
   recommend?
   Company Background –
   Founded by experienced telecom executives, MJTelco uses technologies
   originally developed to overcome communications challenges in space.
   Fundamental to their operation, they need to create a distributed data
   infrastructure that drives real-time analysis and incorporates machine
   learning to continuously optimize their topologies. Because their hardware
   is inexpensive, they plan to overdeploy the network allowing them to
   account for the impact of dynamic regional politics on location availability
   and cost.
   Their management and operations teams are situated all around the globe
   creating many-to-many relationship between data consumers and
   provides in their system. After careful consideration, they decided public
   cloud is the perfect environment to support their needs.
   Solution Concept –
   MJTelco is running a successful proof-of-concept (PoC) project in its labs.
   They have two primary needs:
   ✑ Scale and harden their PoC to support significantly more data flows
   generated when they ramp to more than 50,000 installations.
   Refine their machine-learning cycles to verify and improve the dynamic
   models they use to control topology definition.
MJTelco will also use three separate operating environments `"
development/test, staging, and production `" to meet the needs of running
experiments, deploying new features, and serving production customers.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
CTO Statement –
Our public cloud services must operate as advertised. We need resources
that scale and keep our data secure. We also need environments in which
our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our
development and test environments to work as we iterate.
CFO Statement –
The project is too large for us to maintain the hardware and software
required for the data and analysis. Also, we cannot afford to staff an
operations team to monitor so many data feeds, so we will rely on
automation and infrastructure. Google Cloud's machine learning will allow
our quantitative researchers to work on our high-value problems instead of
problems with our data pipelines.
You need to compose visualization for operations teams with the following
requirements:
   ✑ Telemetry must include data from all 50,000 installations for the most
   recent 6 weeks (sampling once every minute)
   ✑ The report must not be more than 3 hours delayed from live data.
   ✑ The actionable report should only show suboptimal links.
   ✑ Most suboptimal links should be sorted to the top.
   You create a data source to store the last 6 weeks of data, and create
   visualizations that allow viewers to see multiple date ranges, distinct
   geographic regions, and unique installation types. You always show the
   latest data without any changes to your visualizations. You want to avoid
   creating and updating new visualizations each month. What should you
   do?
   Company Background –
   Founded by experienced telecom executives, MJTelco uses technologies
   originally developed to overcome communications challenges in space.
   Fundamental to their operation, they need to create a distributed data
   infrastructure that drives real-time analysis and incorporates machine
   learning to continuously optimize their topologies. Because their hardware
   is inexpensive, they plan to overdeploy the network allowing them to
   account for the impact of dynamic regional politics on location availability
   and cost.
Their management and operations teams are situated all around the globe
creating many-to-many relationship between data consumers and
provides in their system. After careful consideration, they decided public
cloud is the perfect environment to support their needs.
Solution Concept –
MJTelco is running a successful proof-of-concept (PoC) project in its labs.
They have two primary needs:
✑ Scale and harden their PoC to support significantly more data flows
generated when they ramp to more than 50,000 installations.
✑ Refine their machine-learning cycles to verify and improve the dynamic
models they use to control topology definition.
Business Requirements –
✑ Scale up their production environment with minimal cost, instantiating
resources when and where needed in an unpredictable, distributed
telecom user community.
✑ Ensure security of their proprietary data to protect their leading-edge
machine learning and analysis.
✑ Provide reliable and timely access to data for analysis from distributed
research workers
✑ Maintain isolated environments that support rapid iteration of their
machine-learning models without affecting their customers.
Technical Requirements –
✑Ensure secure and efficient transport and storage of telemetry data
✑Rapidly scale instances to support between 10,000 and 100,000 data
providers with multiple flows each.
✑Allow analysis and presentation against data tables tracking up to 2
years of data storing approximately 100m records/day
✑Support rapid iteration of monitoring infrastructure focused on
awareness of data pipeline problems both in telemetry flows and in
production learning cycles.
CEO Statement –
Our business model relies on our patents, analytics and dynamic machine
learning. Our inexpensive hardware is organized to be highly reliable,
which gives us cost advantages. We need to quickly stabilize our large
distributed data pipelines to meet our reliability and capacity
commitments.
   CTO Statement –
   Our public cloud services must operate as advertised. We need resources
   that scale and keep our data secure. We also need environments in which
   our data scientists can carefully study and quickly adapt our models.
   Because we rely on automation to process our data, we also need our
   development and test environments to work as we iterate.
   CFO Statement –
   The project is too large for us to maintain the hardware and software
   required for the data and analysis. Also, we cannot afford to staff an
   operations team to monitor so many data feeds, so we will rely on
   automation and infrastructure. Google Cloud's machine learning will allow
   our quantitative researchers to work on our high-value problems instead of
   problems with our data pipelines.
   Given the record streams MJTelco is interested in ingesting per day, they
   are concerned about the cost of Google BigQuery increasing. MJTelco asks
   you to provide a design solution. They require a single large data table
   called tracking_table. Additionally, they want to minimize the cost of daily
   queries while performing fine-grained analysis of each day's events. They
   also want to use streaming ingestion. What should you do?
   Company Background –
   The company started as a regional trucking company, and then expanded
   into other logistics market. Because they have not updated their
   infrastructure, managing and tracking orders and shipments has become a
   bottleneck. To improve operations, Flowlogistic developed proprietary
   technology for tracking shipments in real time at the parcel level.
   However, they are unable to deploy it because their technology stack,
   based on Apache Kafka, cannot support the processing volume. In
addition, Flowlogistic wants to further analyze their orders and shipments
to determine how best to deploy their resources.
Solution Concept –
Flowlogistic wants to implement two concepts using the cloud:
✑ Use their proprietary technology in a real-time inventory-tracking
system that indicates the location of their loads
✑ Perform analytics on all their orders and shipment logs, which contain
both structured and unstructured data, to determine how best to deploy
resources, which markets to expand info. They also want to use predictive
analytics to learn earlier when a shipment will be delayed.
✑ Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) `" SQL server storage
- Network-attached storage (NAS) image storage, logs, backups
✑ 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements –
✑ Build a reliable and reproducible environment with scaled panty of
production.
✑ Aggregate data in a centralized Data Lake for analysis
✑ Use historical data to perform predictive analytics on future shipments
✑ Accurately track every shipment worldwide using proprietary technology
✑ Improve business agility and speed of innovation through rapid
provisioning of new resources
✑ Analyze and optimize architecture for performance in the cloud
✑ Migrate fully to the cloud if all other requirements are met
Technical Requirements –
✑ Handle both streaming and batch data
✑ Migrate existing Hadoop workloads
✑ Ensure architecture is scalable and elastic to meet the changing
demands of the company.
✑ Use managed services whenever possible
✑ Encrypt data flight and at rest
✑Connect a VPN between the production data center and cloud
environment
SEO Statement –
We have grown so quickly that our inability to upgrade our infrastructure is
really hampering further growth and efficiency. We are efficient at moving
shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand
where our customers are and what they are shipping.
CTO Statement –
IT has never been a priority for us, so as our data has grown, we have not
invested enough in our technology. I have a good staff to manage IT, but
they are so busy managing our infrastructure that I cannot get them to do
the things that really matter, such as organizing our data, building the
analytics, and figuring out how to implement the CFO' s tracking
technology.
CFO Statement –
Part of our competitive advantage is that we penalize ourselves for late
shipments and deliveries. Knowing where out shipments are at all times
has a direct correlation to our bottom line and profitability. Additionally, I
don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka
servers cannot handle the data volume for their real-time inventory
tracking system.
You need to build a new system on Google Cloud Platform (GCP) that will
feed the proprietary tracking software. The system must be able to ingest
data from a variety of global sources, process and query in real-time, and
store the data reliably. Which combination of GCP products should you
choose?
303.      After migrating ETL jobs to run on BigQuery, you need to verify that
   the output of the migrated jobs is the same as the output of the original.
   You've loaded a table containing the output of the original job and want to
   compare the contents with output from the migrated job to show that they
   are identical. The tables do not contain a primary key column that would
   enable you to join them together for comparison. What should you do?
305.      You have an Apache Kafka cluster on-prem with topics containing
   web application logs. You need to replicate the data to Google Cloud for
   analysis in BigQuery and Cloud Storage. The preferred replication method
   is mirroring to avoid deployment of Kafka Connect plugins. What should
   you do?
   By default, preemptible node disk sizes are limited to 100GB or the size of
   the non-preemptible node disk sizes, whichever is smaller. However you
   can override the default preemptible disk size to any requested size. Since
   the majority of our cluster is using preemptible nodes, the size of the disk
   used for caching operations will see a noticeable performance
   improvement using a larger disk. Also, SSD's will perform better than HDD.
   This will increase costs slightly, but is the best option available while
   maintaining costs.
307.      Your team is responsible for developing and maintaining ETLs in
   your company. One of your Dataflow jobs is failing because of some errors
   in the input data, and you need to improve reliability of the pipeline (incl.
   being able to reprocess all failing data). What should you do?
Which table name will make the SQL statement work correctly?
    Option B is the only one that correctly uses the wildcard syntax without
    quotes to specify a pattern for multiple tables and allows the
    TABLE_SUFFIX to filter based on the matched portion of the table name.
    This makes it the correct answer for querying across multiple tables using
    WILDCARD tables in BigQuery.
310.      You are deploying MariaDB SQL databases on GCE VM Instances and
   need to configure monitoring and alerting. You want to collect metrics
   including network connections, disk IO and replication status from MariaDB
   with minimal development effort and use StackDriver for dashboards and
   alerts. What should you do?
   StackDriver Agent: The StackDriver Agent is designed to collect system
   and application metrics from virtual machine instances and send them to
   StackDriver Monitoring. It simplifies the process of collecting and
   forwarding metrics.
   MySQL Plugin: The StackDriver Agent has a MySQL plugin that allows you
   to collect MySQL-specific metrics without the need for additional custom
   development. This includes metrics related to network connections, disk
   IO, and replication status – which are the specific metrics you mentioned.
   Option D is the most straightforward and least development-intensive
   approach to achieve the monitoring and alerting requirements for MariaDB
   on GCE VM Instances using StackDriver.
311.     You work for a bank. You have a labelled dataset that contains
   information on already granted loan application and whether these
   applications have been defaulted. You have been asked to train a model to
   predict default rates for credit applicants. What should you do?
313.      You're using Bigtable for a real-time application, and you have a
   heavy load that is a mix of read and writes. You've recently identified an
   additional use case and need to perform hourly an analytical job to
   calculate certain statistics across the whole database. You need to ensure
   both the reliability of your production application as well as the analytical
   workload. What should you do?
   When you use a single cluster to run a batch analytics job that performs
   numerous large reads alongside an application that performs a mix of
   reads and writes, the large batch job can slow things down for the
   application's users. With replication, you can use app profiles with single-
   cluster routing to route batch analytics jobs and application traffic to
   different clusters, so that batch jobs don't affect your applications' users.
314.      You are designing an Apache Beam pipeline to enrich data from
   Cloud Pub/Sub with static reference data from BigQuery. The reference
   data is small enough to fit in memory on a single worker. The pipeline
   should write enriched results to BigQuery for analysis. Which job type and
   transforms should this pipeline use?
315.     You have a data pipeline that writes data to Cloud Bigtable using
   well-designed row keys. You want to monitor your pipeline to determine
   when to increase the size of your Cloud Bigtable cluster. Which two actions
   can you take to accomplish this? (Choose two.)
   D: In general, do not use more than 70% of the hard limit on total storage,
   so you have room to add more data. If you do not plan to add significant
   amounts of data to your instance, you can use up to 100% of the hard
   limit
   C: If this value is frequently at 100%, you might experience increased
   latency. Add nodes to the cluster to reduce the disk load percentage.
   The key visualizer metrics options, suggest other things other than
   increase the cluster size.
318.     Your company needs to upload their historic data to Cloud Storage.
   The security rules don't allow access from external IPs to their on-premises
   resources. After an initial upload, they will add new data from existing on-
   premises applications every day. What should they do?
   gsutil rsync is the most straightforward, secure, and efficient solution for
   transferring data from on-premises servers to Cloud Storage, especially
   when security rules restrict inbound connections to the on-premises
   environment. It's well-suited for both the initial bulk upload and the
   ongoing daily updates.
319.      You have a query that filters a BigQuery table using a WHERE clause
   on timestamp and ID columns. By using bq query `"-dry_run you learn that
   the query triggers a full scan of the table, even though the filter on
   timestamp and ID select a tiny fraction of the overall data. You want to
   reduce the amount of data scanned by BigQuery with minimal changes to
   existing SQL queries. What should you do?
Partitioning and clustering are the most effective way to optimize
BigQuery queries that filter on specific columns like timestamp and ID. By
reorganizing the table structure, BigQuery can significantly reduce the
amount of data scanned, leading to faster and cheaper queries.