New 2024 Realistic Free Google Professional-Data-Engineer Exam Dump Questions & Answer [Q12-Q33]

New 2024 Realistic Free Google Professional-Data-Engineer Exam Dump Questions and Answer

Professional-Data-Engineer Practice Test Engine: Try These 333 Exam Questions

This course will show you how to manage big data including loading, extracting, cleaning, and validating data. At the end of the training, you can easily create machine learning and statistical models as well as visualizing query results. This program is a bit lengthy but you have to practice well to get the knowledge needed on the actual exam. These are the following modules covered in the course:

Production ML Pipelines and use of Kubeflow
Handling Data Pipelines with Cloud Composer and Cloud Data Fusion
Introduction to Data Engineering
Advanced BigQuery Performance and Functionality
Bigtable Streaming Features and High-Throughput BigQuery
Creating a Data Lake
Introduction to Processing Streaming Data
Performing Spark on Cloud Dataproc
Big Data Analytics with Cloud Al Platform Notebook
Serverless Data Processing with Cloud Dataflow
Building a Data Warehouse
Cloud Dataflow Streaming Features
Serverless Messaging Using Cloud Sub/Pub
Prebuilt ML Models APIs for Unsaturated Data
Custom Model building Using SQL in BigQuery ML
Custom Model building Utilizing Cloud AutoML

These modules involve everything the candidate requires for passing the Professional Data Engineer certification exam. Thus, you will not miss anything if you are taking this learning program keenly and apply the required knowledge in an appropriate way. You would end up getting a good score and achieving the Google Professional Data Engineer certification.

NEW QUESTION # 12
You are building a model to make clothing recommendations. You know a user's fashion pis likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

A. Train on the new data while using the existing data as your test set.
B. Train on the existing data while using the new data as your test set.
C. Continuously retrain the model on a combination of existing data and the new data.
D. Continuously retrain the model on just the new data.

Answer: C

Explanation:
We have to use a combination of old and new test data as well as training data.

NEW QUESTION # 13
Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?

A. Use pre-emptible virtual machines (VMs) for the cluster
B. Use SSDs on the worker nodes so that the job can run faster
C. Migrate the workload to Google Cloud Dataflow
D. Use a higher-memory node so that the job runs faster

Answer: C

NEW QUESTION # 14
Case Study: 2 - MJTelco
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to- many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments ?development/test, staging, and production ?
to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You need to compose visualization for operations teams with the following requirements:
Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute) The report must not be more than 3 hours delayed from live data. The actionable report should only show suboptimal links.
Most suboptimal links should be sorted to the top.
Suboptimal links can be grouped and filtered by regional geography. User response time to load the report must be <5 seconds. You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types.
You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?

A. Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.
B. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.
C. Look through the current data and compose a series of charts and tables, one for each possible combination of criteria.
D. Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of criteria, and spread them across multiple tabs.

Answer: B

NEW QUESTION # 15
Your company has recently grown rapidly and now ingesting data at a significantly higher rate than it was previously. You manage the daily batch MapReduce analytics jobs in Apache Hadoop. However, the recent increase in data has meant the batch jobs are falling behind. You were asked to recommend ways the development team could increase the responsiveness of the analytics without increasing costs. What should you recommend they do?

A. Decrease the size of the Hadoop cluster but also rewrite the job in Hive.
B. Rewrite the job in Pig.
C. Rewrite the job in Apache Spark.
D. Increase the size of the Hadoop cluster.

Answer: C

Explanation:
Spark performs in-memory processing and faster, which results in optimization of job's processing time.

NEW QUESTION # 16
You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?

A. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
B. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
C. Increase the cluster size with more non-preemptible workers.
D. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.

Answer: B

Explanation:
Explanation/Reference:
Reference https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/flex

NEW QUESTION # 17
You work for an advertising company, and you've developed a Spark ML model to predict click-through rates at advertisement blocks. You've been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be closing soon, so a rapid lift-and-shift migration is necessary. However, the data you've been using will be migrated to migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud. What should you do?

A. Use Cloud ML Engine for training existing Spark ML models
B. Rewrite your models on TensorFlow, and start using Cloud ML Engine
C. Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
D. Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery

Answer: A

Explanation:
Explanation

NEW QUESTION # 18
What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?

A. create a third instance and sync the data from the two storage types via batch jobs
B. the selection is final and you must resume using the same storage type
C. export the data from the existing instance and import the data into a new instance
D. run parallel instances where one is HDD and the other is SDD

Answer: C

Explanation:
When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance. Alternatively, you can write
a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another.

NEW QUESTION # 19
Which methods can be used to reduce the number of rows processed by BigQuery?

A. Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
B. Splitting tables into multiple tables; using the LIMIT clause
C. Putting data in partitions; using the LIMIT clause
D. Splitting tables into multiple tables; putting data in partitions

Answer: D

Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables

NEW QUESTION # 20
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values
(CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be
processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection
bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in
Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to
transmit the CSV files as is. The goal is to make reports with data from the previous day available to the
executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even
though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three
months. Which two actions should you take? (Choose two.)

A. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
B. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble
the CSV files in the cloud upon receiving them.
C. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer
Service to transfer on-premices data to the designated storage bucket.
D. Introduce data compression for each file to increase the rate file of file transfer.
E. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in
parallel.

Answer: C,E

NEW QUESTION # 21
You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?

A. Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.
B. Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
C. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
D. Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.

Answer: B

NEW QUESTION # 22
Flowlogistic Case Study
Company Overview
Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping.
Company Background
The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources.
Solution Concept
Flowlogistic wants to implement two concepts using the cloud:
* Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads
* Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed.
Existing Technical Environment
Flowlogistic architecture resides in a single data center:
* Databases
- 8 physical servers in 2 clusters
- SQL Server - user data, inventory, static data
- 3 physical servers
- Cassandra - metadata, tracking messages
10 Kafka servers - tracking message aggregation and batch insert
* Application servers - customer front end, middleware for order/customs
- 60 virtual machines across 20 physical servers
- Tomcat - Java services
- Nginx - static content
- Batch servers
* Storage appliances
- iSCSI for virtual machine (VM) hosts
- Fibre Channel storage area network (FC SAN) - SQL server storage
Network-attached storage (NAS) image storage, logs, backups
* 10 Apache Hadoop /Spark servers
- Core Data Lake
- Data analysis workloads
* 20 miscellaneous servers
- Jenkins, monitoring, bastion hosts,
Business Requirements
* Build a reliable and reproducible environment with scaled panty of production.
* Aggregate data in a centralized Data Lake for analysis
* Use historical data to perform predictive analytics on future shipments
* Accurately track every shipment worldwide using proprietary technology
* Improve business agility and speed of innovation through rapid provisioning of new resources
* Analyze and optimize architecture for performance in the cloud
* Migrate fully to the cloud if all other requirements are met
Technical Requirements
* Handle both streaming and batch data
* Migrate existing Hadoop workloads
* Ensure architecture is scalable and elastic to meet the changing demands of the company.
* Use managed services whenever possible
* Encrypt data flight and at rest
Connect a VPN between the production data center and cloud environment
SEO Statement
We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around.
We need to organize our information so we can more easily understand where our customers are and what they are shipping.
CTO Statement
IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology.
CFO Statement
Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment.
Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose?

A. Cloud Dataflow, Cloud SQL, and Cloud Storage
B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD
C. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage
D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage
E. Cloud Pub/Sub, Cloud SQL, and Cloud Storage

Answer: E

NEW QUESTION # 23
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks. She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?

A. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
B. Grant the user access to Google Cloud Shell.
C. Run a local version of Jupiter on the laptop.
D. Host a visualization tool on a VM on Google Compute Engine.

Answer: B

NEW QUESTION # 24
Which Google Cloud Platform service is an alternative to Hadoop with Hive?

A. Cloud Datastore
B. Cloud Dataflow
C. BigQuery
D. Cloud Bigtable

Answer: C

Explanation:
Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis.
Google BigQuery is an enterprise data warehouse.

NEW QUESTION # 25
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this dat
a. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

A. Preserve the structure of the data as much as possible.
B. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
C. Denormalize the data as must as possible.
D. Use BigQuery UPDATE to further reduce the size of the dataset.
E. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

Answer: C,E

NEW QUESTION # 26
You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs?

A. Cloud Scheduler
B. Cloud Dataflow
C. Cloud Composer
D. Cloud Functions

Answer: C

NEW QUESTION # 27
Your company is in a highly regulated industry. One of your requirements is to ensure individual users have access only to the minimum amount of information required to do their jobs. You want to enforce this requirement with Google BigQuery.
Which three approaches can you take? (Choose three.)

A. Ensure that the data is encrypted at all times.
B. Restrict BigQuery API access to approved users.
C. Segregate data across multiple tables or databases.
D. Disable writes to certain tables.
E. Use Google Stackdriver Audit Logging to determine policy violations.
F. Restrict access to tables by role.

Answer: B,E,F

Explanation:
bigquery.tables.create Create new tables.
bigquery.tables.delete Delete tables.
bigquery.tables.export Export table data out of BigQuery.
bigquery.tables.get Get table metadata.
To get table data, you need bigquery.tables.getData.
bigquery.tables.getData Get table data. This permission is required for querying table data.
To get table metadata, you need bigquery.tables.get.
bigquery.tables.list List tables and metadata on tables.
bigquery.tables.setCategory Set policy tags in table schema.
bigquery.tables.update
Update table metadata.
To update table data, you need bigquery.tables.updateData.
bigquery.tables.updateData
Update table data.
To update table metadata, you need bigquery.tables.update.

NEW QUESTION # 28
You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? (Choose two.)

A. The subscriber code does not acknowledge the messages that it pulls.
B. Error handling in the subscriber code is not handling run-time errors properly.
C. Publisher throughput quota is too small.
D. Total outstanding messages exceed the 10-MB maximum.
E. The subscriber code cannot keep up with the messages.

Answer: B,E

NEW QUESTION # 29
You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?

A. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to using a custom script.
B. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to `none' using a Cloud Dataprep job.
C. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to `none' using a Cloud Dataproc job.
D. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.

Answer: D

NEW QUESTION # 30
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low. You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take? (choose two.)

A. Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
B. Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
C. Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
D. Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.
E. Introduce data compression for each file to increase the rate file of file transfer.

Answer: A,C

NEW QUESTION # 31
You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?

A. Eliminate features that are highly correlated to the output labels.
B. Combine highly co-dependent features into one representative feature.
C. Instead of feeding in each feature individually, average their values in batches of 3.
D. Remove the features that have null values for more than 50% of the training records.

Answer: B

NEW QUESTION # 32
When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?

A. 1 TB
B. 500 GB
C. 1 GB
D. 500 TB

Answer: A

Explanation:
Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi- row transactions. It is not a good solution for less than 1 TB of data.
Reference:
https://cloud.google.com/bigtable/docs/overview#title_short_and_other_storage_options

NEW QUESTION # 33
......

Google Professional-Data-Engineer certification is ideal for data engineers who want to demonstrate their expertise in using Google Cloud technologies to develop and manage data pipelines. Google Certified Professional Data Engineer Exam certification is also suitable for individuals who want to enhance their career prospects in the field of data engineering. By passing the exam, candidates can prove their proficiency in designing, building, and maintaining data processing systems using Google Cloud services.

Guaranteed Success in Google Cloud Certified Professional-Data-Engineer Exam Dumps: https://surepass.free4dump.com/Professional-Data-Engineer-real-dump.html

New 2024 Realistic Free Google Professional-Data-Engineer Exam Dump Questions & Answer [Q12-Q33]

Related Articles

Useful Links

Tags

Contact Us