Weekend Special - 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: c4sdisc65

Databricks-Certified-Professional-Data-Engineer PDF

$38.5

$109.99

3 Months Free Update

  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$61.6

$175.99

3 Months Free Update

  • Exam Name: Databricks Certified Data Engineer Professional Exam
  • Last Update: Mar 2, 2024
  • Questions and Answers: 96
  • Free Real Questions Demo
  • Recommended by Industry Experts
  • Best Economical Package
  • Immediate Access

Databricks-Certified-Professional-Data-Engineer Engine

$46.2

$131.99

3 Months Free Update

  • Best Testing Engine
  • One Click installation
  • Recommended by Teachers
  • Easy to use
  • 3 Modes of Learning
  • State of Art Technology
  • 100% Real Questions included

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Question # 6

Which statement describes the default execution mode for Databricks Auto Loader?

A.

New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

B.

Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

C.

Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

D.

New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Full Access
Question # 7

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFramedf. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFramedfhas the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Databricks-Certified-Professional-Data-Engineer question answer

Choose the response that correctly fills in the blank within the code block to complete this task.

A.

to_interval("event_time", "5 minutes").alias("time")

B.

window("event_time", "5 minutes").alias("time")

C.

"event_time"

D.

window("event_time", "10 minutes").alias("time")

E.

lag("event_time", "10 minutes").alias("time")

Full Access
Question # 8

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Databricks-Certified-Professional-Data-Engineer question answer

Which kind of the test does the above line exemplify?

A.

Integration

B.

Unit

C.

Manual

D.

functional

Full Access
Question # 9

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?

A.

Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

B.

Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

C.

Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

D.

Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.

E.

Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Full Access
Question # 10

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A.

configure

B.

fs

C.

jobs

D.

libraries

E.

workspace

Full Access
Question # 11

A Data engineer wants to run unit’s tests using common Python testing frameworks on python functions defined across several Databricks notebooks currently used in production.

How can the data engineer run unit tests against function that work with data in production?

A.

Run unit tests against non-production data that closely mirrors production

B.

Define and unit test functions using Files in Repos

C.

Define units test and functions within the same notebook

D.

Define and import unit test functions from a separate Databricks notebook

Full Access
Question # 12

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, usingdisplay()calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

A.

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

B.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

C.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

D.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

E.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Full Access
Question # 13

Which Python variable contains a list of directories to be searched when trying to locate required modules?

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Full Access
Question # 14

Which statement describes the correct use of pyspark.sql.functions.broadcast?

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Full Access
Question # 15

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

A.

All records are cached to an operational database and then the filter is applied

B.

The Parquet file footers are scanned for min and max statistics for the latitude column

C.

All records are cached to attached storage and then the filter is applied

D.

The Delta log is scanned for min and max statistics for the latitude column

E.

The Hive metastore is scanned for min and max statistics for the latitude column

Full Access
Question # 16

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

A.

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Full Access
Question # 17

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Full Access
Question # 18

Which statement regarding stream-static joins and static Delta tables is correct?

A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job's initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Full Access
Question # 19

A table is registered with the following code:

Databricks-Certified-Professional-Data-Engineer question answer

Bothusersandordersare Delta Lake tables. Which statement describes the results of queryingrecent_orders?

A.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Full Access
Question # 20

Review the following error traceback:

Databricks-Certified-Professional-Data-Engineer question answer

Which statement describes the error being raised?

A.

The code executed was PvSoark but was executed in a Scala notebook.

B.

There is no column in the table named heartrateheartrateheartrate

C.

There is a type error because a column object cannot be multiplied.

D.

There is a type error because a DataFrame object cannot be multiplied.

E.

There is a syntax error because the heartrate column is not correctly identified as a column.

Full Access
Question # 21

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.

If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.

All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.

All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Full Access
Question # 22

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?

A.

Workspace audit logs

B.

Driver's log file

C.

Ganglia

D.

Cluster Event Log

E.

Executor's log file

Full Access
Question # 23

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

Full Access
Question # 24

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

A.

In the Executor's log file, by gripping for "predicate push-down"

B.

In the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

C.

In the Storage Detail screen, by noting which RDDs are not stored on disk

D.

In the Delta Lake transaction log. by noting the column statistics

E.

In the Query Detail screen, by interpreting the Physical Plan

Full Access
Question # 25

A table nameduser_ltvis being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

Theuser_ltvtable has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

Databricks-Certified-Professional-Data-Engineer question answer

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

A.

Three columns will be returned, but one column will be named "redacted" and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Full Access
Question # 26

The data engineering team maintains the following code:

Databricks-Certified-Professional-Data-Engineer question answer

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A.

The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

B.

A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.

C.

The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.

D.

An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.

E.

An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Full Access
Question # 27

A Databricks SQL dashboard has been configured to monitor the total number of records present in a collection of Delta Lake tables using the following query pattern:

SELECT COUNT (*) FROM table -

Which of the following describes how results are generated each time the dashboard is updated?

A.

The total count of rows is calculated by scanning all data files

B.

The total count of rows will be returned from cached results unless REFRESH is run

C.

The total count of records is calculated from the Delta transaction logs

D.

The total count of records is calculated from the parquet file metadata

E.

The total count of records is calculated from the Hive metastore

Full Access
Question # 28

A junior data engineer on your team has implemented the following code block.

Databricks-Certified-Professional-Data-Engineer question answer

The viewnew_eventscontains a batch of records with the same schema as theeventsDelta table. Theevent_idfield serves as a unique key for this table.

When this query is executed, what will happen with new records that have the sameevent_idas an existing record?

A.

They are merged.

B.

They are ignored.

C.

They are updated.

D.

They are inserted.

E.

They are deleted.

Full Access