New Year Special - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Databricks-Certified-Professional-Data-Engineer PDF



3 Months Free Update

  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine



3 Months Free Update

  • Exam Name: Databricks Certified Data Engineer Professional Exam
  • Last Update: Jan 13, 2025
  • Questions and Answers: 120
  • Free Real Questions Demo
  • Recommended by Industry Experts
  • Best Economical Package
  • Immediate Access

Databricks-Certified-Professional-Data-Engineer Engine



3 Months Free Update

  • Best Testing Engine
  • One Click installation
  • Recommended by Teachers
  • Easy to use
  • 3 Modes of Learning
  • State of Art Technology
  • 100% Real Questions included

Databricks-Certified-Professional-Data-Engineer Practice Exam Questions with Answers Databricks Certified Data Engineer Professional Exam Certification

Question # 6

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?


The five Minute Load Average remains consistent/flat


Bytes Received never exceeds 80 million bytes per second


Total Disk Space remains constant


Network I/O never spikes


Overall cluster CPU utilization is around 25%

Full Access
Question # 7

An hourly batch job is configured to ingest data files from a cloud object storage container where each batch represent all records produced by the source system in a given hour. The batch job to process these records into the Lakehouse is sufficiently delayed to ensure no late-arriving data is missed. The user_id field represents a unique key for the data, which has the following schema:

user_id BIGINT, username STRING, user_utc STRING, user_region STRING, last_login BIGINT, auto_pay BOOLEAN, last_updated BIGINT

New records are all ingested into a table named account_history which maintains a full record of all data in the same schema as the source. The next table in the system is named account_current and is implemented as a Type 1 table representing the most recent value for each unique user_id.

Assuming there are millions of user accounts and tens of thousands of records processed hourly, which implementation can be used to efficiently update the described account_current table as part of each hourly batch job?


Use Auto Loader to subscribe to new files in the account history directory; configure a Structured Streaminq trigger once job to batch update newly detected files into the account current table.


Overwrite the account current table with each batch using the results of a query against the account history table grouping by user id and filtering for the max value of last updated.


Filter records in account history using the last updated field and the most recent hour processed, as well as the max last iogin by user id write a merge statement to update or insert the most recent value for each user id.


Use Delta Lake version history to get the difference between the latest version of account history and one version prior, then write these records to account current.


Filter records in account history using the last updated field and the most recent hour processed, making sure to deduplicate on username; write a merge statement to update or insert the

most recent value for each username.

Full Access
Question # 8

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders?


All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.


All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.


Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.


All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.


The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Full Access
Question # 9

Which statement regarding spark configuration on the Databricks platform is true?


Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.


When the same spar configuration property is set for an interactive to the same interactive cluster.


Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster


The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Full Access
Question # 10

The data engineering team maintains the following code:

Databricks-Certified-Professional-Data-Engineer question answer

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?


The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.


A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.


The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.


An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.


An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Full Access
Question # 11

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?


The job_id is returned in this field.


The job_id and number of times the job has been are concatenated and returned.


The number of times the job definition has been run in the workspace.


The globally unique ID of the newly triggered run.

Full Access
Question # 12

A nightly job ingests data into a Delta Lake table using the following code:

Databricks-Certified-Professional-Data-Engineer question answer

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():


return spark.readStream.table("bronze")


return spark.readStream.load("bronze")


return"readChangeFeed", "true").table ("bronze")


Databricks-Certified-Professional-Data-Engineer question answer

Full Access
Question # 13

A data engineer wants to join a stream of advertisement impressions (when an ad was shown) with another stream of user clicks on advertisements to correlate when impression led to monitizable clicks.

Databricks-Certified-Professional-Data-Engineer question answer

Which solution would improve the performance?


Databricks-Certified-Professional-Data-Engineer question answer


Databricks-Certified-Professional-Data-Engineer question answer


Databricks-Certified-Professional-Data-Engineer question answer


Databricks-Certified-Professional-Data-Engineer question answer


Option A


Option B


Option C


Option D

Full Access
Question # 14

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?


All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.


Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.


Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.


Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Full Access
Question # 15

The data engineer is using Spark's MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI's Storage tab to signal that a cached table is not performing optimally?


Size on Disk is> 0


The number of Cached Partitions> the number of Spark Partitions


The RDD Block Name included the '' annotation signaling failure to cache


On Heap Memory Usage is within 75% of off Heap Memory usage

Full Access
Question # 16

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?


userLookup.join(streamingDF, ["userid"], how="inner")


streamingDF.join(userLookup, ["user_id"], how="outer")


streamingDF.join(userLookup, ["user_id”], how="left")


streamingDF.join(userLookup, ["userid"], how="inner")


userLookup.join(streamingDF, ["user_id"], how="right")

Full Access
Question # 17

A junior data engineer on your team has implemented the following code block.

Databricks-Certified-Professional-Data-Engineer question answer

The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?


They are merged.


They are ignored.


They are updated.


They are inserted.


They are deleted.

Full Access
Question # 18

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?


Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.


Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.


Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.


Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Full Access
Question # 19

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.

What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?


Can manage


Can edit


Can run


Can Read

Full Access
Question # 20

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.

In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?


Set the configuration delta.deduplicate = true.


VACUUM the Delta table after each batch completes.


Perform an insert-only merge with a matching condition on a unique key.


Perform a full outer join on a unique key and overwrite existing data.


Rely on Delta Lake schema enforcement to prevent duplicate records.

Full Access
Question # 21

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?


A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.


An external table will be created in the storage container mounted to /mnt/finance eda bucket.


A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.


An managed table will be created in the storage container mounted to /mnt/finance eda bucket.


A managed table will be created in the DBFS root storage container.

Full Access
Question # 22

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?


• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor


• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor


• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor


• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Full Access
Question # 23

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?


Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.


Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.


Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.


Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.


Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Full Access
Question # 24

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?


The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.


Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.


Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.


Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.


Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Full Access
Question # 25

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used.

Which strategy will yield the best performance without shuffling data?


Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.


Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.


Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.


Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.


Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Full Access
Question # 26

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?


Stage’s detail screen and Executor’s files


Stage’s detail screen and Query’s detail screen


Driver’s and Executor’s log files


Executor’s detail screen and Executor’s log files

Full Access
Question # 27

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?


The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.


Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.


Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.


Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.


Schema inference and evolution on .Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Full Access
Question # 28

A Delta Lake table was created with the below query:

Realizing that the original query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?


The table reference in the metastore is updated and no data is changed.


The table name change is recorded in the Delta transaction log.


All related files and metadata are dropped and recreated in a single ACID transaction.


The table reference in the metastore is updated and all data files are moved.


A new Delta transaction log Is created for the renamed table.

Full Access
Question # 29

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

Which statement describes this implementation?


The customers table is implemented as a Type 3 table; old values are maintained as a new column alongside the current value.


The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.


The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.


The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.


The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

Full Access
Question # 30

A Delta Lake table representing metadata about content from user has the following schema:

Based on the above schema, which column is a good candidate for partitioning the Delta Table?









Full Access
Question # 31

A distributed team of data analysts share computing resources on an interactive cluster with autoscaling configured. In order to better manage costs and query throughput, the workspace administrator is hoping to evaluate whether cluster upscaling is caused by many concurrent users or resource-intensive queries.

In which location can one review the timeline for cluster resizing events?


Workspace audit logs


Driver's log file




Cluster Event Log


Executor's log file

Full Access
Question # 32

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.

Which solution meets the expectations of the end users while controlling and limiting possible costs?


Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.


Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.


Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.


Define a view against the products_per_order table and define the dashboard against this view.

Full Access
Question # 33

A data engineer, User A, has promoted a new pipeline to production by using the REST API to programmatically create several jobs. A DevOps engineer, User B, has configured an external orchestration tool to trigger job runs through the REST API. Both users authorized the REST API calls using their personal access tokens.

Which statement describes the contents of the workspace audit logs concerning these events?


Because the REST API was used for job creation and triggering runs, a Service Principal will be automatically used to identity these events.


Because User B last configured the jobs, their identity will be associated with both the job creation events and the job run events.


Because these events are managed separately, User A will have their identity associated with the job creation events and User B will have their identity associated with the job run events.


Because the REST API was used for job creation and triggering runs, user identity will not be captured in the audit logs.


Because User A created the jobs, their identity will be associated with both the job creation events and the job run events.

Full Access
Question # 34

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.

Given the current implementation, which method can be used?


Parse the Delta Lake transaction log to identify all newly written data files.


Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.


Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.


Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

Full Access
Question # 35

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?




DESCRIBE DETAIL dev.pii test







Full Access
Question # 36

Which statement describes the default execution mode for Databricks Auto Loader?


New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.


Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.


Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.


New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

Full Access