Weekend Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

$33

$109.99

3 Months Free Update

  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

$52.8

$175.99

3 Months Free Update

  • Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python
  • Last Update: Oct 19, 2025
  • Questions and Answers: 136
  • Free Real Questions Demo
  • Recommended by Industry Experts
  • Best Economical Package
  • Immediate Access

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Engine

$39.6

$131.99

3 Months Free Update

  • Best Testing Engine
  • One Click installation
  • Recommended by Teachers
  • Easy to use
  • 3 Modes of Learning
  • State of Art Technology
  • 100% Real Questions included

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Exam Questions with Answers Databricks Certified Associate Developer for Apache Spark 3.5 – Python Certification

Question # 6

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

A.

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

B.

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

C.

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

D.

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Full Access
Question # 7

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

A.

By configuring the option checkpointLocation during readStream

B.

By configuring the option recoveryLocation during the SparkSession initialization

C.

By configuring the option recoveryLocation during writeStream

D.

By configuring the option checkpointLocation during writeStream

Full Access
Question # 8

46 of 55.

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records.

The engineer has written the following code:

inputStream \

.withWatermark("event_time", "10 minutes") \

.groupBy(window("event_time", "15 minutes"))

What happens to data that arrives after the watermark threshold?

A.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

B.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.

Full Access
Question # 9

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

Full Access
Question # 10

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Full Access
Question # 11

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

A.

It increases the partition size for df1 and df2.

B.

It ensures that the join happens only when the id values are identical.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It filters the id values before performing the join.

Full Access
Question # 12

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

A.

users.write(path="/some/path").saveAsTable("default_table")

B.

users.write.saveAsTable("default_table").option("path", "/some/path")

C.

users.write.option("path", "/some/path").saveAsTable("default_table")

D.

users.write.saveAsTable("default_table", path="/some/path")

Full Access
Question # 13

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

A.

Convert the Pandas UDF to a PySpark UDF

B.

Convert the Pandas UDF from a Series ? Series UDF to a Series ? Scalar UDF

C.

Run the in_spanish_inner() function in a mapInPandas() function call

D.

Convert the Pandas UDF from a Series ? Series UDF to an Iterator[Series] ? Iterator[Series] UDF

Full Access
Question # 14

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

A.

Ability to process small datasets efficiently

B.

In-memory computation and parallel processing capabilities

C.

Support for SQL queries on structured data

D.

Built-in machine learning libraries

Full Access
Question # 15

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

A.

ALL, DEBUG, FAIL, INFO

B.

ERROR, WARN, TRACE, OFF

C.

WARN, NONE, ERROR, FATAL

D.

FATAL, NONE, INFO, DEBUG

Full Access
Question # 16

What is the behavior for function date_sub(start, days) if a negative value is passed into the days parameter?

A.

The same start date will be returned

B.

An error message of an invalid parameter will be returned

C.

The number of days specified will be added to the start date

D.

The number of days specified will be removed from the start date

Full Access
Question # 17

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

A.

Driver Nodes

B.

Executors

C.

CPU Cores

D.

Worker Nodes

Full Access
Question # 18

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

A.

Use an RDD action like reduce() to compute the maximum time

B.

Use an accumulator to record the maximum time on the driver

C.

Broadcast a variable to share the maximum time among workers

D.

Configure the Spark UI to automatically collect maximum times

Full Access
Question # 19

What is a feature of Spark Connect?

A.

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

B.

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

C.

It supports only PySpark applications

D.

It has built-in authentication

Full Access
Question # 20

What is the benefit of Adaptive Query Execution (AQE)?

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Full Access
Question # 21

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

A.

Configure the application to run in cluster mode instead of local mode.

B.

Increase the number of local threads based on the number of CPU cores.

C.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

D.

Set the spark.executor.memory property to a large value.

Full Access
Question # 22

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Full Access
Question # 23

16 of 55.

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)

A.

Transformations are executed immediately to build the lineage graph.

B.

The Spark engine optimizes the execution plan during the transformations, causing delays.

C.

Transformations are evaluated lazily.

D.

The Spark engine requires manual intervention to start executing transformations.

E.

Only actions trigger the execution of the transformation pipeline.

Full Access
Question # 24

37 of 55.

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.

The DataFrame has a column named email with customer email addresses.

The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

A.

customerDF = customerDF \

.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

B.

customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))

C.

customerDF = customerDF.select("email").alias("username", "domain")

D.

customerDF = customerDF.withColumn("domain", col("email").split("@")[1])

Full Access
Question # 25

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

A.

final_df \

.sort("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

B.

final_df \

.orderBy("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

C.

final_df \

.sort("market_time") \

.coalesce(1) \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

D.

final_df \

.sortWithinPartitions("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

Full Access
Question # 26

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Full Access
Question # 27

Given the schema:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

Full Access
Question # 28

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

A.

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

B.

Use spark.read.json() with the inferSchema option set to true

C.

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

D.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Full Access
Question # 29

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.executor.memory

D.

spark.sql.shuffle.partitions

Full Access
Question # 30

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

A.

A Cartesian join

B.

A shuffled hash join

C.

A broadcast nested loop join

D.

A sort-merge join

Full Access
Question # 31

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

A.

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

B.

num_df.select(cube_func("num")).show()

C.

spark.createDataFrame(cube_func("num")).show()

D.

num_df.register("cube_func").select("num").show()

Full Access
Question # 32

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

A.

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

B.

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

C.

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

D.

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Full Access
Question # 33

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Full Access
Question # 34

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Full Access
Question # 35

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy("event_year") only

Full Access
Question # 36

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country, and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

A.

df.write.mode("overwrite").partitionBy("country").parquet("/data/output")

B.

df.write.mode("append").partitionBy("country").parquet("/data/output")

C.

df.write.mode("overwrite").parquet("/data/output")

D.

df.write.partitionBy("country").parquet("/data/output")

Full Access
Question # 37

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Full Access
Question # 38

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.

To remove the duplicates, the engineer adds the code:

df = df.withWatermark("event_timestamp", "30 minutes")

What is the result?

A.

It removes all duplicates regardless of when they arrive.

B.

It accepts watermarks in seconds and the code results in an error.

C.

It removes duplicates that arrive within the 30-minute window specified by the watermark.

D.

It is not able to handle deduplication in this scenario.

Full Access
Question # 39

Given the code fragment:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

import pyspark.pandas as ps

psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

A.

psdf.to_spark()

B.

psdf.to_pyspark()

C.

psdf.to_pandas()

D.

psdf.to_dataframe()

Full Access
Question # 40

The following code fragment results in an error:

@F.udf(T.IntegerType())

def simple_udf(t: str) -> str:

return answer * 3.14159

Which code fragment should be used instead?

A.

@F.udf(T.IntegerType())

def simple_udf(t: int) -> int:

return t * 3.14159

B.

@F.udf(T.DoubleType())

def simple_udf(t: float) -> float:

return t * 3.14159

C.

@F.udf(T.DoubleType())

def simple_udf(t: int) -> int:

return t * 3.14159

D.

@F.udf(T.IntegerType())

def simple_udf(t: float) -> float:

return t * 3.14159

Full Access