Practice Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam Questions Answers With Explanation

We at Crack4sure are committed to giving students who are preparing for the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam the most current and reliable questions . To help people study, we've made some of our Databricks Certified Associate Developer for Apache Spark 3.5 – Python exam materials available for free to everyone. You can take the Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Test as many times as you want. The answers to the practice questions are given, and each answer is explained.

Get Full 136 Questions Search Other Databricks Exam

Question # 6

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.

Which change should be made to solve the issue?

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges

Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy

Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges

Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

Question # 7

A data engineer is building a Structured Streaming pipeline and wants the pipeline to recover from failures or intentional shutdowns by continuing where the pipeline left off.

How can this be achieved?

By configuring the option checkpointLocation during readStream

By configuring the option recoveryLocation during the SparkSession initialization

By configuring the option recoveryLocation during writeStream

By configuring the option checkpointLocation during writeStream

Question # 8

46 of 55.

A data engineer is implementing a streaming pipeline with watermarking to handle late-arriving records.

The engineer has written the following code:

inputStream \

.withWatermark("event_time", "10 minutes") \

.groupBy(window("event_time", "15 minutes"))

What happens to data that arrives after the watermark threshold?

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

The watermark ensures that late data arriving within 10 minutes of the latest event time will be processed and included in the windowed aggregation.

Question # 9

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

query = df.writeStream \

.outputMode("append") \

.start()

Question # 10

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

df = df.dropDuplicates()

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

df = df.filter(F.col("transaction_id").isNotNull())

df = df.dropDuplicates(["transaction_amount"])

Question # 11

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

It increases the partition size for df1 and df2.

It ensures that the join happens only when the id values are identical.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

It filters the id values before performing the join.

Question # 12

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

users.write(path="/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table").option("path", "/some/path")

users.write.option("path", "/some/path").saveAsTable("default_table")

users.write.saveAsTable("default_table", path="/some/path")

Question # 13

An MLOps engineer is building a Pandas UDF that applies a language model that translates English strings into Spanish. The initial code is loading the model on every call to the UDF, which is hurting the performance of the data pipeline.

The initial code is:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

def in_spanish_inner(df: pd.Series) -> pd.Series:

model = get_translation_model(target_lang='es')

return df.apply(model)

in_spanish = sf.pandas_udf(in_spanish_inner, StringType())

How can the MLOps engineer change this code to reduce how many times the language model is loaded?

Convert the Pandas UDF to a PySpark UDF

Convert the Pandas UDF from a Series ? Series UDF to a Series ? Scalar UDF

Run the in_spanish_inner() function in a mapInPandas() function call

Convert the Pandas UDF from a Series ? Series UDF to an Iterator[Series] ? Iterator[Series] UDF

Question # 14

8 of 55.

A data scientist at a large e-commerce company needs to process and analyze 2 TB of daily customer transaction data. The company wants to implement real-time fraud detection and personalized product recommendations.

Currently, the company uses a traditional relational database system, which struggles with the increasing data volume and velocity.

Which feature of Apache Spark effectively addresses this challenge?

Ability to process small datasets efficiently

In-memory computation and parallel processing capabilities

Support for SQL queries on structured data

Built-in machine learning libraries

Question # 15

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

ALL, DEBUG, FAIL, INFO

ERROR, WARN, TRACE, OFF

WARN, NONE, ERROR, FATAL

FATAL, NONE, INFO, DEBUG

Question # 16

What is the behavior for function date_sub(start, days) if a negative value is passed into the days parameter?

The same start date will be returned

An error message of an invalid parameter will be returned

The number of days specified will be added to the start date

The number of days specified will be removed from the start date

Question # 17

6 of 55.

Which components of Apache Spark’s Architecture are responsible for carrying out tasks when assigned to them?

Driver Nodes

Executors

CPU Cores

Worker Nodes

Question # 18

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Use an RDD action like reduce() to compute the maximum time

Use an accumulator to record the maximum time on the driver

Broadcast a variable to share the maximum time among workers

Configure the Spark UI to automatically collect maximum times

Question # 19

What is a feature of Spark Connect?

It supports DataStreamReader, DataStreamWriter, StreamingQuery, and Streaming APIs

Supports DataFrame, Functions, Column, SparkContext PySpark APIs

It supports only PySpark applications

It has built-in authentication

Question # 20

What is the benefit of Adaptive Query Execution (AQE)?

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Question # 21

How can a Spark developer ensure optimal resource utilization when running Spark jobs in Local Mode for testing?

Options:

Configure the application to run in cluster mode instead of local mode.

Increase the number of local threads based on the number of CPU cores.

Use the spark.dynamicAllocation.enabled property to scale resources dynamically.

Set the spark.executor.memory property to a large value.

Question # 22

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

Increase the executor memory allocation in the Spark configuration.

Reduce the size of the data partitions to improve task scheduling.

Increase the number of executor instances to handle more concurrent tasks.

Question # 23

16 of 55.

A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior? (Choose 2 answers)

Transformations are executed immediately to build the lineage graph.

The Spark engine optimizes the execution plan during the transformations, causing delays.

Transformations are evaluated lazily.

The Spark engine requires manual intervention to start executing transformations.

Only actions trigger the execution of the transformation pipeline.

Question # 24

37 of 55.

A data scientist is working with a Spark DataFrame called customerDF that contains customer information.

The DataFrame has a column named email with customer email addresses.

The data scientist needs to split this column into username and domain parts.

Which code snippet splits the email column into username and domain columns?

customerDF = customerDF \

.withColumn("username", split(col("email"), "@").getItem(0)) \

.withColumn("domain", split(col("email"), "@").getItem(1))

customerDF = customerDF.withColumn("username", regexp_replace(col("email"), "@", ""))

customerDF = customerDF.select("email").alias("username", "domain")

customerDF = customerDF.withColumn("domain", col("email").split("@")[1])

Question # 25

A data engineer has been asked to produce a Parquet table which is overwritten every day with the latest data. The downstream consumer of this Parquet table has a hard requirement that the data in this table is produced with all records sorted by the market_time field.

Which line of Spark code will produce a Parquet table that meets these requirements?

final_df \

.sort("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

final_df \

.orderBy("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

final_df \

.sort("market_time") \

.coalesce(1) \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

final_df \

.sortWithinPartitions("market_time") \

.write \

.format("parquet") \

.mode("overwrite") \

.saveAsTable("output.market_events")

Question # 26

54 of 55.

What is the benefit of Adaptive Query Execution (AQE)?

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

Question # 27

Given the schema:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

dropDuplicates on all columns (wrong criteria)

dropDuplicates with no arguments (removes based on all columns)

groupBy without aggregation (invalid use)

dropDuplicates on the exact matching fields

Question # 28

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

Use spark.read.json() with the inferSchema option set to true

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Question # 29

11 of 55.

Which Spark configuration controls the number of tasks that can run in parallel on an executor?

spark.executor.cores

spark.task.maxFailures

spark.executor.memory

spark.sql.shuffle.partitions

Question # 30

A developer notices that all the post-shuffle partitions in a dataset are smaller than the value set for spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold.

Which type of join will Adaptive Query Execution (AQE) choose in this case?

A Cartesian join

A shuffled hash join

A broadcast nested loop join

A sort-merge join

Question # 31

30 of 55.

A data engineer is working on a num_df DataFrame and has a Python UDF defined as:

def cube_func(val):

return val * val * val

Which code fragment registers and uses this UDF as a Spark SQL function to work with the DataFrame num_df?

spark.udf.register("cube_func", cube_func)

num_df.selectExpr("cube_func(num)").show()

num_df.select(cube_func("num")).show()

spark.createDataFrame(cube_func("num")).show()

num_df.register("cube_func").select("num").show()

Question # 32

A developer is working with a pandas DataFrame containing user behavior data from a web application.

Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

Use the applylnPandas API

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

Use the applyInPandas API:

df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()

Use the mapInPandas API:

df.mapInPandas(mean_func, schema="user_id long, value double").show()

Use a regular Spark UDF:

from pyspark.sql.functions import mean

df.groupBy("user_id").agg(mean("value")).show()

Use a Pandas UDF:

@pandas_udf("double")

def mean_func(value: pd.Series) -> float:

return value.mean()

df.groupby("user_id").agg(mean_func(df["value"])).show()

Question # 33

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Increase the value of spark.sql.shuffle.partitions

Reduce the value of spark.sql.shuffle.partitions

Increase the size of the dataset to create more partitions

Enable dynamic resource allocation to scale resources as needed

Question # 34

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Spark DataFrames, Structured Streaming, and GraphX

Spark SQL, Pandas API on Spark, and Structured Streaming

Spark Streaming, GraphX, and Pandas API on Spark

Spark DataFrames, Spark SQL, and MLlib

Question # 35

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Replace .bucketBy() with .partitionBy("event_year", "event_month")

Change the bucket count (42) to a lower number

Add .sortBy() after .bucketBy()

Replace .bucketBy() with .partitionBy("event_year") only

Question # 36

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country, and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

df.write.mode("overwrite").partitionBy("country").parquet("/data/output")

df.write.mode("append").partitionBy("country").parquet("/data/output")

df.write.mode("overwrite").parquet("/data/output")

df.write.partitionBy("country").parquet("/data/output")

Question # 37

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Use a single SparkSession instance for the entire application.

Avoid using a SparkSession and rely on SparkContext only.

Create a new SparkSession instance before each transformation.

Stop and restart the SparkSession after each action.

Question # 38

3 of 55. A data engineer observes that the upstream streaming source feeds the event table frequently and sends duplicate records. Upon analyzing the current production table, the data engineer found that the time difference in the event_timestamp column of the duplicate records is, at most, 30 minutes.

To remove the duplicates, the engineer adds the code:

df = df.withWatermark("event_timestamp", "30 minutes")

What is the result?

It removes all duplicates regardless of when they arrive.

It accepts watermarks in seconds and the code results in an error.

It removes duplicates that arrive within the 30-minute window specified by the watermark.

It is not able to handle deduplication in this scenario.

Question # 39

Given the code fragment:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 question answer

import pyspark.pandas as ps

psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

psdf.to_spark()

psdf.to_pyspark()

psdf.to_pandas()

psdf.to_dataframe()

Question # 40

The following code fragment results in an error:

@F.udf(T.IntegerType())

def simple_udf(t: str) -> str:

return answer * 3.14159

Which code fragment should be used instead?

@F.udf(T.IntegerType())

def simple_udf(t: int) -> int:

return t * 3.14159

@F.udf(T.DoubleType())

def simple_udf(t: float) -> float:

return t * 3.14159

@F.udf(T.DoubleType())

def simple_udf(t: int) -> int:

return t * 3.14159

@F.udf(T.IntegerType())

def simple_udf(t: float) -> float:

return t * 3.14159

Cyber Monday Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Crack4sure Logo

Main Navigation

Practice Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Exam Questions Answers With Explanation

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

$33

$109.99

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

$52.8

$175.99