Summer Special - 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: c4sdisc65

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF

$38.5

$109.99

3 Months Free Update

  • Printable Format
  • Value of Money
  • 100% Pass Assurance
  • Verified Answers
  • Researched by Industry Experts
  • Based on Real Exams Scenarios
  • 100% Real Questions

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 PDF + Testing Engine

$61.6

$175.99

3 Months Free Update

  • Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0 Exam
  • Last Update: Oct 16, 2025
  • Questions and Answers: 180
  • Free Real Questions Demo
  • Recommended by Industry Experts
  • Best Economical Package
  • Immediate Access

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Engine

$46.2

$131.99

3 Months Free Update

  • Best Testing Engine
  • One Click installation
  • Recommended by Teachers
  • Easy to use
  • 3 Modes of Learning
  • State of Art Technology
  • 100% Real Questions included

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Practice Exam Questions with Answers Databricks Certified Associate Developer for Apache Spark 3.0 Exam Certification

Question # 6

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A.

transactionsDf.groupBy(col(storeId).avg())

B.

transactionsDf.groupBy("storeId").avg(col("value"))

C.

transactionsDf.groupBy("storeId").agg(avg("value"))

D.

transactionsDf.groupBy("storeId").agg(average("value"))

E.

transactionsDf.groupBy("value").average()

Full Access
Question # 7

Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?

A.

transactionsDf.filter(productId==3 or productId<1)

B.

transactionsDf.filter((col("productId")==3) or (col("productId")<1))

C.

transactionsDf.filter(col("productId")==3 | col("productId")<1)

D.

transactionsDf.where("productId"=3).or("productId"<1))

E.

transactionsDf.filter((col("productId")==3) | (col("productId")<1))

Full Access
Question # 8

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

A.

1. cache

2. MEMORY_ONLY_2

3. count()

B.

1. persist

2. DISK_ONLY_2

3. count()

C.

1. persist

2. MEMORY_ONLY_2

3. select()

D.

1. cache

2. DISK_ONLY_2

3. count()

E.

1. persist

2. MEMORY_ONLY_2

3. count()

Full Access
Question # 9

Which of the following statements about garbage collection in Spark is incorrect?

A.

Garbage collection information can be accessed in the Spark UI's stage detail view.

B.

Optimizing garbage collection performance in Spark may limit caching ability.

C.

Manually persisting RDDs in Spark prevents them from being garbage collected.

D.

In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

E.

Serialized caching is a strategy to increase the performance of garbage collection.

Full Access
Question # 10

Which of the following describes characteristics of the Spark UI?

A.

Via the Spark UI, workloads can be manually distributed across executors.

B.

Via the Spark UI, stage execution speed can be modified.

C.

The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.

D.

There is a place in the Spark UI that shows the property spark.executor.memory.

E.

Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.

Full Access
Question # 11

The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.

Code block:

transactionsDf.withColumn("storeNumber", "storeId")

A.

Instead of withColumn, the withColumnRenamed method should be used.

B.

Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator.

C.

Argument "storeId" should be the first and argument "storeNumber" should be the second argument to the withColumn method.

D.

The withColumn operator should be replaced with the copyDataFrame operator.

E.

Instead of withColumn, the withColumnRenamed method should be used and argument "storeId" should be the first and argument "storeNumber" should be the second argument to that method.

Full Access
Question # 12

Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

A.

transactionsDf.repartition(24, boost=True)

B.

transactionsDf.repartition()

C.

transactionsDf.repartition("itemId", 24)

D.

transactionsDf.coalesce(24)

E.

transactionsDf.repartition(24)

Full Access
Question # 13

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

A.

1. filter

2. "storeId"==25

3. collect

4. 5

B.

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

C.

1. select

2. storeId==25

3. head

4. 5

D.

1. filter

2. col("storeId")==25

3. take

4. 5

E.

1. filter

2. col("storeId")==25

3. collect

4. 5

Full Access
Question # 14

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

A.

tranactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer')

B.

transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})

C.

transactionsDf.select('value', 'productId').distinct()

D.

transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()

E.

transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})

Full Access
Question # 15

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

A.

transactionsDf.drop(["predError", "value"])

B.

transactionsDf.drop("predError", "value")

C.

transactionsDf.drop(col("predError"), col("value"))

D.

transactionsDf.drop(predError, value)

E.

transactionsDf.drop("predError & value")

Full Access
Question # 16

The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is

at least 5. Find the error.

Code block:

transactionsDf.where("col(predError) >= 5")

A.

The argument to the where method should be "predError >= 5".

B.

Instead of where(), filter() should be used.

C.

The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").

D.

The argument to the where method cannot be a string.

E.

Instead of >=, the SQL operator GEQ should be used.

Full Access
Question # 17

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

A.

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

B.

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

C.

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

D.

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

E.

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Full Access
Question # 18

The code block shown below should write DataFrame transactionsDf to disk at path csvPath as a single CSV file, using tabs (\t characters) as separators between columns, expressing missing

values as string n/a, and omitting a header row with column names. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.write.__2__(__3__, " ").__4__.__5__(csvPath)

A.

1. coalesce(1)

2. option

3. "sep"

4. option("header", True)

5. path

B.

1. coalesce(1)

2. option

3. "colsep"

4. option("nullValue", "n/a")

5. path

C.

1. repartition(1)

2. option

3. "sep"

4. option("nullValue", "n/a")

5. csv

(Correct)

D.

1. csv

2. option

3. "sep"

4. option("emptyValue", "n/a")

5. path

1. repartition(1)

2. mode

3. "sep"

4. mode("nullValue", "n/a")

5. csv

Full Access
Question # 19

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

2.|storeId|transactionId|productId|value|

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

A.

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

B.

Instead of union, the concat method should be used, making sure to not use its default arguments.

C.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

D.

Instead of the Spark context, transactionDfMonday should be called with the union method.

E.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Full Access
Question # 20

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

A.

Instead of avg("value"), avg(col("value")) should be used.

B.

The avg("value") should be specified as a second argument to agg() instead of being appended to it.

C.

All column names should be wrapped in col() operators.

D.

agg should be replaced by groupBy.

E.

"storeId" and "value" should be swapped.

Full Access
Question # 21

Which of the following code blocks generally causes a great amount of network traffic?

A.

DataFrame.select()

B.

DataFrame.coalesce()

C.

DataFrame.collect()

D.

DataFrame.rdd.map()

E.

DataFrame.count()

Full Access
Question # 22

The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame

itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

2.|itemId|itemName |attributes |supplier |

3.+------+----------------------------------+-----------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |

6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))

A.

1. length

2. regexp_extract

3. upper

4. col("itemName")

5. as

B.

1. size

2. regexp_replace

3. lower

4. "itemName"

5. alias

C.

1. lower

2. regexp_replace

3. length

4. "itemName"

5. alias

D.

1. length

2. regexp_replace

3. lower

4. col("itemName")

5. alias

E.

1. size

2. regexp_extract

3. lower

4. col("itemName")

5. alias

Full Access
Question # 23

The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column

itemName. Find the error.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

A.

All column names need to be wrapped in the col() operator.

B.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.

C.

Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.

D.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.

E.

The expressions "itemNameElements" and split("itemName") need to be swapped.

Full Access
Question # 24

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

A.

1.counter = 0

2.

3.for index, row in itemsDf.iterrows():

4. if 'Inc.' in row['supplier']:

5. counter = counter + 1

6.

7.print(counter)

B.

1.counter = 0

2.

3.def count(x):

4. if 'Inc.' in x['supplier']:

5. counter = counter + 1

6.

7.itemsDf.foreach(count)

8.print(counter)

C.

print(itemsDf.foreach(lambda x: 'Inc.' in x))

D.

print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

E.

1.accum=sc.accumulator(0)

2.

3.def check_if_inc_in_supplier(row):

4. if 'Inc.' in row['supplier']:

5. accum.add(1)

6.

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Full Access
Question # 25

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

A.

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

B.

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

C.

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

D.

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

E.

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Full Access
Question # 26

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

A.

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

B.

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

C.

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

D.

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

E.

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Full Access
Question # 27

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 question answer

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

A.

Since itemId is the index, it does not need to be an argument to the select() method.

B.

The alias() method needs to be called after the select() method.

C.

The explode() method expects a Column object rather than a string.

D.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

E.

The split() method should be used inside the select() method instead of the explode() method.

Full Access
Question # 28

Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?

A.

spark.createDataFrame((throughputRates), FloatType)

B.

spark.createDataFrame(throughputRates, FloatType)

C.

spark.DataFrame(throughputRates, FloatType)

D.

spark.createDataFrame(throughputRates)

E.

spark.createDataFrame(throughputRates, FloatType())

Full Access
Question # 29

Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet?

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

A.

itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))

B.

itemsDf.withColumn('attributes', sort_array(desc('attributes')))

C.

itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))

D.

itemsDf.withColumn("attributes", sort_array("attributes", asc=False))

E.

itemsDf.select(sort_array("attributes"))

Full Access
Question # 30

The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier

whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

2.|itemId|itemName |attributes |supplier |

3.+------+----------------------------------+-----------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |

6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.__1__(__2__).select(__3__, __4__)

A.

1. filter

2. col("supplier").isin("Sports")

3. "itemName"

4. explode(col("attributes"))

B.

1. where

2. col("supplier").contains("Sports")

3. "itemName"

4. "attributes"

C.

1. where

2. col(supplier).contains("Sports")

3. explode(attributes)

4. itemName

D.

1. where

2. "Sports".isin(col("Supplier"))

3. "itemName"

4. array_explode("attributes")

E.

1. filter

2. col("supplier").contains("Sports")

3. "itemName"

4. explode("attributes")

Full Access
Question # 31

Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

A.

transactionsDf.drop(col("value"), col("predError"))

B.

transactionsDf.drop("predError", "value")

C.

transactionsDf.drop(value, predError)

D.

transactionsDf.drop(["predError", "value"])

E.

transactionsDf.drop([col("predError"), col("value")])

Full Access
Question # 32

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

A.

1. select

2. "storeId"

3. print_schema()

B.

1. limit

2. 1

3. columns

C.

1. select

2. "storeId"

3. printSchema()

D.

1. limit

2. "storeId"

3. printSchema()

E.

1. select

2. storeId

3. dtypes

Full Access
Question # 33

Which of the following describes Spark actions?

A.

Writing data to disk is the primary purpose of actions.

B.

Actions are Spark's way of exchanging data between executors.

C.

The driver receives data upon request by actions.

D.

Stage boundaries are commonly established by actions.

E.

Actions are Spark's way of modifying RDDs.

Full Access
Question # 34

Which of the following statements about reducing out-of-memory errors is incorrect?

A.

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

B.

Reducing partition size can help against out-of-memory errors.

C.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

D.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

E.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

Full Access
Question # 35

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

A.

The number of rows cannot be determined with the count() operator.

B.

Instead of filter, the select method should be used.

C.

The method used on column predError is incorrect.

D.

Instead of a list, the values need to be passed as single arguments to the in operator.

E.

Numbers 3 and 6 need to be passed as string variables.

Full Access
Question # 36

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

A.

DataFrame.repartition(12)

B.

DataFrame.coalesce(6).shuffle()

C.

DataFrame.coalesce(6)

D.

DataFrame.coalesce(6, shuffle=True)

E.

DataFrame.repartition(6)

Full Access
Question # 37

Which of the following describes a narrow transformation?

A.

narrow transformation is an operation in which data is exchanged across partitions.

B.

A narrow transformation is a process in which data from multiple RDDs is used.

C.

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.

D.

A narrow transformation is an operation in which data is exchanged across the cluster.

E.

A narrow transformation is an operation in which no data is exchanged across the cluster.

Full Access
Question # 38

Which of the following describes a difference between Spark's cluster and client execution modes?

A.

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

B.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

C.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

D.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

E.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Full Access
Question # 39

In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame

transactionsDf and itemsDf on columns productId and itemId, respectively?

1. .filter(~isnull(col('value')))

2. .count()

3. transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))

4. transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')

5. .filter(col('value').isnotnull())

6. .sum(col('value'))

A.

4, 1, 2

B.

3, 1, 6

C.

3, 1, 2

D.

3, 5, 2

E.

4, 6

Full Access
Question # 40

Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should

only be listed once.

Sample of DataFrame itemsDf:

1.+------+--------------------+--------------------+-------------------+

2.|itemId| itemName| attributes| supplier|

3.+------+--------------------+--------------------+-------------------+

4.| 1|Thick Coat for Wa...|[blue, winter, cozy]|Sports Company Inc.|

5.| 2|Elegant Outdoors ...|[red, summer, fre...| YetiX|

6.| 3| Outdoors Backpack|[green, summer, t...|Sports Company Inc.|

7.+------+--------------------+--------------------+-------------------+

A.

itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()

B.

itemsDf.select(~col('supplier').contains('X')).distinct()

C.

itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()

D.

itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()

E.

itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()

Full Access
Question # 41

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1.def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

7.

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

9.

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

A.

The operator used to adding the column does not add column predErrorAdded to the DataFrame.

B.

Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

C.

The udf() method does not declare a return type.

D.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

E.

The Python function is unable to handle null values, resulting in the code block crashing on execution.

Full Access
Question # 42

Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?

A.

transactionsDf.distinct("productId")

B.

transactionsDf.dropDuplicates(subset=["productId"])

C.

transactionsDf.drop_duplicates(subset="productId")

D.

transactionsDf.unique("productId")

E.

transactionsDf.dropDuplicates(subset="productId")

Full Access
Question # 43

Which of the following describes Spark's standalone deployment mode?

A.

Standalone mode uses a single JVM to run Spark driver and executor processes.

B.

Standalone mode means that the cluster does not contain the driver.

C.

Standalone mode is how Spark runs on YARN and Mesos clusters.

D.

Standalone mode uses only a single executor per worker per application.

E.

Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.

Full Access
Question # 44

The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.

Code block:

A.

transactionsDf.format("parquet").option("mode", "append").save(path)

B.

The code block is missing a reference to the DataFrameWriter.

C.

save() is evaluated lazily and needs to be followed by an action.

D.

The mode option should be omitted so that the command uses the default mode.

E.

The code block is missing a bucketBy command that takes care of partitions.

F.

Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.

Full Access
Question # 45

Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

A.

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

B.

1.itemsDf.withColumnRenamed("attributes", "feature0")

2.itemsDf.withColumnRenamed("supplier", "feature1")

C.

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D.

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E.

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Full Access
Question # 46

Which of the following statements about broadcast variables is correct?

A.

Broadcast variables are serialized with every single task.

B.

Broadcast variables are commonly used for tables that do not fit into memory.

C.

Broadcast variables are immutable.

D.

Broadcast variables are occasionally dynamically updated on a per-task basis.

E.

Broadcast variables are local to the worker node and not shared across the cluster.

Full Access
Question # 47

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

A.

The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

B.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

(Correct)

C.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

D.

Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

E.

Operator coalesce needs to be replaced by repartition.

Full Access
Question # 48

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

A.

itemsDf.persist(StorageLevel.MEMORY_ONLY)

B.

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

C.

itemsDf.store()

D.

itemsDf.cache()

E.

itemsDf.write.option('destination', 'memory').save()

Full Access
Question # 49

Which of the following is a problem with using accumulators?

A.

Only unnamed accumulators can be inspected in the Spark UI.

B.

Only numeric values can be used in accumulators.

C.

Accumulator values can only be read by the driver, but not by executors.

D.

Accumulators do not obey lazy evaluation.

E.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Full Access
Question # 50

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

2.|transactionId|predError|value|storeId|productId| f|

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

A.

transactionsDf.max('value').min('value')

B.

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

C.

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

D.

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

E.

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Full Access
Question # 51

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block

to accomplish this.

Example of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+------------------------------------------+

2.|itemId|itemName |supplier |itemNameElements |

3.+------+----------------------------------+-------------------+------------------------------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|[Thick, Coat, for, Walking, in, the, Snow]|

5.|2 |Elegant Outdoors Summer Dress |YetiX |[Elegant, Outdoors, Summer, Dress] |

6.|3 |Outdoors Backpack |Sports Company Inc.|[Outdoors, Backpack] |

7.+------+----------------------------------+-------------------+------------------------------------------+

Code block:

itemsDf.__1__(__2__(__3__)__4__)

A.

1. select

2. count

3. col("itemNameElements")

4. >3

B.

1. filter

2. count

3. itemNameElements

4. >=3

C.

1. select

2. count

3. "itemNameElements"

4. >3

D.

1. filter

2. size

3. "itemNameElements"

4. >=3

(Correct)

E.

1. select

2. size

3. "itemNameElements"

4. >3

Full Access
Question # 52

Which of the following describes a valid concern about partitioning?

A.

A shuffle operation returns 200 partitions if not explicitly set.

B.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

C.

No data is exchanged between executors when coalesce() is run.

D.

Short partition processing times are indicative of low skew.

E.

The coalesce() method should be used to increase the number of partitions.

Full Access
Question # 53

Which of the following describes Spark's Adaptive Query Execution?

A.

Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.

B.

Adaptive Query Execution is enabled in Spark by default.

C.

Adaptive Query Execution reoptimizes queries at execution points.

D.

Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.

E.

Adaptive Query Execution applies to all kinds of queries.

Full Access
Question # 54

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

A.

Slot is another name for executor.

B.

There must be less executors than tasks.

C.

An executor runs on a single core.

D.

There must be more slots than tasks.

E.

Tasks run in parallel via slots.

Full Access