Practice Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Questions Answers With Explanation

We at Crack4sure are committed to giving students who are preparing for the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Exam the most current and reliable questions . To help people study, we've made some of our Databricks Certified Associate Developer for Apache Spark 3.0 Exam exam materials available for free to everyone. You can take the Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Practice Test as many times as you want. The answers to the practice questions are given, and each answer is explained.

Get Full 180 Questions Search Other Databricks Exam

Question # 6

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

transactionsDf.groupBy(col(storeId).avg())

transactionsDf.groupBy("storeId").avg(col("value"))

transactionsDf.groupBy("storeId").agg(avg("value"))

transactionsDf.groupBy("storeId").agg(average("value"))

transactionsDf.groupBy("value").average()

Question # 7

Which of the following code blocks selects all rows from DataFrame transactionsDf in which column productId is zero or smaller or equal to 3?

transactionsDf.filter(productId==3 or productId<1)

transactionsDf.filter((col("productId")==3) or (col("productId")<1))

transactionsDf.filter(col("productId")==3 | col("productId")<1)

transactionsDf.where("productId"=3).or("productId"<1))

transactionsDf.filter((col("productId")==3) | (col("productId")<1))

Question # 8

The code block shown below should store DataFrame transactionsDf on two different executors, utilizing the executors' memory as much as possible, but not writing anything to disk. Choose the

answer that correctly fills the blanks in the code block to accomplish this.

1.from pyspark import StorageLevel

2.transactionsDf.__1__(StorageLevel.__2__).__3__

1. cache

2. MEMORY_ONLY_2

3. count()

1. persist

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. select()

1. cache

2. DISK_ONLY_2

3. count()

1. persist

2. MEMORY_ONLY_2

3. count()

Question # 9

Which of the following statements about garbage collection in Spark is incorrect?

Garbage collection information can be accessed in the Spark UI's stage detail view.

Optimizing garbage collection performance in Spark may limit caching ability.

Manually persisting RDDs in Spark prevents them from being garbage collected.

In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

Serialized caching is a strategy to increase the performance of garbage collection.

Answer:

Explanation:

Manually persisting RDDs in Spark prevents them from being garbage collected.

This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used.

So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.

See the linked StackOverflow post below for more information.

Serialized caching is a strategy to increase the performance of garbage collection.

This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a

complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized

DataFrame.

Optimizing garbage collection performance in Spark may limit caching ability.

This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.

A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.

To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full".

This will decrease the number of full garbage collection runs, increasing overall performance.

Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.

Garbage collection information can be accessed in the Spark UI's stage detail view.

This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.

In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.

This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.

While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1

garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.

The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the

Databricks article linked below.

Question # 10

Which of the following describes characteristics of the Spark UI?

Via the Spark UI, workloads can be manually distributed across executors.

Via the Spark UI, stage execution speed can be modified.

The Scheduler tab shows how jobs that are run in parallel by multiple users are distributed across the cluster.

There is a place in the Spark UI that shows the property spark.executor.memory.

Some of the tabs in the Spark UI are named Jobs, Stages, Storage, DAGs, Executors, and SQL.

Question # 11

The code block displayed below contains an error. The code block should return DataFrame transactionsDf, but with the column storeId renamed to storeNumber. Find the error.

Code block:

transactionsDf.withColumn("storeNumber", "storeId")

Instead of withColumn, the withColumnRenamed method should be used.

Arguments "storeNumber" and "storeId" each need to be wrapped in a col() operator.

Argument "storeId" should be the first and argument "storeNumber" should be the second argument to the withColumn method.

The withColumn operator should be replaced with the copyDataFrame operator.

Instead of withColumn, the withColumnRenamed method should be used and argument "storeId" should be the first and argument "storeNumber" should be the second argument to that method.

Question # 12

Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

transactionsDf.repartition(24, boost=True)

transactionsDf.repartition()

transactionsDf.repartition("itemId", 24)

transactionsDf.coalesce(24)

transactionsDf.repartition(24)

Question # 13

The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in

the code block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__(__4__)

1. filter

2. "storeId"==25

3. collect

4. 5

1. filter

2. col("storeId")==25

3. toLocalIterator

4. 5

1. select

2. storeId==25

3. head

4. 5

1. filter

2. col("storeId")==25

3. take

4. 5

1. filter

2. col("storeId")==25

3. collect

4. 5

Question # 14

Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?

tranactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer')

transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'})

transactionsDf.select('value', 'productId').distinct()

transactionsDf.select('value').union(transactionsDf.select('productId')).distinct()

transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'})

Question # 15

Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?

transactionsDf.drop(["predError", "value"])

transactionsDf.drop("predError", "value")

transactionsDf.drop(col("predError"), col("value"))

transactionsDf.drop(predError, value)

transactionsDf.drop("predError & value")

Question # 16

The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is

at least 5. Find the error.

Code block:

transactionsDf.where("col(predError) >= 5")

The argument to the where method should be "predError >= 5".

Instead of where(), filter() should be used.

The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5").

The argument to the where method cannot be a string.

Instead of >=, the SQL operator GEQ should be used.

Question # 17

The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code

block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame

transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Code block:

transactionsDf.__1__(itemsDf, __2__).__3__(__4__)

1. join

2. transactionsDf.productId==itemsDf.itemId, how="inner"

3. select

4. "transactionId", "supplier"

1. select

2. "transactionId", "supplier"

3. join

4. [transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId]

1. join

2. [transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId]

3. select

4. "transactionId", "supplier"

1. filter

2. "transactionId", "supplier"

3. join

4. "transactionsDf.storeId!=itemsDf.itemId, transactionsDf.productId==itemsDf.itemId"

1. join

2. transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId

3. filter

4. "transactionId", "supplier"

Answer:

Explanation:

Explanation

This QUESTION NO: is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the QUESTION NO: carefully, you can use your logic skills

to weed out the

wrong answers here.

First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is

in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers.

For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates.

Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to

the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the

documentation, we can discard that answer, leaving us with two remaining candidates.

Both candidates have valid syntax, but only one of them fulfills the condition in the QUESTION NO: "only where column storeId of DataFrame transactionsDf does not match column itemId of

DataFrame

itemsDf". So, this one remaining answer option has to be the correct one!

As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the

exam.

More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 47 (Databricks import instructions)

Question # 18

The code block shown below should write DataFrame transactionsDf to disk at path csvPath as a single CSV file, using tabs (\t characters) as separators between columns, expressing missing

values as string n/a, and omitting a header row with column names. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__.write.__2__(__3__, " ").__4__.__5__(csvPath)

1. coalesce(1)

2. option

3. "sep"

4. option("header", True)

5. path

1. coalesce(1)

2. option

3. "colsep"

4. option("nullValue", "n/a")

5. path

1. repartition(1)

2. option

3. "sep"

4. option("nullValue", "n/a")

5. csv

(Correct)

1. csv

2. option

3. "sep"

4. option("emptyValue", "n/a")

5. path

•

1. repartition(1)

2. mode

3. "sep"

4. mode("nullValue", "n/a")

5. csv

Question # 19

The code block displayed below contains an error. The code block should merge the rows of DataFrames transactionsDfMonday and transactionsDfTuesday into a new DataFrame, matching

column names and inserting null values where column names do not appear in both DataFrames. Find the error.

Sample of DataFrame transactionsDfMonday:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 5| null| null| null| 2|null|

5.| 6| 3| 2| 25| 2|null|

6.+-------------+---------+-----+-------+---------+----+

Sample of DataFrame transactionsDfTuesday:

1.+-------+-------------+---------+-----+

3.+-------+-------------+---------+-----+

4.| 25| 1| 1| 4|

5.| 2| 2| 2| 7|

6.| 3| 4| 2| null|

7.| null| 5| 2| null|

8.+-------+-------------+---------+-----+

Code block:

sc.union([transactionsDfMonday, transactionsDfTuesday])

The DataFrames' RDDs need to be passed into the sc.union method instead of the DataFrame variable names.

Instead of union, the concat method should be used, making sure to not use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the join method instead of the union method, making sure to use its default arguments.

Instead of the Spark context, transactionDfMonday should be called with the union method.

Instead of the Spark context, transactionDfMonday should be called with the unionByName method instead of the union method, making sure to not use its default arguments.

Answer:

Explanation:

Explanation

Correct code block:

transactionsDfMonday.unionByName(transactionsDfTuesday, True)

Output of correct code block:

+-------------+---------+-----+-------+---------+----+

+-------------+---------+-----+-------+---------+----+

| 6| 3| 2| 25| 2|null|

| 1| null| 4| 25| 1|null|

| 2| null| 7| 2| 2|null|

| 4| null| null| 3| 2|null|

+-------------+---------+-----+-------+---------+----+

For solving this question, you should be aware of the difference between the DataFrame.union() and DataFrame.unionByName() methods. The first one matches columns independent of their

names, just by their order. The second one matches columns by their name (which is asked for in the question). It also has a useful optional argument, allowMissingColumns. This allows you to

merge DataFrames that have different columns - just like in this example.

sc stands for SparkContext and is automatically provided when executing code on Databricks. While sc.union() allows you to join RDDs, it is not the right choice for joining DataFrames. A hint away

from sc.union() is given where the QUESTION NO: talks about joining "into a new DataFrame".

concat is a method in pyspark.sql.functions. It is great for consolidating values from different columns, but has no place when trying to join rows of multiple DataFrames.

Finally, the join method is a contender here. However, the default join defined for that method is an inner join which does not get us closer to the goal to match the two DataFrames as instructed,

especially given that with the default arguments we cannot define a join condition.

More info:

- pyspark.sql.DataFrame.unionByName — PySpark 3.1.2 documentation

- pyspark.SparkContext.union — PySpark 3.1.2 documentation

- pyspark.sql.functions.concat — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 45 (Databricks import instructions)

Question # 20

The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error.

Code block:

transactionsDf.agg("storeId").avg("value")

Instead of avg("value"), avg(col("value")) should be used.

The avg("value") should be specified as a second argument to agg() instead of being appended to it.

All column names should be wrapped in col() operators.

agg should be replaced by groupBy.

"storeId" and "value" should be swapped.

Question # 21

Which of the following code blocks generally causes a great amount of network traffic?

DataFrame.select()

DataFrame.coalesce()

DataFrame.collect()

DataFrame.rdd.map()

DataFrame.count()

Question # 22

The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame

itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.

DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))

1. length

2. regexp_extract

3. upper

4. col("itemName")

5. as

1. size

2. regexp_replace

3. lower

4. "itemName"

5. alias

1. lower

2. regexp_replace

3. length

4. "itemName"

5. alias

1. length

2. regexp_replace

3. lower

4. col("itemName")

5. alias

1. size

2. regexp_extract

3. lower

4. col("itemName")

5. alias

Question # 23

The code block displayed below contains an error. The code block below is intended to add a column itemNameElements to DataFrame itemsDf that includes an array of all words in column

itemName. Find the error.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+

2.|itemId|itemName |supplier |

3.+------+----------------------------------+-------------------+

4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.|

5.|2 |Elegant Outdoors Summer Dress |YetiX |

6.|3 |Outdoors Backpack |Sports Company Inc.|

7.+------+----------------------------------+-------------------+

Code block:

itemsDf.withColumnRenamed("itemNameElements", split("itemName"))

All column names need to be wrapped in the col() operator.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument "," needs to be passed to the split method.

Operator withColumnRenamed needs to be replaced with operator withColumn and the split method needs to be replaced by the splitString method.

Operator withColumnRenamed needs to be replaced with operator withColumn and a second argument " " needs to be passed to the split method.

The expressions "itemNameElements" and split("itemName") need to be swapped.

Question # 24

Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?

1.counter = 0

3.for index, row in itemsDf.iterrows():

4. if 'Inc.' in row['supplier']:

5. counter = counter + 1

7.print(counter)

1.counter = 0

3.def count(x):

4. if 'Inc.' in x['supplier']:

5. counter = counter + 1

7.itemsDf.foreach(count)

8.print(counter)

print(itemsDf.foreach(lambda x: 'Inc.' in x))

print(itemsDf.foreach(lambda x: 'Inc.' in x).sum())

1.accum=sc.accumulator(0)

3.def check_if_inc_in_supplier(row):

4. if 'Inc.' in row['supplier']:

5. accum.add(1)

7.itemsDf.foreach(check_if_inc_in_supplier)

8.print(accum.value)

Question # 25

The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to

30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__((__2__.__3__) __4__ (__5__))

1. select

2. col("storeId")

3. between(20, 30)

4. and

5. col("productId")==2

1. where

2. col("storeId")

3. geq(20).leq(30)

4. &

5. col("productId")==2

1. select

2. "storeId"

3. between(20, 30)

4. &&

5. col("productId")==2

1. select

2. col("storeId")

3. between(20, 30)

4. &&

5. col("productId")=2

1. select

2. col("storeId")

3. between(20, 30)

4. &

5. col("productId")==2

Question # 26

The code block shown below should add column transactionDateForm to DataFrame transactionsDf. The column should express the unix-format timestamps in column transactionDate as string

type like Apr 26 (Sunday). Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__, from_unixtime(__3__, __4__))

1. withColumn

2. "transactionDateForm"

3. "MMM d (EEEE)"

4. "transactionDate"

1. select

2. "transactionDate"

3. "transactionDateForm"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MMM d (EEEE)"

1. withColumn

2. "transactionDateForm"

3. "transactionDate"

4. "MM d (EEE)"

1. withColumnRenamed

2. "transactionDate"

3. "transactionDateForm"

4. "MM d (EEE)"

Question # 27

The code block displayed below contains an error. The code block should create DataFrame itemsAttributesDf which has columns itemId and attribute and lists every attribute from the attributes column in DataFrame itemsDf next to the itemId of the respective row in itemsDf. Find the error.

A sample of DataFrame itemsDf is below.

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 question answer

Code block:

itemsAttributesDf = itemsDf.explode("attributes").alias("attribute").select("attribute", "itemId")

Since itemId is the index, it does not need to be an argument to the select() method.

The alias() method needs to be called after the select() method.

The explode() method expects a Column object rather than a string.

explode() is not a method of DataFrame. explode() should be used inside the select() method instead.

The split() method should be used inside the select() method instead of the explode() method.

Question # 28

Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?

spark.createDataFrame((throughputRates), FloatType)

spark.createDataFrame(throughputRates, FloatType)

spark.DataFrame(throughputRates, FloatType)

spark.createDataFrame(throughputRates)

spark.createDataFrame(throughputRates, FloatType())

Question # 29

Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet?

1.+------+-----------------------------+-------------------+

2.|itemId|attributes |supplier |

3.+------+-----------------------------+-------------------+

4.|1 |[blue, winter, cozy] |Sports Company Inc.|

5.|2 |[red, summer, fresh, cooling]|YetiX |

6.|3 |[green, summer, travel] |Sports Company Inc.|

7.+------+-----------------------------+-------------------+

itemsDf.withColumn('attributes', sort_array(col('attributes').desc()))

itemsDf.withColumn('attributes', sort_array(desc('attributes')))

itemsDf.withColumn('attributes', sort(col('attributes'), asc=False))

itemsDf.withColumn("attributes", sort_array("attributes", asc=False))

itemsDf.select(sort_array("attributes"))

Question # 30

The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier

whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this.

Sample of DataFrame itemsDf:

1.+------+----------------------------------+-----------------------------+-------------------+

3.+------+----------------------------------+-----------------------------+-------------------+

7.+------+----------------------------------+-----------------------------+-------------------+

Code block:

itemsDf.__1__(__2__).select(__3__, __4__)

1. filter

2. col("supplier").isin("Sports")

3. "itemName"

4. explode(col("attributes"))

1. where

2. col("supplier").contains("Sports")

3. "itemName"

4. "attributes"

1. where

2. col(supplier).contains("Sports")

3. explode(attributes)

4. itemName

1. where

2. "Sports".isin(col("Supplier"))

3. "itemName"

4. array_explode("attributes")

1. filter

2. col("supplier").contains("Sports")

3. "itemName"

4. explode("attributes")

Answer:

Explanation:

Explanation

Output of correct code block:

+----------------------------------+------+

|itemName |col |

+----------------------------------+------+

|Thick Coat for Walking in the Snow|blue |

|Thick Coat for Walking in the Snow|winter|

|Thick Coat for Walking in the Snow|cozy |

|Outdoors Backpack |green |

|Outdoors Backpack |summer|

|Outdoors Backpack |travel|

+----------------------------------+------+

The key to solving this QUESTION NO: is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through

the

answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the

first gap, but can also exclude some answers based on obvious problems you see with them.

The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do

not help us in selecting the right answer.

The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option

contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col

("supplier").contains("Sports") and col("supplier").isin("Sports"). The QUESTION NO: states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator

here.

We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names.

Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode

("attributes") will help us achieve our goal. Specifically, the QUESTION NO: asks for one attribute from column attributes per row - this is what the explode() operator does.

One answer option also includes array_explode() which is not a valid operator in PySpark.

More info: pyspark.sql.functions.explode — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 3, QUESTION NO: 39 (Databricks import instructions)

Question # 31

Which of the following code blocks returns a copy of DataFrame transactionsDf that only includes columns transactionId, storeId, productId and f?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

transactionsDf.drop(col("value"), col("predError"))

transactionsDf.drop("predError", "value")

transactionsDf.drop(value, predError)

transactionsDf.drop(["predError", "value"])

transactionsDf.drop([col("predError"), col("value")])

Question # 32

The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code

block to accomplish this.

Code block:

transactionsDf.__1__(__2__).__3__

1. select

2. "storeId"

3. print_schema()

1. limit

2. 1

3. columns

1. select

2. "storeId"

3. printSchema()

1. limit

2. "storeId"

3. printSchema()

1. select

2. storeId

3. dtypes

Question # 33

Which of the following describes Spark actions?

Writing data to disk is the primary purpose of actions.

Actions are Spark's way of exchanging data between executors.

The driver receives data upon request by actions.

Stage boundaries are commonly established by actions.

Actions are Spark's way of modifying RDDs.

Question # 34

Which of the following statements about reducing out-of-memory errors is incorrect?

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

Reducing partition size can help against out-of-memory errors.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

Answer:

Explanation:

Explanation

Concatenating multiple string columns into a single column may guard against out-of-memory errors.

Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and

definitely does not reduce out-of-memory errors.

Reducing partition size can help against out-of-memory errors.

No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does

not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that.

Decreasing the number of cores available to each executor can help against out-of-memory errors.

No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in

parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple

partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors.

Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors.

No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory

errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter

spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error.

Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors.

Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is

happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of-

memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter.

More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error — Closeup. Does the following look familiar when… | by Amit Singh Rathore | The Startup | Medium

Question # 35

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.

Code block:

transactionsDf.filter(col('predError').in([3, 6])).count()

The number of rows cannot be determined with the count() operator.

Instead of filter, the select method should be used.

The method used on column predError is incorrect.

Instead of a list, the values need to be passed as single arguments to the in operator.

Numbers 3 and 6 need to be passed as string variables.

Question # 36

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

DataFrame.repartition(12)

DataFrame.coalesce(6).shuffle()

DataFrame.coalesce(6)

DataFrame.coalesce(6, shuffle=True)

DataFrame.repartition(6)

Question # 37

Which of the following describes a narrow transformation?

narrow transformation is an operation in which data is exchanged across partitions.

A narrow transformation is a process in which data from multiple RDDs is used.

A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.

A narrow transformation is an operation in which data is exchanged across the cluster.

A narrow transformation is an operation in which no data is exchanged across the cluster.

Question # 38

Which of the following describes a difference between Spark's cluster and client execution modes?

In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode.

In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode.

In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode.

In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode.

Question # 39

In which order should the code blocks shown below be run in order to return the number of records that are not empty in column value in the DataFrame resulting from an inner join of DataFrame

transactionsDf and itemsDf on columns productId and itemId, respectively?

1. .filter(~isnull(col('value')))

2. .count()

3. transactionsDf.join(itemsDf, col("transactionsDf.productId")==col("itemsDf.itemId"))

4. transactionsDf.join(itemsDf, transactionsDf.productId==itemsDf.itemId, how='inner')

5. .filter(col('value').isnotnull())

6. .sum(col('value'))

4, 1, 2

3, 1, 6

3, 1, 2

3, 5, 2

4, 6

Question # 40

Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should

only be listed once.

Sample of DataFrame itemsDf:

1.+------+--------------------+--------------------+-------------------+

3.+------+--------------------+--------------------+-------------------+

7.+------+--------------------+--------------------+-------------------+

itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()

itemsDf.select(~col('supplier').contains('X')).distinct()

itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()

itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()

itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()

Question # 41

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error.

Code block:

1.def add_2_if_geq_3(x):

2. if x is None:

3. return x

4. elif x >= 3:

5. return x+2

6. return x

8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)

10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

The operator used to adding the column does not add column predErrorAdded to the DataFrame.

Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.

The udf() method does not declare a return type.

UDFs are only available through the SQL API, but not in the Python API as shown in the code block.

The Python function is unable to handle null values, resulting in the code block crashing on execution.

Question # 42

Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?

transactionsDf.distinct("productId")

transactionsDf.dropDuplicates(subset=["productId"])

transactionsDf.drop_duplicates(subset="productId")

transactionsDf.unique("productId")

transactionsDf.dropDuplicates(subset="productId")

Question # 43

Which of the following describes Spark's standalone deployment mode?

Standalone mode uses a single JVM to run Spark driver and executor processes.

Standalone mode means that the cluster does not contain the driver.

Standalone mode is how Spark runs on YARN and Mesos clusters.

Standalone mode uses only a single executor per worker per application.

Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.

Question # 44

The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error.

Code block:

transactionsDf.format("parquet").option("mode", "append").save(path)

The code block is missing a reference to the DataFrameWriter.

save() is evaluated lazily and needs to be followed by an action.

The mode option should be omitted so that the command uses the default mode.

The code block is missing a bucketBy command that takes care of partitions.

Given that the DataFrame should be saved as parquet file, path is being passed to the wrong method.

Question # 45

Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

1.itemsDf.withColumnRenamed("attributes", "feature0")

2.itemsDf.withColumnRenamed("supplier", "feature1")

itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Question # 46

Which of the following statements about broadcast variables is correct?

Broadcast variables are serialized with every single task.

Broadcast variables are commonly used for tables that do not fit into memory.

Broadcast variables are immutable.

Broadcast variables are occasionally dynamically updated on a per-task basis.

Broadcast variables are local to the worker node and not shared across the cluster.

Question # 47

The code block displayed below contains an error. When the code block below has executed, it should have divided DataFrame transactionsDf into 14 parts, based on columns storeId and

transactionDate (in this order). Find the error.

Code block:

transactionsDf.coalesce(14, ("storeId", "transactionDate"))

The parentheses around the column names need to be removed and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .count() needs to be appended to the code block.

(Correct)

Operator coalesce needs to be replaced by repartition, the parentheses around the column names need to be removed, and .select() needs to be appended to the code block.

Operator coalesce needs to be replaced by repartition and the parentheses around the column names need to be replaced by square brackets.

Operator coalesce needs to be replaced by repartition.

Question # 48

Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

itemsDf.persist(StorageLevel.MEMORY_ONLY)

itemsDf.cache(StorageLevel.MEMORY_AND_DISK)

itemsDf.store()

itemsDf.cache()

itemsDf.write.option('destination', 'memory').save()

Question # 49

Which of the following is a problem with using accumulators?

Only unnamed accumulators can be inspected in the Spark UI.

Only numeric values can be used in accumulators.

Accumulator values can only be read by the driver, but not by executors.

Accumulators do not obey lazy evaluation.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Answer:

Explanation:

Explanation

Accumulator values can only be read by the driver, but not by executors.

Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for

example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good

way to do that.

Only numeric values can be used in accumulators.

No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).

Accumulators do not obey lazy evaluation.

Incorrect – accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a

subsequent action is run.

Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator

variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be

repeated.

Only unnamed accumulators can be inspected in the Spark UI.

No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.

More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide - Spark 3.1.2 Documentation, pyspark.Accumulator — PySpark 3.1.2 documentation, and

pyspark.AccumulatorParam — PySpark 3.1.2 documentation

Question # 50

Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column

productId from DataFrame transactionsDf?

Sample of DataFrame transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

3.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.| 4| null| null| 3| 2|null|

8.| 5| null| null| null| 2|null|

9.| 6| 3| 2| 25| 2|null|

10.+-------------+---------+-----+-------+---------+----+

transactionsDf.max('value').min('value')

transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest"))

transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest'))

transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")})

Question # 51

The code block shown below should return all rows of DataFrame itemsDf that have at least 3 items in column itemNameElements. Choose the answer that correctly fills the blanks in the code block

to accomplish this.

Example of DataFrame itemsDf:

1.+------+----------------------------------+-------------------+------------------------------------------+

3.+------+----------------------------------+-------------------+------------------------------------------+

7.+------+----------------------------------+-------------------+------------------------------------------+

Code block:

itemsDf.__1__(__2__(__3__)__4__)

1. select

2. count

3. col("itemNameElements")

4. >3

1. filter

2. count

3. itemNameElements

4. >=3

1. select

2. count

3. "itemNameElements"

4. >3

1. filter

2. size

3. "itemNameElements"

4. >=3

(Correct)

1. select

2. size

3. "itemNameElements"

4. >3

Question # 52

Which of the following describes a valid concern about partitioning?

A shuffle operation returns 200 partitions if not explicitly set.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No data is exchanged between executors when coalesce() is run.

Short partition processing times are indicative of low skew.

The coalesce() method should be used to increase the number of partitions.

Answer:

Explanation:

Explanation

A shuffle operation returns 200 partitions if not explicitly set.

Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.

The coalesce() method should be used to increase the number of partitions.

Incorrect. The coalesce() method can only be used to decrease the number of partitions.

Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.

A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.

Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the

number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is

smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one

would want to have the number of partitions equal to the number of executors (but not more).

So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.

No data is exchanged between executors when coalesce() is run.

No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.

Short partition processing times are indicative of low skew.

Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.

Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short

processing time is not per se indicative a low skew: It may simply be short because the partition is small.

A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their

partitions than others. But the answer does not make any comparison – so by itself it does not provide enough information to make any assessment about skew.

More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation

Question # 53

Which of the following describes Spark's Adaptive Query Execution?

Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.

Adaptive Query Execution is enabled in Spark by default.

Adaptive Query Execution reoptimizes queries at execution points.

Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.

Adaptive Query Execution applies to all kinds of queries.

Question # 54

Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?

Slot is another name for executor.

There must be less executors than tasks.

An executor runs on a single core.

There must be more slots than tasks.

Tasks run in parallel via slots.

Cyber Monday Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Crack4sure Logo

Main Navigation

Practice Free Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 Databricks Certified Associate Developer for Apache Spark 3.0 Exam Exam Questions Answers With Explanation

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation: