Databricks-Certified-Professional-Data-Scientist Practice Exam Questions with Answers Databricks Certified Professional Data Scientist Exam Certification

Question # 6

You are working in an ecommerce organization, where you are designing and evaluating a recommender system, you need to select which of the following metric wilt always have the largest value?

Root Mean Square Error

Sum of Errors

Mean Absolute Error

Both land 2

Information is not good enough.

Full Access

Question # 7

Select the sequence of the developing machine learning applications

A) Analyze the input data

B) Prepare the input data

C) Collect data

D) Train the algorithm

E) Test the algorithm

F) Use It

A, B, C, D, E, F

C, B, A, D, E, F

C, A, B, D, E, F

C, B, A, D, E, F

Full Access

Answer:

Explanation:

1 Collect data. You could collect the samples by scraping a website and extracting data: or you could get information from an RSS feed or an API. You could have a device collect wind speed measurements and send them to you, or blood glucose levels, or anything you can measure. The number of options is endless. To save some time and effort you could use publicly available data

2 Prepare the input data. Once you have this data, you need to make sure it's in a useable format. The format we'll be using in this book is the Python list. We'll talk about Python more in a little bit, and lists are reviewed in appendix A. The benefit of having this standard format is that you can mix and match algorithms and data sources. You may need to do some algorithm-specific formatting here. Some algorithms need features in a special format, some algorithms can deal with target variables and features as strings, and some need them to be integers. We'll get to this later but the algorithm-specific formatting is usually trivial compared to collecting data.

3 Analyze the input data. This is looking at the data from the previous task. This could be as simple as looking at the data you've parsed in a text editor to make sure steps 1 and 2 are actually working and you don't have a bunch of empty values. You can also look at the data to see if you can recognize any patterns or if there's anything obvious^ such as a few data points that are vastly different from the rest of the set. Plotting data in one: two, or three dimensions can also help. But most of the time you'll have more than three features, and you can't easily plot the data across all features at one time. You could, however use some advanced methods we'll talk about later to distill multiple dimensions down to two or three so you can visualize the data.

4 If you're working with a production system and you know what the data should look like, or you trust its source: you can skip this step. This step takes human involvement, and for an automated system you don't want human involvement. The value of this step is that it makes you understand you don't have garbage coming in.

5 Train the algorithm. This is where the machine learning takes place. This step and the next step are where the "core" algorithms lie, depending on the algorithm.You feed the algorithm good clean data from the first two steps andextract knowledge or information. This knowledge you often store in a formatthat's readily useable by a machine for the next two steps.In the case of unsupervised learning, there's no training step because youdon't have a target value. Everything is used in the next step.

6 Test the algorithm. This is where the information learned in the previous step isput to use. When you're evaluating an algorithm, you'll test it to see how well itdoes. In the case of supervised learning, you have some known values you can use to evaluate the algorithm. In unsupervised learning, you may have to use some other metrics to evaluate the success. In either case, if you're not satisfied, you can go back to step 4, change some things, and try testing again. Often thecollection or preparation of the data may have been the problem, and you'll have to go back to step 1.

7 Use it. Here you make a real program to do some task, and once again you see if all the previous steps worked as you expected. You might encounter some new data and have to revisit steps 1-5.

Question # 8

A website is opened 3 times by a user. What is the probability of he clicks 2 times the advertisement, is best calculated by

Binomial

Poisson

Normal

Any of the above

Full Access

Question # 9

In which of the following scenario you should apply the Bay's Theorem

The sample space is partitioned into a set of mutually exclusive events {A1, A2, . .., An }.

Within the sample space, there exists an event B, for which P(B) > 0.

The analytical goal is to compute a conditional probability of the form: P(Ak | B ).

In all above cases

Full Access

Question # 10

RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.

Between Variables

Particular Variable

Among all the variables

All of the above are correct

Full Access

Question # 11

You are working on a Data Science project and during the project you have been gibe a responsibility to interview all the stakeholders in the project. In which phase of the project you are?

Discovery

Data Preparations

Creating Models

Executing Models

Creating visuals from the outcome

Operationnalise the models

Full Access

Question # 12

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

Above is an example of

Linear Regression

Logistic Regression

Recommendation system

Maximum likelihood estimation

Hierarchical linear models

Full Access

Question # 13

Google Adwords studies the number of men, and women, clicking the advertisement on search

engine during the midnight for an hour each day.

Google find that the number of men that click can be modeled as a random variable with distribution

Poisson(X), and likewise the number of women that click as Poisson(Y).

What is likely to be the best model of the total number of advertisement clicks during the midnight for an hour ?

Binomial(X+Y,X+Y)

Poisson(X/Y)

Normal(X+Y(M+Y)1/2)

Poisson(X+Y)

Full Access

Question # 14

RMSE measures error of a predicted

Numerical Value

Categorical values

For booth Numerical and categorical values

Full Access

Question # 15

A fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the

Presence of the other features.

Absence of the other features.

Presence or absence of the other features

None of the above

Full Access

Question # 16

Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is...

L2 is the sum of the square of the weights, while L1 is just the sum of the weights

L1 is the sum of the square of the weights, while L2 is just the sum of the weights

L1 gives Non-sparse output while L2 gives sparse outputs

None of the above

Full Access

Question # 17

Refer to the exhibit.

Databricks-Certified-Professional-Data-Scientist question answer

You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of-squares (wss) data as shown in the exhibit. How many customer groups should you specify?

Full Access

Question # 18

What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?

Expected value

Variance

Linear regression

Quantiles

Full Access

Question # 19

As a data scientist consultant at ABC Corp, you are working on a recommendation engine for the learning resources for end user. So Which recommender system technique benefits most from additional user preference data?

Naive Bayes classifier

Item-based collaborative filtering

Logistic Regression

Content-based filtering

Full Access

Question # 20

Select the correct statement which applies to logistic regression

Computationally inexpensive, easy to implement knowledge representation easy to interpret

May have low accuracy

Works with Numeric values

Full Access

Question # 21

RMSE is a useful metric for evaluating which types of models?

Logistic regression

Naive Bayes classifier

Linear regression

All of the above

Full Access

Question # 22

Select the statement which applies correctly to the Naive Bayes

Works with a small amount of data

Sensitive to how the input data is prepared

Works with nominal values

Full Access

Question # 23

Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).

Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light

Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)

Databricks-Certified-Professional-Data-Scientist question answer

The marginal probability P(H=Hit) is the sum along the H=Hit row of this joint distribution table, as this is the probability of being hit when the lights are red OR yellow OR green.

marginal probability that P(H=Not Hit) is the sum of the H=Not Hit row

marginal probability that P(H=Not Hit) is the sum of the H= Hit row

Full Access

Question # 24

Which of the following statement true with regards to Linear Regression Model?

Ordinary Least Square can be used to estimates the parameters in linear model

In Linear model, it tries to find multiple lines which can approximate the relationship between the outcome and input variables.

Ordinary Least Square is a sum of the individual distance between each point and the fitted line of regression model.

Ordinary Least Square is a sum of the squared individual distance between each point and the fitted line of regression model.

Full Access

Question # 25

Refer to the Exhibit.

Databricks-Certified-Professional-Data-Scientist question answer

In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data?

Tree A

Tree B

Tree C

Tree D

Full Access

Question # 26

Which of the following is a correct example of the target variable in regression (supervised learning)?

Nominal values like true, false

Reptile, fish, mammal, amphibian, plant, fungi

Infinite number of numeric values, such as 0.100, 42.001, 1000.743..

All of the above

Full Access

Question # 27

What is the considerable difference between L1 and L2 regularization?

L1 regularization has more accuracy of the resulting model

Size of the model can be much smaller in L1 regularization than that produced by L2-regularization

L2-regularization can be of vital importance when the application is deployed in resource-tight environments such as cell-phones.

All of the above are correct

Full Access

Question # 28

You have collected the 100's of parameters about the 1000's of websites e.g. daily hits, average time on the websites, number of unique visitors, number of returning visitors etc. Now you have find the most important parameters which can best describe a website, so which of the following technique you will use

PCA (Principal component analysis)

Linear Regression

Logistic Regression

Clustering

Full Access

Question # 29

Feature Hashing approach is "SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size" now with large vectors or with multiple locations per feature in Feature hashing?

Is a problem with accuracy

It is hard to understand what classifier is doing

It is easy to understand what classifier is doing

Is a problem with accuracy as well as hard to understand what classifier us doing

Full Access

Question # 30

Which of the following true with regards to the K-Means clustering algorithm?

Labels are not pre-assigned to each objects in the cluster.

Labels are pre-assigned to each objects in the cluster.

It classify the data based on the labels.

It discovers the center of each cluster.

It find each objects fall in which particular cluster

Full Access

Question # 31

Question-18. What is the best way to ensure that the k-means algorithm will find a good clustering of a collection of vectors?

Only consider values of k larger than log(N), where N is the number of observations in the data set

Run at least log(N) iterations of Lloyd's algorithm, where N is the number of observations in the data set

Choose the initial centroids so that they all He along different axes

Choose the initial centroids so that they are far away from each other

Full Access

Question # 32

You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?

Identify additional measures to add to the analysis

Remove one of the measures

Decrease the number of clusters

Increase the number of clusters

Full Access

Question # 33

Which of the following problem you can solve using binomial distribution

A manufacturer of metal pistons finds that on the average: 12% of his pistons are rejected because they are either oversize or undersize. What is the probability that a batch of 10 pistons will contain no more than 2 rejects?

A life insurance salesman sells on the average 3 life insurance policies per week. Use Poisson's law to calculate the probability that in a given week he will sell Some policies

Vehicles pass through a junction on a busy road at an average rate of 300 per hour Find the probability that none passes in a given minute.

It was found that the mean length of 100 parts produced by a lathe was 20.05 mm with a standard deviation of 0.02 mm. Find the probability that a part selected at random would have a length between 20.03 mm and 20.08 mm

Full Access

Question # 34

Question-13. Which of the following is not the Classification algorithm?

Logistic Regression

Support Vector Machine

Neural Network

Hidden Markov Models

None of the above

Full Access

Question # 35

Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:

In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.

Select the correct statement

Databricks-Certified-Professional-Data-Scientist question answer

Precision is low, which means the classifier is predicting positives best

Precision is low, which means the classifier is predicting positives poorly

problem domain has a major impact on the measures that should be used to evaluate a classifier within it

1 and 3

2 and 3

Full Access

Question # 36

Reducing the data from many features to a small number so that we can properly visualize it in

two or three dimensions. It is done in_______

supervised learning

un-supervised learning

k-Nearest Neighbors

Support vector machines

Full Access

Question # 37

Which analytical method is considered unsupervised?

Databricks-Certified-Professional-Data-Scientist question answer

may have a trend component that is quadratic in nature. Which pattern of data will indicate that the trend in the time series data is quadratic in nature?

Naive Bayesian classifier

Decision tree

Linear regression

K-means clustering

Full Access

Question # 38

Suppose you have made a model for the rating system, which rates between 1 to 5 stars. And you calculated that RMSE value is 1.0 then which of the following is correct

It means that your predictions are on average one star off of what people really think

It means that your predictions are on average two star off of what people really think

It means that your predictions are on average three star off of what people really think

It means that your predictions are on average four star off of what people really think

Full Access

Question # 39

Which of the below best describe the Principal component analysis

Dimensionality reduction

Collaborative filtering

Classification

Regression

Clustering

Full Access

Question # 40

You are working with the Clustering solution of the customer datasets. There are almost 40 variables are available for each customer and almost 1.00,0000 customer's data is available. You want to reduce the number of variables for clustering, what would you do?

You will randomly reduce the number of variables

You will find the correlation among the variables and from their variables are not co-related will be discarded.

You will find the correlation among the variables and from the highly co-related variables, you will be considering only one or two variables from it.

You cannot discard any variable for creating clusters.

You can combine several variables in one variable

Full Access

Question # 41

What are the key outcomes of the successful analytical projects?

Code of the model

Technical specifications

Presentations for the Analysts

Presentation for Project Sponsors

Full Access

Halloween Special Sale - 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: spcl70

Contact Email:

Crack4sure Logo

Main Navigation

Databricks-Certified-Professional-Data-Scientist PDF

$33

$109.99

Databricks-Certified-Professional-Data-Scientist PDF + Testing Engine

$52.8

$175.99

Databricks-Certified-Professional-Data-Scientist Engine

$39.6

$131.99

Databricks-Certified-Professional-Data-Scientist Practice Exam Questions with Answers Databricks Certified Professional Data Scientist Exam Certification

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Explanation:

Answer:

Answer:

Answer:

Explanation:

Answer:

Explanation:

QUICK LINKS

SUPPORT

PAYMENT METHOD

Site Secure