3 Months Free Update
3 Months Free Update
3 Months Free Update
You are working in an ecommerce organization, where you are designing and evaluating a recommender system, you need to select which of the following metric wilt always have the largest value?
Select the sequence of the developing machine learning applications
A) Analyze the input data
B) Prepare the input data
C) Collect data
D) Train the algorithm
E) Test the algorithm
F) Use It
A website is opened 3 times by a user. What is the probability of he clicks 2 times the advertisement, is best calculated by
RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a______, as it is scale-dependent.
You are working on a Data Science project and during the project you have been gibe a responsibility to interview all the stakeholders in the project. In which phase of the project you are?
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.
Above is an example of
Google Adwords studies the number of men, and women, clicking the advertisement on search
engine during the midnight for an hour each day.
Google find that the number of men that click can be modeled as a random variable with distribution
Poisson(X), and likewise the number of women that click as Poisson(Y).
What is likely to be the best model of the total number of advertisement clicks during the midnight for an hour ?
A fruit may be considered to be an apple if it is red, round, and about 3" in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the
Regularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1 and L2 is...
Refer to the exhibit.
You are using K-means clustering to classify customer behavior for a large retailer. You need to determine the optimum number of customer groups. You plot the within-sum-of-squares (wss) data as shown in the exhibit. How many customer groups should you specify?
What is one modeling or descriptive statistical function in MADlib that is typically not provided in a standard relational database?
As a data scientist consultant at ABC Corp, you are working on a recommendation engine for the learning resources for end user. So Which recommender system technique benefits most from additional user preference data?
Suppose that the probability that a pedestrian will be tul by a car while crossing the toad at a pedestrian crossing without paying attention to the traffic light is lo be computed. Let H be a discrete random variable taking one value from (Hit. Not Hit). Let L be a discrete random variable taking one value from (Red. Yellow. Green).
Realistically, H will be dependent on L That is, P(H = Hit) and P(H = Not Hit) will take different values depending on whether L is red, yellow or green. A person is. for example, far more likely to be hit by a car when trying to cross while Hie lights for cross traffic are green than if they are red In other words, for any given possible pair of values for Hand L. one must consider the joint probability distribution of H and L to find the probability* of that pair of events occurring together if Hie pedestrian ignores the state of the light
Here is a table showing the conditional probabilities of being bit. defending on ibe stale of the lights (Note that the columns in this table must add up to 1 because the probability of being hit oi not hit is 1 regardless of the stale of the light.)
Which of the following statement true with regards to Linear Regression Model?
Refer to the Exhibit.
In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class". Which decision tree is valid for the data?
Which of the following is a correct example of the target variable in regression (supervised learning)?
You have collected the 100's of parameters about the 1000's of websites e.g. daily hits, average time on the websites, number of unique visitors, number of returning visitors etc. Now you have find the most important parameters which can best describe a website, so which of the following technique you will use
Feature Hashing approach is "SGD-based classifiers avoid the need to predetermine vector size by simply picking a reasonable size and shoehorning the training data into vectors of that size" now with large vectors or with multiple locations per feature in Feature hashing?
Which of the following true with regards to the K-Means clustering algorithm?
Question-18. What is the best way to ensure that the k-means algorithm will find a good clustering of a collection of vectors?
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters. When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters. What should you do?
Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:
In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%.
Select the correct statement
Reducing the data from many features to a small number so that we can properly visualize it in
two or three dimensions. It is done in_______
Which analytical method is considered unsupervised?
may have a trend component that is quadratic in nature. Which pattern of data will indicate that the trend in the time series data is quadratic in nature?
Suppose you have made a model for the rating system, which rates between 1 to 5 stars. And you calculated that RMSE value is 1.0 then which of the following is correct
You are working with the Clustering solution of the customer datasets. There are almost 40 variables are available for each customer and almost 1.00,0000 customer's data is available. You want to reduce the number of variables for clustering, what would you do?