# DS-200 - Data Science Essentials Beta

Go back to

Cloudera
## Example Questions

When optimizing a function using stochastic gradient descent, how frequently should you update your estimate of the gradient?
Once after every pass through the data set
For each observation with a probability that you choose ahead of time
You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method will have the best runtime performance?
Method A
A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. You want to understand how productive each engineer is at fixing bugs. What is the best way to visualize the distribution of bug fixes per engineer?
A bar chart of engineers vs. number of bugs fixed
What is the best way to determine the learning rate parameters for stochastic gradient descent when the distribution of the input data shifts over time?
The learning rate should be the value that optimizes the value of the objective function over the first N samples in the dataset
You have a directory containing a number of comma-separated files. Each file has three columns and each filename has a .csv extension. You want to have a single tab-separated file (all .tsv) that contains all the rows from all the files. Which command is guaranteed to produce the desired output if you have more than 20,000 files to process?
Find . name `name * .CSV' | cat | awk `BEGIN {FS = "," OFS = "\t"} {print $1, $2, $3}' > all.tsv
What are two defining features of RMSE (root-mean square error or root-mean-square deviation)?
It is the mean value of recommendations of the K-equal partitions in the input data
It is appropriate for numeric data
In what format are web server log files usually generated and how must you transform them in order to make them usable for analysis in Hadoop?
XML files that you need to convert to JSON
Text files that require parsing into useful fields
You have a large m x n data matrix M. You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA) You performed singular value decomposition (SVD; also called principal components analysis or PCA) on you data matrix but you did not center your data first. What does your first singular component describe?
The standard deviation of the data set
In what way can Hadoop be used to improve the performance of LIoyd's algorithm for k-means clustering on large data sets?
Distributing the updates of the cluster centroids
You are building a k-nearest neighbor classifier (k-NN) on a labeled set of points in a high-dimensional space. You determine that the classifier has a large error on the training data. What is the most likely problem?
k-NN compotation does not coverage in high dimensions
You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method might introduce unexpected correlations?
Method C
Why is the naive Bayes classifier "naive"?
It assumes Independence between all features
What is one limitation encountered by all systems that employ collaborative filtering and use preferences as input. In order to output product recommendations to consumers?
Consumers do not have stable ratings for the same product over time
A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. One engineer points out that some bugs are more difficult to fix than others. What metric should you use to estimate how hard a particular bug is to fix?
The number of bugs that had been found in each sub-component of the project
Many machine learning algorithm involve finding the Global minimum of a convex loss function, primarily because:
The derivative of convex function is always defined
You have a data file that contains two trillion records, one record per line (comma separated). Each record lists two friends and unique message sent between them. Their names will not have commas. Michael, John, Pabst, Blue Ribbon Tiffany, James, BMX Racing John, Michael, Natural Lemon Flavor Analyze the pseudo code examples below and determine which set of mappers and reducers in the below pseudo code snippets will solve for the mean number of messages each user sends to all of the friends? For example pseudo code may have three friends to whom he sends 6, 10, and 200 messages, respectively, so Michael's mean would be (6+10+200)/3. The solution may require a pipeline of two MapReduce jobs.
def mapper1 (line):key1, key2, message = line.split (` , ')emit ( (key1, key2) , 1)emit ( (key1, key2) , 1)def reducer1(key, values):emit (key, sum(values))def mapper2(key, value):key1, key2 = key / / unpack both friends name into separate keysemit (key1, value)def reducer2(key, values):emit (key, mean (values) )
From historical data, you know that 50% of students who take Cloudera's Introduction to Data Science: Building Recommenders Systems training course pass this exam, while only 25% of students who did not take the training course pass this exam. You also know that 50% of this exam's candidates also take Cloudera's Introduction to Data Science: Building Recommendations Systems training course. If we know that a person has passed this exam, what is the probability that they took cloudera's introduction to Data Science: Building Recommender Systems training course?
1/2
You want to build a classification model to identify spam comments on a blog. You decide to use the words in the comment text as inputs to your model. Which criteria should you use when deciding which words to use as features in order to contribute to making the correct classification decision?
Choose words for your sample that are most correlated with the Spam label
You've built a model that has ten different variables with complicated independence relationships between them, and both continuous and discrete variables that have complicated, multi-parameter distributions. Computing the joint probability distribution is complex, but it turns out that computing the conditional probabilities for the variables is easy. What is the most computationally efficient for computing the expected value?
Markov Chain Monte Carlo
Which two machine learning algorithm should you consider as likely to benefit from discretizing continuous features?
Support vector machine
Naïve Bayes
How can the naiveté of the naive Bayes classifier be advantageous?
It does not require you to make strong assumptions about the data because it is a non-parametric
Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 What are the five numbers that summarize this distribution (the five number summary of sample percentiles)?
1, 3, 8, 34, 89
From historical data, you know that 50% of students who take Cloudera's Introduction to Data Science: Building Recommenders Systems training course pass this exam, while only 25% of students who did not take the training course pass this exam. You also know that 50% of this exam's candidates also take Cloudera's Introduction to Data Science: Building Recommendations Systems training course. What is the probability that any individual exam candidate will pass the data science exam?
1/8
You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method requires the most RAM?
Method B
Which two techniques should you use to avoid overfitting a classification model to a data set?
Include a small number "noise" features that are not through to be correlated with the dependent variable.
Preprocess the data to exclude a typical observation from the model input
You have just run a MapReduce job to filter user messages to only those of a selected geographical region. The output for this job in a directory named westUsers, located just below your home directory in HDFS. Which command gathers these records into a single file on your local file system?
Hadoop fs get westUsers WestUsers.txt
Which best describes the primary function of Flume?
Flume provides a query languages for Hadoop similar to SQL
You have user profile records in an OLTP database that you want to join with web server logs which you have already ingested into HDFS. What is the best way to acquire the user profile for use in HDFS?
Ingest with Apache Flume
Ingest using Sqoop
Which recommender system technique is domain specific?
User-based collaborative filtering
You want to understand more about how users browse your public website. For example, you war know which pages they visit prior to placing an order. You have a server farm of 200 web server hosting your website. Which is the most efficient process to gather these web servers access logs into your Hadoop cluster for analysis?
Write a MapReduce job with the web servers for mappers and the Hadoop cluster nodes for reducers
You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method is least likely to give you exactly 10% of your data?
Method B
You have acquired a new data source of millions of customer records, and you've this data into HDFS. Prior to analysis, you want to change all customer registration to the same date format, make all addresses uppercase, and remove all customer names (for anonymization). Which process will accomplish all three objectives?
Write a script that receives records on stdin, corrects them, and then writes them to stdout. Then, invoke this script in a map-only Hadoop Streaming Job
Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 What are two benefits of using the five-number summary of sample percentiles to summarize a data set?
You can calculate it quickly using a relational database like MySQL, even when we have a large sample
Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 How do high-level languages like Apache Hive and Apache Pig efficiently calculate approximately percentiles for a distribution?
They use pivots to assign each observations to the reducer that calculate each percentile
You have a large m x n data matrix M. You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA) Refer to the passage above. What represents the SVD of the Matrix standard M given the following information: U is m x m unitary V is n x n unitary S is m x n diagonal Q is n x n invertible D is n x n diagonal L is m x m lower triangular U is m x m upper triangular
M = U S V
## Study Guides