DS-200 - Data Science Essentials Beta

Go back to Cloudera

Example Questions

When optimizing a function using stochastic gradient descent, how frequently should you update your estimate of the gradient? You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method will have the best runtime performance? A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. You want to understand how productive each engineer is at fixing bugs. What is the best way to visualize the distribution of bug fixes per engineer? What is the best way to determine the learning rate parameters for stochastic gradient descent when the distribution of the input data shifts over time? You have a directory containing a number of comma-separated files. Each file has three columns and each filename has a .csv extension. You want to have a single tab-separated file (all .tsv) that contains all the rows from all the files. Which command is guaranteed to produce the desired output if you have more than 20,000 files to process? What are two defining features of RMSE (root-mean square error or root-mean-square deviation)? In what format are web server log files usually generated and how must you transform them in order to make them usable for analysis in Hadoop? You have a large m x n data matrix M. You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA) You performed singular value decomposition (SVD; also called principal components analysis or PCA) on you data matrix but you did not center your data first. What does your first singular component describe? In what way can Hadoop be used to improve the performance of LIoyd's algorithm for k-means clustering on large data sets? You are building a k-nearest neighbor classifier (k-NN) on a labeled set of points in a high-dimensional space. You determine that the classifier has a large error on the training data. What is the most likely problem? You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method might introduce unexpected correlations? Why is the naive Bayes classifier "naive"? What is one limitation encountered by all systems that employ collaborative filtering and use preferences as input. In order to output product recommendations to consumers? A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week. One engineer points out that some bugs are more difficult to fix than others. What metric should you use to estimate how hard a particular bug is to fix? Many machine learning algorithm involve finding the Global minimum of a convex loss function, primarily because: You have a data file that contains two trillion records, one record per line (comma separated). Each record lists two friends and unique message sent between them. Their names will not have commas. Michael, John, Pabst, Blue Ribbon Tiffany, James, BMX Racing John, Michael, Natural Lemon Flavor Analyze the pseudo code examples below and determine which set of mappers and reducers in the below pseudo code snippets will solve for the mean number of messages each user sends to all of the friends? For example pseudo code may have three friends to whom he sends 6, 10, and 200 messages, respectively, so Michael's mean would be (6+10+200)/3. The solution may require a pipeline of two MapReduce jobs. From historical data, you know that 50% of students who take Cloudera's Introduction to Data Science: Building Recommenders Systems training course pass this exam, while only 25% of students who did not take the training course pass this exam. You also know that 50% of this exam's candidates also take Cloudera's Introduction to Data Science: Building Recommendations Systems training course. If we know that a person has passed this exam, what is the probability that they took cloudera's introduction to Data Science: Building Recommender Systems training course? You want to build a classification model to identify spam comments on a blog. You decide to use the words in the comment text as inputs to your model. Which criteria should you use when deciding which words to use as features in order to contribute to making the correct classification decision? You've built a model that has ten different variables with complicated independence relationships between them, and both continuous and discrete variables that have complicated, multi-parameter distributions. Computing the joint probability distribution is complex, but it turns out that computing the conditional probabilities for the variables is easy. What is the most computationally efficient for computing the expected value? Which two machine learning algorithm should you consider as likely to benefit from discretizing continuous features? How can the naiveté of the naive Bayes classifier be advantageous? Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 What are the five numbers that summarize this distribution (the five number summary of sample percentiles)? From historical data, you know that 50% of students who take Cloudera's Introduction to Data Science: Building Recommenders Systems training course pass this exam, while only 25% of students who did not take the training course pass this exam. You also know that 50% of this exam's candidates also take Cloudera's Introduction to Data Science: Building Recommendations Systems training course. What is the probability that any individual exam candidate will pass the data science exam? You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method requires the most RAM? Which two techniques should you use to avoid overfitting a classification model to a data set? You have just run a MapReduce job to filter user messages to only those of a selected geographical region. The output for this job in a directory named westUsers, located just below your home directory in HDFS. Which command gathers these records into a single file on your local file system? Which best describes the primary function of Flume? You have user profile records in an OLTP database that you want to join with web server logs which you have already ingested into HDFS. What is the best way to acquire the user profile for use in HDFS? Which recommender system technique is domain specific? You want to understand more about how users browse your public website. For example, you war know which pages they visit prior to placing an order. You have a server farm of 200 web server hosting your website. Which is the most efficient process to gather these web servers access logs into your Hadoop cluster for analysis? You have a large file of N records (one per line), and want to randomly sample 10% them. You have two functions that are perfect random number generators (through they are a bit slow): Random_uniform () generates a uniformly distributed number in the interval [0, 1] random_permotation (M) generates a random permutation of the number O through M -1. Below are three different functions that implement the sampling. Method A For line in file: If random_uniform () < 0.1; Print line Method B i = 0 for line in file: if i % 10 = = 0; print line i += 1 Method C idxs = random_permotation (N) [: (N/10)] i = 0 for line in file: if i in idxs: print line i +=1 Which method is least likely to give you exactly 10% of your data? You have acquired a new data source of millions of customer records, and you've this data into HDFS. Prior to analysis, you want to change all customer registration to the same date format, make all addresses uppercase, and remove all customer names (for anonymization). Which process will accomplish all three objectives? Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 What are two benefits of using the five-number summary of sample percentiles to summarize a data set? Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 How do high-level languages like Apache Hive and Apache Pig efficiently calculate approximately percentiles for a distribution? You have a large m x n data matrix M. You decide you want to perform dimension reduction/clustering on your data and have decide to use the singular value decomposition (SVD; also called principal components analysis PCA) Refer to the passage above. What represents the SVD of the Matrix standard M given the following information: U is m x m unitary V is n x n unitary S is m x n diagonal Q is n x n invertible D is n x n diagonal L is m x m lower triangular U is m x m upper triangular

Study Guides