CCD-410 - Cloudera Certified Developer for Apache Hadoop (CCDH)
Go back to Cloudera
What is a SequenceFile?
A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be same type.
In the reducer, the MapReduce API provides you with an iterator over Writable values. What does calling the next () method return?
It returns a reference to the same Writable object each time, but populated with different data.
Assuming default settings, which best describes the order of data provided to a reducer's reduce method:
The keys given to a reducer are in sorted order but the values associated with each key are in no predictable order
A combiner reduces:
The amount of intermediate data that must be transferred between the mapper and reducer.
In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?
The values are arbitrarily ordered, and the ordering may vary from run to run of the same MapReduce job.
For each input key-value pair, mappers can emit:
As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
You use the hadoop fs put command to write a 300 MB file using and HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this life?
They would see the current of the file through the last completed block.
You write MapReduce job to process 100 files in HDFS. Your MapReduce algorithm uses TextInputFormat: the mapper applies a regular expression over input values and emits key-values pairs with the key consisting of the matching text, and the value containing the filename and byte offset. Determine the difference between setting the number of reduces to one and settings the number of reducers to zero.
With zero reducers, instances of matching patterns are stored in multiple files on HDFS. With one reducer, all instances of matching patterns are gathered together in one file on HDFS.
You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper's map method?
Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
When is the earliest point at which the reduce method of a given Reducer can be called?
Not until all mappers have finished processing all records.
In a large MapReduce job with m mappers and n reducers, how many distinct copy operations will there be in the sort/shuffle phase?
mXn (i.e., m multiplied by n)
Your client application submits a MapReduce job to your Hadoop cluster. Identify the Hadoop daemon on which the Hadoop framework will look for an available slot schedule a MapReduce operation.
Table metadata in Hive is:
Stored in the Metastore.
You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you've decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface. Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?
hadoop MyDrive D mapred.job.name=Example input output
Your cluster's HDFS block size in 64MB. You have directory containing 100 plain text files, each of which is 100MB in size. The InputFormat for your job is TextInputFormat. Determine how many Mappers will run?
How are keys and values presented and passed to the reducers during a standard sort and shuffle phase of MapReduce?
Keys are presented to reducer in sorted order; values for a given key are not sorted.
In a MapReduce job, you want each of your input files processed by a single map task. How do you configure a MapReduce job so that a single map task processes each input file regardless of how many blocks the input file occupies?
Write a custom FileInputFormat and override the method isSplitable to always return false.
Analyze each scenario below and indentify which best describes the behavior of the default partitioner?
The default partitioner computes the hash of the key and divides that valule modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.
You need to perform statistical analysis in your MapReduce job and would like to call methods in the Apache Commons Math library, which is distributed as a 1.3 megabyte Java archive (JAR) file. Which is the best way to make this library available to your MapReducer job at runtime?
When submitting the job on the command line, specify the libjars option followed by the JAR file path.
When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?
When the types of the reduce operation's input key and input value match the types of the reducer's output key and output value and when the reduce operation is both communicative and associative.
What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?
By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file) into individual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?
Disk I/O and network I/O
Which best describes how TextInputFormat processes input files and line breaks?
Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
Indentify the utility that allows you to create and run MapReduce jobs with any executable or script as the mapper and/or the reducer?
You want to populate an associative array in order to perform a map-side join. You've decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed. Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?
A client application creates an HDFS file named foo.txt with a replication factor of 3. Identify which best describes the file access rules in HDFS if the file has a single block that is stored on data nodes A, B and C?
The file can be accessed if at least one of the data nodes storing the file is available.
You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text). Indentify what determines the data types used by the Mapper for a given job.
The InputFormat used by the job determines the mapper's input key and value types.
Determine which best describes when the reduce method is first called in a MapReduce job?
Reducers start copying intermediate key-value pairs from each Mapper as soon as it has completed. The reduce method is called only after all intermediate data has been copied and sorted.
Which process describes the lifecycle of a Mapper?
The TaskTracker spawns a new Mapper to process all records in a single input split.
You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?
Ingest the server web logs into HDFS using Flume.
On a cluster running MapReduce v1 (MRv1), a TaskTracker heartbeats into the JobTracker on your cluster, and alerts the JobTracker it has an open map task slot. What determines how the JobTracker assigns each map task to a TaskTracker?
The location of the InsputSplit to be processed in relation to the location of the node.
Indentify which best defines a SequenceFile?
A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be the same type.
Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated files in HDFS.
To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?
Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
What types of algorithms are difficult to express in MapReduce v1 (MRv1)?
Algorithms that require global, sharing states.
For each intermediate key, each reducer task can emit:
As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
You have a directory named jobdata in HDFS that contains four files: _first.txt, second.txt, .third.txt and #data.txt. How many files will be processed by the FileInputFormat.setInputPaths () command when it's given a path object representing this directory?
Two, file names with a leading period or underscore are ignored
What data does a Reducer reduce method process?
All data for a given key, regardless of which mapper(s) produced it.
Workflows expressed in Oozie can contain:
Sequences of MapReduce and Pig. These sequences can be combined with other actions including forks, decision points, and path joins.
All keys used for intermediate output from mappers must:
You have written a Mapper which invokes the following five calls to the OutputColletor.collect method: output.collect (new Text ("Apple"), new Text ("Red") ) ; output.collect (new Text ("Banana"), new Text ("Yellow") ) ; output.collect (new Text ("Apple"), new Text ("Yellow") ) ; output.collect (new Text ("Cherry"), new Text ("Red") ) ; output.collect (new Text ("Apple"), new Text ("Green") ) ; How many times will the Reducer's reduce method be invoked?
You are developing a combiner that takes as input Text keys, IntWritable values, and emits Text keys, IntWritable values. Which interface should your class implement?
Reducer <Text, IntWritable, Text, IntWritable>
Which best describes what the map method accepts and emits?
It accepts a single key-value pairs as input and can emit any number of key-value pair as output, including zero.
In a MapReduce job with 500 map tasks, how many map task attempts will there be?
At least 500.
You want to count the number of occurrences for each unique word in the supplied input data. You've decided to implement this by having your mapper tokenize each word and emit a literal value 1, and then have your reducer increment a counter for each literal 1 it receives. After successful implementing this, it occurs to you that you could optimize this by specifying a combiner. Will you be able to reuse your existing Reduces as your combiner in this case and why or why not?
Yes, because the sum operation is both associative and commutative and the input and output types to the reduce method match.
Given a directory of files with the following structure: line number, tab character, string: Example: 1abialkjfjkaoasdfjksdlkjhqweroij 2kadfjhuwqounahagtnbvaswslmnbfgy 3kjfteiomndscxeqalkzhtopedkfsikj You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?
Which describes how a client reads a file from HDFS?
The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s).
You want to run Hadoop jobs on your development workstation for testing before you submit them to your production cluster. Which mode of operation in Hadoop allows you to most closely simulate a production cluster while using a single machine?
Run the DataNode, TaskTracker, NameNode and JobTracker daemons on a single machine.