CCA-505 - Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam
Go back to Cloudera
You are planning a Hadoop cluster and considering implementing 10 Gigabit Ethernet as the network fabric. Which workloads benefit the most from a faster network fabric?
When your workload generates a large amount of intermediate data, on the order of the input data itself
You decide to create a cluster which runs HDFS in High Availability mode with automatic failover, using Quorum-based Storage. What is the purpose of ZooKeeper in such a configuration?
It only keeps track of which NameNode is Active at any given time
You have installed a cluster running HDFS and MapReduce version 2 (MRv2) on YARN. You have no afs.hosts entry()ies in your hdfs-alte.xml configuration file. You configure a new worker node by setting fs.default.name in its configuration files to point to the NameNode on your cluster, and you start the DataNode daemon on that worker node. What do you have to do on the cluster to allow the worker node to join, and start storing HDFS blocks?
Without creating a dfs.hosts file or making any entries, run the command hadoop dfsadmin refreshHadoop on the NameNode
You are migrating a cluster from MapReduce version 1 (MRv1) to MapReduce version2 (MRv2) on YARN. To want to maintain your MRv1 TaskTracker slot capacities when you migrate. What should you do?
Configure yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores to match the capacity you require under YARN for each NodeManager
You are configuring a cluster running HDFS, MapReduce version 2 (MRv2) on YARN running Linux. How must you format the underlying filesystem of each DataNode?
They must be formatted as either ext3 or ext4
You are running a Hadoop cluster with MapReduce version 2 (MRv2) on YARN. You consistently see that MapReduce map tasks on your cluster are running slowly because of excessive garbage collection of JVM, how do you increase JVM heap property to 3GB to optimize performance?
Your cluster is running MapReduce vserion 2 (MRv2) on YARN. Your ResourceManager is configured to use the FairScheduler. Now you want to configure your scheduler such that a new user on the cluster can submit jobs into their own queue application submission. Which configuration should you set?
You can specify new queue name when user submits a job and new queue can be created dynamically if yarn.scheduler.fair.user-as-default-queue = false
During the execution of a MapReduce v2 (MRv2) job on YARN, where does the Mapper place the intermediate data each Map task?
The Mapper stores the intermediate data on the underlying filesystem of the local disk in the directories yarn.nodemanager.local-dirs
You observe that the number of spilled records from Map tasks far exceeds the number of map output records. Your child heap size is 1GB and your io.sort.mb value is set to 100 MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio?
Tune the io.sort.mb value until you observe that the number of spilled records equals (or is as close to equals) the number of map output records
You are working on a project where you need to chain together MapReduce, Pig jobs. You also needs the ability to use forks, decision, and path joins. Which ecosystem project should you use to perform these actions?
Your cluster's mapped-site.xml includes the following parameters <name>mapreduce.map.memory.mb</name> <value>4096<value/> <name>mapreduce.reduce.memory,mb</name> <value>8192</value> And your cluster's yarn-site.xml includes the following parameters <name>yarn.nodemanager/vmen-pmem-ratio</name> <value>2.1</value> What is the maximum amount of virtual memory allocated for each map before YARN will kill its Container?
You have a Hadoop cluster running HDFS, and a gateway machine external to the cluster from which clients submit jobs. What do you need to do in order to run on the cluster and submit jobs from the command line of the gateway machine?
Install the impalad daemon on each machine in the cluster, the statestored daemon and catalogd daemon on one machine in the cluster, and the impala shell on your gateway machine
A slave node in your cluster has four 2TB hard drives installed (4 x 2TB). The DataNode is configured to store HDFS blocks on the disks. You set the value of the dfs.datanode.du.reserved parameter to 100GB. How does this alter HDFS block storage?
All hard drives may be used to store HDFS blocks as long as atleast 100 GB in total is available on the node
Assuming a cluster running HDFS, MapReduce version 2 (MRv2) on YARN with all settings at their default, what do you need to do when adding a new slave node to a cluster?
Restart the NameNode and ResourceManager deamons and resubmit any running jobs
You are upgrading a Hadoop cluster from HDFS and MapReduce version 1 (MRv1) to one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce a block of 128MB for all new files written to the cluster after the upgrade. What should you do?
Set dfs.block.size to 134217728 on all the worker nodes and client machines, and set the parameter to final. You do need to set this value on the NameNode.
You want a node to only swap Hadoop daemon data from RAM to disk when absolutely necessary. What should you do?
Set vm.swappiness to o in /etc/sysctl.conf
Which Yarn daemon or service monitors a Container's per-application resource usage (e.g, memory, CPU)?
You have converted your Hadoop cluster from a MapReduce 1 (MRv1) architecture to a MapReduce 2 (MRv2) on YARN architecture. Your developers are accustomed to specifying map and reduce tasks (resource allocation) tasks when they run jobs. A developer wants to know how specify to reduce tasks when a specific job runs. Which method should you tell that developer to implement?
In YARN, resource allocation is a function of virtual cores specified by the ApplicationMaster making requests to the NodeManager where a reduce task is handled by a single container (and this a single virtual core). Thus, the developer needs to specify the number of virtual cores to the NodeManager by executing p yarn.nodemanager.cpu-vcores=2
Which is the default scheduler in YARN?
Your Hadoop cluster is configured with HDFS and MapReduce version 2 (MRv2) on YARN. Can you configure a worker node to run a NodeManager daemon but not a DataNode daemon and still have a function cluster?
Yes. The daemon will receive data from the NameNode to run Map tasks
Each node in your Hadoop cluster, running YARN, has 64 GB memory and 24 cores. Your yarn- site.xml has the following configuration: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>32768</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>23</value> </property> You want YARN to launch no more than 16 containers per node. What should you do?
Modify yarn-site.xml with the following property: <name>yarn.nodemanager.resource.cpu- vcores</name><value>16</value>
Which two are Features of Hadoop's rack topology?
Even for small clusters on a single rack, configuring rack awareness will improve performance.
Rack location is considered in the HDFS block placement policy
What processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and you want to change a configuration parameter so that it affects all six DataNodes.
You must restart the NameNode daemon to apply the changes to the cluster
You must modify the configuration files on the NameNode only. DataNodes read their configuration from the master nodes.
You are the hadoop fs put command to add a file "sales.txt" to HDFS. This file is small enough that it fits into a single block, which is replicated to three nodes in your cluster (with a replication factor of 3). One of the nodes holding this file (a single block) fails. How will the cluster handle the replication of this file in this situation/
This file will be immediately re-replicated and all other HDFS operations on the cluster will halt until the cluster's replication values are restored
A user comes to you, complaining that when she attempts to submit a Hadoop job, it fails. There is a directory in HDFS named /data/input. The Jar is named j.jar, and the driver class is named DriverClass. She runs command: hadoop jar j.jar DriverClass /data/input/data/output The error message returned includes the line: PrivilegedActionException as:training (auth:SIMPLE) cause.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exits: file :/data/input What is the cause of the error?
The Hadoop configuration files on the client do not point to the cluster
For each YARN Job, the Hadoop framework generates task log files. Where are Hadoop's files stored?
On the local disk of the slave node running the task
On a cluster running CDH 5.0 or above, you use the hadoop fs put command to write a 300MB file into a previously empty directory using an HDFS block of 64MB. Just after this command has finished writing 200MB of this file, what would another use see when they look in the directory?
They will see the file with a ._COPYING_ extension on its name. if they view the file, they will see contents of the file up to the last completed block (as each 64MB block is written, that block becomes available)
Which YARN daemon or service negotiates map and reduce Containers from the Scheduler, tracking their status and monitoring for progress?
Your cluster is configured with HDFS and MapReduce version 2 (MRv2) on YARN. What is the result when you execute: hadoop jar samplejar.jar MyClass on a client machine?
SampleJar.jar is sent to the ApplicationMaster which allocation a container for Sample.jar
You want to understand more about how users browse you public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server logs into your Hadoop cluster for analysis?
Sample the web server logs web servers and copy them into HDFS using curl
Ingest the server web logs into HDFS using Flume
Your cluster implements HDFS High Availability (HA). Your two NameNodes are named nn01 and nn02. What occurs when you execute the command: hdfs haadmin failover nn01 nn02
nn01 is fenced, and nn02 becomes the active NameNode
You have a 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in you cluster. What should you do?
Run the ResourceManager on a different master from the NameNode in the order to load share HDFS metadata processing
Your Hadoop cluster contains nodes in three racks. You have NOT configured the dfs.hosts property in the NameNode's configuration file. What results?
Any machine running the DataNode daemon can immediately join the cluster
Your cluster has the following characteristics: Which describes the file read process when a client application connects into the cluster and requests a 50MB file?
The client queries the NameNode which retrieves the block from the nearest DataNode to the client and then passes that block back to the client.
In CDH4 and later, which file contains a serialized form of all the directory and files inodes in the filesystem, giving the NameNode a persistent checkpoint of the filesystem metadata?
Fsimage_N (Where N reflects all transactions up to transaction ID N)
You have a cluster running with the Fair Scheduler enabled. There are currently no jobs running on the cluster, and you submit a job A, so that only job A is running on the cluster. A while later, you submit Job B. now job A and Job B are running on the cluster at the same time. How will the Fair Scheduler handle these two jobs?
When job B gets submitted, Job A has to finish first, before job B can scheduled