CCA-505 - Cloudera Certified Administrator for Apache Hadoop (CCAH) CDH5 Upgrade Exam

Go back to Cloudera

Example Questions

You are planning a Hadoop cluster and considering implementing 10 Gigabit Ethernet as the network fabric. Which workloads benefit the most from a faster network fabric? You decide to create a cluster which runs HDFS in High Availability mode with automatic failover, using Quorum-based Storage. What is the purpose of ZooKeeper in such a configuration? You have installed a cluster running HDFS and MapReduce version 2 (MRv2) on YARN. You have no afs.hosts entry()ies in your hdfs-alte.xml configuration file. You configure a new worker node by setting fs.default.name in its configuration files to point to the NameNode on your cluster, and you start the DataNode daemon on that worker node. What do you have to do on the cluster to allow the worker node to join, and start storing HDFS blocks? You are migrating a cluster from MapReduce version 1 (MRv1) to MapReduce version2 (MRv2) on YARN. To want to maintain your MRv1 TaskTracker slot capacities when you migrate. What should you do? You are configuring a cluster running HDFS, MapReduce version 2 (MRv2) on YARN running Linux. How must you format the underlying filesystem of each DataNode? You are running a Hadoop cluster with MapReduce version 2 (MRv2) on YARN. You consistently see that MapReduce map tasks on your cluster are running slowly because of excessive garbage collection of JVM, how do you increase JVM heap property to 3GB to optimize performance? Your cluster is running MapReduce vserion 2 (MRv2) on YARN. Your ResourceManager is configured to use the FairScheduler. Now you want to configure your scheduler such that a new user on the cluster can submit jobs into their own queue application submission. Which configuration should you set? During the execution of a MapReduce v2 (MRv2) job on YARN, where does the Mapper place the intermediate data each Map task? You observe that the number of spilled records from Map tasks far exceeds the number of map output records. Your child heap size is 1GB and your io.sort.mb value is set to 100 MB. How would you tune your io.sort.mb value to achieve maximum memory to disk I/O ratio? You are working on a project where you need to chain together MapReduce, Pig jobs. You also needs the ability to use forks, decision, and path joins. Which ecosystem project should you use to perform these actions? Your cluster's mapped-site.xml includes the following parameters <name>mapreduce.map.memory.mb</name> <value>4096<value/> <name>mapreduce.reduce.memory,mb</name> <value>8192</value> And your cluster's yarn-site.xml includes the following parameters <name>yarn.nodemanager/vmen-pmem-ratio</name> <value>2.1</value> What is the maximum amount of virtual memory allocated for each map before YARN will kill its Container? You have a Hadoop cluster running HDFS, and a gateway machine external to the cluster from which clients submit jobs. What do you need to do in order to run on the cluster and submit jobs from the command line of the gateway machine? A slave node in your cluster has four 2TB hard drives installed (4 x 2TB). The DataNode is configured to store HDFS blocks on the disks. You set the value of the dfs.datanode.du.reserved parameter to 100GB. How does this alter HDFS block storage? Assuming a cluster running HDFS, MapReduce version 2 (MRv2) on YARN with all settings at their default, what do you need to do when adding a new slave node to a cluster? You are upgrading a Hadoop cluster from HDFS and MapReduce version 1 (MRv1) to one running HDFS and MapReduce version 2 (MRv2) on YARN. You want to set and enforce a block of 128MB for all new files written to the cluster after the upgrade. What should you do? You want a node to only swap Hadoop daemon data from RAM to disk when absolutely necessary. What should you do? Which Yarn daemon or service monitors a Container's per-application resource usage (e.g, memory, CPU)? You have converted your Hadoop cluster from a MapReduce 1 (MRv1) architecture to a MapReduce 2 (MRv2) on YARN architecture. Your developers are accustomed to specifying map and reduce tasks (resource allocation) tasks when they run jobs. A developer wants to know how specify to reduce tasks when a specific job runs. Which method should you tell that developer to implement? Which is the default scheduler in YARN? Your Hadoop cluster is configured with HDFS and MapReduce version 2 (MRv2) on YARN. Can you configure a worker node to run a NodeManager daemon but not a DataNode daemon and still have a function cluster? Each node in your Hadoop cluster, running YARN, has 64 GB memory and 24 cores. Your yarn- site.xml has the following configuration: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>32768</value> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>23</value> </property> You want YARN to launch no more than 16 containers per node. What should you do? Which two are Features of Hadoop's rack topology? What processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and you want to change a configuration parameter so that it affects all six DataNodes. You are the hadoop fs ­put command to add a file "sales.txt" to HDFS. This file is small enough that it fits into a single block, which is replicated to three nodes in your cluster (with a replication factor of 3). One of the nodes holding this file (a single block) fails. How will the cluster handle the replication of this file in this situation/ A user comes to you, complaining that when she attempts to submit a Hadoop job, it fails. There is a directory in HDFS named /data/input. The Jar is named j.jar, and the driver class is named DriverClass. She runs command: hadoop jar j.jar DriverClass /data/input/data/output The error message returned includes the line: PrivilegedActionException as:training (auth:SIMPLE) cause.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exits: file :/data/input What is the cause of the error? For each YARN Job, the Hadoop framework generates task log files. Where are Hadoop's files stored? On a cluster running CDH 5.0 or above, you use the hadoop fs ­put command to write a 300MB file into a previously empty directory using an HDFS block of 64MB. Just after this command has finished writing 200MB of this file, what would another use see when they look in the directory? Which YARN daemon or service negotiates map and reduce Containers from the Scheduler, tracking their status and monitoring for progress? Your cluster is configured with HDFS and MapReduce version 2 (MRv2) on YARN. What is the result when you execute: hadoop jar samplejar.jar MyClass on a client machine? You want to understand more about how users browse you public website. For example, you want to know which pages they visit prior to placing an order. You have a server farm of 200 web servers hosting your website. Which is the most efficient process to gather these web server logs into your Hadoop cluster for analysis? Your cluster implements HDFS High Availability (HA). Your two NameNodes are named nn01 and nn02. What occurs when you execute the command: hdfs haadmin ­failover nn01 nn02 You have a 20 node Hadoop cluster, with 18 slave nodes and 2 master nodes running HDFS High Availability (HA). You want to minimize the chance of data loss in you cluster. What should you do? Your Hadoop cluster contains nodes in three racks. You have NOT configured the dfs.hosts property in the NameNode's configuration file. What results? Your cluster has the following characteristics: Which describes the file read process when a client application connects into the cluster and requests a 50MB file? In CDH4 and later, which file contains a serialized form of all the directory and files inodes in the filesystem, giving the NameNode a persistent checkpoint of the filesystem metadata? You have a cluster running with the Fair Scheduler enabled. There are currently no jobs running on the cluster, and you submit a job A, so that only job A is running on the cluster. A while later, you submit Job B. now job A and Job B are running on the cluster at the same time. How will the Fair Scheduler handle these two jobs?

Study Guides