﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>BlogJava-少年阿宾-随笔分类-hadoop</title><link>http://www.blogjava.net/stevenjohn/category/52924.html</link><description>那些青春的岁月</description><language>zh-cn</language><lastBuildDate>Sat, 10 Nov 2012 12:22:47 GMT</lastBuildDate><pubDate>Sat, 10 Nov 2012 12:22:47 GMT</pubDate><ttl>60</ttl><item><title>hadoop面试时可能遇到的问题,你能回答出几个 ? </title><link>http://www.blogjava.net/stevenjohn/archive/2012/11/09/391116.html</link><dc:creator>abin</dc:creator><author>abin</author><pubDate>Fri, 09 Nov 2012 11:48:00 GMT</pubDate><guid>http://www.blogjava.net/stevenjohn/archive/2012/11/09/391116.html</guid><wfw:comment>http://www.blogjava.net/stevenjohn/comments/391116.html</wfw:comment><comments>http://www.blogjava.net/stevenjohn/archive/2012/11/09/391116.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/stevenjohn/comments/commentRss/391116.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/stevenjohn/services/trackbacks/391116.html</trackback:ping><description><![CDATA[<p>面试hadoop可能被问到的问题，你能回答出几个 ?</p>
<p>1、hadoop运行的原理?</p>
<p>2、mapreduce的原理?</p>
<p>3、HDFS存储的机制?</p>
<p>4、举一个简单的例子说明mapreduce是怎么来运行的 ?</p>
<p>5、面试的人给你出一些问题,让你用mapreduce来实现？</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 比如:现在有10个文件夹,每个文件夹都有1000000个url.现在让你找出top1000000url。</p>
<p>6、hadoop中Combiner的作用?</p>
<p>Src： <a href="http://p-x1984.javaeye.com/blog/859843" target="_blank">http://p-x1984.javaeye.com/blog/859843</a></p>
<p><br /></p>
<p>&nbsp;</p>
<div style="background-color: rgb(255,255,255)"><span style="line-height: 18px; font-family: Arial,Tahoma,Helvetica,FreeSans,sans-serif; color: #222222; font-size: 13px"><strong><em><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><em><strong>Q1. Name the most common InputFormats defined in&nbsp;<strong style="color: black">Hadoop</strong>? Which one is default ?&nbsp;</strong></em></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><br /></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">Following 2 are most common InputFormats defined in</span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">&nbsp;</span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><strong style="color: black">Hadoop</strong></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">&nbsp;</span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><br /></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">- TextInputFormat</span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><br /></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">- KeyValueInputFormat</span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal"><br /></span><span style="line-height: normal; font-style: normal; font-family: Simsun; color: #000000; font-size: 16px; font-weight: normal">- SequenceFileInputFormat</span></em></strong></span></div><span style="background-color: rgb(255,255,255)">Q2. What is the difference between TextInputFormatand KeyValueInputFormat class</span><span style="background-color: rgb(255,255,255); white-space: pre"></span><br style="background-color: rgb(255,255,255)" /><span style="background-color: rgb(255,255,255)">TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper</span><br style="background-color: rgb(255,255,255)" /><span style="background-color: rgb(255,255,255)">KeyValueInputFormat: Reads text file and parses lines into key, val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.</span><br style="background-color: rgb(255,255,255)" />
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q3. What is InputSplit in&nbsp;<strong style="color: black">Hadoop</strong></em></strong></div>
<div>When a&nbsp;<strong style="color: black">hadoop</strong>&nbsp;job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q4. How is the splitting of file invoked in&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;Framework<span style="white-space: pre"></span>&nbsp;</em></strong></div>
<div>It is invoked by the&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q5. Consider case scenario: In M/R system,</em></strong></div>
<div><strong><em>&nbsp;&nbsp; &nbsp;- HDFS block size is 64 MB</em></strong></div>
<div><strong><em>&nbsp;&nbsp; &nbsp;- Input format is FileInputFormat</em></strong></div>
<div><strong><em>&nbsp;&nbsp; &nbsp;- We have 3 files of size 64K, 65Mb and 127Mb&nbsp;</em></strong></div>
<div><strong><em>then how many input splits will be made by&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;framework?</em></strong></div>
<div><strong style="color: black">Hadoop</strong>&nbsp;will make 5 splits as follows&nbsp;</div>
<div>- 1 split for 64K files&nbsp;</div>
<div>- 2 &nbsp;splits for 65Mb files&nbsp;</div>
<div>- 2 splits for 127Mb file&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q6. What is the purpose of RecordReader in&nbsp;<strong style="color: black">Hadoop</strong></em></strong></div>
<div>The InputSplithas defined a slice of work, but does not describe how to access it. The RecordReaderclass actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q7. After the Map phase finishes, the&nbsp;<strong style="color: black">hadoop</strong>&nbsp;framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?</em></strong></div>
<div>- Partitioning</div>
<div>Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same</div>
<div><strong><br /></strong></div>
<div>- Shuffle</div>
<div>After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.</div>
<div><br /></div>
<div>- Sort</div>
<div>Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;before they are presented to the Reducer&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q9. If no custom partitioner is defined in the&nbsp;<strong style="color: black">hadoop</strong>&nbsp;then how is data partitioned before its sent to the reducer</em></strong><span style="white-space: pre"></span>&nbsp;</div>
<div>The default partitioner computes a hash value for the key and assigns the partition based on this result&nbsp;</div></div>
<div style="background-color: rgb(255,255,255)"><br /></div>
<div style="background-color: rgb(255,255,255)">
<div><strong><em>Q10. What is a Combiner</em></strong>&nbsp;</div>
<div>The Combiner is a "mini-reduce" process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.</div></div>
<div style="background-color: rgb(255,255,255)">
<div style="line-height: normal; font-family: Simsun; color: rgb(0,0,0); font-size: 16px">
<div style="margin: 0px"><strong><em>Q11. Give an example scenario where a cobiner can be used and where it cannot be used</em></strong></div></div>
<div style="line-height: normal; font-family: Simsun; color: rgb(0,0,0); font-size: 16px">
<div style="margin: 0px">There can be several examples following are the most common ones</div></div>
<div style="line-height: normal; font-family: Simsun; color: rgb(0,0,0); font-size: 16px">
<div>
<div>
<div style="margin: 0px">- Scenario where you can use combiner</div></div>
<div>
<div style="margin: 0px">&nbsp;&nbsp;Getting list of distinct words in a file</div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px">- Scenario where you cannot use a combiner</div></div>
<div>
<div style="margin: 0px">&nbsp;&nbsp;Calculating mean of a list of numbers&nbsp;</div></div></div></div></div>
<div style="background-color: rgb(255,255,255)"><span style="line-height: 18px"></span>
<div>
<div style="margin: 0px"><strong><em>Q12.&nbsp;What is job tracker</em></strong></div></div>
<div>
<div style="margin: 0px">Job Tracker is the service within&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;that runs Map Reduce jobs on the cluster</div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px"><strong><em>Q13. What are some typical functions of Job Tracker</em></strong></div></div>
<div>
<div>
<div style="margin: 0px">The following are some typical tasks of Job Tracker</div></div>
<div>
<div style="margin: 0px">- Accepts jobs from clients</div></div>
<div>
<div style="margin: 0px">-&nbsp;It talks to the NameNode to determine the location of the data</div></div>
<div>
<div style="margin: 0px">-&nbsp;It locates TaskTracker nodes with available slots at or near the data</div></div>
<div>
<div style="margin: 0px">-&nbsp;It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker&nbsp;</div></div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px"><strong><em>Q14.&nbsp;What is task tracker</em></strong></div></div>
<div>
<div style="margin: 0px">Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker&nbsp;</div></div>
<div>
<div style="margin: 0px"><strong><em><br /></em></strong></div></div>
<div>
<div style="margin: 0px"><strong><em>Q15. Whats the relationship between Jobs and Tasks in&nbsp;<strong style="color: black">Hadoop</strong></em></strong></div></div>
<div>
<div style="margin: 0px">One job is broken down into one or many tasks in&nbsp;<strong style="color: black">Hadoop</strong>.&nbsp;</div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px"><strong><em>Q16. Suppose&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;spawned 100 tasks for a job and one of the task failed. What will<strong style="color: black">hadoop</strong>&nbsp;do ?</em></strong></div></div>
<div>
<div style="margin: 0px">It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job</div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px"><strong><em>Q17.&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;provides to combat this</em></strong><span style="white-space: pre"> </span>&nbsp;</div></div>
<div>
<div style="margin: 0px">Speculative Execution&nbsp;</div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div>
<div style="margin: 0px"><strong><em>Q18. How does speculative execution works in&nbsp;<strong style="color: black">Hadoop</strong><span style="white-space: pre"></span>&nbsp;</em></strong></div></div>
<div>
<div style="margin: 0px">Job tracker makes different task trackers process same input. When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively,&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;tells the Task Trackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.&nbsp;</div></div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div>
<div style="margin: 0px"><strong><em>Q19. Using command line in Linux, how will you&nbsp;</em></strong></div></div>
<div>
<div style="margin: 0px"><strong><em>- see all jobs running in the&nbsp;<strong style="color: black">hadoop</strong>&nbsp;cluster</em></strong></div></div>
<div>
<div style="margin: 0px"><strong><em>- kill a job</em></strong></div></div>
<div>
<div style="margin: 0px">-&nbsp;<strong style="color: black">hadoop</strong>&nbsp;job -list</div></div>
<div>
<div style="margin: 0px">-&nbsp;<strong style="color: black">hadoop</strong>&nbsp;job -kill jobid&nbsp;</div></div></div>
<div>
<div style="margin: 0px"><br /></div></div>
<div>
<div style="margin: 0px"><strong><em>Q20. What is&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;Streaming<span style="white-space: pre"></span>&nbsp;</em></strong></div>
<div style="margin: 0px">Streaming is a generic API that allows programs written in virtually any language to be used as<strong style="color: black">Hadoop</strong>&nbsp;Mapper and Reducer implementations&nbsp;</div>
<div><br /></div></div>
<div>
<div style="margin: 0px"><br />
<div style="font-style: normal; margin: 0px; font-weight: normal"><strong><em><strong><em>Q21. What is the characteristic of streaming API that makes it flexible run map reduce jobs in languages like perl, ruby, awk etc.<span style="white-space: pre"></span>&nbsp;</em></strong></em></strong></div>
<div style="margin: 0px"><strong style="color: black">Hadoop</strong>&nbsp;Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.</div>
<div style="margin: 0px"><span style="line-height: normal; font-family: Simsun; color: #000000; font-size: 16px"><strong><em>Q22. Whats is Distributed Cache in&nbsp;<strong style="color: black">Hadoop</strong></em></strong></span><span style="line-height: normal; font-family: Simsun; color: #000000; font-size: 16px"><br /></span><span style="line-height: normal; font-family: Simsun; color: #000000; font-size: 16px">Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.</span></div>
<div style="margin: 0px"><span style="line-height: 18px"><strong><em>Q23. What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it<span style="white-space: pre"></span>&nbsp;</em></strong><br />This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.<br /><br /><strong><em>Q.24 What mechanism does&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;framework provides to synchronize changes made in Distribution Cache during runtime of the application<span style="white-space: pre"></span>&nbsp;</em></strong><br />This is a trick&nbsp;<strong style="color: black">questions</strong>. There is no such mechanism. Distributed Cache by design is read only during the time of Job execution<br /><br /><strong><em>Q25. Have you ever used Counters in&nbsp;<strong style="color: black">Hadoop</strong>. Give us an example scenario<span style="white-space: pre"></span></em></strong><br />Anybody who claims to have worked on a&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;project is expected to use counters<br /><br /><strong><em>Q26. Is it possible to provide multiple input to&nbsp;<strong style="color: black">Hadoop</strong>? If yes then how can you give multiple directories as input to the&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;job<span style="white-space: pre"></span>&nbsp;</em></strong><br />Yes, The input format class provides methods to add multiple directories as input to a&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;job<br /><br /><strong><em>Q27. Is it possible to have&nbsp;<strong style="color: black">Hadoop</strong>&nbsp;job output in multiple directories. If yes then how<span style="white-space: pre"></span>&nbsp;</em></strong><br />Yes, by using Multiple Outputs class<br /><br /><strong><em>Q28. What will a&nbsp;<strong style="color: black">hadoop</strong>&nbsp;job do if you try to run it with an output directory that is already present? Will it<br />- overwrite it<br />- warn you and continue<br />- throw an exception and exit</em></strong><br />The&nbsp;<strong style="color: black">hadoop</strong>&nbsp;job will throw an exception and exit.<br /><br /><strong><em>Q29. How can you set an arbitary number of mappers to be created for a job in&nbsp;<strong style="color: black">Hadoop</strong><span style="white-space: pre"></span>&nbsp;</em></strong><br />This is a trick question. You cannot set it<br /><br /><strong><em>Q30. How can you set an arbitary number of reducers to be created for a job in&nbsp;<strong style="color: black">Hadoop</strong><span style="white-space: pre"></span>&nbsp;</em></strong><br />You can either do it progamatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting</span></div></div></div></div>
<p>&nbsp;</p>
<p>Src:<a href="http://xsh8637.blog.163.com/blog/#m=0&amp;t=1&amp;c=fks_084065087084081065083083087095086082081074093080080069" target="_blank">http://xsh8637.blog.163.com/blog/#m=0&amp;t=1&amp;c=fks_084065087084081065083083087095086082081074093080080069</a></p><img src ="http://www.blogjava.net/stevenjohn/aggbug/391116.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/stevenjohn/" target="_blank">abin</a> 2012-11-09 19:48 <a href="http://www.blogjava.net/stevenjohn/archive/2012/11/09/391116.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>hadoop 官方文档</title><link>http://www.blogjava.net/stevenjohn/archive/2012/10/29/390425.html</link><dc:creator>abin</dc:creator><author>abin</author><pubDate>Mon, 29 Oct 2012 14:42:00 GMT</pubDate><guid>http://www.blogjava.net/stevenjohn/archive/2012/10/29/390425.html</guid><wfw:comment>http://www.blogjava.net/stevenjohn/comments/390425.html</wfw:comment><comments>http://www.blogjava.net/stevenjohn/archive/2012/10/29/390425.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.blogjava.net/stevenjohn/comments/commentRss/390425.html</wfw:commentRss><trackback:ping>http://www.blogjava.net/stevenjohn/services/trackbacks/390425.html</trackback:ping><description><![CDATA[<div>中文文档：<br /><a href="http://hadoop.apache.org/docs/r0.20.2/cn/quickstart.html">http://hadoop.apache.org/docs/r0.20.2/cn/quickstart.html</a><br />英文文档：<br /><a href="http://hadoop.apache.org/docs/r0.20.2/quickstart.html">http://hadoop.apache.org/docs/r0.20.2/quickstart.html</a></div><img src ="http://www.blogjava.net/stevenjohn/aggbug/390425.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.blogjava.net/stevenjohn/" target="_blank">abin</a> 2012-10-29 22:42 <a href="http://www.blogjava.net/stevenjohn/archive/2012/10/29/390425.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>