Friday, 9 September 2016

How to create a WordCount MapReduce with Maven and Eclipse


After configuring hadoop-plugin on eclipse which we learnt in our previous blog(Hadoop-eclipse-plugin installation), we now write our first Word count Map'reduce program using eclipse and maven. 

Before we jump into program, let's understand how the job flow works through YARN implementation when map reduce program is submitted by client.

In Hadopo 1.x version, there are two major components which works in Master-Slave fashion.

  • Job Tracker : This allocates resources required to run a Map reduce job and scheduling activities.
  • Task tracker : These are initiated by Job tracker to process individual tasks. 


Since Job tracker is responsible for both resource management (assigning resources to each job) and job scheduling (assigning task to task trackers and monitoring task progress) in a single node,  there was a  scalability issue in large HDFS clusters with more than 4000 nodes. To overcome this issue, YARN is implemented.

YARN  (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions.
  • The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster.
  • Task Tracker is replaced with Node Manager in YARN which is a per-machine framework agent and it is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the Resource Manager. 

Job Flow : 

  • Client submits MapReduce job by interacting with Job objects (Client runs in it’s own JVM).
  • Client Job’s code interacts with Resource Manager to acquire application meta-data, such as application id and moves all the job related resources to HDFS to make them available for the rest of the job and then submits the application to Resource Manager.
  • Resource Manager chooses a Node Manager with available resources and requests a container for Application Master.
  • Node Manager allocates container for Application Master and Application Master (MRAppMaster) will execute and coordinate MapReduce job.
Role of an Application Master:
  • As noted above, Both map tasks and reduce tasks are created by Application Master.
  • If the submitted job is small, then Application Master runs the job in the same JVM on which Application Master is running. It reduces the overhead of creating new container and running tasks in parallel. These small jobs are called as Uber tasks.
  • Uber tasks are decided by three configuration parameters, number of mappers <= 10, number of reducers <= 1 and Input file size is less than or equal to an HDFS block size. These parameters can be configured via  mapreduce.job.ubertask.maxmaps , mapreduce.job.ubertask.maxreduces , and mapreduce.job.ubertask.maxbytes  properties in mapred-site.xml.
  • If job doesn’t qualify as Uber task, Application Master requests containers for all map tasks and reduce tasks.

Job Start up Phase: 
  • The call to job.waitForCompletion() in the main driver class is where all the execution starts and this call starts communication with the Resource Manager.
  • Retrieves the new Job ID or Application ID from Resource Manager.
  • The Client system copies Job Resources specified via the -files, -archives, and -jar command-line arguments, as well as the job JAR file on to HDFS.
  • Finally, Job is submitted by calling submitApplication() method on Resource Manager.
  • Resource Manager triggers its sub-component Scheduler, which allocates containers for mapreduce job execution. Then Resource Manager starts Application Master in the container provided by the scheduler. This container will be managed by Node Manager from here on wards.
Input Split Phase:
  • In this phase, HDFS splits the input files into equal sized chunks or segments based on minimum split size (mapreduce.input.fileinputformat.split.minsize)  property .
  • Each file segment or split is passed to a unique map task if file is splittable. If File is not splittable then entire file will be provided as input to a single map task.
  • These map tasks are created by Mapreduce Application Master (MRAppMaster Java Class) and reduce tasks are also created by application master based on mapreduce.job.reducer property.

Task Execution Phase:
  • Once Containers assigned to tasks, Application Master starts containers by notifying its Node Manager.
  • Node Manager copies Job resources (like job JAR file) from HDFS distributed cache and runs map or reduce tasks.
  • Application Master collects task progress  and status information from all tasks and aggregate values are propagated to Client or user.
 Job Completion:
  • Client Node checks with Application Master for Job completion status at regular intervals of time usually every 5 seconds when job is submitted by calling runJob() method. This time interval can be configured via mapreduce.client.completion.pollinterval property.
  • Once the job is completed, Application Master and Task Container clean up their working state.Job’s OutputCommitter calls the cleanup method to handle any cleanup activities.
  • Job is archived by Job history server for future reference.


Thats all about the theory part.  Let us now write a sample MapReduce program to count the number of words in a given file.

Tools used 

  • Maven 3.3.9
  • Eclipse Luna
  • JDK 1.8
  • Hadoop 7.2.1

Configure Maven

Download maven from here  and extract to C:\maven\apache-maven-3.3.9. 
Add MAVEN_HOME to user variable and %MAVEN_HOME%\bin to Path variable.

Open cmd prompt as administrator and type below command to verify if maven installation. 

C:\Windows\System32>mvn -version
You will see below log if it is successfully installed
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: C:\maven\apache-maven-3.3.9\bin\..
Java version: 1.8.0_101, vendor: Oracle Corporation
Java home: C:\java\jdk1.8.0_101\jre
Default locale: en_SG, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"
Start eclipse, go to "Window -> Open Perspective -> Other". From perspectives window, you should see “Java”, Select it and click "OK".
Right click on package explorer and select New->Other->Maven Project














Click Next and Check Use default Workspace location and click Next












Select maven-archetype-quickstart 1.1 and click Next












Add Group ID, Artifact Id and Package name as mentioned in the below screen and click Finish.













Switch tp “Map/Reduce” perspective by clicking icon at the top right hand corner of the main eclipse panel now, as highlighted below.






After switching perspective, you will see below project in project explore along with DFS Location.











Add below content in pom.xml and save.
pom.xml
<?xml version="1.0"?>
<project
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
 xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <modelVersion>4.0.0</modelVersion>
 <artifactId>WordCountMR</artifactId>
 <name>WordCountMR</name>
 <url>http://maven.apache.org</url>
 <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>
 <dependencies>
  <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>3.8.1</version>
   <scope>test</scope>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-annotations</artifactId>
   <version>2.7.1</version>
  </dependency>
  <!-- Hadoop Mapreduce Client Core -->
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-core</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs-nfs</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>3.8.1</version>
   <scope>test</scope>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-common</artifactId>
   <version>2.7.1</version>
  </dependency>


  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-app</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-hs</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-hs-plugins</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-api</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-web-proxy</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-sharedcachemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-nodemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-applicationhistoryservice</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-registry</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-client</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-applications-unmanaged-am-launcher</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-applications-distributedshell</artifactId>
   <version>2.7.1</version>
  </dependency>
 </dependencies>
</project>











Java Classes
Create WordCountMapper.java class and add below content.
WordCountMapper.java
package com.hdp.madreduce.example;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter)
   throws IOException {
  String line = value.toString();
  StringTokenizer st = new StringTokenizer(line, " ");
  while (st.hasMoreTokens()) {
   word.set(st.nextToken());
   collector.collect(word, one);
  }

 }
}
Create WordCountReducer.java class and add below content.

WordCountReducer.java
package com.hdp.madreduce.example;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector,
   Reporter reporter) throws IOException {
  int sum = 0;

  while (values.hasNext()) {
   sum = sum + values.next().get();
  }
  outputCollector.collect(key, new IntWritable(sum));

 }
}

Create WordCount.java class and add below content.
WordCount.java
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();

  Path inputPath = new Path("hdfs://127.0.0.1:9000/input/WordCountSample.txt");
  Path outputPath = new Path("hdfs://127.0.0.1:9000/output/result");

  JobConf job = new JobConf(conf, WordCount.class);
  job.setJarByClass(WordCount.class);
  job.setJobName("WordCounterJob");

  FileInputFormat.setInputPaths(job, inputPath);
  FileOutputFormat.setOutputPath(job, outputPath);

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  job.setOutputFormat(TextOutputFormat.class);
  job.setMapperClass(WordCountMapper.class);
  job.setReducerClass(WordCountReducer.class);

  FileSystem hdfs = FileSystem.get(URI.create("hdfs://127.0.0.1:9000"), conf);
  if (hdfs.exists(outputPath))
   hdfs.delete(outputPath, true);

  RunningJob runningJob = JobClient.runJob(job);
  System.out.println("job.isSuccessfull: " + runningJob.isComplete());
 }

}
After creating all the classes, your project explorer looks like below.



Right click on WordCount.java ->Run As -> Run on Hadoop



If program runs successfully, you should see below content in eclipse console.

.


To get detail log, add hadoop-common-2.7.1-test.sources.jar file from C:\hadoop-2.7.1\share\hadoop\common\sources























Rerun the WordCount.java class to see below log

Eclipse console
2016-09-10 09:31:28,117 INFO  Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2016-09-10 09:31:28,124 INFO  jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2016-09-10 09:31:28,442 WARN  mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2016-09-10 09:31:28,494 WARN  mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2016-09-10 09:31:28,513 INFO  input.FileInputFormat (FileInputFormat.java:listStatus(283)) - Total input paths to process : 1
2016-09-10 09:31:28,660 INFO  mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(198)) - number of splits:1
2016-09-10 09:31:28,907 INFO  mapreduce.JobSubmitter (JobSubmitter.java:printTokens(287)) - Submitting tokens for job: job_local375951257_0001
2016-09-10 09:31:29,296 INFO  mapreduce.Job (Job.java:submit(1294)) - The url to track the job: http://localhost:8080/
2016-09-10 09:31:29,297 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1339)) - Running job: job_local375951257_0001
2016-09-10 09:31:29,302 INFO  mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null
2016-09-10 09:31:29,311 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,315 INFO  mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2016-09-10 09:31:29,415 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks
2016-09-10 09:31:29,416 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:29,466 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,478 INFO  util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(192)) - ProcfsBasedProcessTree currently is supported only on Linux.
2016-09-10 09:31:29,594 INFO  mapred.Task (Task.java:initialize(612)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@46ae4d2d
2016-09-10 09:31:29,605 INFO  mapred.MapTask (MapTask.java:runNewMapper(756)) - Processing split: hdfs://127.0.0.1:9000/input/WordCountSample.txt:0+433
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:setEquator(1205)) - (EQUATOR) 0 kvi 26214396(104857584)
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(998)) - mapreduce.task.io.sort.mb: 100
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(999)) - soft limit at 83886080
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(1000)) - bufstart = 0; bufvoid = 104857600
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(1001)) - kvstart = 26214396; length = 6553600
2016-09-10 09:31:29,684 INFO  mapred.MapTask (MapTask.java:createSortingCollector(403)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2016-09-10 09:31:29,816 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1460)) - Starting flush of map output
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1482)) - Spilling map output
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1483)) - bufstart = 0; bufend = 676; bufvoid = 104857600
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1485)) - kvstart = 26214396(104857584); kvend = 26214152(104856608); length = 245/6553600
2016-09-10 09:31:29,860 INFO  mapred.MapTask (MapTask.java:sortAndSpill(1667)) - Finished spill 0
2016-09-10 09:31:29,875 INFO  mapred.Task (Task.java:done(1038)) - Task:attempt_local375951257_0001_m_000000_0 is done. And is in the process of committing
2016-09-10 09:31:29,896 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map
2016-09-10 09:31:29,896 INFO  mapred.Task (Task.java:sendDone(1158)) - Task 'attempt_local375951257_0001_m_000000_0' done.
2016-09-10 09:31:29,896 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:29,897 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2016-09-10 09:31:29,902 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks
2016-09-10 09:31:29,903 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local375951257_0001_r_000000_0
2016-09-10 09:31:29,912 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,914 INFO  util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(192)) - ProcfsBasedProcessTree currently is supported only on Linux.
2016-09-10 09:31:30,006 INFO  mapred.Task (Task.java:initialize(612)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@7ddb8f1c
2016-09-10 09:31:30,010 INFO  mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@46f15a4f
2016-09-10 09:31:30,028 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(196)) - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2016-09-10 09:31:30,031 INFO  reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local375951257_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2016-09-10 09:31:30,126 INFO  reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(144)) - localfetcher#1 about to shuffle output of map attempt_local375951257_0001_m_000000_0 decomp: 802 len: 806 to MEMORY
2016-09-10 09:31:30,136 INFO  reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 802 bytes from map-output for attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:30,139 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(314)) - closeInMemoryFile -> map-output of size: 802, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->802
2016-09-10 09:31:30,141 INFO  reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning
2016-09-10 09:31:30,142 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,143 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(674)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2016-09-10 09:31:30,163 INFO  mapred.Merger (Merger.java:merge(606)) - Merging 1 sorted segments
2016-09-10 09:31:30,163 INFO  mapred.Merger (Merger.java:merge(705)) - Down to the last merge-pass, with 1 segments left of total size: 798 bytes
2016-09-10 09:31:30,168 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(751)) - Merged 1 segments, 802 bytes to disk to satisfy reduce memory limit
2016-09-10 09:31:30,170 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(781)) - Merging 1 files, 806 bytes from disk
2016-09-10 09:31:30,171 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(796)) - Merging 0 segments, 0 bytes from memory into reduce
2016-09-10 09:31:30,171 INFO  mapred.Merger (Merger.java:merge(606)) - Merging 1 sorted segments
2016-09-10 09:31:30,173 INFO  mapred.Merger (Merger.java:merge(705)) - Down to the last merge-pass, with 1 segments left of total size: 798 bytes
2016-09-10 09:31:30,174 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,212 INFO  Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2016-09-10 09:31:30,303 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1360)) - Job job_local375951257_0001 running in uber mode : false
2016-09-10 09:31:30,305 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1367)) -  map 100% reduce 0%
2016-09-10 09:31:30,415 INFO  mapred.Task (Task.java:done(1038)) - Task:attempt_local375951257_0001_r_000000_0 is done. And is in the process of committing
2016-09-10 09:31:30,419 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,420 INFO  mapred.Task (Task.java:commit(1199)) - Task attempt_local375951257_0001_r_000000_0 is allowed to commit now
2016-09-10 09:31:30,434 INFO  output.FileOutputCommitter (FileOutputCommitter.java:commitTask(482)) - Saved output of task 'attempt_local375951257_0001_r_000000_0' to hdfs://127.0.0.1:9000/output/result/_temporary/0/task_local375951257_0001_r_000000
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce
2016-09-10 09:31:30,436 INFO  mapred.Task (Task.java:sendDone(1158)) - Task 'attempt_local375951257_0001_r_000000_0' done.
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local375951257_0001_r_000000_0
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.
2016-09-10 09:31:31,306 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1367)) -  map 100% reduce 100%
2016-09-10 09:31:31,307 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1378)) - Job job_local375951257_0001 completed successfully
2016-09-10 09:31:31,331 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Counters: 35
 File System Counters
  FILE: Number of bytes read=1974
  FILE: Number of bytes written=632424
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=866
  HDFS: Number of bytes written=416
  HDFS: Number of read operations=15
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=6
 Map-Reduce Framework
  Map input records=6
  Map output records=62
  Map output bytes=676
  Map output materialized bytes=806
  Input split bytes=112
  Combine input records=0
  Combine output records=0
  Reduce input groups=45
  Reduce shuffle bytes=806
  Reduce input records=62
  Reduce output records=45
  Spilled Records=124
  Shuffled Maps =1
  Failed Shuffles=0
  Merged Map outputs=1
  GC time elapsed (ms)=0
  Total committed heap usage (bytes)=481296384
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters 
  Bytes Read=433
 File Output Format Counters 
  Bytes Written=416
job.isSuccessful true
Right click on DFS Location and Disconnect and Refresh, you should see output file generated .

























Double click on part-r-00000 file from output->result, you should see below output.


Rerun the WordCount.java class to see below log

Result file content:
A 1
Apache 1
Hadoop 3
across 2
allows 1
an 2
and 2
application 1
clusters 2
computation 2
computers 2
datasets 1
designed 1
distributed 2
each 1
environment 1
frame-worked 1
framework 1
from 1
in 2
is 2
java 1
large 1
local 1
machines 1
models 1
of 4
offering 1
open 1
processing 1
programming 1
provides 1
scale 1
server 1
simple 1
single 1
source 1
storage 2
that 2
thousands 1
to 2
up 1
using 1
works 1
written 1




















WordCount Program in Debug mode :

Place debugger point in WordCountMapper class and right click on WordCount and Debug as - > Java Application, you should be able to debug your program line by line on run-time.


5 comments:

  1. Node and Data node throws error , could get what would be the issue , May be some null pointer

    ReplyDelete