Apache Hadoop 2.7.1 on Windows 10

Saturday 17 September 2016

Installing Apache Hbase on Windows using Cygwin64

After installing hadoop on windows using Cygwin which we learnt in our previous blog(Installing Apache Hadoop on Windows 10 using Cygwin64), we now install Hbase on windows using Cygwin.

Tools Used :

Apache Hbase 1.2.3
Cygwin64
Java 1.8
Hadoop 2.7.1

Download hbase-1.2.3-bin.tar.gz binary from here and place under c:/cygwin/root/usr/local.

Start Cygwin terminal as administrator and issue below commands to extract hbase-1.2.3-bin.tar.gz content.

$ cd /usr/local
$ tar xvf hbase-1.2.3-bin.tar.gz

Create logs folder i.e. C:\cygwin\root\usr\local\hbase-1.2.3\logs

HBase uses the ./conf/hbase-default.xml file for configuration. Some properties do not resolve to existing directories because the JVM runs on Windows. This is the major issue to keep in mind when working with Cygwin: within the shell all paths are *nix-alike, hence relative to the root /. However, every parameter that is to be consumed within the windows processes themself, need to be Windows settings, hence C:\-alike. Change following propeties in the configuration file, adjusting paths where necessary to conform with your own installation:

hbase.rootdir must read e.g. file:///C:/cygwin/root/tmp/hbase/data or hdfs://127.0.0.1:9000/hbase in case of hadoop file system.
hbase.tmp.dir must read C:/cygwin/root/tmp/hbase/tmp
hbase.zookeeper.quorum must read 127.0.0.1 because for some reason localhost doesn't seem to resolve properly on Cygwin.

Make sure the configured hbase.rootdir and hbase.tmp.dir directories exist and have the proper rights set up e.g. by issuing a chmod 777 on them.

Go to c:/cygwin/root/usr/local/hbase-1.2.3/conf and add the following in hbase-site.xml file.

<configuration>
<property>
 <name>hbase.rootdir</name> 
 <!--<value>file:///C:/cygwin/root/tmp/hbase/data</value> -->
 <value>hdfs://127.0.0.1:9000/hbase</value>
</property>
<property>
 <name>hbase.zookeeper.quorum</name> 
 <value>127.0.0.1</value> 
</property>
<property>
 <name>hbase.tmp.dir</name> 
 <value>C://cygwin/root/tmp/hbase/tmp</value>
</property>
</configuration>

Add the following to hbase-env.sh file.

export JAVA_HOME=/usr/local/java/
export HBASE_CLASSPATH=/cygwin/root/usr/local/hbase-1.2.3/lib/
export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
export HBASE_IDENT_STRING=$HOSTNAME

Start a Cygwin terminal, if you haven't already.

Please make sure hadoop is started before issuing hbase start command. Type jps to
check if Hadoop daemon processes are running.

Create hbase directory in hdfs.

Refer Hadoop-eclipse-plugin installation blog to create folder in hdfs using eclipse, if you haven't already.

Change directory to HBase installation using CD /usr/local/hbase-1.2.3.
Start HBase using the command sh start-hbase.sh
When prompted to accept the SSH fingerprint, answer yes.
When prompted, provide your password. Maybe multiple times.
When the command completes, the HBase server should have started.
However, to be absolutely certain, check the logs in the ./logs directory for any exceptions.

Zookeeper startup logs

2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:java.library.path=C:\java\jdk1.8.0_101\bin;C:\WINDOWS\Sun\Java\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\cygwin\root\bin;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0;%JAVA_HOME%\bin;%CYGWIN_HOME%\bin;%HADOOP_BIN_PATH%;%HADOOP_HOME%\bin;%MAVEN_HOME%\bin;.
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:java.io.tmpdir=C:\Users\Naveen\
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:java.compiler=<NA>
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:os.name=Windows 10
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:os.arch=amd64
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:os.version=10.0
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:user.name=Naveen
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:user.home=C:\Users\Naveen
2016-09-18 12:59:10,944 INFO  [main] server.ZooKeeperServer: Server environment:user.dir=C:\cygwin\root\usr\local\hbase-1.2.3
2016-09-18 12:59:10,957 INFO  [main] server.ZooKeeperServer: tickTime set to 3000
2016-09-18 12:59:10,957 INFO  [main] server.ZooKeeperServer: minSessionTimeout set to -1
2016-09-18 12:59:10,957 INFO  [main] server.ZooKeeperServer: maxSessionTimeout set to 90000
2016-09-18 12:59:11,316 INFO  [main] server.NIOServerCnxnFactory: binding to port 0.0.0.0/0.0.0.0:2181

hbase startup logs

Sun, Sep 18, 2016 12:59:06 PM Starting master on Naveen
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 8
stack size              (kbytes, -s) 2032
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256
virtual memory          (kbytes, -v) unlimited
2016-09-18 12:59:08,128 INFO  [main] util.VersionInfo: HBase 1.2.3
2016-09-18 12:59:08,129 INFO  [main] util.VersionInfo: Source code repository git://kalashnikov.att.net/Users/stack/checkouts/hbase.git.commit revision=bd63744624a26dc3350137b564fe746df7a721a4
.
.
.
.
.
2016-09-18 12:59:25,144 INFO  [regionserver/Naveen/192.168.56.1:0.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/naveen,59600,1474174753236/naveen%2C59600%2C1474174753236.default.1474174764214 with entries=2, filesize=303 B; new WAL /hbase/WALs/naveen,59600,1474174753236/naveen%2C59600%2C1474174753236.default.1474174764743
2016-09-18 12:59:25,242 INFO  [Naveen:59566.activeMasterManager] master.HMaster: Master has completed initialization
2016-09-18 12:59:25,244 INFO  [Naveen:59566.activeMasterManager] quotas.MasterQuotaManager: Quota support disabled
2016-09-18 12:59:25,245 INFO  [Naveen:59566.activeMasterManager] zookeeper.ZooKeeperWatcher: not a secure deployment, proceeding
2016-09-18 12:59:27,174 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: Stopping HBase metrics system...
2016-09-18 12:59:27,183 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system stopped.
2016-09-18 12:59:27,688 INFO  [HBase-Metrics2-1] impl.MetricsConfig: loaded properties from hadoop-metrics2-hbase.properties
2016-09-18 12:59:27,693 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2016-09-18 12:59:27,693 INFO  [HBase-Metrics2-1] impl.MetricsSystemImpl: HBase metrics system started
2016-09-18 12:59:46,246 INFO  [WALProcedureStoreSyncThread] wal.WALProcedureStore: Remove log: hdfs://127.0.0.1:9000/hbase/MasterProcWALs/state-00000000000000000001.log
2016-09-18 12:59:46,246 INFO  [WALProcedureStoreSyncThread] wal.WALProcedureStore: Removed logs: [hdfs://127.0.0.1:9000/hbase/MasterProcWALs/state-00000000000000000002.log]

Type jps to check if HMaster daemon process is running.

Next we start the HBase shell using the command sh hbase shell

Once after starting hbase, hdfs file system should show below directory structure.

Now, lets play with some hbase commands.

We’ll start with a basic scan that returns all columns in the cars table.

Using a long column family name, such as columnfamily1 is a horrible idea in production. Every cell (i.e. every value) in HBase is stored fully qualified. This basically means that long column family names will balloon the amount of disk space required to store your data. In summary, keep your column family names as small as possible

To start, I’m going to create a new table named cars. My column family is vi, which is an abbreviation of vehicle information.

The schema that follows below is only for illustration purposes, and should not be used to create a production schema. In production, you should create a Row ID that helps to uniquely identify the row, and that is likely to be used in your queries. Therefore, one possibility would be to shift the Make, Model and Year left and use these items in the Row ID.

create 'cars', 'vi'

Let’s insert 3 column qualifies (make, model, year) and the associated values into the first row (row1).

put 'cars', 'row1', 'vi:make', 'bmw'
put 'cars', 'row1', 'vi:model', '5 series'
put 'cars', 'row1', 'vi:year', '2012'

Now let’s add a second row.

put 'cars', 'row2', 'vi:make', 'mercedes'
put 'cars', 'row2', 'vi:model', 'e class'
put 'cars', 'row2', 'vi:year', '2012'

List the tables using below commands

list

Scan a Table (i.e. Query a Table)

We’ll start with a basic scan that returns all columns in the cars table.

scan 'cars'

You should see output similar to:

Reading the output above you’ll notice that the Row ID is listed under ROW. The COLUMN+CELL field shows the column family after column=, then the column qualifier, a timestamp that is automatically created by HBase, and the value.

Importantly, each row in our results shows an individual row id + column family + column qualifier combination. Therefore, you’ll notice that multiple columns in a row are displayed in multiple rows in our results.

The next scan we’ll run will limit our results to the make column qualifier.

scan 'cars', {COLUMNS => ['vi:make']}

You should see output similar to:

If you have a particularly large result set, you can limit the number of rows returned with the LIMIT option. In this example I arbitrarily limit the results to 1 row to demonstrate how LIMIT works.

scan 'cars', {COLUMNS => ['vi:make'], LIMIT => 1}

You should see output similar to:

Get One Row
The get command allows you to get one row of data at a time. You can optionally limit the number of columns returned. We’ll start by getting all columns in row1.

get 'cars', 'row1'

You should see output similar to:

When looking at the output above, you should notice how the results under COLUMN show the fully qualified column family:column qualifier, such as vi:make.

To get one specific column include the COLUMN option.

get 'cars', 'row1', {COLUMN => 'vi:model'}

You should see output similar

You can also get two or more columns by passing an array of columns.

get 'cars', 'row1', {COLUMN => ['vi:model', 'vi:year']}

You should see output similar to:

Delete a Cell (Value)

delete 'cars', 'row2', 'vi:year'

Let’s check that our delete worked.

get 'cars', 'row2'

You should see output that shows 2 columns.

Disable and Delete a Table

disable 'cars'

drop 'cars'

You should see empty table list.

View HBase Command Help

help

Exit the HBase Shell

exit

To stop the HBase server issue the sh stop-hbase.sh command. And wait for it to complete!!! Killing the process might corrupt your data on disk.

$ sh stop-hbase.sh

Friday 9 September 2016

How to create a WordCount MapReduce with Maven and Eclipse

After configuring hadoop-plugin on eclipse which we learnt in our previous blog(Hadoop-eclipse-plugin installation), we now write our first Word count Map'reduce program using eclipse and maven.

Before we jump into program, let's understand how the job flow works through YARN implementation when map reduce program is submitted by client.

In Hadopo 1.x version, there are two major components which works in Master-Slave fashion.

Job Tracker : This allocates resources required to run a Map reduce job and scheduling activities.
Task tracker : These are initiated by Job tracker to process individual tasks.

Since Job tracker is responsible for both resource management (assigning resources to each job) and job scheduling (assigning task to task trackers and monitoring task progress) in a single node, there was a scalability issue in large HDFS clusters with more than 4000 nodes. To overcome this issue, YARN is implemented.

YARN (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions.

The fundamental idea of YARN is to split up the two major responsibilities of the JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster.

Task Tracker is replaced with Node Manager in YARN which is a per-machine framework agent and it is responsible for containers, monitoring their resource usage (CPU, memory, disk, network) and reporting the same to the Resource Manager.

Job Flow :

Client submits MapReduce job by interacting with Job objects (Client runs in it’s own JVM).
Client Job’s code interacts with Resource Manager to acquire application meta-data, such as application id and moves all the job related resources to HDFS to make them available for the rest of the job and then submits the application to Resource Manager.
Resource Manager chooses a Node Manager with available resources and requests a container for Application Master.
Node Manager allocates container for Application Master and Application Master (MRAppMaster) will execute and coordinate MapReduce job.

Role of an Application Master:

As noted above, Both map tasks and reduce tasks are created by Application Master.
If the submitted job is small, then Application Master runs the job in the same JVM on which Application Master is running. It reduces the overhead of creating new container and running tasks in parallel. These small jobs are called as Uber tasks.
Uber tasks are decided by three configuration parameters, number of mappers <= 10, number of reducers <= 1 and Input file size is less than or equal to an HDFS block size. These parameters can be configured via mapreduce.job.ubertask.maxmaps , mapreduce.job.ubertask.maxreduces , and mapreduce.job.ubertask.maxbytes properties in mapred-site.xml.
If job doesn’t qualify as Uber task, Application Master requests containers for all map tasks and reduce tasks.

Job Start up Phase:

The call to job.waitForCompletion() in the main driver class is where all the execution starts and this call starts communication with the Resource Manager.
Retrieves the new Job ID or Application ID from Resource Manager.
The Client system copies Job Resources specified via the -files, -archives, and -jar command-line arguments, as well as the job JAR file on to HDFS.
Finally, Job is submitted by calling submitApplication() method on Resource Manager.
Resource Manager triggers its sub-component Scheduler, which allocates containers for mapreduce job execution. Then Resource Manager starts Application Master in the container provided by the scheduler. This container will be managed by Node Manager from here on wards.

Input Split Phase:

In this phase, HDFS splits the input files into equal sized chunks or segments based on minimum split size (mapreduce.input.fileinputformat.split.minsize) property .
Each file segment or split is passed to a unique map task if file is splittable. If File is not splittable then entire file will be provided as input to a single map task.
These map tasks are created by Mapreduce Application Master (MRAppMaster Java Class) and reduce tasks are also created by application master based on mapreduce.job.reducer property.

Task Execution Phase:

Once Containers assigned to tasks, Application Master starts containers by notifying its Node Manager.
Node Manager copies Job resources (like job JAR file) from HDFS distributed cache and runs map or reduce tasks.
Application Master collects task progress and status information from all tasks and aggregate values are propagated to Client or user.

Job Completion:

Client Node checks with Application Master for Job completion status at regular intervals of time usually every 5 seconds when job is submitted by calling runJob() method. This time interval can be configured via mapreduce.client.completion.pollinterval property.
Once the job is completed, Application Master and Task Container clean up their working state.Job’s OutputCommitter calls the cleanup method to handle any cleanup activities.
Job is archived by Job history server for future reference.

Thats all about the theory part. Let us now write a sample MapReduce program to count the number of words in a given file.

Tools used

Maven 3.3.9
Eclipse Luna
JDK 1.8
Hadoop 7.2.1

Configure Maven

Download maven from here and extract to C:\maven\apache-maven-3.3.9.
Add MAVEN_HOME to user variable and %MAVEN_HOME%\bin to Path variable.

Open cmd prompt as administrator and type below command to verify if maven installation.

C:\Windows\System32>mvn -version

You will see below log if it is successfully installed

Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: C:\maven\apache-maven-3.3.9\bin\..
Java version: 1.8.0_101, vendor: Oracle Corporation
Java home: C:\java\jdk1.8.0_101\jre
Default locale: en_SG, platform encoding: Cp1252
OS name: "windows 10", version: "10.0", arch: "amd64", family: "dos"

Start eclipse, go to "Window -> Open Perspective -> Other". From perspectives window, you should see “Java”, Select it and click "OK".
Right click on package explorer and select New->Other->Maven Project

Click Next and Check Use default Workspace location and click Next

Select maven-archetype-quickstart 1.1 and click Next

Add Group ID, Artifact Id and Package name as mentioned in the below screen and click Finish.

Switch tp “Map/Reduce” perspective by clicking icon at the top right hand corner of the main eclipse panel now, as highlighted below.

After switching perspective, you will see below project in project explore along with DFS Location.

Add below content in pom.xml and save.

pom.xml

<?xml version="1.0"?>
<project
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
 xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <modelVersion>4.0.0</modelVersion>
 <artifactId>WordCountMR</artifactId>
 <name>WordCountMR</name>
 <url>http://maven.apache.org</url>
 <properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>
 <dependencies>
  <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>3.8.1</version>
   <scope>test</scope>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-annotations</artifactId>
   <version>2.7.1</version>
  </dependency>
  <!-- Hadoop Mapreduce Client Core -->
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-core</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-common</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs-nfs</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-hdfs</artifactId>
   <version>2.7.1</version>
  </dependency>

  <dependency>
   <groupId>junit</groupId>
   <artifactId>junit</artifactId>
   <version>3.8.1</version>
   <scope>test</scope>
  </dependency>

  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-common</artifactId>
   <version>2.7.1</version>
  </dependency>


  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-app</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-hs</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-mapreduce-client-hs-plugins</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-api</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-web-proxy</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-sharedcachemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-nodemanager</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-server-applicationhistoryservice</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-registry</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-common</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-client</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-applications-unmanaged-am-launcher</artifactId>
   <version>2.7.1</version>
  </dependency>
  <dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-yarn-applications-distributedshell</artifactId>
   <version>2.7.1</version>
  </dependency>
 </dependencies>
</project>

Java Classes

Create WordCountMapper.java class and add below content.

WordCountMapper.java

package com.hdp.madreduce.example;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

 private final static IntWritable one = new IntWritable(1);
 private Text word = new Text();

 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> collector, Reporter reporter)
   throws IOException {
  String line = value.toString();
  StringTokenizer st = new StringTokenizer(line, " ");
  while (st.hasMoreTokens()) {
   word.set(st.nextToken());
   collector.collect(word, one);
  }

 }
}

Create WordCountReducer.java class and add below content.

WordCountReducer.java

package com.hdp.madreduce.example;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector,
   Reporter reporter) throws IOException {
  int sum = 0;

  while (values.hasNext()) {
   sum = sum + values.next().get();
  }
  outputCollector.collect(key, new IntWritable(sum));

 }
}

Create WordCount.java class and add below content.

WordCount.java

import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextOutputFormat;

public class WordCount {

 public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();

  Path inputPath = new Path("hdfs://127.0.0.1:9000/input/WordCountSample.txt");
  Path outputPath = new Path("hdfs://127.0.0.1:9000/output/result");

  JobConf job = new JobConf(conf, WordCount.class);
  job.setJarByClass(WordCount.class);
  job.setJobName("WordCounterJob");

  FileInputFormat.setInputPaths(job, inputPath);
  FileOutputFormat.setOutputPath(job, outputPath);

  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  job.setOutputFormat(TextOutputFormat.class);
  job.setMapperClass(WordCountMapper.class);
  job.setReducerClass(WordCountReducer.class);

  FileSystem hdfs = FileSystem.get(URI.create("hdfs://127.0.0.1:9000"), conf);
  if (hdfs.exists(outputPath))
   hdfs.delete(outputPath, true);

  RunningJob runningJob = JobClient.runJob(job);
  System.out.println("job.isSuccessfull: " + runningJob.isComplete());
 }

}

After creating all the classes, your project explorer looks like below.

Right click on WordCount.java ->Run As -> Run on Hadoop

If program runs successfully, you should see below content in eclipse console.

To get detail log, add hadoop-common-2.7.1-test.sources.jar file from C:\hadoop-2.7.1\share\hadoop\common\sources

Rerun the WordCount.java class to see below log

Eclipse console

2016-09-10 09:31:28,117 INFO  Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - session.id is deprecated. Instead, use dfs.metrics.session-id
2016-09-10 09:31:28,124 INFO  jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=
2016-09-10 09:31:28,442 WARN  mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2016-09-10 09:31:28,494 WARN  mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(171)) - No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
2016-09-10 09:31:28,513 INFO  input.FileInputFormat (FileInputFormat.java:listStatus(283)) - Total input paths to process : 1
2016-09-10 09:31:28,660 INFO  mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(198)) - number of splits:1
2016-09-10 09:31:28,907 INFO  mapreduce.JobSubmitter (JobSubmitter.java:printTokens(287)) - Submitting tokens for job: job_local375951257_0001
2016-09-10 09:31:29,296 INFO  mapreduce.Job (Job.java:submit(1294)) - The url to track the job: http://localhost:8080/
2016-09-10 09:31:29,297 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1339)) - Running job: job_local375951257_0001
2016-09-10 09:31:29,302 INFO  mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null
2016-09-10 09:31:29,311 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,315 INFO  mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
2016-09-10 09:31:29,415 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for map tasks
2016-09-10 09:31:29,416 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(224)) - Starting task: attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:29,466 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,478 INFO  util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(192)) - ProcfsBasedProcessTree currently is supported only on Linux.
2016-09-10 09:31:29,594 INFO  mapred.Task (Task.java:initialize(612)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@46ae4d2d
2016-09-10 09:31:29,605 INFO  mapred.MapTask (MapTask.java:runNewMapper(756)) - Processing split: hdfs://127.0.0.1:9000/input/WordCountSample.txt:0+433
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:setEquator(1205)) - (EQUATOR) 0 kvi 26214396(104857584)
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(998)) - mapreduce.task.io.sort.mb: 100
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(999)) - soft limit at 83886080
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(1000)) - bufstart = 0; bufvoid = 104857600
2016-09-10 09:31:29,672 INFO  mapred.MapTask (MapTask.java:init(1001)) - kvstart = 26214396; length = 6553600
2016-09-10 09:31:29,684 INFO  mapred.MapTask (MapTask.java:createSortingCollector(403)) - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
2016-09-10 09:31:29,816 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1460)) - Starting flush of map output
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1482)) - Spilling map output
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1483)) - bufstart = 0; bufend = 676; bufvoid = 104857600
2016-09-10 09:31:29,823 INFO  mapred.MapTask (MapTask.java:flush(1485)) - kvstart = 26214396(104857584); kvend = 26214152(104856608); length = 245/6553600
2016-09-10 09:31:29,860 INFO  mapred.MapTask (MapTask.java:sortAndSpill(1667)) - Finished spill 0
2016-09-10 09:31:29,875 INFO  mapred.Task (Task.java:done(1038)) - Task:attempt_local375951257_0001_m_000000_0 is done. And is in the process of committing
2016-09-10 09:31:29,896 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - map
2016-09-10 09:31:29,896 INFO  mapred.Task (Task.java:sendDone(1158)) - Task 'attempt_local375951257_0001_m_000000_0' done.
2016-09-10 09:31:29,896 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(249)) - Finishing task: attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:29,897 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - map task executor complete.
2016-09-10 09:31:29,902 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(448)) - Waiting for reduce tasks
2016-09-10 09:31:29,903 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(302)) - Starting task: attempt_local375951257_0001_r_000000_0
2016-09-10 09:31:29,912 INFO  output.FileOutputCommitter (FileOutputCommitter.java:<init>(100)) - File Output Committer Algorithm version is 1
2016-09-10 09:31:29,914 INFO  util.ProcfsBasedProcessTree (ProcfsBasedProcessTree.java:isAvailable(192)) - ProcfsBasedProcessTree currently is supported only on Linux.
2016-09-10 09:31:30,006 INFO  mapred.Task (Task.java:initialize(612)) -  Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@7ddb8f1c
2016-09-10 09:31:30,010 INFO  mapred.ReduceTask (ReduceTask.java:run(362)) - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@46f15a4f
2016-09-10 09:31:30,028 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:<init>(196)) - MergerManager: memoryLimit=1314232704, maxSingleShuffleLimit=328558176, mergeThreshold=867393600, ioSortFactor=10, memToMemMergeOutputsThreshold=10
2016-09-10 09:31:30,031 INFO  reduce.EventFetcher (EventFetcher.java:run(61)) - attempt_local375951257_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
2016-09-10 09:31:30,126 INFO  reduce.LocalFetcher (LocalFetcher.java:copyMapOutput(144)) - localfetcher#1 about to shuffle output of map attempt_local375951257_0001_m_000000_0 decomp: 802 len: 806 to MEMORY
2016-09-10 09:31:30,136 INFO  reduce.InMemoryMapOutput (InMemoryMapOutput.java:shuffle(100)) - Read 802 bytes from map-output for attempt_local375951257_0001_m_000000_0
2016-09-10 09:31:30,139 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:closeInMemoryFile(314)) - closeInMemoryFile -> map-output of size: 802, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->802
2016-09-10 09:31:30,141 INFO  reduce.EventFetcher (EventFetcher.java:run(76)) - EventFetcher is interrupted.. Returning
2016-09-10 09:31:30,142 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,143 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(674)) - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
2016-09-10 09:31:30,163 INFO  mapred.Merger (Merger.java:merge(606)) - Merging 1 sorted segments
2016-09-10 09:31:30,163 INFO  mapred.Merger (Merger.java:merge(705)) - Down to the last merge-pass, with 1 segments left of total size: 798 bytes
2016-09-10 09:31:30,168 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(751)) - Merged 1 segments, 802 bytes to disk to satisfy reduce memory limit
2016-09-10 09:31:30,170 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(781)) - Merging 1 files, 806 bytes from disk
2016-09-10 09:31:30,171 INFO  reduce.MergeManagerImpl (MergeManagerImpl.java:finalMerge(796)) - Merging 0 segments, 0 bytes from memory into reduce
2016-09-10 09:31:30,171 INFO  mapred.Merger (Merger.java:merge(606)) - Merging 1 sorted segments
2016-09-10 09:31:30,173 INFO  mapred.Merger (Merger.java:merge(705)) - Down to the last merge-pass, with 1 segments left of total size: 798 bytes
2016-09-10 09:31:30,174 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,212 INFO  Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1173)) - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
2016-09-10 09:31:30,303 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1360)) - Job job_local375951257_0001 running in uber mode : false
2016-09-10 09:31:30,305 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1367)) -  map 100% reduce 0%
2016-09-10 09:31:30,415 INFO  mapred.Task (Task.java:done(1038)) - Task:attempt_local375951257_0001_r_000000_0 is done. And is in the process of committing
2016-09-10 09:31:30,419 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - 1 / 1 copied.
2016-09-10 09:31:30,420 INFO  mapred.Task (Task.java:commit(1199)) - Task attempt_local375951257_0001_r_000000_0 is allowed to commit now
2016-09-10 09:31:30,434 INFO  output.FileOutputCommitter (FileOutputCommitter.java:commitTask(482)) - Saved output of task 'attempt_local375951257_0001_r_000000_0' to hdfs://127.0.0.1:9000/output/result/_temporary/0/task_local375951257_0001_r_000000
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:statusUpdate(591)) - reduce > reduce
2016-09-10 09:31:30,436 INFO  mapred.Task (Task.java:sendDone(1158)) - Task 'attempt_local375951257_0001_r_000000_0' done.
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:run(325)) - Finishing task: attempt_local375951257_0001_r_000000_0
2016-09-10 09:31:30,436 INFO  mapred.LocalJobRunner (LocalJobRunner.java:runTasks(456)) - reduce task executor complete.
2016-09-10 09:31:31,306 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1367)) -  map 100% reduce 100%
2016-09-10 09:31:31,307 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1378)) - Job job_local375951257_0001 completed successfully
2016-09-10 09:31:31,331 INFO  mapreduce.Job (Job.java:monitorAndPrintJob(1385)) - Counters: 35
 File System Counters
  FILE: Number of bytes read=1974
  FILE: Number of bytes written=632424
  FILE: Number of read operations=0
  FILE: Number of large read operations=0
  FILE: Number of write operations=0
  HDFS: Number of bytes read=866
  HDFS: Number of bytes written=416
  HDFS: Number of read operations=15
  HDFS: Number of large read operations=0
  HDFS: Number of write operations=6
 Map-Reduce Framework
  Map input records=6
  Map output records=62
  Map output bytes=676
  Map output materialized bytes=806
  Input split bytes=112
  Combine input records=0
  Combine output records=0
  Reduce input groups=45
  Reduce shuffle bytes=806
  Reduce input records=62
  Reduce output records=45
  Spilled Records=124
  Shuffled Maps =1
  Failed Shuffles=0
  Merged Map outputs=1
  GC time elapsed (ms)=0
  Total committed heap usage (bytes)=481296384
 Shuffle Errors
  BAD_ID=0
  CONNECTION=0
  IO_ERROR=0
  WRONG_LENGTH=0
  WRONG_MAP=0
  WRONG_REDUCE=0
 File Input Format Counters 
  Bytes Read=433
 File Output Format Counters 
  Bytes Written=416
job.isSuccessful true

Right click on DFS Location and Disconnect and Refresh, you should see output file generated .

Double click on part-r-00000 file from output->result, you should see below output.

Rerun the WordCount.java class to see below log

Result file content:

A 1
Apache 1
Hadoop 3
across 2
allows 1
an 2
and 2
application 1
clusters 2
computation 2
computers 2
datasets 1
designed 1
distributed 2
each 1
environment 1
frame-worked 1
framework 1
from 1
in 2
is 2
java 1
large 1
local 1
machines 1
models 1
of 4
offering 1
open 1
processing 1
programming 1
provides 1
scale 1
server 1
simple 1
single 1
source 1
storage 2
that 2
thousands 1
to 2
up 1
using 1
works 1
written 1

WordCount Program in Debug mode :

Place debugger point in WordCountMapper class and right click on WordCount and Debug as - > Java Application, you should be able to debug your program line by line on run-time.

Source Code :

Github : https://github.com/naveenacharya1/Hadoop/tree/master/hadoop-map-reduce/WordCountMR

Zip Version

Apache Hadoop 2.7.1 on Windows 10

Saturday 17 September 2016

Installing Apache Hbase on Windows using Cygwin64

Tools Used :

We’ll start with a basic scan that returns all columns in the cars table.

Friday 9 September 2016

How to create a WordCount MapReduce with Maven and Eclipse

Job Flow :

Tools used

Configure Maven

WordCount Program in Debug mode :

Source Code :

Pages

Blog Archive