1.词频统计任务要求
准备两个txt文件分别为wordfile1.txt和wordfile2.txt,内容如下:
2.在Eclipse中创建项目
我的eclipse在usr/local/eclipse目录下,使用如下命令启动eclipse
cd /usr/local/eclipse
./eclipse
创建一个java工程命名为WordCount,点击next加载jar包
选中Libraries点击Add External JARS加载jar包
为了编写一个MapReduce程序,一般需要向Java工程中添加以下JAR包:
(1)“/usr/local/hadoop/share/hadoop/common”目录下的hadoop-common-3.0.3.jar和haoop-nfs-3.0.3.jar;
(2)“/usr/local/hadoop/share/hadoop/common/lib”目录下的所有JAR包;
(3)“/usr/local/hadoop/share/hadoop/mapreduce”目录下的所有JAR包,但是,不包括jdiff、lib、lib-examples和sources目录;
(4)“/usr/local/hadoop/share/hadoop/mapreduce/lib”目录下的所有JAR包。
3. 编写Java应用程序
importjava.io.IOException;importjava.util.Iterator;importjava.util.StringTokenizer;importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.io.IntWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.Job;importorg.apache.hadoop.mapreduce.Mapper;importorg.apache.hadoop.mapreduce.Reducer;importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importorg.apache.hadoop.util.GenericOptionsParser;publicclassWordCount{publicWordCount(){}publicstaticvoidmain(String[] args)throwsException{Configuration conf =newConfiguration();String[] otherArgs =(newGenericOptionsParser(conf, args)).getRemainingArgs();if(otherArgs.length <2){System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}Job job =Job.getInstance(conf,"word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCount.TokenizerMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);for(int i =0; i < otherArgs.length -1;++i){FileInputFormat.addInputPath(job,newPath(otherArgs[i]));}FileOutputFormat.setOutputPath(job,newPath(otherArgs[otherArgs.length -1]));System.exit(job.waitForCompletion(true)?0:1);}publicstaticclassTokenizerMapperextendsMapper<Object,Text,Text,IntWritable>{privatestaticfinalIntWritable one =newIntWritable(1);privateText word =newText();publicTokenizerMapper(){}publicvoidmap(Object key,Text value,Mapper<Object,Text,Text,IntWritable>.Context context)throwsIOException,InterruptedException{StringTokenizer itr =newStringTokenizer(value.toString());while(itr.hasMoreTokens()){this.word.set(itr.nextToken());
context.write(this.word, one);}}}publicstaticclassIntSumReducerextendsReducer<Text,IntWritable,Text,IntWritable>{privateIntWritable result =newIntWritable();publicIntSumReducer(){}publicvoidreduce(Text key,Iterable<IntWritable> values,Reducer<Text,IntWritable,Text,IntWritable>.Context context)throwsIOException,InterruptedException{int sum =0;IntWritable val;for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()){
val =(IntWritable)i$.next();}this.result.set(sum);
context.write(key,this.result);}}}
4. 编译打包程序
可以直接点击Eclipse工作界面上部的运行程序的快捷按钮,当把鼠标移动到该按钮上时,在弹出的菜单中选择“Run as”,继续在弹出来的菜单中选择“Java Application”,然后可以把Java应用程序打包生成JAR包,部署到Hadoop平台上运行。现在可以把词频统计程序放在“/usr/local/hadoop/WordCount”目录下。可以用如下命令创建WordCount目录
cd /usr/local/hadoop
mkdir WordCount
在工程名称“WordCount”上点击鼠标右键,在弹出的菜单中选择“Export”,然后选择Runnable JAR file,并将其jar包保存在WordCount目录下,中间过程出现提示选择ok即可。
然后可以使用如下命令查看jar包是否保存在WordCount目录下
cd /usr/local/hadoop/WordCount
ls
5.运行程序
运行程序之前需要先使用如下命令启动hadoop平台
cd /usr/local/hadoop
./sbin/start-dfs.sh
登录hadoop平台之后可以用如下命令在hdfs创建两个目录input和output
./bin/hdfs dfs -mkdir /input
./bin/hdfs dfs -mkdir /output
./bin/hdfs dfs -ls/
再使用如下命令将之前创建的wordfile1.txt和wordfile2.txt两个文件上传道hadoop平台的input目录下
./bin/hdfs dfs -put /home/hadoop/wordfile1.txt /input
./bin/hdfs dfs -put /home/hadoop/wordfile2.txt /input
./bin/hdfs dfs -ls/input
(这里如果input前面不加“/”表示的是在hdfs中的/user/hadoop/input目录,加上“/”直接就是hdfs中的/input目录
然后使用如下命令开始运行程序
cd /usr/local/hadoop
./bin/hadoop jar ./WordCount/WordCount.jar /input output
(./WordCount/WordCount.jar表示在/usr/local/hadoop/WordCount目录下的WordCount.jar文件,如果WordCount.jar文件在家目录应改为如下命令)
cd /usr/local/hadoop
./bin/hadoop jar /home/hadoop/WordCount/WordCount.jar /input /output
运行成功后出现如下页面
运行结束后可以查看output目录下多了两个文件
词频统计结果已经保存在了hdfs中的/output目录下,使用如下命令查看运行结果
cd /usr/local/hadoop
./bin/hdfs dfs -cat/output
./bin/hdfs dfs -cat/output/part-r-00000
如果运行失败使用如下命令之后再运行上述代码
./bin/hdfs dfs -rm-r /output
运行结果如下图所示
版权归原作者 旺仔ᥬQ᭄ᥬQ᭄糖 所有, 如有侵权,请联系我们删除。