本文是为了对大学老师使用过时的操作进行使用hadoop的简化
文章目录
前言
最近在上课的时候用到了hadoop,但是老师使用的教学软件还是在虚拟机上运行eclipse进行开发,对于一个习惯了使用Idea的人来说是相当痛苦的。以下命令操作均在hadoop安装目录执行
一、相关环境配置
1.1虚拟机环境
本人虚拟机版本是ubuntukylin-16.04-desktop-amd64,hadoop版本是3.4.1,虚拟机java版本为Java8
1.2修改core-site.xml
core-site.xml位于你的hadoop安装目录的/etc/hadoop/下,修改红线ip为你的虚拟机ip地址
查看虚拟机ip地址的方法
ip addr
1.3idea中的xml文件依赖引入(maven创建新项目)
注意将版本改为你虚拟机上hadoop的版本
<?xml version="1.0" encoding="UTF-8"?><projectxmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>org.example</groupId><artifactId>HadoopExp</artifactId><version>1.0-SNAPSHOT</version><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target><hadoop.version>3.4.1</hadoop.version></properties><dependencies><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>3.4.1</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>3.4.1</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>3.4.1</version></dependency></dependencies></project>
1.4运行实例(合并指定文件夹中的文件)
代码
classMyPathFilterimplementsPathFilter{String reg =null;MyPathFilter(String reg){this.reg = reg;}publicbooleanaccept(Path path){if(!(path.toString().matches(reg)))returntrue;returnfalse;}}/***
* 利用FSDataOutputStream和FSDataInputStream合并HDFS中的文件
*/publicclassMergeFile{Path inputPath =null;//待合并的文件所在的目录的路径Path outputPath =null;//输出文件的路径publicMergeFile(String input,String output){this.inputPath =newPath(input);this.outputPath =newPath(output);}publicvoiddoMerge()throwsIOException{Configuration conf =newConfiguration();//设置HDFS的配置参数,修改为你自己的地址
conf.set("fs.defaultFS","hdfs://192.168.216.164:9000");
conf.set("fs.hdfs.impl","org.apache.hadoop.hdfs.DistributedFileSystem");FileSystem fsSource =FileSystem.get(URI.create(inputPath.toString()), conf);FileSystem fsDst =FileSystem.get(URI.create(outputPath.toString()), conf);//下面过滤掉输入目录中后缀为.abc的文件FileStatus[] sourceStatus = fsSource.listStatus(inputPath,newMyPathFilter(".*\\.abc"));FSDataOutputStream fsdos = fsDst.create(outputPath);PrintStream ps =newPrintStream(System.out);//下面分别读取过滤之后的每个文件的内容,并输出到同一个文件中for(FileStatus sta : sourceStatus){//下面打印后缀不为.abc的文件的路径、文件大小System.out.print("路径:"+ sta.getPath()+" 文件大小:"+ sta.getLen()+" 权限:"+ sta.getPermission()+" 内容:");FSDataInputStream fsdis = fsSource.open(sta.getPath());byte[] data =newbyte[1024];int read =-1;while((read = fsdis.read(data))>0){
ps.write(data,0, read);
fsdos.write(data,0, read);}
fsdis.close();}
ps.close();
fsdos.close();}publicstaticvoidmain(String[] args)throwsIOException{MergeFile merge =newMergeFile("hdfs://192.168.216.164:9000/user/shi/input",//输入文件路径"hdfs://192.168.216.164:9000/user/shi/merge.txt");//输出文件路径
merge.doMerge();}}
二、WorldCount示例
2.1创建实例文件wordfile1.txt
内容随意输入
2.2上传到hadoop
进入hadoop安装目录
./bin/hdfs dfs -put ./wordfile1.txt 你的hadoop文件夹
如果没有创建文件夹,请使用以下命令创建,-p是指创建父目录,如果输入的路径不存在
./bin/hdfs dfs -mkdir -p 你的文件夹
查看是否上传成功
./bin/hdfs dfs -ls 你的hadoop文件夹
2.3创建存放jar的文件夹
随意创建一个目录即可,以我的为例
./bin/hdfs dfs -mkdir myapp
2.4将在idea写好的程序打包(以maven为例)
代码
/**
* @author Shi
* @version 1.0
* @description: TODO
* @date 2024/11/12 13:23
*/importjava.io.IOException;importjava.net.URI;importjava.util.Iterator;importjava.util.StringTokenizer;importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.fs.FileSystem;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.io.IntWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.Job;importorg.apache.hadoop.mapreduce.Mapper;importorg.apache.hadoop.mapreduce.Reducer;importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importorg.apache.hadoop.util.GenericOptionsParser;publicclassWordCount{publicWordCount(){}publicstaticvoidmain(String[] args)throwsException{Configuration conf =newConfiguration();String[] otherArgs =(newGenericOptionsParser(conf, args)).getRemainingArgs();if(otherArgs.length <2){System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}System.out.println("Input: "+ otherArgs[0]);System.out.println("Output: "+ otherArgs[1]);Job job =Job.getInstance(conf,"word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(WordCount.TokenizerMapper.class);
job.setCombinerClass(WordCount.IntSumReducer.class);
job.setReducerClass(WordCount.IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);for(int i =0; i < otherArgs.length -1;++i){FileInputFormat.addInputPath(job,newPath(otherArgs[i]));}//判断输出文件夹是否存在Path outputPath =newPath(otherArgs[otherArgs.length -1]);FileSystem fsSource =FileSystem.get(URI.create(outputPath.toString()), conf);if(fsSource.exists(outputPath)){System.out.println("输出文件夹已存在,删除:"+outputPath.toString());
fsSource.delete(outputPath,true);}FileOutputFormat.setOutputPath(job, outputPath);System.exit(job.waitForCompletion(true)?0:1);System.out.println("运行结束");}publicstaticclassTokenizerMapperextendsMapper<Object,Text,Text,IntWritable>{privatestaticfinalIntWritable one =newIntWritable(1);privateText word =newText();publicTokenizerMapper(){}publicvoidmap(Object key,Text value,Mapper<Object,Text,Text,IntWritable>.Context context)throwsIOException,InterruptedException{StringTokenizer itr =newStringTokenizer(value.toString());while(itr.hasMoreTokens()){this.word.set(itr.nextToken());
context.write(this.word, one);}}}publicstaticclassIntSumReducerextendsReducer<Text,IntWritable,Text,IntWritable>{privateIntWritable result =newIntWritable();publicIntSumReducer(){}publicvoidreduce(Text key,Iterable<IntWritable> values,Reducer<Text,IntWritable,Text,IntWritable>.Context context)throwsIOException,InterruptedException{int sum =0;IntWritable val;for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()){
val =(IntWritable)i$.next();}this.result.set(sum);
context.write(key,this.result);}}}
2.5在虚拟机上运行jar包
进入hadoop安装目录
./bin/hadoop jar ./myapp/HadoopExp-1.0-SNAPSHOT.jar WordCount/user/shi/input/wordfile1.txt /user/shi/output/
命令解读:
- ./bin/hadoop jar:hadoop运行jar命令
- ./myapp/HadoopExp-1.0-SNAPSHOT.jar:指定jar位置,运行时需改为你的路径
- /user/shi/input/wordfile1.txt:输入文件路径
- /user/shi/output/:词频统计输出路径
运行
查看结果:
输出结果:这里是指输出output的所有文件
./bin/hdfs dfs -cat /user/shi/output/*
结束!
版权归原作者 叫我伟明桑 所有, 如有侵权,请联系我们删除。