实验5 MapReduce初级编程实践（3）——对给定的表格进行信息挖掘

一、实验目的

通过实验掌握基本的MapReduce编程方法；
掌握用MapReduce解决一些常见的数据处理问题，包括数据去重、数据排序和数据挖掘等。

二、实验平台

操作系统：Linux（建议Ubuntu16.04或Ubuntu18.04）
Hadoop版本：3.1.3

三、实验内容

对给定的表格进行信息挖掘

下面给出一个child-parent的表格，要求挖掘其中的父子辈关系，给出祖孙辈关系的表格。

输入文件的内容如下，保存在

child-parent

文件内：

child parent
Steven Lucy
Steven Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Frank
Jack Alice
Jack Jesse
David Alice
David Jesse
Philip David
Philip Alma
Mark David
Mark Alma

根据输入文件得到的输出样例如下：

grand_child    grand_parent
Mark    Jesse
Mark    Alice
Philip    Jesse
Philip    Alice
Jone    Jesse
Jone    Alice
Steven    Jesse
Steven    Alice
Steven    Frank
Steven    Mary
Jone    Frank
Jone    Mary

四、实验步骤

进入 Hadoop 安装目录，启动 hadoop：

cd /usr/local/hadoop
sbin/start-dfs.sh

新建文件夹，创建文件 child-parent：

sudomkdir Pritice3 &&cd Pritice3
sudovim child-parent

写入上述输入内容，接着编写 Java 文件实现 MapReduce：

sudovim simple_data_mining.java

实现的 Java 代码如下：

importjava.io.IOException;importjava.util.*;importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.fs.Path;importorg.apache.hadoop.io.IntWritable;importorg.apache.hadoop.io.Text;importorg.apache.hadoop.mapreduce.Job;importorg.apache.hadoop.mapreduce.Mapper;importorg.apache.hadoop.mapreduce.Reducer;importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;importorg.apache.hadoop.util.GenericOptionsParser;publicclass simple_data_mining {publicstaticint time =0;/**
     * @param args
     * 输入一个child-parent的表格
     * 输出一个体现grandchild-grandparent关系的表格
     *///Map将输入文件按照空格分割成child和parent，然后正序输出一次作为右表，反序输出一次作为左表，需要注意的是在输出的value中必须加上左右表区别标志publicstaticclassMapextendsMapper<Object,Text,Text,Text>{publicvoidmap(Object key,Text value,Context context)throwsIOException,InterruptedException{String child_name =newString();String parent_name =newString();String relation_type =newString();String line = value.toString();int i =0;while(line.charAt(i)!=' '){
                i++;}String[] values ={line.substring(0,i),line.substring(i+1)};if(values[0].compareTo("child")!=0){
                child_name = values[0];
                parent_name = values[1];
                relation_type ="1";//左右表区分标志
                context.write(newText(values[1]),newText(relation_type+"+"+child_name+"+"+parent_name));//左表
                relation_type ="2";
                context.write(newText(values[0]),newText(relation_type+"+"+child_name+"+"+parent_name));//右表}}}publicstaticclassReduceextendsReducer<Text,Text,Text,Text>{publicvoidreduce(Text key,Iterable<Text> values,Context context)throwsIOException,InterruptedException{if(time ==0){//输出表头
                context.write(newText("grand_child"),newText("grand_parent"));
                time++;}int grand_child_num =0;String grand_child[]=newString[10];int grand_parent_num =0;String grand_parent[]=newString[10];Iterator ite = values.iterator();while(ite.hasNext()){Stringrecord= ite.next().toString();int len =record.length();int i =2;if(len ==0)continue;char relation_type =record.charAt(0);String child_name =newString();String parent_name =newString();//获取value-list中value的childwhile(record.charAt(i)!='+'){
                    child_name = child_name +record.charAt(i);
                    i++;}
                i=i+1;//获取value-list中value的parentwhile(i<len){
                    parent_name = parent_name+record.charAt(i);
                    i++;}//左表，取出child放入grand_childif(relation_type =='1'){
                    grand_child[grand_child_num]= child_name;
                    grand_child_num++;}else{//右表，取出parent放入grand_parent
                    grand_parent[grand_parent_num]= parent_name;
                    grand_parent_num++;}}if(grand_parent_num !=0&& grand_child_num !=0){for(int m =0;m<grand_child_num;m++){for(int n=0;n<grand_parent_num;n++){
                        context.write(newText(grand_child[m]),newText(grand_parent[n]));//输出结果}}}}}publicstaticvoidmain(String[] args)throwsException{// TODO Auto-generated method stubConfiguration conf =newConfiguration();
        conf.set("fs.default.name","hdfs://localhost:9000");String[] otherArgs =newString[]{"input","output"};/* 直接设置输入参数 */if(otherArgs.length !=2){System.err.println("Usage: wordcount <in><out>");System.exit(2);}Job job =Job.getInstance(conf,"Single table join");
        job.setJarByClass(simple_data_mining.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);FileInputFormat.addInputPath(job,newPath(otherArgs[0]));FileOutputFormat.setOutputPath(job,newPath(otherArgs[1]));System.exit(job.waitForCompletion(true)?0:1);}}

赋予用户相关权限：

sudochown -R hadoop /usr/local/hadoop

添加编译所需要使用的 jar 包：

vim ~/.bashrc

添加下面一行到文件的最后：

exportHADOOP_HOME=/usr/local/hadoop
exportCLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH

使更改立即生效：

source ~/.bashrc

编译 simple_data_mining.java：

javac simple_data_mining.java

打包生成的 class 文件为 jar 包：

jar -cvf simple_data_mining.jar *.class

创建 Hadoop 主目录为 /user/hadoop 并创建 input 文件夹：

/usr/local/hadoop/bin/hdfs dfs -mkdir -p /user/hadoop
/usr/local/hadoop/bin/hdfs dfs -mkdir input

若 intput 已存在则删除原有文件：

/usr/local/hadoop/bin/hdfs dfs -rm input/*

上传文件 child-parent 到 input 文件夹中：

/usr/local/hadoop/bin/hdfs dfs -put ./child-parent input

使用之前确保 output 文件夹不存在：

/usr/local/hadoop/bin/hdfs dfs -rm -r output

使用我们刚生成的 simple_data_mining.jar 包：

/usr/local/hadoop/bin/hadoop jar simple_data_mining.jar simple_data_mining

查看输出结果：

/usr/local/hadoop/bin/hdfs dfs -cat output/*

输出如下：

hadoop@fzqs-Laptop:/usr/local/hadoop$ hdfs dfs -cat output/*
grand_child    grand_parent
Mark    Jesse
Mark    Alice
Philip    Jesse
Philip    Alice
Jone    Jesse
Jone    Alice
Steven    Jesse
Steven    Alice
Steven    Frank
Steven    Mary
Jone    Frank
Jone    Mary
hadoop@fzqs-Laptop:/usr/local/hadoop$

此外，有想用 Python 写的可以参考我这篇博客：实验5 MapReduce初级编程实践（Python实现）

标签： ubuntu hadoop mapreduce

本文转载自: https://blog.csdn.net/weixin_46584887/article/details/121603995
版权归原作者 Z.Q.Feng 所有，如有侵权，请联系我们删除。

实验5 MapReduce初级编程实践（3）——对给定的表格进行信息挖掘

一、实验目的

二、实验平台

三、实验内容

对给定的表格进行信息挖掘

四、实验步骤

发表评论

“实验5 MapReduce初级编程实践（3）——对给定的表格进行信息挖掘”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航