0


Hadoop集群进行map词频统计

一、首先新建虚拟机

二、配置静态IP

  1. 1、首先查看虚拟网络编辑器 查看起始IP
  2. ![](https://img-blog.csdnimg.cn/9f9b0fc72fb345e7bb58ae06de2909c9.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_17,color_FFFFFF,t_70,g_se,x_16)
  3. 2.1、修改静态IP
  4. 输入指令:vi /etc/sysconfig/network-scripts/ifcfg-ens33
  5. 修改BOOTPROTO=static
  6. 增加IPADDRNETWASKGATEWAYDNS1
  7. ![](https://img-blog.csdnimg.cn/de82dc4bfe9b4d98acfd63b1687b5462.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  8. ![](https://img-blog.csdnimg.cn/dedfb9effb6449febeb1be7310e4dd9a.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  9. 2.2、输入指令:vi /etc/sysconfig/network增加以下两条
  10. ![](https://img-blog.csdnimg.cn/53733028997b461da21e3c07bc9752d3.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  11. ![](https://img-blog.csdnimg.cn/988e426cef9a4441997763cd903abad4.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_9,color_FFFFFF,t_70,g_se,x_16)
  12. 2.3、输入指令:vi /etc/hosts 添加上IP和主机名
  13. ![](https://img-blog.csdnimg.cn/42b69a5c48784cb29b4e03f239560e69.png)
  14. 2.4、输入:reboot 重启虚拟机

三、安装JDK

  1. 3.1、在opt目录下创建modulejdk文件夹
  2. 输入命令:cd /opt/
  3. 输入命令:mkdir module
  4. 输入命令:mkdir jdk
  5. 输入命令:mkdir hadoop
  6. 3.2、卸载当前jdk
  7. 输入命令:java -version 查看当前jdk版本
  8. 输入命令:yum remove java* 卸载所有jdk
  9. ![](https://img-blog.csdnimg.cn/ab75cbda67b24ff0a239fb2dd17602d7.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  10. 3.3、使用FileZilla链接虚拟机
  11. jdk压缩包上传到hadoop102opt/jdk目录下、
  12. hadoop压缩包上传到hadoop102opt/hadoop目录下。
  13. ![](https://img-blog.csdnimg.cn/59cce46b63e04b818203b7469f62ddb5.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  14. 3.4、解压压缩包到制定目录
  15. 输入指令:cd /opt/jdk
  16. 输入指令:tar -zxvf jdkjdk压缩包) -C /opt/module/
  17. ![](https://img-blog.csdnimg.cn/2b19a2f71f3d49f9842ce931280591bb.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  18. 3.5、配置profile文件并让其生效
  19. 输入指令:pwd 查看当前目录
  20. 输入指令:vi /etc/profile 在文件末尾添加JAVA_HOME
  21. ![](https://img-blog.csdnimg.cn/09db62fec3734338a08b2ea1449f028e.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_9,color_FFFFFF,t_70,g_se,x_16)
  22. 输入指令:source /etc/profile
  23. 输入指令:java -version
  24. ![](https://img-blog.csdnimg.cn/3d0d1f7b1b9e4d79b00ed641182948e4.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)

四、安装hadoop

  1. 4.1、解压hadoop到指定目录
  2. 切换到根目录
  3. 输入指令:mkdir kkb
  4. 输入指令:cd kkb
  5. 输入指令:mkdir install
  6. 输入指令:cd /opt/hadoop
  7. 输入指令:tar -zxvf hadoophadoop压缩包) -C /kkb/install
  8. ![](https://img-blog.csdnimg.cn/e5e4602bec1848e5b25b7207c7fb3531.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  9. 4.2、配置profile文件并使其生效
  10. 输入指令:vi /etc/profile 配置HADOOP_HOME环境
  11. ![](https://img-blog.csdnimg.cn/bcdc5e2ddcb84f52903b8588af60908c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  12. 输入指令:source /etc/profile
  13. 输入指令:hadoop version
  14. ![](https://img-blog.csdnimg.cn/0eb3abb3e11e4674b51036b445719f15.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)

五、克隆出hadoop103、hadoop104,并向hadoop102一样步骤修改静态IP

六、配置ssh免密登录

  1. 6.1、以102为例配置ssh
  2. 输入指令:cd ~/.ssh
  3. 输入指令:ssh-keygen -t rsa
  4. 连续输入三个回车,生成密匙
  5. 6.2、分发密匙,优先分发给自己,再分发给103104
  6. 输入指令:ssh-copy-id 192.168.88.130
  7. 输入指令:ssh-copy-id 192.168.88.131
  8. 输入指令:ssh-copy-id 192.168.88.132
  9. 6.3、在103104上按照6.1-6.2的步骤配置ssh
  10. ![](https://img-blog.csdnimg.cn/f5aa11e2a5504575a17c5be21c25b1db.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_9,color_FFFFFF,t_70,g_se,x_16)
  11. ![](https://img-blog.csdnimg.cn/8470272efc7b472fb2cbc9e1d1db0c3f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_10,color_FFFFFF,t_70,g_se,x_16)

七、配置集群分发脚本xsync

  1. 7.1、在/usr/local/bin目录下创建xsync文件
  2. 输入指令:vi /usr/local/bin/xsync
  3. 7.2xsync内容文件如下:
  1. #!/bin/bash
  2. #1 获取输入参数个数,如果没有参数,直接退出
  3. pcount=$#
  4. if((pcount==0)); then
  5. echo no args;
  6. exit;
  7. fi
  8. #2 获取文件名称
  9. p1=$1
  10. fname=`basename $p1`
  11. echo fname=$fname
  12. #3 获取上级目录到绝对路径
  13. pdir=`cd -P $(dirname $p1); pwd`
  14. echo pdir=$pdir
  15. #4 获取当前用户名称
  16. user=`whoami`
  17. #5 循环
  18. for((host=103; host<105; host++)); do
  19. #echo $pdir/$fname $user@hadoop$host:$pdir
  20. echo --------------- hadoop$host ----------------
  21. rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
  22. done
  1. 7.3、修改文件权限
  2. 输入指令:chomd a+x xsync

八、hadoop3的集群配置

  1. 8.1、执行checknative
  2. 输入指令:cd /kkb/install/hadoop-3.1.4/
  3. 输入指令:bin/hadoop checknative
  4. ![](https://img-blog.csdnimg.cn/12f58f690a964b8aa43a165501d13d2c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  5. 8.2、安装openssl-deve1
  6. 输入指令:yum -y install openssl-deve1
  7. 8.3、修改dfsyarn配置文件
  8. 输入指令:cd /kkb/install/hadoop-3.1.4/etc/hadoop
  9. 输入指令:vim /hadoop-env.sh 在末尾添加以下内容
  1. export JAVA_HOME=/kkb/install/jdk1.8.0_162
  1. 输入指令:vim core-site.xml 在标签内添加以下内容
  1. <configuration>
  2. <property>
  3. <name>fs.defaultFS</name>
  4. <value>hdfs://hadoop102:8020</value>
  5. </property>
  6. <property>
  7. <name>hadoop.tmp.dir</name>
  8. <value>/kkb/install/hadoop-3.1.4/hadoopDatas/tempDatas</value>
  9. </property>
  10. <!-- 缓冲区大小,实际工作中根据服务器性能动态调整;默认值4096 -->
  11. <property>
  12. <name>io.file.buffer.size</name>
  13. <value>4096</value>
  14. </property>
  15. <!-- 开启hdfs的垃圾桶机制,删除掉的数据可以从垃圾桶中回收,单位分钟;默认值0 -->
  16. <property>
  17. <name>fs.trash.interval</name>
  18. <value>10080</value>
  19. </property>
  20. </configuration>
  1. 输入指令:vim /hdfs-site.xml
  1. <configuration>
  2. <!-- NameNode存储元数据信息的路径,实际工作中,一般先确定磁盘的挂载目录,然后多个目录用,进行分割 -->
  3. <!-- 集群动态上下线
  4. <property>
  5. <name>dfs.hosts</name>
  6. <value>/kkb/install/hadoop-3.1.4/etc/hadoop/accept_host</value>
  7. </property>
  8. <property>
  9. <name>dfs.hosts.exclude</name>
  10. <value>/kkb/install/hadoop-3.1.4/etc/hadoop/deny_host</value>
  11. </property>
  12. -->
  13. <property>
  14. <name>dfs.namenode.secondary.http-address</name>
  15. <value>hadoop102:9868</value>
  16. </property>
  17. <property>
  18. <name>dfs.namenode.http-address</name>
  19. <value>hadoop102:9870</value>
  20. </property>
  21. <!-- namenode保存fsimage的路径 -->
  22. <property>
  23. <name>dfs.namenode.name.dir</name>
  24. <value>file:///kkb/install/hadoop-3.1.4/hadoopDatas/namenodeDatas</value>
  25. </property>
  26. <!-- 定义dataNode数据存储的节点位置,实际工作中,一般先确定磁盘的挂载目录,然后多个目录用,进行分割 -->
  27. <property>
  28. <name>dfs.datanode.data.dir</name>
  29. <value>file:///kkb/install/hadoop-3.1.4/hadoopDatas/datanodeDatas</value>
  30. </property>
  31. <!-- namenode保存editslog的目录 -->
  32. <property>
  33. <name>dfs.namenode.edits.dir</name>
  34. <value>file:///kkb/install/hadoop-3.1.4/hadoopDatas/dfs/nn/edits</value>
  35. </property>
  36. <!-- secondarynamenode保存待合并的fsimage -->
  37. <property>
  38. <name>dfs.namenode.checkpoint.dir</name>
  39. <value>file:///kkb/install/hadoop-3.1.4/hadoopDatas/dfs/snn/name</value>
  40. </property>
  41. <!-- secondarynamenode保存待合并的editslog -->
  42. <property>
  43. <name>dfs.namenode.checkpoint.edits.dir</name>
  44. <value>file:///kkb/install/hadoop-3.1.4/hadoopDatas/dfs/nn/snn/edits</value>
  45. </property>
  46. <property>
  47. <name>dfs.replication</name>
  48. <value>3</value>
  49. </property>
  50. <property>
  51. <name>dfs.permissions.enabled</name>
  52. <value>false</value>
  53. </property>
  54. <property>
  55. <name>dfs.blocksize</name>
  56. <value>134217728</value>
  57. </property>
  58. </configuration>
  1. 输入指令:vim mapred-site.xml
  1. <configuration>
  2. <property>
  3. <name>mapreduce.framework.name</name>
  4. <value>yarn</value>
  5. </property>
  6. <property>
  7. <name>mapreduce.job.ubertask.enable</name>
  8. <value>true</value>
  9. </property>
  10. <property>
  11. <name>mapreduce.jobhistory.address</name>
  12. <value>hadoop102:10020</value>
  13. </property>
  14. <property>
  15. <name>mapreduce.jobhistory.webapp.address</name>
  16. <value>hadoop102:19888</value>
  17. </property>
  18. <property>
  19. <name>yarn.app.mapreduce.am.env</name>
  20. <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  21. </property>
  22. <property>
  23. <name>mapreduce.map.env</name>
  24. <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  25. </property>
  26. <property>
  27. <name>mapreduce.reduce.env</name>
  28. <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
  29. </property>
  30. </configuration>
  1. 输入指令:vi yarn-site.xml
  1. <configuration>
  2. <!-- Site specific YARN configuration properties -->
  3. <property>
  4. <name>yarn.resourcemanager.hostname</name>
  5. <value>hadoop102</value>
  6. </property>
  7. <property>
  8. <name>yarn.nodemanager.aux-services</name>
  9. <value>mapreduce_shuffle</value>
  10. </property>
  11. <!-- 如果vmem、pmem资源不够,会报错,此处将资源监察置为false -->
  12. <property>
  13. <name>yarn.nodemanager.vmem-check-enabled</name>
  14. <value>false</value>
  15. </property>
  16. <property>
  17. <name>yarn.nodemanager.pmem-check-enabled</name>
  18. <value>false</value>
  19. </property>
  20. </configuration>
  1. 输入指令:vi workers
  1. hadoop102
  2. hadoop103
  3. hadoop104
  1. 8.4、创建文件存放目录
  2. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/tempDatas
  3. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/namenodeDatas
  4. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/datanodeDatas
  5. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/dfs/nn/edits
  6. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/dfs/snn/name
  7. 输入指令:mkdir -p /kkb/install/hadoop-3.1.4/hadoopDatas/dfs/nn/snn/edits
  8. 8.5、使用xsync分发配置文件
  9. 输入指令:xsync hadoop-3.1.4 102install目录下将hadoop分发给103104

九、启动hdfs、yarn

  1. 9.1、在102上格式化集群(只能格式一次、不能频繁格式)
  2. 输入指令:hdfs namenode -format
  3. 9.2、在102hadoop-3.1.4目录下启动dfsyarn
  4. 输入指令:sbin/start-dfs.sh
  5. 输入指令:sbin/start-yarn.sh
  6. 9.3jps命令查看启动进程
  7. ![](https://img-blog.csdnimg.cn/cb57e951621b4506afb0834f85e9206f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_6,color_FFFFFF,t_70,g_se,x_16)![](https://img-blog.csdnimg.cn/ab14fe00cc614a4ca92ffe2512d036ff.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_7,color_FFFFFF,t_70,g_se,x_16)
  8. ![](https://img-blog.csdnimg.cn/c5001bdb7a434964b2732ffd745599c5.png)
  9. 9.4、验证集群是否启动成功
  10. 在浏览器打开:192.168.88.130:8088
  11. 在浏览器打开:192.168.88.130:9870
  12. ![](https://img-blog.csdnimg.cn/ebe2bcbdcc3a4c95a49531d32d360afc.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  13. ![](https://img-blog.csdnimg.cn/8b6cd92534b34e21a3cab1cfcee9d0ed.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)

十、在Windows中配置hadoop

  1. 10.1、修改windowshosts文件
  2. 地址:C:\Windows\System32\drivers\etc\hosts
  3. ![](https://img-blog.csdnimg.cn/52917d929c504a4f8df2305bf25fcfa1.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  4. 10.2、配置Windows本中配置hadoop环境
  5. 将集群所用的hadoop-3.1.4.tar.gz解压到一个没有中文、空格的目录下
  6. 10.3、配置hadoop的环境变量
  7. ![](https://img-blog.csdnimg.cn/a18668b854b44e1a80d92bc7eded68b3.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  8. ![](https://img-blog.csdnimg.cn/6d7c6135f15b4c5d88b3549d52448ab1.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_9,color_FFFFFF,t_70,g_se,x_16)
  9. 10.4、将下图的hadoop.dll文件拷贝到C:\\Windows\System32
  10. ![](https://img-blog.csdnimg.cn/819b5bc96a7b47a1abfb5a28b5be807e.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  11. 10.5、将hadoop集群的一下5个配置文件core-site.xmlhdfs-site.xmlmapred-site.xmlyarn-site.xmlworkers,拷贝到windowshadoopC:\hadoop-3.1.4\etc\hadoop目录下
  12. 10.6、打开cmd运行hadoop命令
  13. ![](https://img-blog.csdnimg.cn/cb615145d04349d0aefc810de28ed55c.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_11,color_FFFFFF,t_70,g_se,x_16)

十一、安装maven

  1. 11.1、下载安装包 apache-maven-3.6.1-bin.zip 并解压到某目录、配置环境变量
  2. ![](https://img-blog.csdnimg.cn/2273d796aaa749b0805d77439eed8b58.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  3. ![](https://img-blog.csdnimg.cn/a48efdeb546d4aef9849f42ce1e16c69.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_10,color_FFFFFF,t_70,g_se,x_16)
  4. 11.2cmd中运行mvn -v
  5. ![](https://img-blog.csdnimg.cn/436eba03149440e08a666d3aea3fa962.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  6. 11.3、找到maven解压的目录,找到settings.xml文件,添加以下内容
  7. ![](https://img-blog.csdnimg.cn/5bf6ee3a169c4f42b0d4926f90e18e61.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  8. ![](https://img-blog.csdnimg.cn/6c392c7058aa48009ce4546997f06f2f.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  9. 11.4、打开IDEA,新建一个maven工程,配置pom文件,内容如下:
  1. <properties>
  2. <hadoop.version>3.1.4</hadoop.version>
  3. </properties>
  4. <dependencies>
  5. <dependency>
  6. <groupId>org.apache.hadoop</groupId>
  7. <artifactId>hadoop-client</artifactId>
  8. <version>${hadoop.version}</version>
  9. </dependency>
  10. <dependency>
  11. <groupId>org.apache.hadoop</groupId>
  12. <artifactId>hadoop-common</artifactId>
  13. <version>${hadoop.version}</version>
  14. </dependency>
  15. <dependency>
  16. <groupId>org.apache.hadoop</groupId>
  17. <artifactId>hadoop-hdfs</artifactId>
  18. <version>${hadoop.version}</version>
  19. </dependency>
  20. <dependency>
  21. <groupId>org.apache.hadoop</groupId>
  22. <artifactId>hadoop-mapreduce-client-core</artifactId>
  23. <version>${hadoop.version}</version>
  24. </dependency>
  25. <!-- https://mvnrepository.com/artifact/junit/junit -->
  26. <dependency>
  27. <groupId>junit</groupId>
  28. <artifactId>junit</artifactId>
  29. <version>4.11</version>
  30. <scope>test</scope>
  31. </dependency>
  32. <dependency>
  33. <groupId>org.testng</groupId>
  34. <artifactId>testng</artifactId>
  35. <version>RELEASE</version>
  36. </dependency>
  37. <dependency>
  38. <groupId>log4j</groupId>
  39. <artifactId>log4j</artifactId>
  40. <version>1.2.17</version>
  41. </dependency>
  42. </dependencies>
  43. <build>
  44. <plugins>
  45. <plugin>
  46. <groupId>org.apache.maven.plugins</groupId>
  47. <artifactId>maven-compiler-plugin</artifactId>
  48. <version>3.0</version>
  49. <configuration>
  50. <source>1.8</source>
  51. <target>1.8</target>
  52. <encoding>UTF-8</encoding>
  53. <!-- <verbal>true</verbal>-->
  54. </configuration>
  55. </plugin>
  56. <plugin>
  57. <groupId>org.apache.maven.plugins</groupId>
  58. <artifactId>maven-shade-plugin</artifactId>
  59. <version>2.4.3</version>
  60. <executions>
  61. <execution>
  62. <phase>package</phase>
  63. <goals>
  64. <goal>shade</goal>
  65. </goals>
  66. <configuration>
  67. <minimizeJar>true</minimizeJar>
  68. </configuration>
  69. </execution>
  70. </executions>
  71. </plugin>
  72. </plugins>
  73. </build>

十二、词频统计程序实现

  1. 12.1、编写mapper
  1. package wordcount;
  2. import org.apache.hadoop.io.IntWritable;
  3. import org.apache.hadoop.io.LongWritable;
  4. import org.apache.hadoop.io.Text;
  5. import org.apache.hadoop.mapreduce.Mapper;
  6. import java.io.IOException;
  7. import java.util.ArrayList;
  8. import java.util.List;
  9. public class MyMapper extends Mapper <LongWritable, Text,Text, IntWritable>{
  10. @Override
  11. protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
  12. //获得当前行的数据
  13. String line = value.toString();
  14. //获得一个个的单词
  15. String lineBuffer = line;
  16. String[] keys = new String[]{" ", "\t", " ", ".", "(", ")", "(", ")"};
  17. for (String k : keys){
  18. lineBuffer = lineBuffer.replace(k, ",");
  19. }
  20. String[] wordsBuffer = lineBuffer.split(",");
  21. List<String> words = new ArrayList<>();
  22. for (String w : wordsBuffer){
  23. if (!w.equals("")){
  24. words.add(w);
  25. }
  26. }
  27. //每个单词编程kv对
  28. for (String word : words) {
  29. //将kv对输出出去
  30. context.write(new Text(word),new IntWritable(1));
  31. }
  32. }
  33. }
  1. 12.2、编写Reducer
  1. package wordcount;
  2. import org.apache.hadoop.io.IntWritable;
  3. import org.apache.hadoop.io.Text;
  4. import org.apache.hadoop.mapreduce.Reducer;
  5. import java.io.IOException;
  6. public class MyReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
  7. //bear,List(2,3,3)
  8. @Override
  9. protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
  10. int sum=0;
  11. for (IntWritable value : values) {
  12. int count = value.get();
  13. sum +=count;
  14. }
  15. context.write(key,new IntWritable(sum));
  16. }
  17. }
  1. 12.3、组装main程序
  1. package wordcount;
  2. import org.apache.hadoop.conf.Configuration;
  3. import org.apache.hadoop.conf.Configured;
  4. import org.apache.hadoop.fs.Path;
  5. import org.apache.hadoop.io.IntWritable;
  6. import org.apache.hadoop.io.Text;
  7. import org.apache.hadoop.mapreduce.Job;
  8. import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
  9. import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
  10. import org.apache.hadoop.util.Tool;
  11. import org.apache.hadoop.util.ToolRunner;
  12. public class WordCount extends Configured implements Tool {
  13. public static void main(String[] args) throws Exception {
  14. int run = ToolRunner.run(new Configuration(), new WordCount(), args); // 集群代码
  15. System.exit(run);
  16. }
  17. @Override
  18. public int run(String[] args) throws Exception {
  19. Job job = Job.getInstance(super.getConf(), "wordcount");
  20. job.setJarByClass(WordCount.class);
  21. job.setInputFormatClass(TextInputFormat.class);
  22. TextInputFormat.addInputPath(job, new Path(args[0]));
  23. job.setMapperClass(MyMapper.class);
  24. job.setMapOutputKeyClass(Text.class);
  25. job.setMapOutputValueClass(IntWritable.class);
  26. job.setReducerClass(MyReducer.class);
  27. job.setOutputKeyClass(Text.class);
  28. job.setOutputValueClass(IntWritable.class);
  29. job.setOutputFormatClass(TextOutputFormat.class);
  30. TextOutputFormat.setOutputPath(job, new Path(args[1]));
  31. job.setNumReduceTasks(Integer.parseInt(args[2]));
  32. boolean b = job.waitForCompletion(true);
  33. return b ? 0 : 1;
  34. }
  35. }
  1. 12.4、将程序打包、点击mavenpackage
  2. ![](https://img-blog.csdnimg.cn/2dcbc6cb2d3d4798bc9232932430b158.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)

十三、在集群中实现

  1. 13.1、使用FileZilla链接hadoop102
  2. 找到打包文件,和测试文件,一起上传到hadoop102
  3. ![](https://img-blog.csdnimg.cn/2ca791df8d40411fb619e28eabc95528.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  4. 13.2、将测试文件上传到hdfs
  5. hadoop-3.1.4目录下
  6. 输入指令:bin/hdfs dfs -mkdir -p /test-wrh hdfs上创建test-wrh文件夹
  7. 输入指令:bin/hdfs dfs -put 测试文件地址 /test-wrh/ 将测试文件上传到test-wrh
  8. ![](https://img-blog.csdnimg.cn/94ba732d10704ecdb085471fbd19a8bd.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  9. ![](https://img-blog.csdnimg.cn/48277b092321420ba133fc8136fa387d.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  10. 13.3、在IDEA中拷贝地址
  11. ![](https://img-blog.csdnimg.cn/7a0c004ee11e4a858255f0224554afd2.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  12. 13.4、运行程序
  13. 在包含程序的目录中:
  14. 输入指令:hadoop jar jar包名 Reference /输入路径 /输出路径 3个节点

13.5、将结果从hdfs上下载

  1. ![](https://img-blog.csdnimg.cn/4e716e2a62d341eea3859f10c4a1c676.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_12,color_FFFFFF,t_70,g_se,x_16)
  2. 输入指令:hadoop fs -get /输出路径/part-r-00000 /下载路径

  1. 13.6、查看结果
  2. 输入指令:vim part-r-00000
  3. ![](https://img-blog.csdnimg.cn/6bc02f11f7664a858a9ad913210bc552.png?x-oss-process=image/watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAd2VpeGluXzQ3MjMxNzEz,size_10,color_FFFFFF,t_70,g_se,x_16)

本文转载自: https://blog.csdn.net/weixin_47231713/article/details/122156621
版权归原作者 陈信宇是大聪明 所有, 如有侵权,请联系我们删除。

“Hadoop集群进行map词频统计”的评论:

还没有评论