基于** Docker 构建 Hadoop **平台
**0. **绪论
使⽤Docker搭建Hadoop技术平台,包括安装Docker、Java、Scala、Hadoop、 Hbase、Spark。
集群共有5台机器,主机名分别为 h01、h02、h03、h04、h05。其中 h01 为 master,其他的为
slave。
虚拟机配置:建议1盒2线程、8G内存、30G硬盘。最早配置4G内存,HBase和Spark运⾏异常。
JDK 1.8
Scala 2.11.12
Hadoop 3.3.3
Hbase 3.0.0
Spark 3.3.0
**1. Docker **
**1.1 Ubuntu 22.04 安装Docker **
在 Ubuntu 下对 Docker 的操作都需要加上 sudo ,如果已经是 root 账号了,则不需要。
如果不加 sudo ,Docker 相关命令会⽆法执⾏。
在 Ubuntu 下安装 Docker 的时候需在管理员的账号下操作。
安装完成之后,以 sudo 启动 Docker 服务。
显⽰ Docker 中所有正在运⾏的容器,由于 Docker 才安装,我们没有运⾏任何容器,所以显⽰结果如
下所⽰。
**1.2 使⽤Docker **
现在的 Docker ⽹络能够提供 DNS 解析功能,我们可以使⽤如下命令为接下来的 Hadoop 集群单独构
建⼀个虚拟的⽹络。可以采⽤直通、桥接或macvlan⽅式,这⾥采⽤桥接模式,可以做到5台主机互联,
并能访问宿主机和⽹关,可以连接外⽹,便于在线下载程序资源。
以上命令创建了⼀个名为 hadoop 的虚拟桥接⽹络,该虚拟⽹络内部提供了⾃动的DNS解析服务。使⽤
下⾯这个命令查看 Docker 中的⽹络,可以看到刚刚创建的名为 hadoop 的虚拟桥接⽹络。
mike@ubuntu2204:~$ wget -qO- https://get.docker.com/ | sh
mike@ubuntu2204:~$ sudo service docker start
mike@ubuntu2204:~$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
mike@ubuntu2204:~$
mike@ubuntu2204:$sudo docker network create --driver=bridge hadoopmike@ubuntu2204:$ sudo docker network ls
[sudo] password for mike:
NETWORK ID NAME DRIVER SCOPE
3948edc3e8f3 bridge bridge local
337965dd9b1e hadoop bridge local
cb8f2c453adc host host local
fff4bd1c15ee mynet macvlan local
30e1132ad754 none null local
mike@ubuntu2204:~$
查找 ubuntu 容器
打开https://hub.docker.com/官⽹,搜索ubuntu,找到官⽅认证镜像,这⾥选取第⼀个
点击第⼀个ubuntu,查找可选⽤的版本,这⾥选取22.04下载 ubuntu 22.04 版本的镜像⽂件
mike@ubuntu2204:~$ sudo docker pull ubuntu:22.04
查看已经下载的镜像
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 3 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
根据镜像启动⼀个容器,可以看出 shell 已经是容器的 shell 了,这⾥注意@后⾯的容器ID与上图镜像ID
⼀致
mike@ubuntu2204:~$ sudo docker run -it ubuntu:22.04 /bin/bash
root@27941809078c:/#
输⼊ exit 可以退出容器,不过建议使⽤ Ctrl + P + Q ,退出容器状态,但仍让容器处于后台运⾏状
态。
mike@ubuntu2204:~$
查看本机上所有的容器此处会看到刚刚创建好的容器,并在后台运⾏。这⾥因为是后期制作的教程,为了节省内存,只保留了5
台hadoop的容器,最原始的容器已经删除。
启动⼀个状态为退出的容器,最后⼀个参数为容器 ID
进⼊⼀个容器
关闭⼀个容器
**2. **安装集群
主要是安装 JDK 1.8 的环境,因为 Spark 要 Scala,Scala 要 JDK 1.8,以及 Hadoop,以此来构建基础
镜像。
**2.1 安装 Java 与 Scala **
进⼊之前的 Ubuntu 容器
先更换 apt 的源
**2.1.1 修改 apt **源
备份源
先删除就源⽂件,这个时候没有 vim ⼯具..
mike@ubuntu2204:~$ sudo docker ps -a
[sudo] password for mike:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
NAMES
8016da5278ae newuhadoop "/bin/bash" 3 days ago Up 2 days
h05
409c7e8aa2e9 newuhadoop "/bin/bash" 3 days ago Up 2 days
h04
0d8af236e1e7 newuhadoop "/bin/bash" 3 days ago Up 2 days
h03
72d62b7d4874 newuhadoop "/bin/bash" 3 days ago Up 2 days
h02
d4d3ca3bbb61 newuhadoop "/bin/bash" 3 days ago Up 2 days 0.0.0.0:8088-
8088/tcp, :::8088->8088/tcp, 0.0.0.0:9870->9870/tcp, :::9870->9870/tcp h01
mike@ubuntu2204:~$
mike@ubuntu2204:~$ sudo docker start 27941809078c
mike@ubuntu2204:~$ sudo docker attach 27941809078c
mike@ubuntu2204:~$ sudo docker stop 27941809078c
root@27941809078c:/# cp /etc/apt/sources.list /etc/apt/sources_init.list
root@27941809078c:/#
root@27941809078c:/# rm /etc/apt/sources.list复制以下命令,回⻋,即可⼀键切换到阿⾥云 ubuntu 22.04镜像:(此时已经是root权限,提⽰符为
#)
再使⽤ apt update / apt upgrade 来更新,update更列表,upgrade更新包
**2.1.2 安装 Java与 Scala **
安装 jdk 1.8,直接输⼊命令
测试⼀下安装结果
接下来安装scala
测试⼀下安装结果
bash -c "cat << EOF > /etc/apt/sources.list && apt update
deb http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-security main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-updates main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-proposed main restricted universe
multiverse
deb http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted universe
multiverse
deb-src http://mirrors.aliyun.com/ubuntu/ jammy-backports main restricted
universe multiverse
EOF"
root@27941809078c:/# apt update
root@27941809078c:/# apt upgrade
root@27941809078c:/# apt install openjdk-8-jdk
root@27941809078c:/# java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
root@27941809078c:/#
root@27941809078c:/# apt install scala输⼊ :quit 退出scala
**2.2 安装 Hadoop **
在当前容器中将配置配好
导⼊出为镜像
以此镜像为基础创建五个容器,并赋予 hostname
进⼊ h01 容器,启动 Hadoop
**2.2.1 安装 Vim **与 ⽹络⼯具包
安装 vim,⽤来编辑⽂件
安装 net-tools、iputils-ping、iproute2⽹络⼯具包,⽬的是为了使⽤ping、ifconfig、ip、traceroute
等命令
**2.2.2 安装 SSH **
安装 SSH,并配置免密登录,由于后⾯的容器之间是由⼀个镜像启动的,就像同⼀个磨具出来的 5 把锁
与钥匙,可以互相开锁。所以在当前容器⾥配置 SSH ⾃⾝免密登录就 OK 了。
安装 SSH 服务器端
安装 SSH 的客⼾端
进⼊当前⽤⼾的⽤⼾根⽬录
⽣成密钥,不⽤输⼊,⼀直回⻋就⾏,⽣成的密钥在当前⽤⼾根⽬录下的 .ssh ⽂件夹中。以 . 开头
的⽂件与⽂件夹 ls 是隐藏的,需要 ls -al 才能查看。
将公钥追加到 authorized_keys ⽂件中
root@27941809078c:/# scala
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_312).
Type in expressions for evaluation. Or try :help.
scala>
root@27941809078c:/# apt install vim
root@27941809078c:/# apt install net-tools
root@27941809078c:/# apt install iputils-ping
root@27941809078c:/# apt install iproute2
root@27941809078c:/# apt install openssh-server
root@27941809078c:/# apt install openssh-client
root@27941809078c:/# cd ~
root@27941809078c:~#
root@27941809078c:# ssh-keygen -t rsa -P ""root@27941809078c:# cat .ssh/id_rsa.pub >> .ssh/authorized_keys
root@27941809078c:~#
启动 SSH 服务
root@27941809078c:~# service ssh start
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@27941809078c:~#
免密登录⾃⼰
root@27941809078c:~# ssh 127.0.0.1
Welcome to Ubuntu 22.04 LTS (GNU/Linux 5.15.0-41-generic x86_64)
Documentation: https://help.ubuntu.com
Management: https://landscape.canonical.com
Support: https://ubuntu.com/advantage
This system has been minimized by removing packages and content that are
not required on a system that users do not log into.
To restore this content, you can run the 'unminimize' command.
Last login: Sun Jul 17 08:26:15 2022 from 172.18.0.1
- Starting OpenBSD Secure Shell server sshd
root@27941809078c:~#
修改 .bashrc ⽂件,启动 shell 的时候,⾃动启动 SSH 服务
⽤ vim 打开 .bashrc ⽂件
root@27941809078c:~# vim ~/.bashrc
按⼀下 i 键,使得 vim 进⼊插⼊模式,此时终端的左下⻆会显⽰为 -- INSERT -- ,将光标移动到最后
⾯,添加⼀⾏(Caps + g 可直接到最后⼀⾏)
service ssh start
添加完的结果为,只显⽰最后⼏⾏
if [ -f ~/.bash_aliases ]; then
. ~/.bash_aliases
fi
enable programmable completion features (you don't need to enable
this, if it's already enabled in /etc/bash.bashrc and /etc/profile
sources /etc/bash.bashrc).
#if [ -f /etc/bash_completion ] && ! shopt -oq posix; then
. /etc/bash_completion
#fi
service ssh start按⼀下 Esc 键,使得 vim 退出插⼊模式
再输⼊英⽂模式下的冒号 : ,此时终端的左下⽅会有⼀个冒号 : 显⽰出来
再输⼊三个字符 wq! ,这是⼀个组合命令
w 是保存的意思
q 是退出的意思
! 是强制的意思
再输⼊回⻋,退出 vim。
此时,SSH 免密登录已经完全配置好。
**2.2.3 安装 Hadoop **
下载 Hadoop 的安装⽂件
解压到 /usr/local ⽬录下⾯并重命名⽂件夹
修改 /etc/profile ⽂件,添加⼀下环境变量到⽂件中
先⽤ vim 打开 /etc/profile
追加以下内容
JAVA_HOME 为 JDK 安装路径,使⽤ apt 安装就是这个,⽤ update-alternatives --config java 可
查看
root@27941809078c:~# wget https://mirrors.aliyun.com/apache/hadoop/common/hadoop-
3.3.3/hadoop-3.3.3.tar.gz
root@27941809078c:~# tar -zxvf hadoop-3.3.3.tar.gz -C /usr/local/
root@27941809078c:~# cd /usr/local/
root@27941809078c:/usr/local# mv hadoop-3.3.3 hadoop
root@27941809078c:/usr/local#
vim /etc/profile
#java
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
#hadoop
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoopexport HDFS_DATANODE_USER=root
export HDFS_DATANODE_SECURE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_NAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
使环境变量⽣效
root@27941809078c:/usr/local# source /etc/profile
root@27941809078c:/usr/local#
在⽬录 /usr/local/hadoop/etc/hadoop 下,修改6个重要配置⽂件
修改 hadoop-env.sh ⽂件,在⽂件末尾添加⼀下信息
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
修改 core-site.xml,修改为
<configuration> <property><name>fs.default.name</name>
<value>hdfs://h01:9000</value>
</property> <property><name>hadoop.tmp.dir</name>
<value>/home/hadoop3/hadoop/tmp</value>
</property> </configuration>修改 hdfs-site.xml,修改为
<configuration> <property><name>dfs.replication</name>
<value>2</value>
</property> <property><name>dfs.namenode.name.dir</name>
<value>/home/hadoop3/hadoop/hdfs/name</value>
</property> <property><name>dfs.namenode.data.dir</name>
<value>/home/hadoop3/hadoop/hdfs/data</value>
</property> </configuration>修改 mapred-site.xml,修改为修改 yarn-site.xml,修改为
修改 worker 为
此时,hadoop已经配置好了
**2.2.4 在 Docker **中启动集群
先将当前容器导出为镜像,并查看当前镜像。使⽤ ctrl + p + q ,退出容器,回到宿主机
<configuration> <property><name>mapreduce.framework.name</name>
<value>yarn</value>
</property> <property><name>mapreduce.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop,
/usr/local/hadoop/share/hadoop/common/*,
/usr/local/hadoop/share/hadoop/common/lib/*,
/usr/local/hadoop/share/hadoop/hdfs/*,
/usr/local/hadoop/share/hadoop/hdfs/lib/*,
/usr/local/hadoop/share/hadoop/mapreduce/*,
/usr/local/hadoop/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop/share/hadoop/yarn/*,
/usr/local/hadoop/share/hadoop/yarn/lib/*
</value> </property> </configuration> <configuration> <property><name>yarn.resourcemanager.hostname</name>
<value>h01</value>
</property> <property><name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property> </configuration>h01
h02
h03
h04
h05mike@ubuntu2204:~$ sudo docker commit -m "hadoop" -a "hadoop" 27941809078c
newuhadoop
sha256:648d8e082a231919faeaa14e09f5ce369b20879544576c03ef94074daf978823
mike@ubuntu2204:~$ sudo docker images
[sudo] password for mike:
REPOSITORY TAG IMAGE ID CREATED SIZE
newuhadoop latest fe08b5527281 4 days ago 2.11GB
ubuntu 22.04 27941809078c 6 weeks ago 77.8MB
mike@ubuntu2204:~$
启动 5 个终端,分别执⾏这⼏个命令
第⼀条命令启动的是 h01 是做 master 节点的,所以暴露了端⼝,以供访问 web ⻚⾯
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h01" --name "h01" -p
9870:9870 -p 8088:8088 newuhadoop /bin/bash
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h01:/#
其余的四条命令就是⼏乎⼀样的了,注意:启动容器后,使⽤ ctrl + p + q 退回到宿主机,之后再启动下
⼀个容器
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h02" --name "h02"
newuhadoop /bin/bash
[sudo] password for mike:
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h02:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h03" --name "h03"
newuhadoop /bin/bash
[sudo] password for mike:
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h03:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h04" --name "h04"
newuhadoop /bin/bash
[sudo] password for mike:
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h04:/#
mike@ubuntu2204:~$ sudo docker run -it --network hadoop -h "h05" --name "h05"
newuhadoop /bin/bash
[sudo] password for mike:
- Starting OpenBSD Secure Shell server sshd
[ OK ]
root@h05:/#接下来,在 h01 主机中,启动 Haddop 集群
先进⾏格式化操作,不格式化操作,hdfs会起不来
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format
进⼊ hadoop 的 sbin ⽬录
root@h01:/# cd /usr/local/hadoop/sbin/
root@h01:/usr/local/hadoop/sbin#
启动 hadoop
root@h01:/usr/local/hadoop/sbin# ./start-all.sh
Starting namenodes on [h01]
h01: Warning: Permanently added 'h01,172.18.0.2' (ECDSA) to the list of known
hosts.
Starting datanodes
h05: Warning: Permanently added 'h05,172.18.0.6' (ECDSA) to the list of known
hosts.
h02: Warning: Permanently added 'h02,172.18.0.3' (ECDSA) to the list of known
hosts.
h03: Warning: Permanently added 'h03,172.18.0.4' (ECDSA) to the list of known
hosts.
h04: Warning: Permanently added 'h04,172.18.0.5' (ECDSA) to the list of known
hosts.
h03: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h05: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h02: WARNING: /usr/local/hadoop/logs does not exist. Creating.
h04: WARNING: /usr/local/hadoop/logs does not exist. Creating.
Starting secondary namenodes [h01]
Starting resourcemanager
Starting nodemanagers
root@h01:/usr/local/hadoop/sbin#
使⽤jps查看集群启动状态 (这个状态不是固定不变的,随着应⽤不同⽽不同,但⾄少应该有3个)
root@h01:~# jps
10017 HRegionServer
10609 Master
9778 HQuorumPeer
8245 SecondaryNameNode
8087 DataNode
9881 HMaster
41081 Jps
10684 Worker
7965 NameNode
8477 ResourceManager
8591 NodeManager
root@h01:~#
使⽤命令 ./hdfs dfsadmin -report 可查看分布式⽂件系统的状态
root@h01:/usr/local/hadoop/bin# ./hdfs dfsadmin -report
Configured Capacity: 90810798080 (84.57 GB)Present Capacity: 24106247929 (22.45 GB)
DFS Remaining: 24097781497 (22.44 GB)
DFS Used: 8466432 (8.07 MB)
DFS Used%: 0.04%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Live datanodes (5):
Name: 172.18.0.2:9866 (h01)
Hostname: h01
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 2875392 (2.74 MB)
Non DFS Used: 11887669248 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.02%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:36:54 GMT 2022
Num of Blocks: 293
Name: 172.18.0.3:9866 (h02.hadoop)
Hostname: h02
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1396736 (1.33 MB)
Non DFS Used: 11889147904 (11.07 GB)
DFS Remaining: 4846399828 (4.51 GB)
DFS Used%: 0.01%
DFS Remaining%: 26.68%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 8
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Tue Jul 19 23:51:39 GMT 2022Num of Blocks: 153
Name: 172.18.0.4:9866 (h03.hadoop)
Hostname: h03
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1323008 (1.26 MB)
Non DFS Used: 11889221632 (11.07 GB)
DFS Remaining: 5114835114 (4.76 GB)
DFS Used%: 0.01%
DFS Remaining%: 28.16%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 4
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:14:39 GMT 2022
Num of Blocks: 151
Name: 172.18.0.5:9866 (h04.hadoop)
Hostname: h04
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1527808 (1.46 MB)
Non DFS Used: 11889016832 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 00:42:09 GMT 2022
Num of Blocks: 134
Name: 172.18.0.6:9866 (h05.hadoop)
Hostname: h05
Decommission Status : Normal
Configured Capacity: 18162159616 (16.91 GB)
DFS Used: 1343488 (1.28 MB)
Non DFS Used: 11889201152 (11.07 GB)
DFS Remaining: 4712182185 (4.39 GB)
DFS Used%: 0.01%
DFS Remaining%: 25.95%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 10访问宿主机的 8088 与 9870 端⼝就可以看到监控信息了
⾄此,Hadoop 集群已经构建好了
2.2.5 运⾏内置WordCount例⼦
把 license 作为需要统计的⽂件
在 HDFS 中创建 input ⽂件夹
Last contact: Wed Jul 20 04:55:01 GMT 2022
Last Block Report: Wed Jul 20 02:36:21 GMT 2022
Num of Blocks: 149
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop# cat LICENSE.txt > file1.txt
root@h01:/usr/local/hadoop# lsroot@h01:/usr/local/hadoop/bin# ./hadoop fs -mkdir /input
root@h01:/usr/local/hadoop/bin#
上传 file1.txt ⽂件到 HDFS 中
root@h01:/usr/local/hadoop/bin# ./hadoop fs -put ../file1.txt /input
root@h01:/usr/local/hadoop/bin#
查看 HDFS 中 input ⽂件夹⾥的内容
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /input
Found 1 items
-rw-r--r-- 2 root supergroup 15217 2022-07-17 08:50 /input/file1.txt
root@h01:/usr/local/hadoop/bin#
运⾏wordcount 例⼦程序
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
输出如下:
root@h01:/usr/local/hadoop/bin# ./hadoop jar ../share/hadoop/mapreduce/hadoop
mapreduce-examples-3.3.3.jar wordcount /input /output
2022-07-20 05:12:38,394 INFO client.DefaultNoHARMFailoverProxyProvider:
Connecting to ResourceManager at h01/172.18.0.2:8032
2022-07-20 05:12:38,816 INFO mapreduce.JobResourceUploader: Disabling Erasure
Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1658047711391_0002
2022-07-20 05:12:39,076 INFO input.FileInputFormat: Total input files to process
: 1
2022-07-20 05:12:39,198 INFO mapreduce.JobSubmitter: number of splits:1
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1658047711391_0002
2022-07-20 05:12:39,399 INFO mapreduce.JobSubmitter: Executing with tokens: []
2022-07-20 05:12:39,674 INFO conf.Configuration: resource-types.xml not found
2022-07-20 05:12:39,674 INFO resource.ResourceUtils: Unable to find 'resource
types.xml'.
2022-07-20 05:12:39,836 INFO impl.YarnClientImpl: Submitted application
application_1658047711391_0002
2022-07-20 05:12:39,880 INFO mapreduce.Job: The url to track the job:
http://h01:8088/proxy/application_1658047711391_0002/
2022-07-20 05:12:39,882 INFO mapreduce.Job: Running job: job_1658047711391_0002
2022-07-20 05:12:49,171 INFO mapreduce.Job: Job job_1658047711391_0002 running in
uber mode : false
2022-07-20 05:12:49,174 INFO mapreduce.Job: map 0% reduce 0%
2022-07-20 05:12:54,285 INFO mapreduce.Job: map 100% reduce 0%
2022-07-20 05:13:01,356 INFO mapreduce.Job: map 100% reduce 100%
2022-07-20 05:13:02,391 INFO mapreduce.Job: Job job_1658047711391_0002 completed
successfully
2022-07-20 05:13:02,524 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=12507
FILE: Number of bytes written=577413
FILE: Number of read operations=0FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=15313
HDFS: Number of bytes written=9894
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3141
Total time spent by all reduces in occupied slots (ms)=3811
Total time spent by all map tasks (ms)=3141
Total time spent by all reduce tasks (ms)=3811
Total vcore-milliseconds taken by all map tasks=3141
Total vcore-milliseconds taken by all reduce tasks=3811
Total megabyte-milliseconds taken by all map tasks=3216384
Total megabyte-milliseconds taken by all reduce tasks=3902464
Map-Reduce Framework
Map input records=270
Map output records=1672
Map output bytes=20756
Map output materialized bytes=12507
Input split bytes=96
Combine input records=1672
Combine output records=657
Reduce input groups=657
Reduce shuffle bytes=12507
Reduce input records=657
Reduce output records=657
Spilled Records=1314
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=126
CPU time spent (ms)=1110
Physical memory (bytes) snapshot=474148864
Virtual memory (bytes) snapshot=5063700480
Total committed heap usage (bytes)=450887680
Peak Map Physical memory (bytes)=288309248
Peak Map Virtual memory (bytes)=2528395264
Peak Reduce Physical memory (bytes)=185839616
Peak Reduce Virtual memory (bytes)=2535305216
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=15217
File Output Format Counters
Bytes Written=9894
root@h01:/usr/local/hadoop/bin#查看 HDFS 中的 /output ⽂件夹的内容
查看 part-r-00000 ⽂件的内容
⾄此,hadoop部分已经结束
**2.3 安装 Hbase **
在 Hadoop 集群的基础上安装 Hbase
下载 Hbase 3.0.0
解压到 /usr/local ⽬录下⾯
修改 /etc/profile 环境变量⽂件,添加 Hbase 的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤ ssh h02 可进⼊h02容器,修改profile⽂件如上。依次修改h03、h04、h05
即是每个容器都要在 /etc/profile ⽂件后追加那两⾏环境变量
在⽬录 /usr/local/hbase**-****3.0.0/conf **修改配置
修改 hbase-env.sh,追加
修改 hbase-site.xml 为
root@h01:/usr/local/hadoop/bin# ./hadoop fs -ls /output
Found 2 items
-rw-r--r-- 2 root supergroup 0 2022-07-20 05:13 /output/_SUCCESS
-rw-r--r-- 2 root supergroup 9894 2022-07-20 05:13 /output/part-r-00000
root@h01:/usr/local/hadoop/bin#
root@h01:/usr/local/hadoop/bin# ./hadoop fs -cat /output/part-r-00000
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/3.0.0-alpha-
3/hbase-3.0.0-alpha-3-bin.tar.gz
root@h01:~# tar -zxvf hbase-3.0.0-bin.tar.gz -C /usr/local/
export HBASE_HOME=/usr/local/hbase-3.0.0
export PATH=$PATH:$HBASE_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HBASE_MANAGES_ZK=true
<configuration> <property><name>hbase.rootdir</name><value>hdfs://h01:9000/hbase</value>
</property> <property><name>hbase.cluster.distributed</name>
<value>true</value>
</property> <property><name>hbase.master</name>
<value>h01:60000</value>
</property> <property><name>hbase.zookeeper.quorum</name>
<value>h01,h02,h03,h04,h05</value>
</property> <property><name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zoodata</value>
</property> </configuration>修改 regionservers ⽂件为
h01
h02
h03
h04
h05
使⽤ scp 命令将配置好的 Hbase 复制到其他 4 个容器中
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h02:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h03:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h04:/usr/local/
root@h01:~# scp -r /usr/local/hbase-3.0.0 root@h05:/usr/local/
启动 Hbaseroot@h01:/usr/local/hbase-3.0.0/bin# ./start-hbase.sh
h04: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h04.out
h02: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h02.out
h03: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h03.out
h05: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h05.out
h01: running zookeeper, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase-root
zookeeper-h01.out
running master, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase--master
h01.out
h05: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h05.out
h01: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h01.out
h04: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h04.out
h03: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h03.out
h02: running regionserver, logging to /usr/local/hbase-3.0.0/bin/../logs/hbase
root-regionserver-h02.out
root@h01:/usr/local/hbase-3.0.0/bin#
打开 Hbase 的 shell
root@h01:/usr/local/hbase-3.0.0/bin# ./hbase shell
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/book.html#shell
Version 3.0.0-alpha-3, rb3657484850f9fa9679f2186bf53e7df768f21c7, Wed Jun 15
07:56:54 UTC 2022
Took 0.0017 seconds
hbase:001:0>
hbase测试
创建表member
hbase:006:0> create 'member','id','address','info'
Created table member
Took 0.6838 seconds
=> Hbase::Table - member
hbase:007:0>
添加数据,并查看表中数据
hbase:007:0> put 'member', 'debugo','id','11'
Took 0.1258 seconds
hbase:008:0> put 'member', 'debugo','info:age','27'**2.4 安装 Spark **
在 Hadoop 的基础上安装 Spark
下载 Spark 3.3.0
解压到 /usr/local ⽬录下⾯
修改⽂件夹的名字
修改 /etc/profile 环境变量⽂件,添加 Hbase 的环境变量,追加下述代码
使环境变量配置⽂件⽣效
使⽤ ssh h02 可进⼊其他四个容器,依次修改。
即是每个容器都要在 /etc/profile ⽂件后追加那两⾏环境变量
在⽬录 /usr/local/spark**-****3.3.0/conf **修改配置
修改⽂件名
Took 0.0108 seconds
hbase:009:0> count 'member'
1 row(s)
Took 0.0499 seconds
=> 1
hbase:010:0> scan 'member'
ROW COLUMN+CELL
debugo column=id:, timestamp=2022-07-
20T05:37:58.720, value=11
debugo column=info:age, timestamp=2022-07-
20T05:38:11.302, value=27
1 row(s)
Took 0.0384 seconds
hbase:011:0>
root@h01:~# wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-
3.3.0/spark-3.3.0-bin-hadoop3.tgz
root@h01:~# tar -zxvf spark-3.3.0-bin-hadoop3.tgz -C /usr/local/
root@h01:~# cd /usr/local/
root@h01:/usr/local# mv spark-3.3.0-bin-hadoop3 spark-3.3.0
export SPARK_HOME=/usr/local/spark-3.3.0
export PATH=$PATH:$SPARK_HOME/bin
root@h01:/usr/local# source /etc/profile
root@h01:/usr/local#修改 spark-env.sh,追加
修改⽂件名
修改 slaves 如下
使⽤ scp 命令将配置好的 Hbase 复制到其他 4 个容器中
启动 Spark
**3 **其他
root@h01:/usr/local/spark-3.3.0/conf# mv spark-env.sh.template spark-env.sh
root@h01:/usr/local/spark-3.3.0/conf#
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export SCALA_HOME=/usr/share/scala
export SPARK_MASTER_HOST=h01
export SPARK_MASTER_IP=h01
export SPARK_WORKER_MEMORY=4g
root@h01:/usr/local/spark-3.3.0/conf# mv slaves.template slaves
root@h01:/usr/local/spark-3.3.0/conf#
h01
h02
h03
h04
h05
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h02:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h03:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h04:/usr/local/
root@h01:/usr/local# scp -r /usr/local/spark-3.3.0 root@h05:/usr/local/
root@h01:/usr/local/spark-3.3.0/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-
3.3.0/logs/spark--org.apache.spark.deploy.master.Master-1-h01.out
h03: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h03.out
h02: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h02.out
h04: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h04.out
h05: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h05.out
h01: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-
3.3.0/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-h01.out
root@h01:/usr/local/spark-3.3.0/sbin#**3.1 HDFS **重格式化问题
参考 https://blog.csdn.net/gis_101/article/details/52821946
重新格式化意味着集群的数据会被全部删除,格式化前需考虑数据备份或转移问题;
先删除主节点(即namenode节点),Hadoop的临时存储⽬录tmp、namenode存储永久性元数
据⽬录dfs/name、Hadoop系统⽇志⽂件⽬录log 中的内容 (注意是删除⽬录下的内容不是⽬
录);
删除所有数据节点(即datanode节点) ,Hadoop的临时存储⽬录tmp、namenode存储永久性元数
据⽬录dfs/name、Hadoop系统⽇志⽂件⽬录log 中的内容;
格式化⼀个新的分布式⽂件系统:
注意事项:
Hadoop的临时存储⽬录tmp(即core-site.xml配置⽂件中的hadoop.tmp.dir属性,默认值
是/tmp/hadoop-${user.name}),如果没有配置hadoop.tmp.dir属性,那么hadoop格式化时将
会在/tmp⽬录下创建⼀个⽬录,例如在cloud⽤⼾下安装配置hadoop,那么Hadoop的临时存储⽬
录就位于/tmp/hadoop-cloud⽬录下
Hadoop的namenode元数据⽬录(即hdfs-site.xml配置⽂件中的dfs.namenode.name.dir属性,
默认值是${hadoop.tmp.dir}/dfs/name),同样如果没有配置该属性,那么hadoop在格式化时将
⾃⾏创建。必须注意的是在格式化前必须清楚所有⼦节点(即DataNode节点)dfs/name下的内
容,否则在启动hadoop时⼦节点的守护进程会启动失败。这是由于,每⼀次format主节点
namenode,dfs/name/current⽬录下的VERSION⽂件会产⽣新的clusterID、namespaceID。但
是如果⼦节点的dfs/name/current仍存在,hadoop格式化时就不会重建该⽬录,因此形成⼦节点
的clusterID、namespaceID与主节点(即namenode节点)的clusterID、namespaceID不⼀致。
最终导致hadoop启动失败。
root@h01:/usr/local/hadoop/bin# ./hadoop namenode -format
版权归原作者 汉卿HanQ 所有, 如有侵权,请联系我们删除。