目录
Hadoop Installation on VMs,VMware 虚拟机安装 Hadoop 集群
Setup Three VMs with CentOS7,安装三台 CentOS7 虚拟机(Macbook M1,ARM 架构)
- VM1: hadoop1: 4G RAM + 20G Disk
- VM2: hadoop2: 2G RAM + 20G Disk
- VM3: hadoop3: 2G RAM + 20G Disk
Take host “hadoop2” as VM setup example
以安装 hadoop2 虚拟机作为例子
- Select ISO Image,选择宿主机上的镜像文件

- Select OS,选择 Debian10 操作系统


- Select install CentOS7,选择安装

- Select start up disk,选择硬盘


- Select GNOME GUI,选择安装桌面


- Select timezone,选择时区

- Enable network and set host name,开启网络,设置主机名
- Note down the network interface,网口: ens160
- IP address,IP 地址: 192.168.57.135
- Default route (gateway),网关: 192.168.57.2

- Create user hadoop,创建 hadoop 用户

- Begin installation,开始安装

- During installation,安装中

- Finished installation and reboot,安装完毕点击重启

- Accept license,接受声明

- Complete CentOS installation,完成

- Login GUI as user hadoop,hadoop 用户登录

- Enable date & time update, 同步节点之间的时间

Config static IP address,设置静态 IP 地址
- Use FinalShell software to SSH,使用 FinalShell 软件远程登录到三台虚拟机- Login as hadoop user,hadoop 用户登录



- Edit network file in 3 machines,修改网络配置文件
- hadoop1 192.168.57.134
- hadoop2 192.168.57.135
- hadoop3 192.168.57.136
# Edit ifcfg-{network interface}
sudo vim /etc/sysconfig/network-scripts/ifcfg-ens160
...
BOOTPROTO=static
...
# append,追加
IPADDR=192.168.57.134
GATEWAY=192.168.57.2
NETMASK=255.255.255.0
DNS1=192.168.57.2
DNS2=114.114.114.114
PREFIX=24

- Restart network,重启网络
/etc/init.d/network restart
- Check new network,查看静态 IP
ifconfig

- Try to ping google.com,尝试连接 google

SSH without password,免密登录 SSH
- Config known host names on 3 hosts,三台机器声明 host 对应的 IP
sudo vim /etc/hosts
# append,追加
192.168.57.134 hadoop1
192.168.57.135 hadoop2
192.168.57.136 hadoop3

2. Generate key pair under user “hadoop” on 3 hosts,三台机器上生成 hadoop 用户的密钥
su hadoop
ssh-keygen -t rsa
- Distribute pub key to all 3 hosts,每台机器分发自己的公钥给所有三台机器
ssh-copy-id hadoop@hadoop1
ssh-copy-id hadoop@hadoop2
ssh-copy-id hadoop@hadoop3
# check added pub keys for each hosts,查看本台机器保存的其他机器的公钥
cat ~/.ssh/authorized_keys

4. Test SSH to other hosts without password,测试免密登录
# from hadoop1
ssh hadoop@hadoop2
# from hadoop3
ssh hadoop@hadoop1
- Do the same above for root user only for hadoop1, because NameNode is on hadoop1,对 hadoop1 root 用户也做上述操作,因为 NameNode 在 hadoop1 上
# on hadoop1
su -
ssh-keygen -t rsa
ssh-copy-id root@hadoop1
ssh-copy-id root@hadoop2
ssh-copy-id root@hadoop3
- Disable firewall on 3 hosts (important), 关闭三台机器防火墙 (重要)
sudo systemctl stop firewalld
sudo systemctl disable firewalld
Install Java8 and Hadoop3.4.0,安装 Java8 和 Hadoop3.4.0
- Download packages,下载软件包
- jdk (ARM64 Compressed Archive): https://www.oracle.com/java/technologies/downloads/#java8
- hadoop 3.4.0: https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0-aarch64.tar.gz
- Create /opt/software and /opt/modules on 3 hosts,三台机器创建文件夹/opt/modules
sudo mkdir /opt/modules
sudo mkdir /opt/software
sudo chown hadoop:hadoop /opt/modules
sudo chown hadoop:hadoop /opt/software

3. Upload both to /opt/software on 3 hosts via FinalShell GUI,用 FinalShell 上传软件包到 hadoop1 目录/opt/software
- Extract to /opt/modules on hadoop1,hadoop1 上解压缩到/opt/modules
su hadoop
tar -zxvf /opt/software/hadoop-3.4.0-aarch64.tar.gz -C /opt/modules
tar -zxvf /opt/software/jdk-8u411-linux-aarch64.tar.gz -C /opt/modules
cd /opt/modules
mv jdk1.8.0_411 jdk1.8.0
ls -l

- Change default java,修改系统默认 Java(不是必须)
su -
# add my jdk to list
update-alternatives --install /usr/bin/java java /opt/modules/jdk1.8.0/bin/java 1
update-alternatives --install /usr/bin/javac javac /opt/modules/jdk1.8.0/bin/javac 1
# choose my jdk as default
update-alternatives --config java
update-alternatives --config javac

# check default java
ls -l /etc/alternatives/java
ls -l /etc/alternatives/javac
java -version
javac -version

6. Add JDK and Hadoop to $PATH,添加 jdk 和 hadoop 软件到全局环境变量
# root user
su -
vim /etc/profile
# append, 追加到最后
export JAVA_HOME=/opt/modules/jdk1.8.0
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/opt/modules/hadoop-3.4.0
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

source /etc/profile
# test hadoop
hadoop version

Config HDFS, Yarn, MapReduce,配置组件
- Navigate to hadoop config dir,去到 hadoop 配置文件夹
su hadoop
cd /opt/modules/hadoop-3.4.0/etc/hadoop
- Edit config files,修改以下配置文件
vi core-site.xml
or
Use VS code,也可以使用其他编辑器
# core-site.xml
<configuration>
<!-- 设置hdfs内部端口 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop1:9000</value>
</property>
<!-- 设置数据/元数据存储位置 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/data</value>
</property>
</configuration>
# hadoop-env.sh
export JAVA_HOME=/opt/modules/jdk1.8.0
# The language environment in which Hadoop runs. Use the English
# environment to ensure that logs are printed as expected.
export LANG=en_US.UTF-8
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
# export HADOOP_HOME=
export HADOOP_HOME=/opt/modules/hadoop-3.4.0
# hdfs-site.xml,HDFS配置
<configuration>
<!-- 设置namenode网页访问地址 -->
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop1:9870</value>
</property>
<!-- 设置secondarynamenode网页访问地址 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop2:9868</value>
</property>
</configuration>
# mapred-site.xml,MapReduce配置
<configuration>
<!-- 设置mapreduce为yarn模式 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
</property>
</configuration>
# yarn-site.xml,Yarn配置
<configuration>
<!-- 设置hadoop1为resourcemanager -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop1</value>
</property>
<!-- 开启shuffle服务 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!--NodeManager在启动时加载shuffleHandler类-->
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<!-- 开启日志聚集功能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志聚集服务器地址 -->
<property>
<name>yarn.log.server.url</name>
<value>http://hadoop1:19888/jobhistory/logs</value>
</property>
<!-- 设置日志保留7天 -->
<property>
<name>yarn.log-aggregation。retain-seconds</name>
<value>604800</value>
</property>
</configuration>
# workers,声明所有DataNode机器,无空格
hadoop1
hadoop2
hadoop3
- Copy /opt/modules to other 2 hosts,复制/modules 到另外两台机器
# on hadoop1
scp -r /opt/modules/* hadoop@hadoop2:/opt/modules
scp -r /opt/modules/* hadoop@hadoop3:/opt/modules
- Copy /etc/profile to other 2 hosts,拷贝环境变量设置到另外两台机器
# on hadoop1
scp /etc/profile root@hadoop2:/etc
# on hadoop2
source /etc/profile
# on hadoop1
scp /etc/profile root@hadoop3:/etc
# on hadoop3
source /etc/profile
# change default Java for hadoop2, hadoop3 (不是必须)
su -
# add my jdk to list
update-alternatives --install /usr/bin/java java /opt/modules/jdk1.8.0/bin/java 1
update-alternatives --install /usr/bin/javac javac /opt/modules/jdk1.8.0/bin/javac 1
# choose my jdk as default
update-alternatives --config java
update-alternatives --config javac
- Sync hadoop config if config changes, hadoop1 修改 hadoop 配置后同步配置用以下命令
# on hadoop1
rsync -avz /opt/modules/hadoop-3.4.0/etc/hadoop/ hadoop@hadoop2:/opt/modules/hadoop-3.4.0/etc/hadoop/
rsync -avz /opt/modules/hadoop-3.4.0/etc/hadoop/ hadoop@hadoop3:/opt/modules/hadoop-3.4.0/etc/hadoop/
Start Hadoop (HDFS, YARN),启动 Hadoop
- hadoop1: NameNode, DataNode, ResourceManager, NodeManager
- hadoop2: SecondaryNameNode, DataNode, NodeManager
- hadoop3: DataNode, NodeManager
- Format namenode,格式化 NameNode
# on hadoop1
hdfs namenode -format
You should see namenode meta data dir is created, 检查 namenode 数据文件夹
# on hadoop1
cd /home/hadoop/data/dfs/name/current
cat VERSION
- Start all deamons,启动所有守护进程
# on hadoop1
start-all.sh
- hadoop1,使用 jps 查看进程

- hadoop2,使用 jps 查看进程

- hadoop3,使用 jps 查看进程

- Start MapReduce job history server,启动 MapReduce 历史服务器
# on any host
mapred --daemon start historyserver

4. Check Web UI,虚拟机上查看服务对应网页
- HDFS: http://hadoop1:9870
- YARN: http://hadoop1:8088 Note 3 active node, 注意应该有 3 个活跃节点,如果只有一个,检查防火墙是否关闭

- MapReduce History Server: http://hadoop1:19888
- Access on host machine,宿主机上访问网页
sudo vi /etc/hosts
192.168.57.134 hadoop1
192.168.57.135 hadoop2
192.168.57.136 hadoop3

Run MapReduce Example Jar, 运行示例 MapReduce 程序
- Create input dir in HDFS, HDFS 中创建输入文件夹
hdfs dfs -mkdir /input
- Create and upload files to input dir, 创建并上传 wordcount 文件
vim ~/words.txt
hello hadoop
hello world
hello hadoop
mapreduce
hdfs dfs -put ~/words.txt /input
hdfs dfs -ls /input

3. Run example program, 运行示例程序
# /output doesn't exist, /output路径不能存在
hadoop jar /opt/modules/hadoop-3.4.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount /input /output
- Print wordcount output, 打印 wordcount 结果
hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000

5. Yarn web UI and historyserver


Stop hadoop, 关闭 hadoop
- Stop all deamons,停止所有守护进程
# on hadoop1
mapred --daemon stop historyserver && stop-all.sh
- Poweroff 3 machines,关闭虚拟机
poweroff

3. Take snapshot for each machine,截取虚拟机快照
使用hadoop streaming创建python mapreduce程序请参考:https://blog.csdn.net/Jacob12138/article/details/138908010
版权归原作者 元元Jacob 所有, 如有侵权,请联系我们删除。