0


学习大数据DAY56 业务理解和第一次接入

作业1

1 了解行业名词 ERP CRM OA MES WMS RPA SAAS 了解每个系统的功能和应用

ERP 系统,(Enterprise Resource Planning,企业资源计划系统):ERP 系统

是一种用于管理企业各类资源的软件系统,包括生产管理、采购管理、库存管理、

财务管理等功能模块,帮助企业实现资源的优化配置和管理。

CRM 系统,(Customer Relationship Management,客户关系管理系统):CRM

系统是一种用于管理客户关系的软件系统,包括客户信息管理、销售管理、客户

服务管理等功能模块,帮助企业提高客户满意度和市场竞争力。

OA 系统,(Office Automation System,办公自动化系统):OA 系统是一种用

于协调、管理和优化办公流程的软件系统,包括电子邮件、日程安排、文档管理、

工作流程管理等功能模块,帮助企业提高工作效率和管理水平。

MES 系统(Manufacturing Execution System,制造执行系统):MES 系统是一

种用于管理制造过程的软件系统,包括生产计划管理、生产调度管理、工艺管理、

质量管理等功能模块,帮助企业提高生产效率和质量水平。

WMS 系统(Warehouse Management System,仓库管理系统):WMS 系统是一种用

于管理仓库的软件系统,包括入库管理、出库管理、库存管理、配送管理等功能

模块,帮助企业提高仓库管理效率和准确性。

SaaS(Software as a service),软件即服务,即通过网络提供软件服务,SaaS

平台供应商将应用软件统一部署在自己的服务器上,客户可以根据工作实际需

求,通过互联网向厂商定购所需的应用软件服务,按定购的服务多少和时间长短

向厂商支付费用,并通过互联网获得 Saas 平台供应商提供的服务。

2 模型设计 雪花模型 星型模型 ER 图 逻辑模型图...

星座模型

雪花模型

ER 图

逻辑模型图

3 检查自己的 Hadoop 集群,正常标准 444 进程。 myhadoop status

作业2

1 Hive 有哪些文件格式,为什么不用 textfile 格式

TextFile:这是 Hadoop 的默认文件格式,它将数据以行的形式存储。每行由换

行符(\n)分隔。

SequenceFile:这是 Hadoop 自带的一种二进制文件格式,它可以存储键值对,

序列化后存储。它支持压缩,而且有同步标记,容易分块,便于并行处理。

RCFile(Record Columnar File):这是一种列存储格式,它将数据首先按列存

储,然后在每一列中再按行存储。这种格式适用于读操作多于写操作的场景,可

以有效提高查询性能。

ORC(Optimized Row Columnar):这是 Hive 中支持的另一种列存储格式,相比

于 RCFile,ORC 文件格式具有更高的压缩比和查询效率。它支持 Zlib、Snappy

和 LZ4 压缩方式,并且提供了更高级别的向量化读取、投影和过滤优化。

Parquet:这是另一种列存储格式,它旨在提供跨平台的文件格式,可以很好地

与 Hadoop 生态系统中的其他组件(如 Impala、Presto 等)协同工作。Parquet

支持嵌套数据结构,并且可以与多种语言的客户端库一起使用。

不使用 TextFile 格式的原因主要有以下几点:

性能问题:TextFile 格式数据没有压缩,导致占用更多的存储空间,这会增加

存储成本和网络传输的负担。

读写效率低:TextFile 格式的数据不支持列存储,这对于列式存储优化的 Hive

查询引擎来说,效率较低,因为它不能有效地跳过不需要的列,从而增加了查询

时的 I/O 开销。

数据组织效率:由于是按行存储,对于执行大量列筛选和聚合操作的查询,性能

不够优化。列存储格式如 ORC 和 Parquet 可以更有效地利用列索引、压缩和编码

技术,从而加快数据处理速度。

压缩支持:Hive 中的其他文件格式如 ORC 和 Parquet 提供了对数据压缩的支持,

可以有效减少存储空间并提高处理速度,而 TextFile 格式不支持压缩。

2 完成环境的部署

**Python 安装: **

hadoop100,hadoop101,hadoop102 都要安装

阿里云下载 python 安装包

wget

https://mirrors.aliyun.com/python-release/source/Python-3.9.10.tgz

在 hadoop100 下载完成后,通过 scp 传输到另外两个虚拟机

scp /root/Python-3.9.10.tgz root@192.168.200.101:/rootscp /root/Python-3.9.10.tgz root@192.168.200.102:/root

我的 hadoop100 的 python 在上个阶段就装好了,直接在 hadoop101 和 hadoop102

再安装一次:

tar -zxvf Python-3.9.10.tgz

cd Python-3.9.10

安装依赖包:

yum install -y gcc patch libffi-devel python-devel zlib-devel

bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel

tk-devel gdbm-devel db4-devel libpcap-devel xz-devel

配置环境

./configure --prefix=/usr/include/openssl

make && make install

建立安装快捷指令

ln -s /usr/include/openssl/bin/python3 /usr/bin/python3

ln -s /usr/include/openssl/bin/pip3 /usr/bin/pip3

python3 -m pip install --upgrade pip

pip3 config set global.index-url

https://pypi.tuna.tsinghua.edu.cn/simple

pip3 install -i https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple

pymysql pyspark requests bs4 hdfs pandas numpy openpyxl

pip3 install urllib3==1.26.15

安装完成验证:

**部署 Hive: **

同步 Hive:

xsync /opt/module/apache-hive-3.1.2-bin

同步环境变量:

xsync /etc/profile.d/my_env.sh

部署调度平台:

cd /opt

mkdir xxljob

将‘Z:\班级\09_医药项目\xxl-job-student-20221220’中的所有文件上传到

xxljob 中

同步到其它虚拟机:

xsync xxljob

确保 mysql 账号密码均为 root

mysql 中执行 xxljob 建表文件

source /opt/xxljob/tables_xxl_job.sql;

编写启停脚本:

mkdir /root/bin

vim /root/bin/xxl

内容:

#!/bin/bash
act=$1
start()
{
echo "starting xxl-job"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-admin-2.3.0.jar > xxl-job.log 2>&1 &"
ssh root@hadoop100 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop101 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
ssh root@hadoop102 "cd /opt/xxljob; nohup java -jar
xxl-job-executor-sample-springboot-2.3.0.jar >
xxl-job-executor.log 2>&1 &"
}
stop(){
echo "stopping xxl-job"
ssh root@hadoop100 "ps -aux | grep xxl-job-admin | grep-v grep | awk '{print \$2}' | xargs kill -9"
ssh
root@hadoop100
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep | awk '{print \$2}' |
xargs kill -9"
}
status(){
echo "=============== hadoop102 ==============="
ssh
root@hadoop102
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop101 ==============="
ssh
root@hadoop101
"ps
-aux
|
grep
xxl-job-executor-sample | grep -v grep"
echo "=============== hadoop100 ==============="
ssh root@hadoop100 "ps -aux | grep xxl-job | grep -v
grep"
}
case $act in
start)
start
status
;;
stop)
stop
status
;;
restart)
stop
start
;;
status)
status
;;
esac

增加执行权限:

chmod +x /root/bin/xxl

验证:

xxl start

netstat -nltp | grep 8080

安装 DataX:

wget

https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/202308/datax.tar.gz

tar -zxvf datax.tar.gz

同步到其它节点:

xsync /opt/datax

**3 完成第一个接入任务 **

客户提供的连接信息

需求:使用 DataX 每天凌晨 1 点定时抽取 c_org_busi 门店信息表

编写配置文件

打开 VScode:

mkdir -p /zhiyun/shihaihong

cd /zhiyun/shihaihong

mkdir data jobs shell sql python读取

mysql https://github.com/alibaba/DataX/blob/master/mysqlreader/doc

/mysqlreader.md

写入

hive https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdf

swriter.md

job 目录中的 c_org_busi.json 文件下:

{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": ["jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print":true
}
}
}
]
}
}

运行抽取:

python /opt/datax/bin/datax.py /zhiyun/shihaihong/jobs/c_org_busi.json

写入到 Hive

json 文件改为:

{
"job": {"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc",
"path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
{"name":"id","type": "int"},
{"name":"busno","type": "string"},{"name":"orgname","type": "string"},
{"name":"orgsubno","type":
"string"},
{"name":"orgtype","type": "string"},
{"name":"salegroup","type":
"string"},
{"name":"org_tran_code","type":
"string"},
{"name":"accno","type": "string"},
{"name":"sendtype","type":
"string"},
{"name":"sendday","type": "string"},
{"name":"maxday","type": "string"},
{"name":"minday","type": "string"},
{"name":"notes","type": "string"},
{"name":"stamp","type": "string"},
{"name":"status","type": "string"},
{"name":"customid","type":
"string"},
{"name":"whl_vendorno","type":
"string"},
{"name":"whlgroup","type":
"string"},
{"name":"rate","type": "string"},
{"name":"creditamt","type":
"string"},
{"name":"creditday","type":
"string"},
{"name":"peoples","type": "string"},
{"name":"area","type": "string"},
{"name":"abc","type": "string"},
{"name":"address","type": "string"},
{"name":"tel","type": "string"},
{"name":"principal","type":
"string"},
{"name":"identity_card","type":
"string"},
{"name":"mobil","type": "string"},
{"name":"corporation","type":
"string"},
{"name":"saler","type": "string"},
{"name":"createtime","type":
"string"},{"name":"bank","type": "string"},
{"name":"bankno","type": "string"},
{"name":"bak1","type": "string"},
{"name":"bak2","type": "string"},
{"name":"a_bak1","type": "string"},
{"name":"aa_bak1","type": "string"},
{"name":"b_bak1","type": "string"},
{"name":"bb_bak1","type": "string"},
{"name":"y_bak1","type": "string"},
{"name":"t_bak1","type": "string"},
{"name":"ym_bak1","type": "string"},
{"name":"tm_bak1","type": "string"},
{"name":"supervise_code","type":
"string"},
{"name":"monthrent","type":
"string"},
{"name":"wms_warehid","type":
"string"},
{"name":"settlement_cycle","type":
"string"},
{"name":"apply_cycle","type":
"string"},
{"name":"applydate","type":
"string"},
{"name":"accounttype","type":
"string"},
{"name":"applydate_last","type":
"string"},
{"name":"paymode","type": "string"},
{"name":"yaolian_flag","type":
"string"},
{"name":"org_longitude","type":
"string"},
{"name":"org_latitude","type":
"string"},
{"name":"org_province","type":
"string"},
{"name":"org_city","type":
"string"},
{"name":"org_area","type":
"string"},
{"name":"business_time","type":
"string"},{"name":"yaolian_group","type":
"string"},
{"name":"pacard_storeid","type":
"string"},
{"name":"opening_time","type":
"string"},
{"name":"ret_ent_id","type":
"string"},
{"name":"ent_id","type": "string"}
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}

打开 myhadoop 集群和 hive:

创建 hdfs 文件:

hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi

运行抽取:

python

/opt/datax/bin/datax.py

/zhiyun/shihaihong/jobs/c_org_busi.json

Hive 建表

-- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,
stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,
org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi";

验证数据

**调度脚本 **

自动运行,自动更新

c_org_busi.sh:

#!/bin/bash
# 作用: 完成从编写配置文件到验证数据的整个过程
# 需要在任何节点都可以执行
echo "开始抽取 c_org_busi 门店信息表"
echo "生成配置文件"mkdir -p /zhiyun/shihaihong/jobs
echo '
{
"job": {
"setting": {
"speed": {
"channel": 3
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "zhiyun",
"password": "zhiyun",
"column": [
"*"
],
"connection": [
{
"table": [
"c_org_busi"
],
"jdbcUrl": [
"jdbc:mysql://zhiyun.pub:233
06/erp"
]
}
]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://hadoop100:8020",
"fileType": "orc","path":
"/zhiyun/shihaihong/ods/c_org_busi",
"fileName": "c_org_busi.data",
"column": [
...
],
"writeMode": "truncate",
"fieldDelimiter": "\t"
}
}
}
]
}
}' > /zhiyun/shihaihong/jobs/c_org_busi.json
echo "生成 HDFS 路径"
hadoop fs -mkdir -p /zhiyun/shihaihong/ods/c_org_busi
echo "抽取数据"
python /opt/datax/bin/datax.py
/zhiyun/shihaihong/jobs/c_org_busi.json
echo "hive 建表"
hive -e '
-- 创建数据库
create database if not exists ods_shihaihong location
"/zhiyun/shihaihong/ods";
-- 创建对应的数据表
-- ODS 表都应该是外部表 防止建错表然后删表的情况
create external table if not exists ods_shihaihong.c_org_busi(
id int,
busno string,
orgname string,
orgsubno string,
orgtype string,
salegroup string,
org_tran_code string,
accno string,
sendtype string,
sendday string,
maxday string,
minday string,
notes string,stamp string,
status string,
customid string,
whl_vendorno string,
whlgroup string,
rate string,
creditamt string,
creditday string,
peoples string,
area string,
abc string,
address string,
tel string,
principal string,
identity_card string,
mobil string,
corporation string,
saler string,
createtime string,
bank string,
bankno string,
bak1 string,
bak2 string,
a_bak1 string,
aa_bak1 string,
b_bak1 string,
bb_bak1 string,
y_bak1 string,
t_bak1 string,
ym_bak1 string,
tm_bak1 string,
supervise_code string,
monthrent string,
wms_warehid string,
settlement_cycle string,
apply_cycle string,
applydate string,
accounttype string,
applydate_last string,
paymode string,
yaolian_flag string,
org_longitude string,
org_latitude string,org_province string,
org_city string,
org_area string,
business_time string,
yaolian_group string,
pacard_storeid string,
opening_time string,
ret_ent_id string,
ent_id string
) row format delimited fields terminated by "\t"
lines terminated by "\n"
stored as orc
location "/zhiyun/shihaihong/ods/c_org_busi";
'
echo "验证数据"
hive -e '
select count(1) from ods_shihaihong.c_org_busi;
select * from ods_shihaihong.c_org_busi limit 2;
'
echo "抽取完成!"

赋权:

chmod u=rwx c_org_busi.sh

./c_org_busi.sh

设置定时:

执行一次,查看执行日志:

执行一次

4 作业标准

4.1 所有截图必须全屏截图

4.2 把调度平台的日志复制到作业中

标签: 学习 大数据 hive

本文转载自: https://blog.csdn.net/shh2000424/article/details/142184937
版权归原作者 工科小石头 所有, 如有侵权,请联系我们删除。

“学习大数据DAY56 业务理解和第一次接入”的评论:

还没有评论