Hive 必知必会

Hive

1.Hive简介

Hive是一个基于Hadoop的数据仓库工具，可以将sql转为MR或Spark任务进行运算，又可以说是MapReduce或Spark sql的客户端；由于直接使用MR进行开发的难度大，学习成本高，所以采用了类sql语法的hive。

特点：

采用HDFS作为底层数据的存储，利用MR查询分析数据
将结构化的数据文件映射成一张数据库表，提供类sql的查询功能
不支持数据的改写、删除

支持的计算引擎：MR、Tez、Spark，暂不展开

2.Hive表类型

Hive的数据类型：INT、FLOAT、DOUBLE、STRING、BOOLEAN、TIMESTAMP、BINAEY、STRUCT、MAP、ARRAY等

没有固定的数据格式，用户只需指定文件的行分隔符、列分隔符、文件读取方式即可

TIPS：Hive表的元数据保存在传统关系型数据中（只支持derby和mysql），原因自己想…

内部表

默认创建的就是内部表（managed table）

内部表存储在hive.metastore.warehouse.dir路径下，可以使用location覆盖默认路径

create table if not exist person (
    name string,sex int,age int comment'life span'
)row format delimited fields terminated by ','
store as textfile;

外部表

hive外部表指向已经存在的数据(stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations.)

Hive外部表通常不直接支持指向Linux本机上的非HDFS文件夹

删除外部表时，只会删除元数据，不会删除表数据

如果外部表的分区或结构发生改变，可以使用 MSCK REPAIR TABLE table_name 刷新元数据信息

create table if not exist person (
    name string,sex int,age int comment'life span'
)row format delimited fields terminated by ','
store as textfile
location '/data/person.txt';

分区表

分文件夹，每个分区对应一个文件夹

分区字段名称不能和表中列相同

create table if not exist person (
    name string,sex int,age int comment'life span'
)row format delimited fields terminated by ','
partitioned by (century string,country string)
store as sequencefile;

分桶表

分文件，对指定列进行哈希计算，根据hash值切分数据分到多个文件中

进一步优化查询和减少数据的扫描范围

create table if not exist person (
    name string,sex int,age int comment'life span'
)row format delimited fields terminated by ','
clustered by(name) sorted by (age) into 12 buckets
store as textfile;

视图

create view person_view as select * from person;
alter view person_view [statement]

3.Hive的存储格式

行存储

textfile、sequencefile
同一行元素存储在相邻物理位置

列存储

orc 默认采用zlib作为压缩方式、parquet
访问多行的相同列时查询速度更快
能针对性的设计更好的压缩算法

压缩格式

zlib、snappy、gzip
压缩来节省MR处理时的网络带宽

一般常用orc/parquet+snappy

create table if not exist person (
    name string,sex int,age int comment'life span'
)row format delimited fields terminated by '\t'
store as orc
tblproperties("orc.compress"="SNAPPY")
;

Hive建表时采用ORC+Snappy的组合是可以在MR作业时实现文件分割的。这种组合既提供了高效的压缩性能，又支持文件的分割操作，非常适合大数据处理场景

4.Hive Sql

随便看看

-- 数据库相关
create database if not exist [name];
create database location [hdfs dir];
alter database [name] set dbproperties('createtime'='9999-12-31');
desc database [name];
drop database [name] (cascade);

-- 加载数据
-- 指定 从本地文件系统（复制）/hdfs文件系统（移动）加载 
load data [local] inpath [path] [overwrite] into table [tabname] [partition(...)];

-- 分区相关
alter table add partition(month='202409') partition(month='202410');
alter table drop partition(month='202409')

-- 分桶相关
-- 桶表加载数据只能通过insert overwrite从普通表查询而来
insert overwrite table [name] [query language] cluster by(col_name);

-- 表相关
alter table [old] rename to [new];
alter table [name] add columns(col1 string,col2 int);
alter table [name] change column col1 col1_new int;
drop table [name];
truncate table [name];
insert into table [name] partition(...) values (...);
create table [name1] like [name2]; -- 根据已有表结构创建表

-- hive表导入/导出
export table [name] to [dir];
import table [name] from [dir];
insert [overwrite] [local] directory [dir] [row format delimited fields terminated by '\t' collection items terminated by '#']
[select query statement]; -- 格式化导出

DQL

-- basic structureselectdistinct|all...from A join B on...where...groupby...having...
cluster by| distribute by+ sort by...orderby...limit num

cluster by = distribute by + sort by
order by 全局有序，只有一个reduce；sort by 在每一个reduce内有序，全局无法保证？
distribute by 根据指定字段将数据分发到不同的reduce中
每一个join会启动一个job，on后的条件可以是不等值连接

5.Hive常用函数

lateral view

为每一行调用UDTF（User-Defined Table-Generating Functions，如explode、split），UDTF会把一行拆分一行或多行，侧视图会把结果组合，产生一个虚拟表；同时侧视图也会记录炸裂后数据的行与行之间的关系；同原表关联时，会产生一个
类似笛卡尔积
的结果（每一行与之相关的炸开列组合）

// 行转列select a.team_name,b.champion_year from the_nba_championship a lateral view explode(spilt(champion_year,',')) b as champion_year;

collect_list

将分组的某列转为数组（array类型）返回,包括重复数据；而collect_set是对某列进行
去重
汇总

// 列转行select col1,col2,concat_ws('-',collect_list(cast(col3))as string)from row2col2 groupby col1,col2;

reflect

支持在sql中调用java的自带函数

select reflect("java.lang.Math","max",col1,col2)from a;

窗口函数

MySQL中的窗口函数，基本类似

grouping sets、cube、rollup

GROUP BY month,day grouping sets (month,day,(month,day)) == GROUP BY month + GROUP BY day + GROUP BY month,day
GROUP BY month,day with cube = 根据 GROUP BY 的维度的所有组合进行聚合(包括不分组)
GROUP BY month,day with rollup = 是cube的子集，以最左侧维度为主，以该维度进行层级聚合

6.Hive执行计划

explain

一个hive查询会被转换一个或多个stage序列（DAG有向无环图），这些stage可以是MR计算stage、元数据存储stage、文件系统操作stage；

join 过滤 null 值？
group by 排序？
sql 执行效率高低？

stage dependency:
    Stage-1 is a root stage
    Stage-0 depends on stages: Stage-1
stage plans:
    stage:stage-1
        Map Reduce
            [TableScan\Select Operator\Group By Operator\...]
            Map Operator Tree
                ...
            Reduce Operator Tree
                ...
    stage:stage-0    
        ...

explain dependency

查看sql需要的数据来源（有分区则显示分区）

explain authorization

查看当前sql的数据输入、输出位置、当前用户、操作

7.SQL解析成MR

Join 实现原理

select a.col,b.col from a join b on a.id=b.id

MAP → 获取a，b表的id和col列；两表分别都以id为key，tag和col为value 构成键值对，其中tag为区分a，b表的标签，如a表tag全为1，b表tag全为2
Shuffle → 把连接条件相同（id相同，或者hash结果相同？）的放在同一个Reduce中
Reduce → 根据tag来区分不同表的数据，构成a.col,b.col的笛卡尔积，最终每个reduce产生一个结果

Group By 实现原理

select col1,col2,sum(money)from a groupby col1,col2

MAP → 获取a表的col1，col2，money列；以group by后的字段col1、col2作为key，money作为value 构成键值对，先在map端进行预聚合
Shuffle → 把相同key值（或者hash结果相同？）的放入同一个Reduce中
Reduce → 聚合相同key值的money，最终每个reduce产生一个结果

Distinct 实现原理

single distinct
mutipule distinct

TODO：待补充原理…

TODO：HQL底层执行原理及架构…

标签： hive hadoop 数据仓库

本文转载自: https://blog.csdn.net/m0_46877917/article/details/143405949
版权归原作者 ArliKache 所有，如有侵权，请联系我们删除。

Hive

1.Hive简介

2.Hive表类型

3.Hive的存储格式

4.Hive Sql

5.Hive常用函数

6.Hive执行计划

7.SQL解析成MR

发表评论

“Hive 必知必会”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航