0


项目:千亿级离线数仓项目

一、项目介绍

1. 新零售行业背景

阶段一: 百货商店

阶段二: 超级市场(超市)
阶段三: 连锁商店

-----以上为线下的店铺------

阶段四: 电子商务(网络销售)

阶段五: 新零售(线上 + 线下 + 物流)

项日真实来源: 永辉超市

2. 业务需求

2.1 业务系统流程

2.1.1 商品发布流程

2.1.2 单店铺订单流程

2.1.3 购物车订单流程图

2.1.4 配送流程图

2.1.5 退货业务流程图

2.2 原始数据

原始数据按照子系统,划分为订单模块、支付模块、店铺模块、商品模块、用户模块、系统配置模块、广告模块、促销模块和配送模块等。

2.3 研发阶段

2.3.1 第一阶段
2.3.1.1 目标
  • 完成10个hadoop节点的搭建工作
  • 家成调度平台的搭建工作…
  • 完成基础数据迁移 mysql--> Hadoop平台
  • 完成销售模块的数据建模
  • 能够满足业务对于基础销售的需求
2.3.1.1 资源

1.人员

项目经理一名

数据开发工程师三名

数据分析师2名


2.时间
3个月

数据分析师常见岗位要求:

  • 熟悉行业业务、对数据敏感;
  • 深入理解业务,能够基于数据分析得到有价值的信息,为业务发展提供策略和建议;
  • 独立研究数据挖掘模型,参与数据挖掘模型的构建、维护、部署和评估,引入数据分析方法及模型到实际分析工作当中;
  • 负责公司数据整理和发掘数据价值,设计数据产品并推动落地;
  • 熟悉SQL、hive、excel等数据查询及分析工具,对Hadoop、Spark、flink、Hive等大数据技术有一定了解者优先;
  • 有较强的探索精神和求知欲,对数据敏感、具备较强的数据分析能力、逻辑思考能力、沟通能力和团队意识,能够承受工作压力。
2.3.2 第二阶段
2.3.2.1 目标
  • 完成32个hadoop节点的扩容工作
  • 完成相应内存计算平台presto集群的搭建
  • 完成整个源系统的数据抽取工作
  • 完成销售模块、用户模块、商品模块、促销模块数据建模工作5.满足公司日常运营的80%的数据需求和报表需求
  • 支撑财务的成本和利润的核算
  • 完成准实时数据的数据应用开发工作
2.3.2.2 资源

1.人员

项目经理一名

数据开发工程师四名

数据分析师3名


2.时间
4个月

2.4 报表需求

3. 数据和集群规模

3.1 性能指标

对于报表展现的内容刷新,页面数据的请求到展现的过程总体时间不能超过5秒。

3.2 数据量

1 全量
经过4年左右的业务发展,整个数据平台的数据为35T左右,冗余存储量为105T。
2 增量
每日增量25G左右。

3.3 集群规模

42台服务器

3.4 项目架构

  • Apache sqoop:用于关系型数据库和大数据生态圈之间数据导入导出工具
  • Apache Hive: 构建集团的数据中心平台: 数仓大平台 对数据基于业务最终形成业务宽表
  • Presto:基于内存的计算引擎完成基于主题的统计分析,形成一个个的数据集巿
  • Apache oozie: 工作流调度工具 => 自动调度
  • HUE: 通过HUE可以操作HADOOP,HIVE,oozie
  • cloudera manager: 大数据统一监控软件

架构:

  • 基于cloudera manager构建的大数据分析平台;
  • 在此平台基础上,构建有 HDFS,YARN,zookeeper,sqoop,oozie,HUE,HIVE 等相关的大数据组件;
  • 同时为了提升分析的效率,引入presto来进行分析处理操作;
  • 使用FineBi实现图表展示操作;
  • 整个分析工作是一个周而复始,不断的干,采用oozie完成任务的调度工作。

数据流转的流程:

  • 整个项目的数据源都是集中在MySQL中的;
  • 通过sqoop完成数据的导入操作,将数据导入到HDFS中;
  • 使用HIVE构建相关的表,建立数仓体系,在HIVE进行分层处理;
  • 在进行统计分析的时候,采用presto提升分析的效率,将分析的结果导出到Mysql中;
  • 最后使用fineBi完成报表展示操作;
  • 整个项目基于cloudera manager进行监控管理,使用oozie完成工作流的调度操作。

4. 安装ClouderaManager

连接clouder manager

http://hadoop01:7180

用户名:admin
密码:admin

注意:

  • 后续关闭虚拟机: 务必使用** shutdown -h now | init 0** 万万不可直接强制关机;
  • 重启虚拟机:使用reboot命令;
  • 长时间不使用虚拟机,建议将其关机(尤其是使用机械硬盘的)比如中午的时候;

5. 数据仓库

定义

存储数据的仓库,主要用于存储过去历史发生过的数据,面向主题,对数据进行统计分析的操作,从而能够对未来提供决策支持


  • ODS负责数据的最初采集和存储;
  • DWD进行数据清洗和加工;
  • DWS为数据提供服务接口和访问权限;
  • DWT整合多源数据并生成决策支持报表;
  • ADS为特定应用提供数据展示和分析。

特点

数据仓库既不生产数据,也不消耗数据,数据来源于各个数据源

四大特征

  • 面向主题:分析什么什么就是我们的主题。
  • 集成性:数据从各个数据源汇聚而来,数据的结构都不一定一样。
  • 非易失性(稳定性):存储都是过去历史的数据,不会发送变更,甚至某些数据仓库都不支持修改操作
  • 时变性:随着时间推移,将最近发生的数据也需要放置到数据仓库中,同时分析的方案也无法满足当前需求,需要变更分析的手段

OLAP 和 OLTP区别

ETL

ETL:抽取 转换 加载

狭义上ETL:
指的数据从ODS层抽取出来,对oDs层的数据进行清洗转换处理的操作,将清洗转换后的数据加载到Dw层过程

广义上ETL:

    指的是数仓的全过程

数据仓库 和 数据集市

数据仓库是包含数据集市的,在一个数据仓库中可以有多个数据集市

数据仓库:一般指的构建集团数据中心,基于业务形成各种业务的宽表

数据集市:基于部门或者基于主题,形成主题或者部门相关的统计宽表

6. 维度分析

6.1 维度

看待问题角度,当我们对一个主题进行分析的时候,可以从不同的角度来分析,这些角度就是维度。

比如说:对订单进行分析 可以从 用户,时间,地区,商家,商圈...


维度的分类:

  • 定性维度:一般指的统计每个各个这种维度,比如 统计每天 每小时 各个用户... 这种维度在编写SQL,一般是放置在 group by
  • 定量维度: 一般指的统计某个范围,或者某个具体值的,比如 统计年龄在18~60岁,时间为2021年度 这种维度在编写 SQL,一般是放置在 where条件

**上卷和下钻 **

比如说 我们以作为标准,
上卷:统计 周 月 年

下钻:统计** 小时**

分层或者分级

比如说:以地区为例,将地区划分为省份市县/区

6.2 指标

衡量事务发展的标准,也叫度量;

简单来说:在根据维度进行分析的时候,必然要分析出一些结果,这个结果就是度量

如:count() , sum() , avg() 等


指标的分类:

  • 绝对指标:指的统计计算一个具体的值,

      比如说 销售额,订单量,
       count(,sum(),min(,max(),avg();
    
  • 相对指标:指的统计相对的结果,比如说同比增长环比增长流失率长率...

      这些指标在计算的时候,是可以不需要对全部数据进行统计可以通过抽样的方式来计算即可。
    

6.3 示例:

需求:
请统计在2021年度,来自于北京 女性 未婚,年龄在 18 ~28 之间的每天的销售总额是多少?

分析:
涉及维度:时间维度,地区维度,性别,婚状态,年龄

            定性维度:每天

            定量维度:2021,北京 女性 未婚 年龄

    涉及指标: 销售额(绝对指标)

SQL:

** select day,sum(price) from 表 where 时间=2021 and address='北京' and sex='女性' and status ='未婚' and age between 18 and 28 group by day;**

7. 数仓建模

何为建模: 如何在数据仓库中构建表,是一套用于规范化建表的理论

三范式建模:

     主要是应用在传统的数据仓库中,指的数据存储在关系型的数据库中了要求在建表的时候,表必须有主键;同时 表中尽量的避免数据的冗余的发生,尽可能拆分表

维度建模:

    主要是应用于新型的数据仓库中,指的数据存储在专门用于进行OLAP(面向与分析)数据库中,只要是利于分析的建模方案,认为都是OK的,可以允许冗余的发生。

后续主要采用维度建模的思想来构建相关的表,在维度建模中,主要规定了两种表模型: 事实表 和 维度表

7.1 事实表

  • 事实表:指的主题,要统计的主题是什么,对应事实就是什么,而主题所对应的表,就是事实表
  • 事实表一般是一坨主键(其他表)的聚集
  • 事实表一般是反应了用户某种行为表

比如说:订单表,收藏表,登录表


事实表分类:

  • 事务事实表 :最初始确定的事实表 其实就是事务事实表
  • 周期快照事实表:指的对数据进行提前聚合后表,比如将事实表按照天聚合统计 结果表
  • 累计快照事实表:每一条数据,记录了完整的事件 从开始 到结束整个流程,一般有多个时间组成

7.2 维度表

  • 维度表:当对事实表进行统计分析的时候,可能需要关联一些其他表进行辅助,这些表其实就是维度表
  • 维度表一般是由台或者商家来构建的表,与用户无关,不会反应用户的行为

比如说:地区表,商品表 时间表,分类表


维度表分类:
高基数维度表:

    如果数据量达到几万 或者几十万 甚至几百万的数据量,一般这样维度表称为高基数维度表比

    如: 商品表 、用户表

低基数维度表:

    如果数据量只有几条 或者 几十条 或者几千条,这样称为低基数维度表

    比如:地区表、 时间表 、分类表 、配置表

7.4 数仓发展模型

7.5 渐变维SCD

维度可以根据变化剧烈程度主要分为无变化维度变化维度。例如一个人的相关信息,身份证号、姓名和性别等信息数据属于不变的部分;而婚姻状态、工作经历、工作单位和培训经历等属于可能会变化的字段。
大多数维度数据随时间的迁移是缓慢变化的。比如增加了新的产品,或者产品的!D号码修改了

** 解决方式**

  • SCD1:不维护历史变更行为,直接对过去数据进行覆盖即可。此种操作 仅适用于错误数据的处理

  • SCD2:维护历史变更行为,处理方式 在表中新增两个新的字段,一个是起始时间,一个是结束时间,当数据发生变更后,将之前的数据设置为过期,将新的变更后完整的数据添加到表中,重新记录其起始和结束时间, 将这种方案称为 拉链表- 好处:可以维护更多的历史版本的数据,处理起来也是比较简单的(利于维护)- 弊端: 造成数据冗余存储 大量占用磁盘空间

  • SCD3:维度历史变化,处理方式,当表中有字段发生变更后,新增一列,将变更后的数据存储到这一列中即可好处: - 减少数据冗余存储- 弊端: 只能维护少量的历史版本,而且维护不方便,效率比较低

8. 数仓分层

ODS:源数据层(临时存储层)

    作用:对接数据源,用于将数据源的数据完整的导入到ODS层中,一般ODS层的数据和数据源的数据保持一致,类似于一种数据迁移的操作,一般在ODS层建表的时候,会额外增加一个 日期的分区,用于标记何时进行数据采集。

DW:数据仓库层

    作用:用于进行数据统计分析的操作,数据来源于 ODS层

APP(DA I ADS | RPT | ST):数据应用层(数据展示层)

    作用:存储分析的结果信息,用于对接相关的应用,比如BI图表

以新零售项目为例:

ODS:源数据层(临时存储层)

    **作用**:对接数据源,用于将数据源的数据完整的导入到ODS层中,一般ODS层的数据和数据源的数据保持一致,类似于一种数据迁移的操作,一般在ODS层建表的时候,会额外增加一个 日期的分区,用于标记何时进行数据采集。

DW:数据仓库层

    **DWD**:明细层

            **作用**:和ODS层保持相同的粒度,不会对数据进行聚合操作,只要进行清洗 转换工作,保证数据质量,利于后续分析。

            **清洗**:过滤掉一些无用数据
             **转换**:格式转换 或者一个字段 转换为多个字段, json转换

    **DWM**:中间层

            **作用**:进行维度退化的操作,根据业务模块。形成业务宽表

            **维度退化**:

                    以订单表为例,在订单表中,有用户 id ,商品id 商家id,地区的id信息。如果统计需要按照用户,商品,商家 和地区来统计操作,此时需要关联用户表,商品表 商家表 地区表

            提前先将这些维度表中可能需要使用字段**合并到事实表**中,让事**实表变的更宽**,后续在统计的时侯,只需要关联订单表即可

    **DWS**:业务层

            **作用**:用于进行提前聚合操作,形成基础主题统计宽表指标数据

            **例如**:需求要求统计 每年 每月 每口的销售额。那么在DWS层,可以先按照日形成统计结果数据

DM:数据集市层
作用:基于主题,形成数据集市,对指标进行细化统计过程

    **例如**:需要将 每年 每月 每日的销售额全部记录在DM层中,此时我们只需要对DWS层进行上卷统计即可

APP(DA I ADS | RPT | ST):数据应用层(数据展示层)

    作用:存储分析的结果信息,用于对接相关的应用,比如BI图表

9. 数仓的工具的基本使用

9.1 HUE 的基本使用

HUE: hadoop 用户 体验

HUE: 本质上就是大集成者,将大数据中各种软件的操作界面 集成在一起,通过HUE 完成对大数据相关的组件操作

进入HUE

用户名:hue 密码:hue

9.2 HUE 操作HDFS

方式1:

方式2:

9.3 HUE 操作Hive

9.4 HUE操作oozie

oozie:是一个工作流的调度工具,实现对工作流的定时调度操作

何为工作流:指业务过程的部分或整体在计算机应用环境下的自动化

工作流一般要满足以下几个特征:

  1. 一个流程是可以被拆解为多个阶段(步骤)
  2. 多个阶段之间存在依赖关系,前序没有执行,后续无法执行
  3. 整个流程 需要周而复始不断的于

先回答:大数据的工作流程有那些呢?

  1. 确定数据源
  2. 数据的存储
  3. 数据的预处理
  4. 数据的分析处理
  5. 数据的应用

能够实现工作流的大数据组件有那些?

oozie:

    是apache旗下一款工作流的调度工具,出现时间也是比较久的,oozie如果单独使用,是非常麻烦的,提供的管理界面仅仅只能査看一些状态,无法对工作流进行操作,所有的操作都需要通过配置 XML文档 

    但是由于人家apache 大数据 全家桶一员,HUE在集成一款调度工具,优先选择自家产品,使用HUE,只需要用户通过鼠标点一点即可完成oozie工作流的配置了

azkaban:

    属于领英公司旗下一款的工作流的调度工具,开源免费的,azkaban简单来说就是一个she11脚本调度工具,提供了同一个工作流的界面,可以直接在界面上提交工作流,并对工作流进行监控,整个工作流的配置,仅需要简单配置几行类似于properties文件即可完成

如何使用oozie:

开启 HUE 对 ggzie的支持

9.4.1 配置工作流

9.4.1 定时配置

9.5 sqoop的基本使用操作

sqoop是apache旗下顶级项目.主要是用于 RDBMS 和 大数据生态圈之间的数据导入导出的工具,从RDBMS到大数据生态圈 是导入操作,反之为导出操作

sqoop本质上也是一款翻译软件,将sqoop的命令翻译为 MR程序

关于使用sqoop 将数据导入到HIVE,支持两种导入方案:原生导入方案 和 hcatalog方式

区别点:

如果使用原生导入方式,导入HIVE,仅支持 textFi1e导入方式hcatalog支持数据存储方案比较多:textFile,ORC,sequence,parquet...

原生方式支持数据覆盖导入
hcatalog仅支持追加导入

原生方式在导入时候,根据字段的顺序,导入到HIVE中hcatalog在导入的时候,是根据字段的名称导入的

(此部分建议在导入到HIVE ,hive表字段的顺序 和 mysq1表字段顺序保持一致,名称保持一致)

后续主要采用 hcata1og的导入方式,因为建表的时候,主要存储格式为ORC

9.5.1 基本使用操作

sqoop help

如何查看某个操作下的相关的参数信点:

sqoop 操作 --help

查询mysql中所有的库有那些?

思考:连接mysq1需要知道什么信息呢?

1-用户名 2-密码 3- 连接地址

sqoop list-databases --connect jdbc:mysql://hadoop01:3306 --username root --password 123456

**查询mysql中scm中所有的表 **

*sqoop list-tables *

**--connect jdbc:mysq1://hadoop01:3306/scm **

**--username root **

--password 123456

**\ **:未完待续

9.5.2 数据全量导入操作

如何全量将数据导入到HDFS中

需求一: 将 emp表(mysql)中数据导入到HDFS中

方式1:

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **

说明:

  • 当不指定导出路径的时候,默认会将数据导入到当前操作用户的HDFS的家目录下,在此目录下以表名创建一个文件夹,将数据放置到这个文件夹中;
  • 发现在导入数据的时候,有多少条数据,就会运行多少个mapTask,最高和cpu核数相等;
  • 数据之间的分隔符号为 逗号

方式2:

思考:是否可以将其导入到其他位置呢? --target-dir 和--delete-target-dir

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **
--delete-target-dir
--target-dir /sqoop_works/emp

说明:
--target-dir :将数据导入到HDFS的那个位置中

-delete-target-dir :如果目的地路径以存在,先删除

思考:是否可以设置其mapTask的数量呢? -m和 --split-by

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp
--delete-target-dir
--target-dir /sqoop_works/emp **

**-m 1**

--split-by id

说明:
如果-m为1 ,表示只允许一个mapTask,此时可能省略 --split-by

--split-by 表示按照那个字段进行切割数据表,一般设置为主键字段如果主键为多个那么就写多个用逗号隔开

思考:是否可以调整分隔符号呢?比如说 设置为| 参数:--fields-terminated-by

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp
--delete-target-dir
--target-dir /sqoop_works/emp **

**--fields-terminated-by '|' **

**-m 1**

--split-by id

如何全量将数据导入到Hive中

  • 在hive中建相关表 emp_add
  • 编写 sqoop命令 完成数据导入操作

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **

**--fields-terminated-by '\t' **

**--hcatalog-database 'day03_xs' **

**--hcatalog-table 'emp_add' **

-m 1

注意:
由于 hive表的存储格式为 orc,所以无法使用sqoop的原生导入方案,必须使用hcatalog

9.5.3 数据条件导入数据

  • 方式一:通过 where条件的方式,将部分数据导入到HDFS中

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **

**--where 'id > 1205 and ...' **
**--delete-target-dir
--target-dir /sqoop_works/emp **

**-m 1**

  • 方式二: 通过 SQL的方式,将部分数据导入到HDFS中:

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

--query ' select * from emp where 1=1 and** $CONDITIONS**** ' **
**--delete-target-dir
--target-dir /sqoop_works/emp **

**-m 1**

注意:

  • 当使用 --query 方式的时候,不允许在使用 --table,因为 SQL中已经明确需要导入那个表的数据
  • 当使用 --query 方式的时候,!编写的SQL语句必须添加where条件,条件最后必须要跟 $CONDITIONS,如果使用双引号包裹SQL,$前面必须加一个 \

如何导入HIVE呢?以其中SQL为例

*sqoop import
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **

--query ' select * from emp where 1=1 and** $CONDITIONS**** ' **

**--fields-terminated-by '\t' **

**--hcatalog-database 'day03_xs' **

**--hcatalog-table 'emp_add' **

-m 1

9.5.4 数据全量导出操作
  • 需求: 将HIVE中 emp_add表中所有的数据全量导出MySQL中

步骤一:在MySQL中创建目标表

步骤二:

*sqoop export
--connect jdbc:mysq1://hadoop01:3306/test *

**--username root
--password 123456 **

**--table emp **

**--fields-terminated-by '\t' **

**--hcatalog-database 'day03_xs' **

**--hcatalog-table 'emp_add' **

-m 1

9.5.5 相关的sqoop参数

二、项目开始

1. 业务数据准备

将数据表导入进mysql中

  1. hive的基础优化(都不要去调整)

3. 完成ODS层数据采集操作

  • 作用:对接数据源,一般和数据源保持相同的粒度;
  • ODS层:处于在HIVE端;
  • 业务数据:MySQL;
  • 目标:将MySQL中业务库的表数据 导入到 ODS层中;
  • 技术:sqoop 完成导入的操作;

3.1 数据存储格式和压缩方案

存储格式:

在hive中,数据存储格式主要分为两大类: *行式存储 ** 和 列式存储
行式存储(textFile):
优点:可读性较好 执行 select
效率比较高
弊端:耗费磁盘资源 执行 select 字段 效率比较低

    列式存储(ORC):
             优点:节省磁盘空间,执行 select 字段 效率比较高

            弊端:执行 select* 效率比较低

ORC是兼具行式存储优势又具有列式存储优势,数据按行分块,每块中按列存储数据,同时在每个块内部,对数据构建索引,提升查询的效率

压缩方案:

思考:压缩有什么用?能够在有限的空间下,存储更多的数据

在进行压缩的时候,压缩的方案其实有很多种:ZIP(GZIP),SNAPPY,LZ0,ZLIB....

思考,具体使用那种压缩的方案呢? 性价比比较高的(压缩比,解压缩的性能)

   ** zlib(gzip):**具有良好的压缩比,但是解压缩的性能一般

    **snappy:**具有良好的解压缩的性能,同时具有较好的压缩比弊端 没有z1ib压缩比好,同时hadoop默认原生是不支持snappy压缩的

本项目采用那种方案?主要采用 snappy

    在 ODS层,一般会使用** zlib**

    在其他层次中,一般采用 **snappy**

说明:

  • 如果读取次数较少,写入了较大,优先保证压缩比 ---zlib(gzip) 比如说 ODS层
  • 如果读取次数比较高,优先保障解压缩性能 --snappy 比如说 Dw层相关的表
  • 如果不清楚,建议使用snappy,或者如果空间足够,统一采用snappy也没有问题

创建表的时候,选择内部表,还是外部表?

判断标准: 对表数据是否有管理的权限

有权限删除数据,那么我们可以构建内部表,当然也可以构建外部表

如果没有权限删除数据,只能构建外部表

创建表的函数,是否需要构建分区表呢?

般情况下,都是分区表(分区的字段大多数的都是以时间为主)

3.2 DOS层数据同步

全量覆盖

适用于:
表数据变更的频次并不多,不需要记录其历史数据 而日整个表数据量相对较少 这个时候可以采用全年覆盖

操作方式:
每次同步数据,都是要先将原有的数仓中表数据全部删除,然后重新从业务端导入即可建表的时候,不需要构建分区表

比如说:地区表,时间表

仅新增同步

适用于:
业务端数据只会有新增的操作,不会有变更的时候,数据量比较多

操作方式:
在数仓中建表的时候,需要构建分区表,分区字段和同步数据的周期是一致的,

比如说: 每天都需要同步数据,分区字段 需要按大如果每月同步一次,分区字段按照月

    每次进行同步的时候,将对应周期下的新增数据放置到对应日期分区下

**比如说:**登录日志表,访问日志表

新增及更新同步

适用于:
业务端表数据既有更新操作,又有新增操作的时候,而且数据量比较多

操作方式:
在数仓中建表的时候,需要构建分区表 ,分区字段和同步数据的周期是一致的,比如说: 每天都需要同步数据,分区字段 需要按天如果每月同步一次,分区字段按照月

    每次进行同步的时候,将对应这个周期的下新增数据和更新数据放置到对应日期分区下即可

比如说:
订单表,商品表,用户表...

全量同步

适用于:
业务端数据量不是特别大,但是也存在更新和新增,而且不需要保留太多的历史版本

操作方式:
在数仓中建表的时候,需要构建分区表,分区字段和同步数据的周期是一致的,比如说:每天都需要同步数据。分区字段 需要按天 如果每月同步一次,分区字段按照月

    每次导入都是导入截止当前时间的全量数据,定期将历史的日期数据删除即可

3.3 中文乱码问题

注意:在后续hive中建表的时候,如果表字段说明信息是中文,可能hive会出现乱码情况

在mysql的hive库中,执行一下SQL即可:

use hive;
alter table COLUMNS_V2 modify column CoMMENT varchar(256) character set utf8;
alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;
alter table PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8 ;
alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;
alter table INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;

3.4 创建ODS层相关的表

1-创建库

create database yp_ods;

2-构建ODS层表

2.1-全量覆盖表:

--区域表: t_district

DROP TABLE if exists yp_ods.t_district;
CREATE TABLE yp_ods.t_district
(
    `id` string COMMENT '主键ID',
    `code` string COMMENT '区域编码',
    `name` string COMMENT '区域名称',
    `pid`  int COMMENT '父级ID',
    `alias` string COMMENT '别名'
)
comment '区域字典表'
row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress'='ZLIB');

--日期表: t_date

drop table yp_ods.t_date;
CREATE TABLE yp_ods.t_date
(
    dim_date_id           string COMMENT '日期',
    date_code             string COMMENT '日期编码',
    lunar_calendar        string COMMENT '农历',
    year_code             string COMMENT '年code',
    year_name             string COMMENT '年名称',
    month_code            string COMMENT '月份编码',
    month_name            string COMMENT '月份名称',
    quanter_code          string COMMENT '季度编码',
    quanter_name          string COMMENT '季度名称',
    year_month            string COMMENT '年月',
    year_week_code        string COMMENT '一年中第几周',
    year_week_name        string COMMENT '一年中第几周名称',
    year_week_code_cn     string COMMENT '一年中第几周(中国)',
    year_week_name_cn     string COMMENT '一年中第几周名称(中国',
    week_day_code         string COMMENT '周几code',
    week_day_name         string COMMENT '周几名称',
    day_week              string COMMENT '周',
    day_week_cn           string COMMENT '周(中国)',
    day_week_num          string COMMENT '一周第几天',
    day_week_num_cn       string COMMENT '一周第几天(中国)',
    day_month_num         string COMMENT '一月第几天',
    day_year_num          string COMMENT '一年第几天',
    date_id_wow           string COMMENT '与本周环比的上周日期',
    date_id_mom           string COMMENT '与本月环比的上月日期',
    date_id_wyw           string COMMENT '与本周同比的上年日期',
    date_id_mym           string COMMENT '与本月同比的上年日期',
    first_date_id_month   string COMMENT '本月第一天日期',
    last_date_id_month    string COMMENT '本月最后一天日期',
    half_year_code        string COMMENT '半年code',
    half_year_name        string COMMENT '半年名称',
    season_code           string COMMENT '季节编码',
    season_name           string COMMENT '季节名称',
    is_weekend            string COMMENT '是否周末(周六和周日)',
    official_holiday_code string COMMENT '法定节假日编码',
    official_holiday_name string COMMENT '法定节假日',
    festival_code         string COMMENT '节日编码',
    festival_name         string COMMENT '节日',
    custom_festival_code  string COMMENT '自定义节日编码',
    custom_festival_name  string COMMENT '自定义节日',
    update_time           string COMMENT '更新时间'
)
COMMENT '时间维度表'
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'ZLIB');
    2.2-仅新增同步表:

--订单评价表:t_goods_evaluation

DROP TABLE if exists yp_ods.t_goods_evaluation_detail;
CREATE TABLE yp_ods.t_goods_evaluation_detail
(
    `id`                              string,
    `user_id`                         string COMMENT '评论人id',
    `store_id`                        string COMMENT '店铺id',
    `goods_id`                        string COMMENT '商品id',
    `order_id`                        string COMMENT '订单id',
    `order_goods_id`                  string COMMENT '订单商品表id',
    `geval_scores_goods`              INT COMMENT '商品评分0-10分',
    `geval_content`                   string,
    `geval_content_superaddition`     string COMMENT '追加评论',
    `geval_addtime`                   string COMMENT '评论时间',
    `geval_addtime_superaddition`     string COMMENT '追加评论时间',
    `geval_state`                     TINYINT COMMENT '评价状态 1-正常 0-禁止显示',
    `geval_remark`                    string COMMENT '管理员对评价的处理备注',
    `revert_state`                    TINYINT COMMENT '回复状态0未回复1已回复',
    `geval_explain`                   string COMMENT '管理员回复内容',
    `geval_explain_superaddition`     string COMMENT '管理员追加回复内容',
    `geval_explaintime`               string COMMENT '管理员回复时间',
    `geval_explaintime_superaddition` string COMMENT '管理员追加回复时间',
    `create_user`                     string,
    `create_time`                     string,
    `update_user`                     string,
    `update_time`                     string,
    `is_valid`                        TINYINT COMMENT '0 :失效,1 :开启'
)
comment '商品评价明细'
partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress'='ZLIB');

--登录记录表:t_user_login

DROP TABLE if exists yp_ods.t_user_login;
CREATE TABLE yp_ods.t_user_login(
   id string,
   login_user string,
   login_type string COMMENT '登录类型(登陆时使用)',
   client_id string COMMENT '推送标示id(登录、第三方登录、注册、支付回调、给用户推送消息时使用)',
   login_time string,
   login_ip string,
   logout_time string
) 
COMMENT '用户登录记录表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc tblproperties ('orc.compress' = 'ZLIB');

--订单组支付表:t_order_pay

DROP TABLE if exists yp_ods.t_order_pay;
CREATE TABLE yp_ods.t_order_pay
(
    id               string,
    group_id         string COMMENT '关联shop_order_group的group_id,一对多订单',
    order_pay_amount DECIMAL(11,2) COMMENT '订单总金额;',
    create_date      string COMMENT '订单创建的时间,需要根据订单创建时间进行判断订单是否已经失效',
    create_user      string,
    create_time      string,
    update_user      string,
    update_time      string,
    is_valid         TINYINT COMMENT '是否有效  0: false; 1: true;   订单是否有效的标志'
)
comment '订单支付表'
partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress' = 'ZLIB');

DROP TABLE if exists yp_ods.t_shop_order_goods_details;
CREATE TABLE yp_ods.t_shop_order_goods_details
(
    `id`                  string COMMENT 'id主键',
    `order_id`            string COMMENT '对应订单表的id',
    `shop_store_id`       string COMMENT '卖家店铺ID',
    `buyer_id`            string COMMENT '购买用户ID',
    `goods_id`            string COMMENT '购买商品的id',
    `buy_num`             INT COMMENT '购买商品的数量',
    `goods_price`         DECIMAL(11,2) COMMENT '购买商品的价格',
    `total_price`         DECIMAL(11,2) COMMENT '购买商品的价格 = 商品的数量 * 商品的单价 ',
    `goods_name`          string COMMENT '商品的名称',
    `goods_image`         string COMMENT '商品的图片',
    `goods_specification` string COMMENT '商品规格',
    `goods_weight`        INT,
    `goods_unit`          string COMMENT '商品计量单位',
    `goods_type`          string COMMENT '商品分类     ytgj:进口商品    ytsc:普通商品     hots爆品',
    `refund_order_id`     string COMMENT '退款订单的id',
    `goods_brokerage`     DECIMAL(11,2) COMMENT '商家设置的商品分润的金额',
    `is_refund`           TINYINT COMMENT '0.不退款; 1.退款',
    `create_user`         string,
    `create_time`         string,
    `update_user`         string,
    `update_time`         string,
    `is_valid`            TINYINT COMMENT '是否有效  0: false; 1: true'
)
 comment '订单和商品的中间表'
 partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress' = 'ZLIB');
    2.3-新增及更新同步

--店铺表:t_store

DROP TABLE if exists yp_ods.t_store;
CREATE TABLE yp_ods.t_store
(
    `id`                 string COMMENT '主键',
    `user_id`            string,
    `store_avatar`       string COMMENT '店铺头像',
    `address_info`       string COMMENT '店铺详细地址',
    `name`               string COMMENT '店铺名称',
    `store_phone`        string COMMENT '联系电话',
    `province_id`        INT COMMENT '店铺所在省份ID',
    `city_id`            INT COMMENT '店铺所在城市ID',
    `area_id`            INT COMMENT '店铺所在县ID',
    `mb_title_img`       string COMMENT '手机店铺 页头背景图',
    `store_description` string COMMENT '店铺描述',
    `notice`             string COMMENT '店铺公告',
    `is_pay_bond`        TINYINT COMMENT '是否有交过保证金 1:是0:否',
    `trade_area_id`      string COMMENT '归属商圈ID',
    `delivery_method`    TINYINT COMMENT '配送方式  1 :自提 ;3 :自提加配送均可; 2 : 商家配送',
    `origin_price`       DECIMAL,
    `free_price`         DECIMAL,
    `store_type`         INT COMMENT '店铺类型 22天街网店 23实体店 24直营店铺 33会员专区店',
    `store_label`        string COMMENT '店铺logo',
    `search_key`         string COMMENT '店铺搜索关键字',
    `end_time`           string COMMENT '营业结束时间',
    `start_time`         string COMMENT '营业开始时间',
    `operating_status`   TINYINT COMMENT '营业状态  0 :未营业 ;1 :正在营业',
    `create_user`        string,
    `create_time`        string,
    `update_user`        string,
    `update_time`        string,
    `is_valid`           TINYINT COMMENT '0关闭,1开启,3店铺申请中',
    `state`              string COMMENT '可使用的支付类型:MONEY金钱支付;CASHCOUPON现金券支付',
    `idCard`             string COMMENT '身份证',
    `deposit_amount`     DECIMAL(11,2) COMMENT '商圈认购费用总额',
    `delivery_config_id` string COMMENT '配送配置表关联ID',
    `aip_user_id`        string COMMENT '通联支付标识ID',
    `search_name`        string COMMENT '模糊搜索名称字段:名称_+真实名称',
    `automatic_order`    TINYINT COMMENT '是否开启自动接单功能 1:是  0 :否',
    `is_primary`         TINYINT COMMENT '是否是总店 1: 是 2: 不是',
    `parent_store_id`    string COMMENT '父级店铺的id,只有当is_primary类型为2时有效'
)
comment '店铺表'
partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress'='ZLIB');

--商圈表:t_trade_area

DROP TABLE if exists yp_ods.t_trade_area;
CREATE TABLE yp_ods.t_trade_area
(
    `id`                  string COMMENT '主键',
    `user_id`             string COMMENT '用户ID',
    `user_allinpay_id`    string COMMENT '通联用户表id',
    `trade_avatar`        string COMMENT '商圈logo',
    `name`                string COMMENT '商圈名称',
    `notice`              string COMMENT '商圈公告',
    `distric_province_id` INT COMMENT '商圈所在省份ID',
    `distric_city_id`     INT COMMENT '商圈所在城市ID',
    `distric_area_id`     INT COMMENT '商圈所在县ID',
    `address`             string COMMENT '商圈地址',
    `radius`              double COMMENT '半径',
    `mb_title_img`        string COMMENT '手机商圈 页头背景图',
    `deposit_amount`      DECIMAL(11,2) COMMENT '商圈认购费用总额',
    `hava_deposit`        INT COMMENT '是否有交过保证金 1:是0:否',
    `state`               TINYINT COMMENT '申请商圈状态 -1 :未认购 ;0 :申请中;1 :已认购 ;',
    `search_key`          string COMMENT '商圈搜索关键字',
    `create_user`         string,
    `create_time`         string,
    `update_user`         string,
    `update_time`         string,
    `is_valid`            TINYINT COMMENT '是否有效  0: false; 1: true'
)
comment '商圈表'
partitioned by (dt string) row format delimited fields terminated by '\t' stored as orc tblproperties ('orc.compress'='ZLIB');

......

3.5 通过sqoop将数据导入到ODS层

1-采用全量覆盖导入操作:

日期表: t_date

sqoop import
--connect jdbc:mysql://hadoop01:3306/yipin
--username root
--password 123456
--query 'select * from t_date where 1=1 and $CONDITIONS'
--hcatalog-database 'yp_ods'
--hcatelog-table 't_date'
--fields-terminated-by '\t'
-m 1

2-仅新增同步方式的表 导入操作:

订单评价表 :t_goods_evaluation

*sqoop import
--connect jdbc:mysql://hadoop01:3306/yipin
--username root
--password 123456
--query "select , '2021-03-03' as dt from t_goods_evaluation where 1=1 and \ create_time between '2010-01-01 00:00:00' and '2021-01-01 23:59:59' and \ $CONDITIONS"
--hcatalog-database 'yp_ods'
--hcatelog-table 't_goods_evaluation'
--fields-terminated-by '\t'
-m 1

3-新增及更新同步方式的表导入操作:

订单表 t_shop_order

*sqoop import
--connect 'jdbc:mysql://192.168.88.80:3306/yipin
--username root
--password 123456
--query "select , '2021-03-03' as dt from t_shop_order where 1=1 and (create_time between '2010-01-01 00:00:00' and '2021-01-01 23:59:59') or (update_time between '2010-01-01 00:00:00' and '2021-01-01 23:59:59') and $CONDITIONS"
--hcatalog-database yp_ods
--hcatalog-table t_shop_order
-m 1

目前,书写的这些sqoop的脚本,是比较死板的,原因:当中日期本应该是一个变量,随着时间的推移,每天都应该自动指向上一天的日期数据,但是目前都是写死了,每天都需要手动的修改,这种方式,并不是我们想要的吧.…如何解决这个问题呢?希望除了能够自动获取上一天的日期,还能支持根据指定的日期导入相关的数据基于 SHELL脚本来实现,后续将shell脚本通过oozie完成自动化调度操作

1-思考:在shel1执行的时候,是否支持读取到外部传递的参数?完全支持的

2-如何通过shell读取上一天的日期呢?

    获取当前日期:**date**
             2022年04月28日星期四10:23:06 CST
     获取上一天日期:**date -d '-1 day'**

    获取上一时日期:**date -d '-1 hour'**

3-如何让日期数据按照特定的格式输出呢?

** date -d '-1 day' + '%Y-%m-%d %H:%M:%S'**

** 2022-04-27 10:27:20**

4-如果外部传递了参数,shell内部如何接收呢?

    **$#**:获取当前外部一共传递了多少个参数

    **$N**:N 表示数据,获取第一个参数

5-在编写一个shell脚本,默认第一行书写 #!/bin/bash ** **用于标识这个是she11脚本,采用bash编译器

    运行一个shell脚本方式:sh 脚本

6-如果外部传递了参数,按照抬定的参数日期进行数据采集,如果没有传递,使用上一天的日期即可

#注意:[]内部两端都要有空格
if [ $# == 1 ]
#等号两端不允许出现空格
then dateStr=$1
#飘号(`):表示内部内容,会先执行
else dateStr=`date -d '-1 day' +'%Y-%m-%d'`
fi
# ${变量}: 用于获取变量的值
echo ${dateStr}

shell操作hive

hive -e -S ‘sql语句’

S:表示静默执行,避免输出太多的日志数

编写shell脚本

#! /bin/bash
#SQOOP_HOME=/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/bin/sqoop
SQOOP_HOME=/usr/bin/sqoop
if [[ $1 == "" ]];then
   TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
else
   TD_DATE=$1
fi

echo '========================================'
echo '==============开始增量导入==============='
echo '========================================'

# 全量表
/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select * from t_district where 1=1 and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_district \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select * from t_date where 1=1 and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_date \
-m 1
wait

# 增量表
/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_goods_evaluation where 1=1 and (create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_goods_evaluation \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_user_login where 1=1 and (login_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_user_login \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_order_pay where 1=1 and (create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_order_pay \
-m 1
wait

# 新增和更新同步
/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_store where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_store \
-m 1
wait

/usr/bin/sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true -Dorg.apache.sqoop.db.type=mysql \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_trade_area where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS order by id" \
--hcatalog-database yp_ods \
--hcatalog-table t_trade_area \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_location where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_location \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_goods where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_goods \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_goods_class where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_goods_class \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_brand where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_brand \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_shop_order where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_shop_order \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_shop_order_address_detail where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_shop_order_address_detail \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_order_settle where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_order_settle \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_refund_order where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_refund_order \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_shop_order_group where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_shop_order_group \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_shop_order_goods_details where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_shop_order_goods_details \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_shop_cart where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_shop_cart \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_store_collect where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_store_collect \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_goods_collect where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_goods_collect \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_order_delievery_item where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_order_delievery_item \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_goods_evaluation_detail where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_goods_evaluation_detail \
-m 1
wait

/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?enabledTLSProtocols=TLSv1.2&useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_trade_record where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_trade_record \
-m 1
wait

echo '========================================'
echo '=================success==============='
echo '========================================'

3.6 oozie定时

目前脚木确实写好了,但是呢,还是需要每天上去跑一下才可以,而且这个执行操作,必须是凌晨,或者深夜执行(因为这个操作比较影响业务端资源),显然不太合适,此时需要让程序能够定时的周期性的运行操作,所以需要使用oozie完成定时调度操作

1-配置工作流

2-配置计划

4. 分桶表

在创建表的时候,指定分桶字段,并设置分多少个桶,在添加数据的时候,hive会根据设置分桶字段,将数据划分到N个桶(文件)中,默认情况采用HASH分桶方案,分多少个桶,取决于建表的时候,设置分桶数量,分了多少个桶最终翻译的MR也就会运行多少个reduce程序(HIVE的分桶本质上就是MR的分区操作)

注意:

  • 如果使用 apache 版本的HIVE,默认情况下,是可以通过 load data 方式来加载数据,只不过没有分桶的效果;
  • 但是对于 CDH版本中,是不允许通过 load data 方式来加载的:在CDH中默认开启了一个参数,禁止采用load data方式向桶表添加数据:set hive.strict.checks.bucketing =true;

如果 现有一个文本文件数据,需要加载到分桶表,如何解决呢?

  • 第一步:基于桶表创建一张临时表,此表和桶表保持相同字段,唯一区别,当前这个表不是一个桶
  • 表第二步:将数据先加载到这个临时表中
  • 第三步:基于临时表,使用 insert into l overwrite+select 将数据添加到桶表

注意:sqoop不支持桶表数据导入操作

桶表有什么用呢?

  • 进行数据采样工作 - 当表的数据量比较庞大的时候,在编写SQL语句后,需要首先测试 SQL是否可以正常的执行, 需要在表中执行査询操作,由于表数据量比较庞大,在测试一条SQL的时候整个运行的时间比较久为了提升测试效率,可以整个表抽样出一部分的数据,进行测试- 校验数据的可行性(质量校验)- 进行统计分析的时候,并不需要统计出具休的指标,可能统计的都是一些相对性指标,比如说一些比率问题
  • 提升查询的效率(更主要是提升J0IN的效率)

4.1 如何提升join效率

4.2 数据采样

采样函数:
** tablesample( bucket x out of y on column )**

使用位置:紧紧跟在表名的后面,如果表名有别名,必须放置别名的前面

说明:

    x:从第几个桶开始进行采样

    y:抽样比例
     column:分桶的字段,可以省略

注意:
** x不能大于 y
y 必须是表的分桶数量的倍数或者因子**

案例

1.假设 A表自10个桶,请分析,下面的采样函数,会将那些抽取出来呢?

** tablesample(bucket 2 out of 5 on xxx)**

会抽取出几个桶数据呢?

  总桶数 / 抽样比例 = 分桶数量  2个桶

抽取都几个桶呢? (x+y)

  (2,7)

2.假设 A 表有20个桶,请分析,下面的抽样函数,会将那些桶抽取出来呢?

  **  tablesample(bucket 4out of 4 on xxx)**

会抽取出几个桶数据呢?

      总桶数 / 抽样比例 = 分桶数量  5个桶

抽取那几个桶呢?
4,8,12,16,20

** tablesample(bucket gout of 40 on xxx)**

会抽取出几个桶数据呢?

    总桶数/抽样比例 = 分桶数量  二分之一个桶

抽取那几个桶呢?

    8号桶一半

4.3 执行计划

用户提交HiveSQL查询后,Hive会把查询语句转换为MapReduce作业。Hive会自动完成整个执行过程一般情况下,我们并不用知道内部是如何运行的。
执行计划可以告诉我们查询过程的关键信息,用来帮助我们判定优化措施是否已经生效。
语法:

EXPLAIN [EXTENDED] query

5. DWD层

DWD层作用:DWD层 和 ODS层保持相同粒度,从ODS层将数据抽取出来,对数据进行清洗转换的操作,将清洗转换后的数据灌入到DWD层中

1-构建DWD层的库

CREATE DATABASE yp_dwd;

说明: 关于在DWD层, 一般主要做那些清洗和转换的操作呢?

关于当前项目的DwD层构建:

  • 由于粒度是一致的,所以DWD层表数量以及表的结构基本上ODS层是一致的;
  • 在DwD层建表的时候,将压缩方案 从原有zlib 更改为 snappy,便于后续读取操作;
  • 对于同步方式为新增及更新的表,由于篇要在DD层中对历史数据进行拉链处理操作,所以在DMD层进行建表的时候,会新建两个字段:start date(拉链开始时间)和 end date(拉链的结束时间)其中会将start date作为分区字段;

一般在实际生产环境中 一般需要清洗转换那些操作呢?

  • 去除无用空值,缺少值
  • 去重
  • 过滤掉一些以及标记为删除的数据
  • 发现一个字段中如果涵盖了多个字段信息,干般需要将其转换为 多个字段来分别处理,比如说:日期,ODS层中日期值可能为 2022-01-0114:25:30含年月日小时 分钟 秒;可以将其拆解为 午字段,月字段,日字段,小时字段,季度字段..
  • 原有数据可能足通过数字来表示一种行为,可能需要将其转换为具体的内容比如说:数据中用 1表示男性 0 表示女性直接将其转换为男和女
  • 维度退化操作,将多个相关的表合并为一个表(此种一般需要J0IN大量的表来处埋)一般注息:此操作会独立处理,不会和清洗转换放置在一块,除非非常简单
  • JSOIN数据的拉平操作: 比如说:一个字段为 content字段,字段里面数据格式“{'name':'张三','address':'北京'}” 此时需要将其拉宽拉平,形成两个新的字段:content_name ,content_address

目前,在我们项目中,基本上不去处理任何的转换操作,主要原因是因为当前这份数据,本身就是一些测试教据,王面将大量的敏感数据给脱敏了,导致一旦进行清洗处理,可能什么数据都不剩下了

5.1 建表操作

事实表:

--订单事实表(拉链表)
DROP TABLE if EXISTS yp_dwd.fact_shop_order;
CREATE TABLE yp_dwd.fact_shop_order(
  id string COMMENT '根据一定规则生成的订单编号', 
  order_num string COMMENT '订单序号', 
  buyer_id string COMMENT '买家的userId', 
  store_id string COMMENT '店铺的id', 
  order_from string COMMENT '此字段可以转换 1.安卓\; 2.ios\; 3.小程序H5 \; 4.PC', 
  order_state int COMMENT '订单状态:1.已下单\; 2.已付款, 3. 已确认 \;4.配送\; 5.已完成\; 6.退款\;7.已取消', 
  create_date string COMMENT '下单时间', 
  finnshed_time timestamp COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价', 
  is_settlement tinyint COMMENT '是否结算\;0.待结算订单\; 1.已结算订单\;', 
  is_delete tinyint COMMENT '订单评价的状态:0.未删除\;  1.已删除\;(默认0)', 
  evaluation_state tinyint COMMENT '订单评价的状态:0.未评价\;  1.已评价\;(默认0)', 
  way string COMMENT '取货方式:SELF自提\;SHOP店铺负责配送', 
  is_stock_up int COMMENT '是否需要备货 0:不需要    1:需要    2:平台确认备货  3:已完成备货 4平台已经将货物送至店铺 ', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志',
  end_date string COMMENT '拉链结束日期')
COMMENT '订单表'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单详情表(拉链表)
DROP TABLE if EXISTS yp_dwd.fact_shop_order_address_detail;
CREATE TABLE yp_dwd.fact_shop_order_address_detail(
  id string COMMENT '关联订单的id', 
  order_amount decimal(36,2) COMMENT '订单总金额:购买总金额-优惠金额', 
  discount_amount decimal(36,2) COMMENT '优惠金额', 
  goods_amount decimal(36,2) COMMENT '用户购买的商品的总金额+运费', 
  is_delivery string COMMENT '0.自提;1.配送', 
  buyer_notes string COMMENT '买家备注留言', 
  pay_time string, 
  receive_time string, 
  delivery_begin_time string, 
  arrive_store_time string, 
  arrive_time string COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志',
  end_date string COMMENT '拉链结束日期')
COMMENT '订单详情表'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单结算表
DROP TABLE if exists yp_dwd.fact_order_settle;
CREATE TABLE yp_dwd.fact_order_settle(
  id string COMMENT '结算单号', 
  order_id string, 
  settlement_create_date string COMMENT '用户申请结算的时间', 
  settlement_amount decimal(36,2) COMMENT '如果发生退款,则结算的金额 = 订单的总金额 - 退款的金额', 
  dispatcher_user_id string COMMENT '配送员id', 
  dispatcher_money decimal(36,2) COMMENT '配送员的配送费(配送员的运费(如果退货方式为1:则买家支付配送费))', 
  circle_master_user_id string COMMENT '圈主id', 
  circle_master_money decimal(36,2) COMMENT '圈主分润的金额', 
  plat_fee decimal(36,2) COMMENT '平台应得的分润', 
  store_money decimal(36,2) COMMENT '商家应得的订单金额', 
  status tinyint COMMENT '0.待结算;1.待审核 \; 2.完成结算;3.拒绝结算', 
  note string COMMENT '原因', 
  settle_time string COMMENT ' 结算时间', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志', 
  first_commission_user_id string COMMENT '一级分佣用户', 
  first_commission_money decimal(36,2) COMMENT '一级分佣金额', 
  second_commission_user_id string COMMENT '二级分佣用户', 
  second_commission_money decimal(36,2) COMMENT '二级分佣金额',
  end_date string COMMENT '拉链结束日期')
COMMENT '订单结算表'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--退款订单表(拉链表)
DROP TABLE if exists yp_dwd.fact_refund_order;
CREATE TABLE yp_dwd.fact_refund_order
(
    id                   string COMMENT '退款单号',
    order_id             string COMMENT '订单的id',
    apply_date           string COMMENT '用户申请退款的时间',
    modify_date          string COMMENT '退款订单更新时间',
    refund_reason        string COMMENT '买家退款原因',
    refund_amount        DECIMAL(11,2) COMMENT '订单退款的金额',
    refund_state         TINYINT COMMENT '1.申请退款;2.拒绝退款; 3.同意退款,配送员配送; 4:商家同意退款,用户亲自送货 ;5.退款完成',
    refuse_refund_reason string COMMENT '商家拒绝退款原因',
    refund_goods_type    string COMMENT '1.上门取货(买家承担运费); 2.买家送达;',
    refund_shipping_fee  DECIMAL(11,2) COMMENT '配送员的运费(如果退货方式为1:则买家支付配送费)',
    create_user          string,
    create_time          string,
    update_user          string,
    update_time          string,
    is_valid             TINYINT COMMENT '是否有效  0: false; 1: true;   订单是否有效的标志',
      end_date string COMMENT '拉链结束日期'
)
comment '退款订单表'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单组表(拉链表)
DROP TABLE if EXISTS yp_dwd.fact_shop_order_group;
CREATE TABLE yp_dwd.fact_shop_order_group(
  id string, 
  order_id string COMMENT '订单id', 
  group_id string COMMENT '订单分组id', 
  is_pay tinyint COMMENT '是否已支付,0未支付,1已支付', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint,
  end_date string COMMENT '拉链结束日期')
COMMENT '订单组'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单组支付(拉链表)
DROP TABLE if EXISTS yp_dwd.fact_order_pay;
CREATE TABLE yp_dwd.fact_order_pay(
  id string, 
  group_id string COMMENT '关联shop_order_group的group_id,一对多订单', 
  order_pay_amount decimal(36,2) COMMENT '订单总金额\;', 
  create_date string COMMENT '订单创建的时间,需要根据订单创建时间进行判断订单是否已经失效', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志',
  end_date string COMMENT '拉链结束日期')
COMMENT '订单组支付表'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单商品快照(拉链表)
DROP TABLE if EXISTS yp_dwd.fact_shop_order_goods_details;
CREATE TABLE yp_dwd.fact_shop_order_goods_details(
  id string COMMENT 'id主键', 
  order_id string COMMENT '对应订单表的id', 
  shop_store_id string COMMENT '卖家店铺ID', 
  buyer_id string COMMENT '购买用户ID', 
  goods_id string COMMENT '购买商品的id', 
  buy_num int COMMENT '购买商品的数量', 
  goods_price decimal(36,2) COMMENT '购买商品的价格', 
  total_price decimal(36,2) COMMENT '购买商品的价格 = 商品的数量 * 商品的单价 ', 
  goods_name string COMMENT '商品的名称', 
  goods_image string COMMENT '商品的图片', 
  goods_specification string COMMENT '商品规格', 
  goods_weight int, 
  goods_unit string COMMENT '商品计量单位', 
  goods_type string COMMENT '商品分类     ytgj:进口商品    ytsc:普通商品     hots爆品', 
  refund_order_id string COMMENT '退款订单的id', 
  goods_brokerage decimal(36,2) COMMENT '商家设置的商品分润的金额', 
  is_refund tinyint COMMENT '0.不退款\; 1.退款', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true',
  end_date string COMMENT '拉链结束日期')
COMMENT '订单商品快照'
partitioned by (start_date string)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--订单评价表(增量表,与ODS一致)
DROP TABLE if EXISTS yp_dwd.fact_goods_evaluation;
CREATE TABLE yp_dwd.fact_goods_evaluation(
  id string, 
  user_id string COMMENT '评论人id', 
  store_id string COMMENT '店铺id', 
  order_id string COMMENT '订单id', 
  geval_scores int COMMENT '综合评分', 
  geval_scores_speed int COMMENT '送货速度评分0-5分(配送评分)', 
  geval_scores_service int COMMENT '服务评分0-5分', 
  geval_isanony tinyint COMMENT '0-匿名评价,1-非匿名', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启')
COMMENT '订单评价表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--商品评价表(增量表,与ODS一致)
DROP TABLE if EXISTS yp_dwd.fact_goods_evaluation_detail;
CREATE TABLE yp_dwd.fact_goods_evaluation_detail(
  id string, 
  user_id string COMMENT '评论人id', 
  store_id string COMMENT '店铺id', 
  goods_id string COMMENT '商品id', 
  order_id string COMMENT '订单id', 
  order_goods_id string COMMENT '订单商品表id', 
  geval_scores_goods int COMMENT '商品评分0-10分', 
  geval_content string, 
  geval_content_superaddition string COMMENT '追加评论', 
  geval_addtime string COMMENT '评论时间', 
  geval_addtime_superaddition string COMMENT '追加评论时间', 
  geval_state tinyint COMMENT '评价状态 1-正常 0-禁止显示', 
  geval_remark string COMMENT '管理员对评价的处理备注', 
  revert_state tinyint COMMENT '回复状态0未回复1已回复', 
  geval_explain string COMMENT '管理员回复内容', 
  geval_explain_superaddition string COMMENT '管理员追加回复内容', 
  geval_explaintime string COMMENT '管理员回复时间', 
  geval_explaintime_superaddition string COMMENT '管理员追加回复时间', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启')
COMMENT '商品评价明细'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--配送表(增量表,与ODS一致)
DROP TABLE if EXISTS yp_dwd.fact_order_delievery_item;
CREATE TABLE yp_dwd.fact_order_delievery_item(
  id string COMMENT '主键id', 
  shop_order_id string COMMENT '订单表ID', 
  refund_order_id string, 
  dispatcher_order_type tinyint COMMENT '配送订单类型1.支付单\; 2.退款单', 
  shop_store_id string COMMENT '卖家店铺ID', 
  buyer_id string COMMENT '购买用户ID', 
  circle_master_user_id string COMMENT '圈主ID', 
  dispatcher_user_id string COMMENT '配送员ID', 
  dispatcher_order_state tinyint COMMENT '配送订单状态:0.待接单.1.已接单,2.已到店.3.配送中 4.商家普通提货码完成订单.5.商家万能提货码完成订单。6,买家完成订单', 
  order_goods_num tinyint COMMENT '订单商品的个数', 
  delivery_fee decimal(36,2) COMMENT '配送员的运费', 
  distance int COMMENT '配送距离', 
  dispatcher_code string COMMENT '收货码', 
  receiver_name string COMMENT '收货人姓名', 
  receiver_phone string COMMENT '收货人电话', 
  sender_name string COMMENT '发货人姓名', 
  sender_phone string COMMENT '发货人电话', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true')
COMMENT '订单配送详细信息表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--登录记录表(增量表,与ODS一致)
DROP TABLE if exists yp_dwd.fact_user_login;
CREATE TABLE yp_dwd.fact_user_login(
    id string,
    login_user string,
    login_type string COMMENT '登录类型(登陆时使用)',
    client_id string COMMENT '推送标示id(登录、第三方登录、注册、支付回调、给用户推送消息时使用)',
    login_time string,
    login_ip string,
    logout_time string
)
COMMENT '用户登录记录表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

--购物车(拉链表)
DROP TABLE if exists yp_dwd.fact_shop_cart;
CREATE TABLE yp_dwd.fact_shop_cart
(
    id            string COMMENT '主键id',
    shop_store_id string COMMENT '卖家店铺ID',
    buyer_id      string COMMENT '购买用户ID',
    goods_id      string COMMENT '购买商品的id',
    buy_num       INT COMMENT '购买商品的数量',
    create_user   string,
    create_time   string,
    update_user   string,
    update_time   string,
    is_valid      TINYINT COMMENT '是否有效  0: false; 1: true',
  end_date string COMMENT '拉链结束日期')
comment '购物车'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

--收藏店铺记录(拉链表)
DROP TABLE if exists yp_dwd.fact_store_collect;
CREATE TABLE yp_dwd.fact_store_collect
(
    id          string,
    user_id     string COMMENT '收藏人id',
    store_id    string COMMENT '店铺id',
    create_user string,
    create_time string,
    update_user string,
    update_time string,
    is_valid    TINYINT COMMENT '0 :失效,1 :开启',
  end_date string COMMENT '拉链结束日期')
comment '收藏店铺记录表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

--商品收藏(拉链表)
DROP TABLE if exists yp_dwd.fact_goods_collect;
CREATE TABLE yp_dwd.fact_goods_collect
(
    id          string,
    user_id     string COMMENT '收藏人id',
    goods_id    string COMMENT '商品id',
    store_id    string COMMENT '通过哪个店铺收藏的(因主店分店概念存在需要)',
    create_user string,
    create_time string,
    update_user string,
    update_time string,
    is_valid    TINYINT COMMENT '0 :失效,1 :开启',
  end_date string COMMENT '拉链结束日期')
comment '收藏商品记录表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

--交易记录(增量表,和ODS一致)
DROP TABLE if exists yp_dwd.fact_trade_record;
CREATE TABLE yp_dwd.fact_trade_record
(
    id                   string COMMENT '交易单号',
    external_trade_no    string COMMENT '(支付,结算.退款)第三方交易单号',
    relation_id          string COMMENT '关联单号',
    trade_type           TINYINT COMMENT '1.支付订单; 2.结算订单; 3.退款订单;4.充值单;5.提现单;6.分销单;7缴纳保证金单8退还保证金单9,冻结通联订单,10通联通账户余额充值,11.扫码单',
    status               TINYINT COMMENT '1.成功;2.失败;3.进行中',
    finnshed_time        string COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价',
    fail_reason          string COMMENT '交易失败的原因',
    payment_type         string COMMENT '支付方式:小程序,app微信,支付宝,快捷支付,钱包,银行卡,消费券',
    trade_before_balance DECIMAL(11,2) COMMENT '交易前余额',
    trade_true_amount    DECIMAL(11,2) COMMENT '交易实际支付金额,第三方平台扣除优惠以后实际支付金额',
    trade_after_balance  DECIMAL(11,2) COMMENT '交易后余额',
    note                 string COMMENT '业务说明',
    user_card            string COMMENT '第三方平台账户标识/多钱包用户钱包id',
    user_id              string COMMENT '用户id',
    aip_user_id          string COMMENT '钱包id',
    create_user          string,
    create_time          string,
    update_user          string,
    update_time          string,
    is_valid             TINYINT COMMENT '是否有效  0: false; 1: true;   订单是否有效的标志')
comment '交易记录'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

维度表

--区域字典表(全量覆盖)
DROP TABLE if EXISTS yp_dwd.dim_district;
CREATE TABLE yp_dwd.dim_district(
  id string COMMENT '主键ID', 
  code string COMMENT '区域编码', 
  name string COMMENT '区域名称', 
  pid string COMMENT '父级ID', 
  alias string COMMENT '别名')
COMMENT '区域字典表'
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--时间维度
drop table yp_dwd.dim_date;
CREATE TABLE yp_dwd.dim_date
(
    dim_date_id           string COMMENT '日期',
    date_code             string COMMENT '日期编码',
    lunar_calendar        string COMMENT '农历',
    year_code             string COMMENT '年code',
    year_name             string COMMENT '年名称',
    month_code            string COMMENT '月份编码',
    month_name            string COMMENT '月份名称',
    quanter_code          string COMMENT '季度编码',
    quanter_name          string COMMENT '季度名称',
    year_month            string COMMENT '年月',
    year_week_code        string COMMENT '一年中第几周',
    year_week_name        string COMMENT '一年中第几周名称',
    year_week_code_cn     string COMMENT '一年中第几周(中国)',
    year_week_name_cn     string COMMENT '一年中第几周名称(中国',
    week_day_code         string COMMENT '周几code',
    week_day_name         string COMMENT '周几名称',
    day_week              string COMMENT '周',
    day_week_cn           string COMMENT '周(中国)',
    day_week_num          string COMMENT '一周第几天',
    day_week_num_cn       string COMMENT '一周第几天(中国)',
    day_month_num         string COMMENT '一月第几天',
    day_year_num          string COMMENT '一年第几天',
    date_id_wow           string COMMENT '与本周环比的上周日期',
    date_id_mom           string COMMENT '与本月环比的上月日期',
    date_id_wyw           string COMMENT '与本周同比的上年日期',
    date_id_mym           string COMMENT '与本月同比的上年日期',
    first_date_id_month   string COMMENT '本月第一天日期',
    last_date_id_month    string COMMENT '本月最后一天日期',
    half_year_code        string COMMENT '半年code',
    half_year_name        string COMMENT '半年名称',
    season_code           string COMMENT '季节编码',
    season_name           string COMMENT '季节名称',
    is_weekend            string COMMENT '是否周末(周六和周日)',
    official_holiday_code string COMMENT '法定节假日编码',
    official_holiday_name string COMMENT '法定节假日',
    festival_code         string COMMENT '节日编码',
    festival_name         string COMMENT '节日',
    custom_festival_code  string COMMENT '自定义节日编码',
    custom_festival_name  string COMMENT '自定义节日',
    update_time           string COMMENT '更新时间'
)
COMMENT '时间维度表'
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--店铺(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_store;
CREATE TABLE yp_dwd.dim_store(
  id string COMMENT '主键', 
  user_id string, 
  store_avatar string COMMENT '店铺头像', 
  address_info string COMMENT '店铺详细地址', 
  name string COMMENT '店铺名称', 
  store_phone string COMMENT '联系电话', 
  province_id int COMMENT '店铺所在省份ID', 
  city_id int COMMENT '店铺所在城市ID', 
  area_id int COMMENT '店铺所在县ID', 
  mb_title_img string COMMENT '手机店铺 页头背景图', 
  store_description string COMMENT '店铺描述', 
  notice string COMMENT '店铺公告', 
  is_pay_bond tinyint COMMENT '是否有交过保证金 1:是0:否', 
  trade_area_id string COMMENT '归属商圈ID', 
  delivery_method tinyint COMMENT '配送方式  1 :自提 ;3 :自提加配送均可\; 2 : 商家配送', 
  origin_price decimal(36,2), 
  free_price decimal(36,2), 
  store_type int COMMENT '店铺类型 22天街网店 23实体店 24直营店铺 33会员专区店', 
  store_label string COMMENT '店铺logo', 
  search_key string COMMENT '店铺搜索关键字', 
  end_time string COMMENT '营业结束时间', 
  start_time string COMMENT '营业开始时间', 
  operating_status tinyint COMMENT '营业状态  0 :未营业 ;1 :正在营业', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0关闭,1开启,3店铺申请中', 
  state string COMMENT '可使用的支付类型:MONEY金钱支付\;CASHCOUPON现金券支付', 
  idcard string COMMENT '身份证', 
  deposit_amount decimal(36,2) COMMENT '商圈认购费用总额', 
  delivery_config_id string COMMENT '配送配置表关联ID', 
  aip_user_id string COMMENT '通联支付标识ID', 
  search_name string COMMENT '模糊搜索名称字段:名称_+真实名称', 
  automatic_order tinyint COMMENT '是否开启自动接单功能 1:是  0 :否', 
  is_primary tinyint COMMENT '是否是总店 1: 是 2: 不是', 
  parent_store_id string COMMENT '父级店铺的id,只有当is_primary类型为2时有效',
  end_date string COMMENT '拉链结束日期')
COMMENT '店铺表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--商圈(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_trade_area;
CREATE TABLE yp_dwd.dim_trade_area(
  id string COMMENT '主键', 
  user_id string COMMENT '用户ID', 
  user_allinpay_id string COMMENT '通联用户表id', 
  trade_avatar string COMMENT '商圈logo', 
  name string COMMENT '商圈名称', 
  notice string COMMENT '商圈公告', 
  distric_province_id int COMMENT '商圈所在省份ID', 
  distric_city_id int COMMENT '商圈所在城市ID', 
  distric_area_id int COMMENT '商圈所在县ID', 
  address string COMMENT '商圈地址', 
  radius double COMMENT '半径', 
  mb_title_img string COMMENT '手机商圈 页头背景图', 
  deposit_amount decimal(36,2) COMMENT '商圈认购费用总额', 
  hava_deposit int COMMENT '是否有交过保证金 1:是0:否', 
  state tinyint COMMENT '申请商圈状态 -1 :未认购 ;0 :申请中;1 :已认购;', 
  search_key string COMMENT '商圈搜索关键字', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true',
  end_date string COMMENT '拉链结束日期')
COMMENT '商圈表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--地址信息表(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_location;
CREATE TABLE yp_dwd.dim_location(
  id string COMMENT '主键', 
  type int COMMENT '类型   1:商圈地址;2:店铺地址;3.用户地址管理\;4.订单买家地址信息\;5.订单卖家地址信息', 
  correlation_id string COMMENT '关联表id', 
  address string COMMENT '地图地址详情', 
  latitude double COMMENT '纬度', 
  longitude double COMMENT '经度', 
  street_number string COMMENT '门牌', 
  street string COMMENT '街道', 
  district string COMMENT '区县', 
  city string COMMENT '城市', 
  province string COMMENT '省份', 
  business string COMMENT '百度商圈字段,代表此点所属的商圈', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true', 
  adcode string COMMENT '百度adcode,对应区县code',
  end_date string COMMENT '拉链结束日期')
COMMENT '地址信息'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--商品SKU表(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_goods;
CREATE TABLE yp_dwd.dim_goods(
  id string, 
  store_id string COMMENT '所属商店ID', 
  class_id string COMMENT '分类id:只保存最后一层分类id', 
  store_class_id string COMMENT '店铺分类id', 
  brand_id string COMMENT '品牌id', 
  goods_name string COMMENT '商品名称', 
  goods_specification string COMMENT '商品规格', 
  search_name string COMMENT '模糊搜索名称字段:名称_+真实名称', 
  goods_sort int COMMENT '商品排序', 
  goods_market_price decimal(36,2) COMMENT '商品市场价', 
  goods_price decimal(36,2) COMMENT '商品销售价格(原价)', 
  goods_promotion_price decimal(36,2) COMMENT '商品促销价格(售价)', 
  goods_storage int COMMENT '商品库存', 
  goods_limit_num int COMMENT '购买限制数量', 
  goods_unit string COMMENT '计量单位', 
  goods_state tinyint COMMENT '商品状态 1正常,2下架,3违规(禁售)', 
  goods_verify tinyint COMMENT '商品审核状态: 1通过,2未通过,3审核中', 
  activity_type tinyint COMMENT '活动类型:0无活动1促销2秒杀3折扣', 
  discount int COMMENT '商品折扣(%)', 
  seckill_begin_time string COMMENT '秒杀开始时间', 
  seckill_end_time string COMMENT '秒杀结束时间', 
  seckill_total_pay_num int COMMENT '已秒杀数量', 
  seckill_total_num int COMMENT '秒杀总数限制', 
  seckill_price decimal(36,2) COMMENT '秒杀价格', 
  top_it tinyint COMMENT '商品置顶:1-是,0-否', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启',
  end_date string COMMENT '拉链结束日期')
COMMENT '商品表_店铺(SKU)'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--商品分类(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_goods_class;
CREATE TABLE yp_dwd.dim_goods_class(
  id string, 
  store_id string COMMENT '店铺id', 
  class_id string COMMENT '对应的平台分类表id', 
  name string COMMENT '店铺内分类名字', 
  parent_id string COMMENT '父id', 
  level tinyint COMMENT '分类层级', 
  is_parent_node tinyint COMMENT '是否为父节点:1是0否', 
  background_img string COMMENT '背景图片', 
  img string COMMENT '分类图片', 
  keywords string COMMENT '关键词', 
  title string COMMENT '搜索标题', 
  sort int COMMENT '排序', 
  note string COMMENT '类型描述', 
  url string COMMENT '分类的链接', 
  is_use tinyint COMMENT '是否使用:0否,1是', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启',
  end_date string COMMENT '拉链结束日期')
COMMENT '商品分类表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

--品牌表(拉链表)
DROP TABLE if EXISTS yp_dwd.dim_brand;
CREATE TABLE yp_dwd.dim_brand(
  id string, 
  store_id string COMMENT '店铺id', 
  brand_pt_id string COMMENT '平台品牌库品牌Id', 
  brand_name string COMMENT '品牌名称', 
  brand_image string COMMENT '品牌图片', 
  initial string COMMENT '品牌首字母', 
  sort int COMMENT '排序', 
  is_use tinyint COMMENT '0禁用1启用', 
  goods_state tinyint COMMENT '商品品牌审核状态 1 审核中,2 通过,3 拒绝', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启',
  end_date string COMMENT '拉链结束日期')
COMMENT '品牌(店铺)'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

5.2 DWD层数据导入

事实表

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

--===========订单事实表(拉链表)===========
INSERT overwrite TABLE yp_dwd.fact_shop_order PARTITION (start_date)
SELECT 
   id,
   order_num,
   buyer_id,
   store_id,
   case order_from 
      when 1
      then 'android'
      when 2
      then 'ios'
      when 3
      then 'miniapp'
      when 4
      then 'pcweb'
      else 'other'
      end
      as order_from,
   order_state,
   create_date,
   finnshed_time,
   is_settlement,
   is_delete,
   evaluation_state,
   way,
   is_stock_up,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   '9999-99-99' end_date,
   substr(create_time, 1, 10) as start_date
FROM yp_ods.t_shop_order
order by id;

--订单详情表(拉链表)
INSERT overwrite TABLE yp_dwd.fact_shop_order_address_detail PARTITION (start_date)
SELECT 
    id,
    order_amount,
    discount_amount,
    goods_amount,
    is_delivery,
    buyer_notes,
    pay_time,
    receive_time,
    delivery_begin_time,
    arrive_store_time,
    arrive_time,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_shop_order_address_detail
order by id;

--订单结算表
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT overwrite TABLE  yp_dwd.fact_order_settle PARTITION (start_date)
SELECT
    id
    ,order_id
    ,settlement_create_date
    ,settlement_amount
    ,dispatcher_user_id
    ,dispatcher_money
    ,circle_master_user_id
    ,circle_master_money
    ,plat_fee
    ,store_money
    ,status
    ,note
    ,settle_time
    ,create_user
    ,create_time
    ,update_user
    ,update_time
    ,is_valid
    ,first_commission_user_id
    ,first_commission_money
    ,second_commission_user_id
    ,second_commission_money
    ,'9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_order_settle;

--订单退款表
INSERT overwrite TABLE yp_dwd.fact_refund_order PARTITION (start_date)
SELECT
    id
    ,order_id
    ,apply_date
    ,modify_date
    ,refund_reason
    ,refund_amount
    ,refund_state
    ,refuse_refund_reason
    ,refund_goods_type
    ,refund_shipping_fee
    ,create_user
    ,create_time
    ,update_user
    ,update_time
    ,is_valid
    ,'9999-99-99' end_date
    ,substr(create_time, 1, 10) as start_date
FROM yp_ods.t_refund_order;

--订单组表(拉链表)
INSERT overwrite TABLE yp_dwd.fact_shop_order_group PARTITION (start_date)
SELECT
    id,
    order_id,
    group_id,
    is_pay,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_shop_order_group;

--订单组支付表
INSERT overwrite TABLE yp_dwd.fact_order_pay PARTITION (start_date)
SELECT
    id
    ,group_id
    ,order_pay_amount
    ,create_date
    ,create_user
    ,create_time
    ,update_user
    ,update_time
    ,is_valid
    ,'9999-99-99' end_date
    ,substr(create_time, 1, 10) as start_date
FROM yp_ods.t_order_pay;

--订单商品快照(拉链表)
INSERT overwrite TABLE yp_dwd.fact_shop_order_goods_details PARTITION (start_date)
SELECT
    id,
    order_id,
    shop_store_id,
    buyer_id,
    goods_id,
    buy_num,
    goods_price,
    total_price,
    goods_name,
    goods_image,
    goods_specification,
    goods_weight,
    goods_unit,
    goods_type,
    refund_order_id,
    goods_brokerage,
    is_refund,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM
yp_ods.t_shop_order_goods_details;

--购物车(拉链表)
INSERT overwrite TABLE yp_dwd.fact_shop_cart PARTITION (start_date)
SELECT
    id,
    shop_store_id,
    buyer_id,
    goods_id,
    buy_num,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM
yp_ods.t_shop_cart;

--店铺收藏(拉链表)
INSERT overwrite TABLE yp_dwd.fact_store_collect PARTITION (start_date)
SELECT
    id,
    user_id,
    store_id,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_store_collect;

--店铺收藏(拉链表)
INSERT overwrite TABLE yp_dwd.fact_goods_collect PARTITION (start_date)
SELECT
    id,
    user_id,
    goods_id,
    store_id,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_goods_collect;

--===========增量表,只会新增不会更新===========
--订单评价表(增量表,与ODS一致,可以做适当的清洗)
INSERT overwrite TABLE yp_dwd.fact_goods_evaluation PARTITION(dt)
select 
    id,
    user_id,
    store_id,
    order_id,
    geval_scores,
    geval_scores_speed,
    geval_scores_service,
    geval_isanony,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    dt
from yp_ods.t_goods_evaluation;

--评价明细表(增量表,与ODS一致)
INSERT overwrite TABLE yp_dwd.fact_goods_evaluation_detail PARTITION(start_date)
select 
   id,
   user_id,
   store_id,
   goods_id,
   order_id,
   order_goods_id,
   GEVAL_scores_goods,
   geval_content,
   geval_content_superaddition,
   geval_addtime,
   geval_addtime_superaddition,
   geval_state,
   geval_remark,
   revert_state,
   geval_explain,
   geval_explain_superaddition,
   geval_explaintime,
   geval_explaintime_superaddition,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   '9999-99-99' end_date,
   substr(create_time, 1, 10) as start_date
from yp_ods.t_goods_evaluation_detail;

--配送表(增量表,与ODS一致)
INSERT overwrite TABLE yp_dwd.fact_order_delievery_item PARTITION(start_date)
select
   id,
   shop_order_id,
   refund_order_id,
   dispatcher_order_type,
   shop_store_id,
   buyer_id,
   circle_master_user_id,
   dispatcher_user_id,
   dispatcher_order_state,
   order_goods_num,
   delivery_fee,
   distance,
   dispatcher_code,
   receiver_name,
   receiver_phone,
   sender_name,
   sender_phone,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   '9999-99-99' end_date,
   substr(create_time, 1, 10) as start_date
FROM yp_ods.t_order_delievery_item;

--用户登录记录表(增量表,与ODS一致)
INSERT overwrite TABLE yp_dwd.fact_user_login PARTITION(dt)
select
    id,
    login_user,
    login_type,
    client_id,
    login_time,
    login_ip,
    logout_time,
    SUBSTRING(login_time, 1, 10) as dt
FROM yp_ods.t_user_login;

--交易记录(增量表)
INSERT overwrite TABLE yp_dwd.fact_trade_record PARTITION (start_date)
SELECT
   id,
   external_trade_no,
   relation_id,
   trade_type,
   status,
   finnshed_time,
   fail_reason,
   payment_type,
   trade_before_balance,
   trade_true_amount,
   trade_after_balance,
   note,
   user_card,
   user_id,
   aip_user_id,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   '9999-99-99' end_date,
   substr(create_time, 1, 10) as start_date
FROM yp_ods.t_trade_record;

维度表

--分区
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;
set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compression.strategy=COMPRESSION;
--分桶
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;

--===========拉链表===========
--店铺
INSERT overwrite TABLE yp_dwd.dim_store PARTITION (start_date)
select 
    id,
    user_id,
    store_avatar,
    address_info,
    name,
    store_phone,
    province_id,
    city_id,
    area_id,
    mb_title_img,
    store_description,
    notice,
    is_pay_bond,
    trade_area_id,
    delivery_method,
    origin_price,
    free_price,
    store_type,
    store_label,
    search_key,
    end_time,
    start_time,
    operating_status,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    state,
    idcard,
    deposit_amount,
    delivery_config_id,
    aip_user_id,
    search_name,
    automatic_order,
    is_primary,
    parent_store_id,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
from yp_ods.t_store;

--商圈
INSERT overwrite TABLE yp_dwd.dim_trade_area PARTITION(start_date)
SELECT 
    id,
    user_id,
    user_allinpay_id,
    trade_avatar,
    name,
    notice,
    distric_province_id,
    distric_city_id,
    distric_area_id,
    address,
    radius,
    mb_title_img,
    deposit_amount,
    hava_deposit,
    state,
    search_key,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_trade_area;

--地址信息表(拉链表)
INSERT overwrite TABLE yp_dwd.dim_location PARTITION(start_date)
SELECT
    id,
    type,
    correlation_id,
    address,
    latitude,
    longitude,
    street_number,
    street,
    district,
    city,
    province,
    business,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    adcode,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_location;

--商品SKU表(拉链表)
INSERT overwrite TABLE yp_dwd.dim_goods PARTITION(start_date)
SELECT
    id,
    store_id,
    class_id,
    store_class_id,
    brand_id,
    goods_name,
    goods_specification,
    search_name,
    goods_sort,
    goods_market_price,
    goods_price,
    goods_promotion_price,
    goods_storage,
    goods_limit_num,
    goods_unit,
    goods_state,
    goods_verify,
    activity_type,
    discount,
    seckill_begin_time,
    seckill_end_time,
    seckill_total_pay_num,
    seckill_total_num,
    seckill_price,
    top_it,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM
yp_ods.t_goods;

--商品分类(拉链表)
INSERT overwrite TABLE yp_dwd.dim_goods_class PARTITION(start_date)
SELECT
    id,
    store_id,
    class_id,
    name,
    parent_id,
    level,
    is_parent_node,
    background_img,
    img,
    keywords,
    title,
    sort,
    note,
    url,
    is_use,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_goods_class;

--品牌表(拉链表)
INSERT overwrite TABLE yp_dwd.dim_brand PARTITION(start_date)
SELECT
    id,
    store_id,
    brand_pt_id,
    brand_name,
    brand_image,
    initial,
    sort,
    is_use,
    goods_state,
    create_user,
    create_time,
    update_user,
    update_time,
    is_valid,
    '9999-99-99' end_date,
    substr(create_time, 1, 10) as start_date
FROM yp_ods.t_brand;

--===========全量覆盖===========
--区域字典表
INSERT overwrite TABLE yp_dwd.dim_district
select * from yp_ods.t_district
WHERE code IS NOT NULL AND name IS NOT NULL;

--时间维度表
INSERT overwrite TABLE yp_dwd.dim_date
select
    concat(substr(dim_date_id,1,4), '-', substr(dim_date_id,5,2), '-', substr(dim_date_id,7,2)) as dim_date_id,
    date_code,
    concat(substr(lunar_calendar,1,4), '-', substr(lunar_calendar,5,2), '-', substr(lunar_calendar,7,2)) as lunar_calendar,
    year_code,
    year_name,
    month_code,
    month_name,
    quanter_code,
    quanter_name,
    concat(substr(year_month,1,4), '-', substr(year_month,5,2)) as year_month,
    year_week_code,
    concat(substr(year_week_name,1,4), '-', substr(year_week_name,5,2)) as year_week_name,
    year_week_code_cn,
    concat(substr(year_week_name_cn,1,4), '-', substr(year_week_name_cn,5,2)) as year_week_name_cn,
    week_day_code,
    week_day_name,
    day_week,
    day_week_cn,
    day_week_num,
    day_week_num_cn,
    day_month_num,
    day_year_num,
    concat(substr(date_id_wow,1,4), '-', substr(date_id_wow,5,2), '-', substr(date_id_wow,7,2)) as date_id_wow,
    concat(substr(date_id_mom,1,4), '-', substr(date_id_mom,5,2), '-', substr(date_id_mom,7,2)) as date_id_mom,
    date_id_wyw,
    concat(substr(date_id_mym,1,4), '-', substr(date_id_mym,5,2), '-', substr(date_id_mym,7,2)) as date_id_mym,
    concat(substr(first_date_id_month,1,4), '-', substr(first_date_id_month,5,2), '-', substr(first_date_id_month,7,2)) as first_date_id_month,
    concat(substr(last_date_id_month,1,4), '-', substr(last_date_id_month,5,2), '-', substr(last_date_id_month,7,2)) as last_date_id_month,
    half_year_code,
    half_year_name,
    season_code,
    season_name,
    is_weekend,
    official_holiday_code,
    official_holiday_name,
    festival_code,
    festival_name,
    custom_festival_code,
    custom_festival_name,
    update_time
from yp_ods.t_date;

5.3 DWD层拉链表

5.4 JOIN优化

5.4.1 原生reduce端Join实现流程

思考:这种reduce端Join操作,存在那些弊呢?
1-可能会存在数据倾斜的问题(某几个reduce接收数据量远远大于其他的reduce接收数据量)
2-所有的数据处理的操作,全部都压在reduce中进行处理,而reduce数量相比Map来说少的多,导致整个reduce压力比较大

5.4.2 MapJoin

Map Join:

    每一个mapTask在读取数据的时候,每读取一条数据,就会和内存中班级表数据进行匹配,如果能匹配的上,将匹配上数据合并在一起,输出即可好处:将原有reduce join 问题全部都可以解决

弊端:
1-比较消耗内存
2-要求整个 Join 中,必须的都有一个小表,否则无法放入到内存中仅适用于:小表 join 大表|大表 join 小表

如何使用呢?

**set hive.auto.convert.ioin=true; ** 开启 map join的支持 默认值为True
set hive,auto.convert.join.noconditionaltask,size=28971520; 设置 小表数据量的最大阈值:默认值为 28971528(28M)

5.4.3 Bucket Map Join

中型表 和 大表 join:

方案1:如果中型表能对数据进行提前过滤,尽量提前过滤,过滤后,有可能满足了Map Join 条件 (并不一定可用)
方案二: Bucket Map Join

5.4.4 SMB JOIN

使用条件:
1-两个表必须都是分桶表
2-开启 SMB Join 支持:
set hive.auto.convert.sortmerge.join=true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

3-两个表的分桶的数量是一致的
4-分桶列 必须是 join的 on条件的列,同时必须保证按照分桶列进行排序操作

    --开启强制排序
     set hive.enforce.sorting=true;
     --在建分桶表使用:必须使用sorted by()

5-应用在Bucket Map Join 场景中

    --开启 bucket map join
     set hive.optimize.bucketmapjoin = true; 

6-必须开启HIVE自动尝试使用SMB 方案

    set hive.optimize.bucketmapjoin.sortedmerge = true;

6. DWB层

DWB层作用:维度退化操作(降维)

  • 指的将各个维度表或者事实表的核心字段全部汇聚成一个表操作,形成一个宽表,这样在后续进行统计分析的时候,只需要操作这个合并后大宽表数据即可。
  • 对于当前项目,此处的合并宽表过程,与主题是无关(没有直接关系的),更多是基于业务模块,形成业务模块的一些宽表。
  • 对于一些其他的项目,可能从一开始就是直接对主题进行处理,所以在一些其他的项目中,可能会直接基于主题形成主题相关的宽表。

建库

create database if not exists yp_dwb;

6.1 订单业务宽表

建表

DROP TABLE if EXISTS yp_dwb.dwb_order_detail;
CREATE TABLE yp_dwb.dwb_order_detail(
  order_id string COMMENT '根据一定规则生成的订单编号',
  order_num string COMMENT '订单序号',
  buyer_id string COMMENT '买家的userId',
  store_id string COMMENT '店铺的id',
  order_from string COMMENT '渠道类型:android、ios、miniapp、pcweb、other',
  order_state int COMMENT '订单状态:1.已下单\; 2.已付款, 3. 已确认 \;4.配送\; 5.已完成\; 6.退款\;7.已取消',
  create_date string COMMENT '下单时间',
  finnshed_time timestamp COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价',
  is_settlement tinyint COMMENT '是否结算\;0.待结算订单\; 1.已结算订单\;',
  is_delete tinyint COMMENT '订单评价的状态:0.未删除\;  1.已删除\;(默认0)',
  evaluation_state tinyint COMMENT '订单评价的状态:0.未评价\;  1.已评价\;(默认0)',
  way string COMMENT '取货方式:SELF自提\;SHOP店铺负责配送',
  is_stock_up int COMMENT '是否需要备货 0:不需要    1:需要    2:平台确认备货  3:已完成备货 4平台已经将货物送至店铺 ',
--  订单副表
  order_amount decimal(36,2) COMMENT '订单总金额:购买总金额-优惠金额',
  discount_amount decimal(36,2) COMMENT '优惠金额',
  goods_amount decimal(36,2) COMMENT '用户购买的商品的总金额+运费',
  is_delivery string COMMENT '0.自提;1.配送',
  buyer_notes string COMMENT '买家备注留言',
  pay_time string,
  receive_time string,
  delivery_begin_time string,
  arrive_store_time string,
  arrive_time string COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价',
  create_user string,
  create_time string,
  update_user string,
  update_time string,
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志',
--  订单组
  group_id string COMMENT '订单分组id',
  is_pay tinyint COMMENT '订单组是否已支付,0未支付,1已支付',
--  订单组支付
  group_pay_amount decimal(36,2) COMMENT '订单总金额\;',
--  退款单
  refund_id string COMMENT '退款单号',
  apply_date string COMMENT '用户申请退款的时间',
  refund_reason string COMMENT '买家退款原因',
  refund_amount decimal(36,2) COMMENT '订单退款的金额',
  refund_state tinyint COMMENT '1.申请退款\;2.拒绝退款\; 3.同意退款,配送员配送\; 4:商家同意退款,用户亲自送货 \;5.退款完成',
--  结算单
  settle_id string COMMENT '结算单号',
  settlement_amount decimal(36,2) COMMENT '如果发生退款,则结算的金额 = 订单的总金额 - 退款的金额',
  dispatcher_user_id string COMMENT '配送员id',
  dispatcher_money decimal(36,2) COMMENT '配送员的配送费(配送员的运费(如果退货方式为1:则买家支付配送费))',
  circle_master_user_id string COMMENT '圈主id',
  circle_master_money decimal(36,2) COMMENT '圈主分润的金额',
  plat_fee decimal(36,2) COMMENT '平台应得的分润',
  store_money decimal(36,2) COMMENT '商家应得的订单金额',
  status tinyint COMMENT '0.待结算;1.待审核 \; 2.完成结算;3.拒绝结算',
  settle_time string COMMENT ' 结算时间',
-- 订单评价
  evaluation_id string,
  evaluation_user_id string COMMENT '评论人id',
  geval_scores int COMMENT '综合评分',
  geval_scores_speed int COMMENT '送货速度评分0-5分(配送评分)',
  geval_scores_service int COMMENT '服务评分0-5分',
  geval_isanony tinyint COMMENT '0-匿名评价,1-非匿名',
  evaluation_time string,
-- 订单配送
  delievery_id string COMMENT '主键id',
  dispatcher_order_state tinyint COMMENT '配送订单状态:0.待接单.1.已接单,2.已到店.3.配送中 4.商家普通提货码完成订单.5.商家万能提货码完成订单。6,买家完成订单',
  delivery_fee decimal(36,2) COMMENT '配送员的运费',
  distance int COMMENT '配送距离',
  dispatcher_code string COMMENT '收货码',
  receiver_name string COMMENT '收货人姓名',
  receiver_phone string COMMENT '收货人电话',
  sender_name string COMMENT '发货人姓名',
  sender_phone string COMMENT '发货人电话',
  delievery_create_time string,
-- 商品快照
  order_goods_id string COMMENT '--商品快照id',
  goods_id string COMMENT '购买商品的id',
  buy_num int COMMENT '购买商品的数量',
  goods_price decimal(36,2) COMMENT '购买商品的价格',
  total_price decimal(36,2) COMMENT '购买商品的价格 = 商品的数量 * 商品的单价 ',
  goods_name string COMMENT '商品的名称',
  goods_specification string COMMENT '商品规格',
  goods_type string COMMENT '商品分类     ytgj:进口商品    ytsc:普通商品     hots爆品',
  goods_brokerage decimal(36,2) COMMENT '商家设置的商品分润的金额',
  is_goods_refund tinyint COMMENT '0.不退款\; 1.退款'
)
COMMENT '订单明细表'
PARTITIONED BY(dt STRING)
row format delimited fields terminated by '\t'
stored as orc
tblproperties ('orc.compress' = 'SNAPPY');

插入数据

insert into yp_dwb.dwb_order_detail partition (dt)
select
    o.id as order_id,
    o.order_num,
    o.buyer_id,
    o.store_id,
    o.order_from,
    o.order_state,
    o.create_date,
    o.finnshed_time,
    o.is_settlement,
    o.is_delete,
    o.evaluation_state,
    o.way,
    o.is_stock_up,
    od.order_amount,
    od.discount_amount,
    od.goods_amount,
    od.is_delivery,
    od.buyer_notes,
    od.pay_time,
    od.receive_time,
    od.delivery_begin_time,
    od.arrive_store_time,
    od.arrive_time,
    od.create_user,
    od.create_time,
    od.update_user,
    od.update_time,
    od.is_valid,
    og.group_id,
    og.is_pay,
    op.order_pay_amount as group_pay_amount,
    refund.id as refund_id,
    refund.apply_date,
    refund.refund_reason,
    refund.refund_amount,
    refund.refund_state,
    os.id as settle_id,
    os.settlement_amount,
    os.dispatcher_user_id,
    os.dispatcher_money,
    os.circle_master_user_id,
    os.circle_master_money,
    os.plat_fee,
    os.store_money,
    os.status,
    os.settle_time,

    e.id,
    e.user_id,
    e.geval_scores,
    e.geval_scores_speed,
    e.geval_scores_service,
    e.geval_isanony,
    e.create_time,
    d.id,
    d.dispatcher_order_state,
    d.delivery_fee,
    d.distance,
    d.dispatcher_code,
    d.receiver_name,
    d.receiver_phone,
    d.sender_name,
    d.sender_phone,
    d.create_time,

    ogoods.id as order_goods_id,
    ogoods.goods_id,
    ogoods.buy_num,
    ogoods.goods_price,
    ogoods.total_price,
    ogoods.goods_name,
    ogoods.goods_specification,
    ogoods.goods_type,
    ogoods.goods_brokerage,
    ogoods.is_refund as is_goods_refund,
    SUBSTRING(o.create_date,1,10) as dt
--订单表
from yp_dwd.fact_shop_order o
--订单副表
left join yp_dwd.fact_shop_order_address_detail od on o.id = od.id and od.end_date='9999-99-99'
--订单组表
left join yp_dwd.fact_shop_order_group og on o.id = og.order_id and og.end_date='9999-99-99'
--订单组支付表
left join yp_dwd.fact_order_pay op on og.group_id = op.group_id
--退款单表
left join yp_dwd.fact_refund_order refund on refund.order_id=o.id and refund.end_date='9999-99-99'
--结算单表
left join yp_dwd.fact_order_settle os on os.order_id=o.id and os.end_date='9999-99-99'
--订单商品快照
left join yp_dwd.fact_shop_order_goods_details ogoods on ogoods.order_id=o.id and ogoods.end_date='9999-99-99'
--订单评价表
left join yp_dwd.fact_goods_evaluation e on e.order_id=o.id and e.is_valid=1
--订单配送表
left join yp_dwd.fact_order_delievery_item d on d.shop_order_id=o.id and d.dispatcher_order_type=1 and d.is_valid=1 and d.end_date='9999-99-99'
where o.end_date='9999-99-99'
--and og.group_id='dd2019022817332967452f41f'
--and o.id='dd19040334655236d5'
;

6.2 店铺明细宽表

建表

DROP TABLE if EXISTS yp_dwb.dwb_shop_detail;
CREATE TABLE yp_dwb.dwb_shop_detail(
--  店铺
  id string, 
  address_info string COMMENT '店铺详细地址', 
  store_name string COMMENT '店铺名称', 
  is_pay_bond tinyint COMMENT '是否有交过保证金 1:是0:否', 
  trade_area_id string COMMENT '归属商圈ID', 
  delivery_method tinyint COMMENT '配送方式  1 :自提 ;3 :自提加配送均可\; 2 : 商家配送', 
  store_type int COMMENT '店铺类型 22天街网店 23实体店 24直营店铺 33会员专区店', 
  is_primary tinyint COMMENT '是否是总店 1: 是 2: 不是', 
  parent_store_id string COMMENT '父级店铺的id,只有当is_primary类型为2时有效', 
--  商圈
  trade_area_name string COMMENT '商圈名称',
--  区域-店铺
  province_id string COMMENT '店铺所在省份ID', 
  city_id string COMMENT '店铺所在城市ID', 
  area_id string COMMENT '店铺所在县ID', 
  province_name string COMMENT '省份名称', 
  city_name string COMMENT '城市名称', 
  area_name string COMMENT '县名称'
  )
COMMENT '店铺明细表'
PARTITIONED BY(dt STRING)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

插入数据

insert into yp_dwb.dwb_shop_detail partition (dt)
select 
    s.id,
    s.address_info,
    s.name as store_name,
    s.is_pay_bond,
    s.trade_area_id,
    s.delivery_method,
    s.store_type,
    s.is_primary,
    s.parent_store_id,
    ta.name as trade_area_name,
    pro.id province_id,
    city.id city_id,
    area.id area_id,
    pro.name province_name,
    city.name city_name,
    area.name area_name,
    SUBSTRING(s.create_time,1,10) dt
--店铺
from yp_dwd.dim_store s
--商圈
left join yp_dwd.dim_trade_area ta on s.trade_area_id=ta.id and ta.end_date='9999-99-99'
--     地区
left join yp_dwd.dim_location lo on lo.type=2 and lo.correlation_id=s.id and lo.end_date='9999-99-99'
left join yp_dwd.dim_district area on area.code=lo.adcode
left join yp_dwd.dim_district city on area.pid=city.id
left join yp_dwd.dim_district pro on city.pid=pro.id
where s.end_date='9999-99-99'
;

6.3 商品明细宽表

建表

DROP TABLE if EXISTS yp_dwb.dwb_goods_detail;
CREATE TABLE yp_dwb.dwb_goods_detail(
  id string, 
  store_id string COMMENT '所属商店ID', 
  class_id string COMMENT '分类id:只保存最后一层分类id', 
  store_class_id string COMMENT '店铺分类id', 
  brand_id string COMMENT '品牌id', 
  goods_name string COMMENT '商品名称', 
  goods_specification string COMMENT '商品规格', 
  search_name string COMMENT '模糊搜索名称字段:名称_+真实名称', 
  goods_sort int COMMENT '商品排序', 
  goods_market_price decimal(36,2) COMMENT '商品市场价', 
  goods_price decimal(36,2) COMMENT '商品销售价格(原价)', 
  goods_promotion_price decimal(36,2) COMMENT '商品促销价格(售价)', 
  goods_storage int COMMENT '商品库存', 
  goods_limit_num int COMMENT '购买限制数量', 
  goods_unit string COMMENT '计量单位', 
  goods_state tinyint COMMENT '商品状态 1正常,2下架,3违规(禁售)', 
  goods_verify tinyint COMMENT '商品审核状态: 1通过,2未通过,3审核中', 
  activity_type tinyint COMMENT '活动类型:0无活动1促销2秒杀3折扣', 
  discount int COMMENT '商品折扣(%)', 
  seckill_begin_time string COMMENT '秒杀开始时间', 
  seckill_end_time string COMMENT '秒杀结束时间', 
  seckill_total_pay_num int COMMENT '已秒杀数量', 
  seckill_total_num int COMMENT '秒杀总数限制', 
  seckill_price decimal(36,2) COMMENT '秒杀价格', 
  top_it tinyint COMMENT '商品置顶:1-是,0-否', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启', 
--  商品小类
  min_class_id string COMMENT '分类id:只保存最后一层分类id', 
  min_class_name string COMMENT '店铺内分类名字', 
--  商品中类
  mid_class_id string COMMENT '分类id:只保存最后一层分类id', 
  mid_class_name string COMMENT '店铺内分类名字', 
--  商品大类
  max_class_id string COMMENT '分类id:只保存最后一层分类id', 
  max_class_name string COMMENT '店铺内分类名字', 
--  品牌
  brand_name string COMMENT '品牌名称'
  )
COMMENT '商品明细表'
PARTITIONED BY(dt STRING)
row format delimited fields terminated by '\t' 
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

插入数据

INSERT into yp_dwb.dwb_goods_detail partition (dt)
SELECT
    goods.id,
    goods.store_id,
    goods.class_id,
    goods.store_class_id,
    goods.brand_id,
    goods.goods_name,
    goods.goods_specification,
    goods.search_name,
    goods.goods_sort,
    goods.goods_market_price,
    goods.goods_price,
    goods.goods_promotion_price,
    goods.goods_storage,
    goods.goods_limit_num,
    goods.goods_unit,
    goods.goods_state,
    goods.goods_verify,
    goods.activity_type,
    goods.discount,
    goods.seckill_begin_time,
    goods.seckill_end_time,
    goods.seckill_total_pay_num,
    goods.seckill_total_num,
    goods.seckill_price,
    goods.top_it,
    goods.create_user,
    goods.create_time,
    goods.update_user,
    goods.update_time,
    goods.is_valid,
    CASE class1.level WHEN 3
        THEN class1.id
        ELSE NULL
        END as min_class_id,
    CASE class1.level WHEN 3
        THEN class1.name
        ELSE NULL
        END as min_class_name,
    CASE WHEN class1.level=2
        THEN class1.id
        WHEN class2.level=2
        THEN class2.id
        ELSE NULL
        END as mid_class_id,
    CASE WHEN class1.level=2
        THEN class1.name
        WHEN class2.level=2
        THEN class2.name
        ELSE NULL
        END as mid_class_name,
    CASE WHEN class1.level=1
        THEN class1.id
        WHEN class2.level=1
        THEN class2.id
        WHEN class3.level=1
        THEN class3.id
        ELSE NULL
        END as max_class_id,
    CASE WHEN class1.level=1
        THEN class1.name
        WHEN class2.level=1
        THEN class2.name
        WHEN class3.level=1
        THEN class3.name
        ELSE NULL
        END as max_class_name,
    brand.brand_name,
    SUBSTRING(goods.create_time,1,10) dt
--SKU
FROM yp_dwd.dim_goods goods
--商品分类
left join yp_dwd.dim_goods_class class1 on goods.store_class_id = class1.id AND class1.end_date='9999-99-99'
left join yp_dwd.dim_goods_class class2 on class1.parent_id = class2.id AND class2.end_date='9999-99-99'
left join yp_dwd.dim_goods_class class3 on class2.parent_id = class3.id AND class3.end_date='9999-99-99'
--品牌
left join yp_dwd.dim_brand brand on goods.brand_id=brand.id AND brand.end_date='9999-99-99'
WHERE goods.end_date='9999-99-99'
;

7. Hive 索引

提升查询效率

7.1 HIVE的原始索引(废弃)

hive的原始索引可以针对某个列,或者某几列构建索引信息,构建后提升查询执行列的查询效率

存在憋端:
hive原始索引不会自动更新,每次表中数据发生变化后,都是需要手动重建索引操作,比较耗费时间和资源,整体提升性能一般

所以在HIVE3.x版本后,已经直接将这种索引废弃掉了,无法使用,而且官方描述在hive1.x 和 hive2.x版本中,也不建议优先使用原始索引

7.2 HIVE的row group index索引

row group index: 行组索引

条件:

  •     要求表的存储类型为ORC存储格式;
    
  •     在创建表的时候,必须开启row group index 索引支持
    
                   orc.create.index'='true;
    
  •     在插入数据的时候,必须保证需求进行索引列,按序插入数据;
    

适用于: 数值类型的,并且对数值类型进行>=操作

思路:
插入数据到ORC表后,会自动进行划分为多个script片段,每个片段内部,会保存着每个字段的最小,最大值,这样,当执行査询 ><= 的条件筛选操作的时候,根据最小最大值锁定相关的script片段,从而减少数据扫描量,提升效率

操作

CREATE TABLE lxw1234_orc2 
(字段列表) 
stored AS ORC 
TBLPROPERTIES(
    'orc.compress'='SNAPPY',
    --开启行组索引
    'orc.create.index'='true'
)

插入数据的时候,需要保证数据有序的

insert overwrite table lxw1234_orc2 
SELECT CAST(siteid As INT)As id, pcid FROM lxw1234_text
-- 插入的数据保持排序(可以使用全局排序,也可以使用局部排序,只需要保证一定有序即可)
DISTRIBUTE BY id sort By id;

使用:

set hive.optimize.index.filter=true;
SELECT COUNT(1) FROM IxW1234_OrC1 WHERE id >= 1382 AND id<= 1399;

7.3 HIVE的bloom filter index索引

bloom filter index(布隆过滤索引)

条件:

  • 要求表的存储类型为 ORC存储方案
  • 在建表的饿时候,必须设置为那些列构建布隆索引
  • 仅能适合于等值过滤查询操作

思路:
在开启布隆过滤索引后,可以针对某个列,或者某几列来建立索引,构建索引后,会将这一列的数据的值存储在对应script片段的索引信息中,这样当进行 等值查询的时候,首先会到每一个script片段的索引中,判断是否有这个值,如果没有,直接跳过script,从而减少数据扫描量,提升效率

操作:

CREATE TABLE lxw1234_orc2 (字段列表...)
stored As ORC
TBLPROPERTIES(
    orc.compress'='SNAPPY',
--开启 行组索引(可选的,支持全部都打开,也可以仅开启一个)
    'orc.create.index'='true',
--pcid字段开启BloomFilter索引
    'orc.bloom.filter.columns'='pcid,字段2,字段3..'
)

插入数据:没有要求,当然如果开启行组索引,可以将需要使用行组索引的字段,进行有序插入即可

使用:

SET hive.optimize.index.filter=true;
SELECT COUNT(1) 
FROM IXW1234_OrC1 
WHERE id >= 0 AND id <=1000 AND pCid IN ('0005E26F0DCCDB56F9841C','A');

在什么时候可以使用呢?

  • 对于行组索引:我们建议只要数据存储格式为OC,建议将这种索引全部打开,至于导入数据的时候,如果能保证有序,那最好,如果保证不了,也无所谓,大不了这个索引的效率不是特别好
  • 对于布隆过滤索引:建议将后续会大量的用于等值连接的操作字段,建立成布隆索引

8. 如何解决数据倾斜问题

在hive中,执行一条SQL语句,最终会被翻译为MR ,MR中mapTask和reduceTask都可能存在多个,数据倾斜主要指的整个MR中reduce阶段有多个,每个reduce拿到的数据量并不均衡,导致某一个或者某几个reduce拿到了比其他reduce更多的数据,导致处理数据压力,都集中在某几个reduce上,形成数据倾斜问题,导致执行时间变长影响执行效率

那么倾斜主要发送在执行SQL什么阶段呢?执行JOIN操作 以及执行 group by的时候

8.1 Join数据倾斜

在前序讲解reduce 端 JOIN的时候,描述过reduce 端Join的问题,其中就包含reduce端)oin存在数据倾斜的问题

** 解决方案一**

可以通过 Map Join Bucket Map Join以及SMB Join 解决
注意:
通过 Map Join,Bucket Map Join,s8 Join 来解决数据倾斜,但是 这种操作是存在使用条件的,如果无法满足这些条件, 无法使用 这种处理方案

** 解决方案二**

思路:
将那些产生倾斜的key和对应v2的数据,从当前这个MR中移出去,单独找一个MR来处理即可,处理后,和之前的MR进行汇总结果即可

关键问题:
如何找到那些存在倾斜的key呢?

运行期处理方案:

思路:在执行MR的时候,会动态统计每一个k2的值出现重复的次数,当这个重复的次数达到一定的阈值后,认为当前这个k2的数据存在数据倾斜,自动将其剔除,交由给一个单独的MR来处理即可,两个MR处理完成后,将结果基于union all 合并在一起即可

实操:

  •   set hive.optimize.skewjoin=true;   开启运行期处理倾斜参数 
    
  •     set hive.skewjoin.key=100000;   阈值,此参数在实际生产环境中,需要调整在一个合理的值
    

判断依据: 查看 join的 字段 对应重复的数量有多少个,然后选择一个合理值
比如判断: id为1大概有100w id为288w id为3大概有 500w 设置阈值为 大于500w次数据

编译期处理方案:
** 思路**: 在创建这个表的时候,我们就可以预知到后续插入到这个表中数据,那些key的值会产生倾斜,在建表的时候,将其提前配置设置好即可,在后续运行的时候,程序会自动将设置的key的数据单独找一个MR来进行处理即可,处理完成后,再和原有结果进行union a1l 合并操作

** 实操:**
set hive,optimize.skewjoin.compiletime=true; --开启编译期处理倾斜参数

             CREATE TABLE list bucket single(key STRING, Value STRING)

            --倾斜的字段和需要拆分的key值

           SKEWED BY(key)ON(1,56)

            --为倾斜值创建子目录单独存放

            [STORED AS DIRECTORIES];

适用于: 提前知道那些key存在倾斜

在实际生产环境中,应该使用那种方式呢?

    两种方式都会使用的般来说,会将两个都开启,编译期的明确在编译期将其设置好,编译期不清楚,通过运行期动态捕获即可

8.2 group by 数据倾斜

方案一:规约(提前聚合)操作

如何配置呢?
只需要在HIVE中开启combiner提前聚合配置参数即可:

** set hive.map.aggr=true;**

方案二:负载均衡的解决方案(需要运行两个MR来处理)

mapTask执行完成后,在进行分发数据到达reduce,默认情况下将相同k2的数据发往同一个reduce,目前采用防范为随机分发保证每一个reduce拿到相等数量的数据信息
(负载过程,让每一个reduce接收到相同数量的数据)

说明:

  • 方案二,比方案一,更能彻底解决数据倾斜问题,因为其处理数据范围更大,整个整个数据集来处理,而方案一,只是每个MapTask处理,仅仅局部处理
  • 这两种方式: 建议在生产中,优先使用第一种,如果第一种无法解决,尝试使用第二种解决

8.3 如何发现数据倾斜

倾斜发生后,出现的问题,程序迟迟无法结束,或者说翻译的mR中reduceTask有多个,大部分的reduceTask都执行完成了,只有其中一个或者几个没有执行完成,此时认为发生了数据倾斜
关键点:如何查看每一个reduceTask执行时间

方式一: 通过Yarn查看(运行过程中)或者jobhistory查看

目前,我们这里可能只有一个reduce,但是实际上生产环境中,此位置可能会有多个reduceTask,我们需要观察每个reduceTask执行时间,如果发现其中一不或者几个
reduce执行时间,远远大于其他的reduceTask执行时间,那么说明存在数据倾斜的问题

9. DWS

DWS层: 业务层
基于主题统计分析,此层一般适用于进行细化粒度的聚合统计操作,主要为了服务后续上卷统计过程(提前聚合操作)

比如说:
以年 月 日 来统计操作,在Dws层,仅需要按照 日进行统计相关的指标即可,进行提前聚合操作后续在DM层,进行上卷操作,将月和年基于日统计宽表 计算得出
注意:
如果要进行提前聚合操作,不能对数据进行去重统计

本次DWS层,共计有三个主题需要进行统计:销售主题,商品主题, 和 用户主题

9.1 销售主题的日统计宽表

可分析的主要指标有:

    销售收入、平台收入、配送成交额、小程序成交额、安卓APP成交额、苹果APP成交额、PC商城成交额、订单量、参评单量、差评单量、配送单量、退款单量、小程序订单量、安卓APP订单量、苹果APP订单量、PC商城订单量。

维度有:

    日期、城市、商圈、店铺、品牌、大类、中类、小类

维度组合:

  • 日期:日
  • 日期 + 城市
  • 日期 + 城市 + 商圈
  • 日期 + 城市 + 商圏 + 店铺
  • 日期+品牌
  • 日期 + 大类
  • 日期 + 大类 + 中类
  • 日期 + 大类 + 中类 + 小类

16*8=128 个需求指标结果

维度字段:
日期:dwb order detail.dt
城市:dwb shop detail:city id 和 city name
商圈:dwb shop detail:trade area id和trade area name
店铺:dwb shop detail:id 和 store name
品牌:dwb goods detail:brand id 和 brand_name

    大类:dwb goods detail:max class id和max class name

    中类:dwb goods detail:mid classid和 mid_class_name
     小类:dwb_goods_detail:

建库建表

create database yp_dws;

-- 销售主题日统计宽表
DROP TABLE IF EXISTS yp_dws.dws_sale_daycount;
CREATE TABLE yp_dws.dws_sale_daycount(
    province_id string COMMENT '订单id',
    province_name string COMMENT '订单name',
   city_id string COMMENT '城市id',
   city_name string COMMENT '城市name',
   trade_area_id string COMMENT '商圈id',
   trade_area_name string COMMENT '商圈名称',
   store_id string COMMENT '店铺的id',
   store_name string COMMENT '店铺名称',
   brand_id string COMMENT '品牌id',
   brand_name string COMMENT '品牌名称',
   max_class_id string COMMENT '商品大类id',
   max_class_name string COMMENT '大类名称',
   mid_class_id string COMMENT '中类id', 
   mid_class_name string COMMENT '中类名称',
   min_class_id string COMMENT '小类id', 
   min_class_name string COMMENT '小类名称',
   group_type string COMMENT '分组类型:store,trade_area,city,brand,min_class,mid_class,max_class,all',
   --   =======日统计=======
   --   销售收入
   sale_amt DECIMAL(38,2) COMMENT '销售收入',
   --   平台收入
   plat_amt DECIMAL(38,2) COMMENT '平台收入',
   -- 配送成交额
   deliver_sale_amt DECIMAL(38,2) COMMENT '配送成交额',
   -- 小程序成交额
   mini_app_sale_amt DECIMAL(38,2) COMMENT '小程序成交额',
   -- 安卓APP成交额
   android_sale_amt DECIMAL(38,2) COMMENT '安卓APP成交额',
   --  苹果APP成交额
   ios_sale_amt DECIMAL(38,2) COMMENT '苹果APP成交额',
   -- PC商城成交额
   pcweb_sale_amt DECIMAL(38,2) COMMENT 'PC商城成交额',
   -- 成交单量
   order_cnt BIGINT COMMENT '成交单量',
   -- 参评单量
   eva_order_cnt BIGINT COMMENT '参评单量comment=>cmt',
   -- 差评单量
   bad_eva_order_cnt BIGINT COMMENT '差评单量negtive-comment=>ncmt',
   -- 配送成交单量
   deliver_order_cnt BIGINT COMMENT '配送单量',
   -- 退款单量
   refund_order_cnt BIGINT COMMENT '退款单量',
   -- 小程序成交单量
   miniapp_order_cnt BIGINT COMMENT '小程序成交单量',
   -- 安卓APP订单量
   android_order_cnt BIGINT COMMENT '安卓APP订单量',
   -- 苹果APP订单量
   ios_order_cnt BIGINT COMMENT '苹果APP订单量',
   -- PC商城成交单量
   pcweb_order_cnt BIGINT COMMENT 'PC商城成交单量'
)
COMMENT '销售主题日统计宽表' 
PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');
  • 根据 日期 +城市,统计相关的指标

​​​

分析数据灌入dws层

--开启动态分区支持:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.max.dynamic.partitions.pernode=10000;
set hive.exec.max.dynamic.partitions=100000;
set hive.exec.max.created.files=150000;
--hive压缩
set hive.exec.compress.intermediate=true;set hive.exec.compress.output=true;
--写入时压缩生效
set hive.exec.orc.compressiom.strategy=COMPRESSION;
--map join 优化操作
set hive.auto.convert.join=true;
set hive.auto,convert.join.noconditionaltask.size=20971520;
--group by 数据倾斜的优化
-- set hive.map.aggr=true; --方案1
set hive.groupby.skewindata=true; --方案2

--日期+城市

--第一步:先去重
WITH A as ( 
    SELECT 
        --维度字段
        o.dt,
        s.province_id,
        s.province_name,
        s.city_id,
        s.city_name,
        
        --指标字段
        o.order_id, --订单id
        o.order_amount, --订单额
        o.plat_fee, --平台分润
        o.delivery_fee, -- 配送费
        
        --判断字段
        o.order_from, --渠道类型
        o.evaluation_id, --评价id(是否参加评价)
        o.geval_scores, --综合评分
        o.delievery_id, --配送id(是否配送)
        o.refund_id, --退款id(判断是否退款)
        
        --去重操作
        row_number() OVER(PARTITION BY o.order_id) as rn
    
    from yp_dwb.dwb_order_detail o
        LEFT JOIN yp_dwb.dwb_shop_detail s on o.store_id = s.id
        LEFT JOIN yp_dwb.dwb_goods_detail g on o.goods_id = g.id
    --确保订单为支付状态
    WHERE o.is_pay = 1 and o.order_state NOT IN (1,7)
)
INSERT OVERWRITE TABLE yp_dws.dws_sale_daycount partition(dt)
SELECT 
    province_id,
    province_name,
    city_id,
    city_name,
    '-1' as trade_area_id,
    '-1' as trade_area_name,
    '-1' as store_id,
    '-1' as store_name,
    '-1' as brand_id,
    '-1' as brand_name,
    '-1' as max_class_id,
    '-1' as max_class_name,
    '-1' as mid_class_id,
    '-1' as mid_class_name,
    '-1' as min_class_id,
    '-1' as min_class_name,
    'city' as group_type,
    
    -----------------指标统计--------------------
    --额度
    --销售收入
    --coalesce 返回第一个不为空的,都为空则返回0
    sum(coalesce(order_amount,0)) as sale_amt,
    --平台收入
    sum(coalesce(plat_fee,0)) as plat_amt,
    --配送成交额
    sum(coalesce(delivery_fee,0)) as delivery_sale_amt,
    --小程序成交额
    sum(
        if(order_from = 'minapp',
            coalesce(order_amount,0),
            0
        )
    ) as mini_app_sale_amt,
    --安卓APP成交额
    sum(
        if(order_from = 'android',
            coalesce(order_amount,0),
            0
        )
    ) as android_sale_amt,
    --苹果APP成交额
    sum(
        if(order_from = 'ios',
            coalesce(order_amount,0),
            0
        )
    ) as ios_sale_amt,
    --pc商城成交额
    sum(
        if(order_from = 'pcweb',
            coalesce(order_amount,0),
            0
        )
    ) as pcweb_sale_amt,
    
    --数量
    --成交单量
    count(order_id) as order_cnt,
    --参评单量(订单被评价了,就说明这个订单属于参评订单了,需要进行统计)
    count(
        if(
            evaluation_id is not null,
            order_id,
            null
        )
    ) as eva_dorder_cnt,
    --差评单量
    count(
        if(
            evaluation_id is not null and geval_scores <= 6,
            order_id,
            null
        )
    ) as bad_eva_order_cnt,
    --配送单量
    count(
        if(
            delievery_id is NOT NULL,
            order_id,
            NULL
        )
    ) as deliver_order_cnt,
    --退款单量
    count(
        if(
            refund_id is NOT NULL,
            order_id,
            NULL
        )
    ) as refund_order_cnt,
    
    --小程序订单量
    count(
        if(
            order_from = 'minapp',
            order_id,
            0
        )
    ) as minapp_order_cnt,
    --安卓订单量
    count(
        if(
            order_from = 'android',
            order_id,
            0
        )
    ) as android_order_cnt, 
    --苹果订单量
    count(
        if(
            order_from = 'ios',
            order_id,
            0
        )
    ) as ios_order_cnt , 
    --pc商城订单量
    count(
        if(
            order_from = 'pcweb',
            order_id,
            0
        )
    ) as pcweb_order_cnt,
    
    dt --分区字段
from A where A.rn =1 
GROUP BY dt,province_id,province_name,city_id,city_name;
  • 日期+城市+商圈(与上一个类型)

一个子订单一定是属于一家商铺,一个商铺必须是在某一个城市的,一个城市对应商铺位置,一定属于某一个商圈的一个商圈下,可以有多个店铺,但是一个店铺只能有一个商圈,一个店铺可以有多笔订单,但是一个子订单只能属于一个店铺

店铺和子订单关系: 多对一关系
店铺和城市:一对多关系
店铺和商圈:多对一关系

子订单 和 商圈关系:1个订单只能对应一个商圈 1-1

子订单 和 城市关系:1个订单只能对饮一个城市 1-1

子订单和城市和商圈: 1-1-1

--第一步:先去重
WITH A as ( 
    SELECT 
        --维度字段
        o.dt,
        s.province_id,
        s.province_name,
        s.city_id,
        s.city_name,
        s.trade_area_id, --商圈id
        s.trade_area_name, --商圈name
        
        --指标字段
        o.order_id, --订单id
        o.order_amount, --订单额
        o.plat_fee, --平台分润
        o.delivery_fee, -- 配送费
        
        --判断字段
        o.order_from, --渠道类型
        o.evaluation_id, --评价id(是否参加评价)
        o.geval_scores, --综合评分
        o.delievery_id, --配送id(是否配送)
        o.refund_id, --退款id(判断是否退款)
        
        --去重操作
        row_number() OVER(PARTITION BY o.order_id) as rn
    
    from yp_dwb.dwb_order_detail o
        LEFT JOIN yp_dwb.dwb_shop_detail s on o.store_id = s.id
        LEFT JOIN yp_dwb.dwb_goods_detail g on o.goods_id = g.id
    --确保订单为支付状态
    WHERE o.is_pay = 1 and o.order_state NOT IN (1,7)
)
INSERT OVERWRITE TABLE yp_dws.dws_sale_daycount partition(dt)
SELECT 
    province_id,
    province_name,
    city_id,
    city_name,
    trade_area_id,
    trade_area_name,
    '-1' as store_id,
    '-1' as store_name,
    '-1' as brand_id,
    '-1' as brand_name,
    '-1' as max_class_id,
    '-1' as max_class_name,
    '-1' as mid_class_id,
    '-1' as mid_class_name,
    '-1' as min_class_id,
    '-1' as min_class_name,
    'trade_area' as group_type,
    
    -----------------指标统计--------------------
    --额度
    --销售收入
    --coalesce 返回第一个不为空的,都为空则返回0
    sum(coalesce(order_amount,0)) as sale_amt,
    --平台收入
    sum(coalesce(plat_fee,0)) as plat_amt,
    --配送成交额
    sum(coalesce(delivery_fee,0)) as delivery_sale_amt,
    --小程序成交额
    sum(
        if(order_from = 'minapp',
            coalesce(order_amount,0),
            0
        )
    ) as mini_app_sale_amt,
    --安卓APP成交额
    sum(
        if(order_from = 'android',
            coalesce(order_amount,0),
            0
        )
    ) as android_sale_amt,
    --苹果APP成交额
    sum(
        if(order_from = 'ios',
            coalesce(order_amount,0),
            0
        )
    ) as ios_sale_amt,
    --pc商城成交额
    sum(
        if(order_from = 'pcweb',
            coalesce(order_amount,0),
            0
        )
    ) as pcweb_sale_amt,
    
    --数量
    --成交单量
    count(order_id) as order_cnt,
    --参评单量(订单被评价了,就说明这个订单属于参评订单了,需要进行统计)
    count(
        if(
            evaluation_id is not null,
            order_id,
            null
        )
    ) as eva_dorder_cnt,
    --差评单量
    count(
        if(
            evaluation_id is not null and geval_scores <= 6,
            order_id,
            null
        )
    ) as bad_eva_order_cnt,
    --配送单量
    count(
        if(
            delievery_id is NOT NULL,
            order_id,
            NULL
        )
    ) as deliver_order_cnt,
    --退款单量
    count(
        if(
            refund_id is NOT NULL,
            order_id,
            NULL
        )
    ) as refund_order_cnt,
    
    --小程序订单量
    count(
        if(
            order_from = 'minapp',
            order_id,
            0
        )
    ) as minapp_order_cnt,
    --安卓订单量
    count(
        if(
            order_from = 'android',
            order_id,
            0
        )
    ) as android_order_cnt, 
    --苹果订单量
    count(
        if(
            order_from = 'ios',
            order_id,
            0
        )
    ) as ios_order_cnt , 
    --pc商城订单量
    count(
        if(
            order_from = 'pcweb',
            order_id,
            0
        )
    ) as pcweb_order_cnt,
    
    dt --分区字段
from A where A.rn =1 
GROUP BY dt,province_id,province_name,city_id,city_name,trade_area_id,trade_area_name;

SELECT * FROM dws_sale_daycount LIMIT 10;
  • 日期 +品牌:(需求思考,处理逻辑是不一样的,因为 一个订单中可以有多个品牌,一个品牌可以有多个订单)

订单和品牌之间的关系: 一个订单下,可以多个品牌的数据,一个品牌也可以有多个订单数据, 之间的关系为 多对多

set hive.groupby.skewindata=false;
--第一步:先去重
WITH A as ( 
    SELECT 
        --维度字段
        o.dt,
        s.province_id,
        s.province_name,
        s.city_id,
        s.city_name,
        s.trade_area_id, --商圈id
        s.trade_area_name, --商圈name
        g.brand_id,
        g.brand_name, --品牌
        
        --指标字段
        o.order_id, --订单id
        o.order_amount, --订单额
        o.plat_fee, --平台分润
        o.delivery_fee, -- 配送费
        o.total_price, --商品价格
        
        --判断字段
        o.order_from, --渠道类型
        o.evaluation_id, --评价id(是否参加评价)
        o.geval_scores, --综合评分
        o.delievery_id, --配送id(是否配送)
        o.refund_id, --退款id(判断是否退款)
        
        --去重操作
        row_number() OVER(PARTITION BY o.order_id) as rn ,--计算 日期,日期+城市,日期+城市+商圈,日期+城市+商圈+店铺
        row_number() over(partition by o.order_id, o.goods_id,g.brand_id)as rn1 --日期+品牌
    
    from yp_dwb.dwb_order_detail o
        LEFT JOIN yp_dwb.dwb_shop_detail s on o.store_id = s.id
        LEFT JOIN yp_dwb.dwb_goods_detail g on o.goods_id = g.id
    --确保订单为支付状态
    WHERE o.is_pay = 1 and o.order_state NOT IN (1,7)
)
insert into TABLE yp_dws.dws_sale_daycount partition(dt) 
SELECT 
   '-1' as province_id,
   '-1' as province_name,
   '-1' as city_id,
   '-1' as city_name,
   '-1' as trade_area_id,
   '-1' as trade_area_name,
    '-1' as store_id,
    '-1' as store_name,
    brand_id,
    brand_name,
    '-1' as max_class_id,
    '-1' as max_class_name,
    '-1' as mid_class_id,
    '-1' as mid_class_name,
    '-1' as min_class_id,
    '-1' as min_class_name,
    'brand' as group_type,
    
-----------------指标统计--------------------
    --销售收入
    --coalesce 返回第一个不为空的,都为空则返回0
    sum(coalesce(total_price,0)) as sale_amt,
    --平台收入 无法计算,因为都是基于订单,而不是基于品牌
    null as plat_amt,
    --配送成交额
    null as delivery_sale_amt,
    --小程序成交额
    sum(
        if(order_from = 'minapp',
            coalesce(total_price,0),
            0
        )
    ) as mini_app_sale_amt,
    --安卓APP成交额
    sum(
        if(order_from = 'android',
            coalesce(total_price,0),
            0
        )
    ) as android_sale_amt,
    --苹果APP成交额
    sum(
        if(order_from = 'ios',
            coalesce(total_price,0),
            0
        )
    ) as ios_sale_amt,
    --pc商城成交额
    sum(
        if(order_from = 'pcweb',
            coalesce(total_price,0),
            0
        )
    ) as pcweb_sale_amt,
    
    --数量
    --成交单量
    count(DISTINCT order_id) as order_cnt,
    --参评单量(订单被评价了,就说明这个订单属于参评订单了,需要进行统计)
    count(DISTINCT
        if(
            evaluation_id is not null,
            order_id,
            null
        )
    ) as eva_dorder_cnt,
    --差评单量
    count(DISTINCT
        if(
            evaluation_id is not null and geval_scores <= 6,
            order_id,
            null
        )
    ) as bad_eva_order_cnt,
    --配送单量
    count(DISTINCT
        if(
            delievery_id is NOT NULL,
            order_id,
            NULL
        )
    ) as deliver_order_cnt,
    --退款单量
    count(DISTINCT
        if(
            refund_id is NOT NULL,
            order_id,
            NULL
        )
    ) as refund_order_cnt,
    
    --小程序订单量
    count(DISTINCT
        if(
            order_from = 'minapp',
            order_id,
            0
        )
    ) as minapp_order_cnt,
    --安卓订单量
    count(DISTINCT
        if(
            order_from = 'android',
            order_id,
            0
        )
    ) as android_order_cnt, 
    --苹果订单量
    count(DISTINCT
        if(
            order_from = 'ios',
            order_id,
            0
        )
    ) as ios_order_cnt , 
    --pc商城订单量
    count(DISTINCT
        if(
            order_from = 'pcweb',
            order_id,
            0
        )
    ) as pcweb_order_cnt,
    
    dt --分区字段
    
from A WHERE rn1 = rn1
GROUP BY dt,brand_id,brand_name LIMIT 10;

9.2 hive的优化

9.3 商品主题的日统计宽表

--商品主题日统计宽表
drop table if exists yp_dws.dws_sku_daycount;
create table yp_dws.dws_sku_daycount 
(
   sku_id string comment 'sku_id',
   sku_name string comment '商品名称',
    order_count bigint comment '被下单次数',
    order_num bigint comment '被下单件数',
    order_amount decimal(38,2) comment '被下单金额',
    payment_count bigint  comment '被支付次数',
    payment_num bigint comment '被支付件数',
    payment_amount decimal(38,2) comment '被支付金额',
    refund_count bigint  comment '被退款次数',
    refund_num bigint comment '被退款件数',
    refund_amount  decimal(38,2) comment '被退款金额',
    cart_count bigint comment '被加入购物车次数',
    cart_num bigint comment '被加入购物车件数',
    favor_count bigint comment '被收藏次数',
    evaluation_good_count bigint comment '好评数',
    evaluation_mid_count bigint comment '中评数',
    evaluation_bad_count bigint comment '差评数'
) COMMENT '每日商品行为'
PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

9.4 用户主题的日统计宽表

--用户主题日统计宽表
drop table if exists yp_dws.dws_user_daycount;
create table yp_dws.dws_user_daycount
(
   user_id string comment '用户 id',
    login_count bigint comment '登录次数',
    store_collect_count bigint comment '店铺收藏数量',
    goods_collect_count bigint comment '商品收藏数量',
    cart_count bigint comment '加入购物车次数',
    cart_amount decimal(38,2) comment '加入购物车金额',
    order_count bigint comment '下单次数',
    order_amount    decimal(38,2)  comment '下单金额',
    payment_count   bigint      comment '支付次数',
    payment_amount  decimal(38,2) comment '支付金额'
) COMMENT '每日用户行为'
PARTITIONED BY(dt STRING)
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

10. Presto

Presto是一个大数据旗下的分布式SQL查询引擎,Presto可以独立提供计算分析操作,不需要依赖于其他的计算引擎,

而HIVE仅仅是一个工具,最终计算是依赖于MR或者其他的执行引擎

Presto可以对接多种数据源,可以从不同的数据源中读取数据进行分析处理,一条presto查询可以将多个数据源进行合并,可以跨越多个不同的组织进行分析

10.1 安装

安装Java

yum install java-1.8.0-openjdk*-y

解压

mkdir -p /export/server
tar -xzvf presto-server-0.245.1.tar.gz-C /export/server

cd /export/server
创建软连接
ln -s presto-server-0.245.1 presto

创建目录

cd /export/server/presto
mkdir -p data

mkdir -p etc

修改配置文件

      1. 添加node.properties

cd /export/server/presto/etc

vim node.properties

内容如下:

node.environment=presto_cluster

node.id=presto_hadoop01

node.data-dir=/export/server/presto/data

node.environment:环境的名称。群集中的所有Presto节点必须具有相同的环境名称。

node.id:此Presto安装的唯一标识符。这对于每个节点都必须是唯一的。在重新启动或升级Presto时,此标识符应保持一致。如果在一台计算机上运行多个Presto安装(即同一台计算机上有多个节点),则每个安装必须具有唯一的标识符。

node.data-dir:数据目录的位置(文件系统路径)。Presto将在此处存储日志和其他数据。

      1. JVM虚拟机配置: jvm.config

cd /export/server/presto/etc

vim jvm.config

内容如下:

-server

-Xmx5G

-XX:+UseG1GC

-XX:G1HeapRegionSize=32M

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

由于OutOfMemoryError将会导致JVM处于不一致状态,所以遇到这种错误的时候我们一般的处理措施就是记录下dump heap中的信息(用于debugging),然后强制终止进程。

      1. Presto服务配置: config.properties

cd /export/server/presto/etc

vim config.properties

内容如下:

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8090

query.max-memory=4GB

query.max-memory-per-node=1GB

query.max-total-memory-per-node=2GB

discovery-server.enabled=true

discovery.uri=http://192.168.88.80:8090

注意:

  当前配置表示既是coordinator 又是worker

当节点仅作为coordinator的时候, 需要将node-scheduler.include-coordinator改为false

当节点仅作为worker角色时, 需要将coordinator改为false, 并删除discovery-server.enabled和node-scheduler.include-coordinator配置

coordinator:允许此Presto实例充当coordinator协调器角色(接受来自客户端的查询并管理查询执行)。

node-scheduler.include-coordinator:允许此Presto实例充当coordinator&worker角色。对于较大的群集,coordinator上的worker工作可能会影响查询性能,因为两者互相争抢计算机的资源会导致调度的关键任务受到影响。

http-server.http.port:指定HTTP服务器的端口。Presto使用HTTP进行内部和外部所有通信。

query.max-memory:单个query操作可以使用的最大集群内存量。

query.max-memory-per-node:单个query操作在单个节点上用户内存能用的最大值。

query.max-total-memory-per-node:单个query操作可在单个节点上使用的最大用户内存量和系统内存量,其中系统内存是读取器、写入器和网络缓冲区等在执行期间使用的内存。

discovery-server.enabled:Presto使用发现服务Discovery service来查找群集中的所有节点。每个Presto实例在启动时都会向Discovery服务注册。为了简化部署并避免运行其他服务,Presto协调器coordinator可以运行Discovery服务的嵌入式版本。它与Presto共享HTTP服务器,因此使用相同的端口。

discovery.uri:Discovery服务的URI地址。由于启用了Presto coordinator内嵌的Discovery 服务,因此这个uri就是Presto coordinator的uri。修改example.net:8090,根据你的实际环境设置该URI。此URI不得以“/“结尾。

      1. 日志级别: log.properties

cd /export/server/presto/etc

vim log.properties

内容如下:

com.facebook.presto=INFO

这会将com.facebook.presto.server和com.facebook.presto.hive的日志级别都设置为INFO。

共有四个级别:DEBUG,INFO,WARN和ERROR。

      1. 连接器配置

Presto通过catalogs中的连接器connectors访问数据。connector提供了对应catalog中的所有schema和table。比如,如果在catalog中配置了Hive connector,并且此Hive的’web’数据库中有一个’clicks’表,该表在Presto中就可以通过hive.web.clicks来访问。

通过在etc/catalog目录中创建配置文件来注册connector。比如,通过创建etc/catalog/hive.properties,即可用来注册hive的connector:

mkdir -p /export/server/presto/etc/catalog

cd /export/server/presto/etc/catalog

vim hive.properties

内容如下:

connector.name=hive-hadoop2

hive.metastore.uri=thrift://192.168.88.80:9083

启动presto集群

cd /export/server/presto

执行:

bin/launcher start

注意: 三台都需要执行

查看web监控界面

访问web:http://192.168.88.80:8090/ui/

主页面显示了正在执行的查询数,正常活动的Worker数,排队的查询数,阻塞的查询数,并行度等等;以及每个查询的列表区域(包括查询的ID,查询语句,查询状态,用户名,数据源等等)。正在查询的排在最上面,紧接着依次为最近完成的查询,失败的查询等。

查询的状态有以下几种:

QUEUED-查询以及被接受,正等待执行

PLANNING-查询在计划中

STARTING-查询已经开始执行

RUNNING-查询已经运行,至少有一个task开始执行

BLOCKED-查询被阻塞,并且在等待资源(缓存空间,内存,切片)

FINISHING-查询正完成(比如commit forautocommit queries)

FINISHED-查询已经完成(比如数据已输出)

FAILED-查询执行失败

presto_cli 安装

Presto CLI提供了基于终端的交互式命令程序,用于运行查询。Presto CLI是一个可执行的JAR文件,这意味着它的行为类似于普通的UNIX可执行文件。

下载presto-cli-0.245.1-executable.jar(https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.245.1/presto-cli-0.245.1-executable.jar),将其重命名为presto,使用chmod +x分配执行权限后,运行:

#上传presto-cli-0.245.1-executable.jar到/export/server/presto/bin

mv presto-cli-0.245.1-executable.jar presto

chmod +x presto

./presto --server localhost:8090 --catalog hive --schema yp_ods

10.2 presto的架构

10.3 presto的相关时间函数

在presto中 对干数推米型要求比较严格 比如 数推字段类型为date米型 那么在基干这个
的过滤条件上值必须也是date类型

  • date_format(timestamp,format):将一个带有年月日 时分秒的日期对象 转换为字符串。
  • date_parse(string,format):timestamp:将带有年月日 时分秒的日期字符串 转换为 日期对象。
  • date(日期对象):date :将带有年月日 时分秒的日期对象,转换为仅包含年月日的日期对象。

说明
date类型: 表示只有 ** 年 月 日**

    **timestamp**类型: 表示 ** 年  月  日  时  分  秒**

format格式 :
年:%Y
月:%m
日:%d

    时:%H
     分:%i
     秒:%s
     周几:%w(0..6)
--2020-10-10
--timestamp'日期字符串数据直接转换为日期对象,要求后面的日期字符串的数据必须是标准格式的日期
select date format( timestamp '2020-10-10 12:50:50' , '%-%m-%d'); 

--2020-10-10
--date_parse:可以将指定日期格式数据转换为日期对象
select date format( date parse('2020-10-10 12:50:50','%-%m-%d %H:%i:%s'), '%-%m-%d'); 

日期计算的操作:

  • 根据 unit设置时间单位,对timestamp的时间数据 进行 value计算 ,计算为+号

      **date_add(unit,value,timestamp)→ [same as input]**
    
  • 根据 unit单位,时间timestamp2-时间timestamp1 ,得出差值

** date diff(unit,timestamp1,timestamp2)→ bigint**

--2021-05-05 15:20:21
select date_add('hour',3,timestamp'2021-05-05 12:20:21');

select
date diff(
'day', 
timestamp '2020-07-30 12:20:21',
timestamp '2021-05-01 12:20:21'
) --275

10.4 presto的内存调整

  • 各个节点 JVM内存推荐大小:当前节点剩余内存80%
  • 对于 memory.heap-headroom-per-node 第三方库的内存配置:建议 jvm内存的 15%左右,默认为30%
  • 在配置的时候,不要正正好后,建议预留一点点,以免出现问题,建议预密 5%~10%
  • 用户内存和系统内存之间比较,一般是以8/2原则或者 7/3原则

经验说明:
数据量在35TB,presto节点数量大约在30台左右(每台节点:128GB内存+8核CPU)

sql 的优化

10.5 Presto高级语法

10.5.1 Group Sets、CUBE和ROLLUP
  • Grouping sets:

      当需要对表中各个字段进行分组操作的时候,并且最终需要将各个分组的结果汇总在一个表的时候,此时可以通过grouping sets 来简写,将各个分组操作,统一的放置在
    

grouping sets(..) 中就OK了

  • CUBE

CUBE操作会生成提供column所有可能的groupingsets结果

  • ROLLUP

先将rollup中所有的字段全部组合在一起,进行分组,然后从后往前逐一递减进行分组,直到为空

10.5.2 Grouping

grouping(判断分组的字段, …):主要是用于判断结果是否按照某个或者某几个字段的分组操作

11. 基于presto统计DWS层

11.1 销售主题

-- 第一步: 对数据进行去重换作,通过row_number实现
insert into hive.yp_dws.dws_sale_daycount
with t1 as (
    select
        --维度字段
        o.dt, --时间
        s.province_id,
        s.province_name,
        s.city_id,
        s.city_name, --城市维度
        s.trade_area_id,
        s.trade_area_name, --商圈
        o.store_id,
        s.store_name, --店铺
        g.brand_id,
        g.brand_name, --品牌
        g.max_class_id,
        g.max_class_name, --大类
        g.mid_class_id,
        g.mid_class_name, --中类
        g.min_class_id,
        g.min_class_name, --小类

        --指标字段
        o.order_id, --订单id,订单量
        o.order_amount, --订单销售收入
        o.total_price, --商品销售收入
        o.plat_fee, --平台分润
        o.delivery_fee, --配送费用

        --判断字段
        o.order_from, --渠道(小程序,安卓,苹果,pc)
        o.evaluation_id, --评价表id 是否有参评
        o.delievery_id, --配送id 是否有配送
        o.geval_scores, --综合评分:10分制
        o.refund_id, --退款id 是否有退款

        --去重操作
        --计算日期,目期+城市,日期+城市+商圈,日期+城市+商图+店铺
        row_number() over(partition by o.order_id) as order_rn,
        --计算日期加品牌:第一个去重计算订单量第二个去重计算销售额
        row_number() over(partition by o.order_id,g.brand_id) as brand_rn,
        row_number() over(partition by o.order_id,o.goods_id,g.brand_id) as brand_goods_rn,
        --计算日期+大类
        row_number() over(partition by o.order_id,g.max_class_id) as max_class_rn,
        row_number() over(partition by o.order_id,o.goods_id,g.max_class_id) as max_class_goods_rn,
         --计算日期+大类+中类
        row_number() over(partition by o.order_id,g.max_class_id,g.mid_class_id) as mid_class_rn,
        row_number() over(partition by o.order_id,o.goods_id,g.max_class_id,g.mid_class_id) as mid_class_goods_rn,
         --计算日期+大类+中类+小类
        row_number() over(partition by o.order_id,g.max_class_id,g.mid_class_id,g.min_class_id) as min_class_rn,
        row_number() over(partition by o.order_id,o.goods_id,g.max_class_id,g.mid_class_id,g.min_class_id) as min_class_goods_rn

    from yp_dwb.dwb_order_detail o
    left join yp_dwb.dwb_shop_detail s on o.store_id = s.id
    left join yp_dwb.dwb_goods_detail g on o.goods_id = g.id
    where o.is_pay = 1 and o.order_state not in (1,7)
)
select
    --维度
    province_id,
    province_name,
    city_id,
    city_name,
    trade_area_id,
    trade_area_name,
    store_id,
    store_name,
    brand_id,
    brand_name,
    max_class_id,
    max_class_name,
    mid_class_id,
    mid_class_name,
    min_class_id,
    min_class_name,
    -- group_type字段的值,需要根据不同的维度分组,标上不同的值
--    分组类型:store,trade_area,city,brand,min_class,mid_class,max_class,all
    CASE WHEN grouping(city_id, trade_area_id, store_id)=0
        THEN 'store'
        WHEN grouping(city_id, trade_area_id)=0
        THEN 'trade_area'
        WHEN grouping(city_id)=0
        THEN 'city'
        WHEN grouping(brand_id)=0
        THEN 'brand'
        WHEN grouping(max_class_id, mid_class_id, min_class_id)=0
        THEN 'min_class'
        WHEN grouping(max_class_id, mid_class_id)=0
        THEN 'mid_class'
        WHEN grouping(max_class_id)=0
        THEN 'max_class'
        ELSE 'all'
        END as group_type,
    --销售收入
    case
        when grouping(store_id)= 0 then sum(if(order_rn = 1 and store_id is not null ,order_amount,0) )
        when grouping(trade_area_id)=0 then  sum(if(order_rn = 1 and trade_area_id is not null ,order_amount,0) )
        when grouping(city_id)=0 then sum(if(order_rn = 1 and city_id is not null ,order_amount,0) )
        when grouping(brand_id)= 0 then sum(if(brand_goods_rn = 1 and brand_id is not null ,total_price,0) )
        when grouping(min_class_id)=0 then sum(if(min_class_goods_rn = 1 and min_class_id is not null ,total_price,0) )
        when grouping(mid_class_id)=0 then sum(if(mid_class_goods_rn = 1 and mid_class_id is not null ,total_price,0) )
        when grouping(max_class_id)=0 then sum(if(max_class_goods_rn = 1 and max_class_id is not null ,total_price,0) )
        when grouping(dt)=0 then sum(if(order_rn = 1 and dt is not null ,order_amount,0) )
        else null
    end as sale_amt,
    --平台收入
    case
        when grouping(store_id)= 0 then sum(if(order_rn = 1 and store_id is not null ,plat_fee,0) )
        when grouping(trade_area_id)=0 then  sum(if(order_rn = 1 and trade_area_id is not null ,plat_fee,0) )
        when grouping(city_id)=0 then sum(if(order_rn = 1 and city_id is not null ,plat_fee,0) )
        when grouping(brand_id)= 0 then null
        when grouping(min_class_id)=0 then null
        when grouping(mid_class_id)=0 then null
        when grouping(max_class_id)=0 then null
        when grouping(dt)=0 then sum(if(order_rn = 1 and dt is not null ,order_amount,0) )
        else null
    end as plat_amt,
    --配送成交
    case
        when grouping(store_id) = 0 then sum(if(order_rn = 1 and store_id is not null, delivery_fee, 0))
        when grouping(trade_area_id) = 0 then sum(if(order_rn = 1 and trade_area_id is not null, delivery_fee, 0))
        when grouping(city_id) = 0 then sum(if(order_rn = 1 and city_id is not null, delivery_fee, 0))
        when grouping(brand_id) = 0 then null
        when grouping(min_class_id) = 0 then null
        when grouping(mid_class_id) = 0 then null
        when grouping(max_class_id) = 0 then null
        when grouping(dt) = 0 then sum(if(order_rn = 1 and dt is not null, order_amount, 0))
        else null
        end as delivery_amt,
    --小程序成交额
    case
        when grouping(store_id) = 0
            then sum(if(order_rn = 1 and store_id is not null and order_from = 'miniapp', order_amount, 0))
        when grouping(trade_area_id) = 0
            then sum(if(order_rn = 1 and trade_area_id is not null and order_from = 'miniapp', order_amount, 0))
        when grouping(city_id) = 0
            then sum(if(order_rn = 1 and city_id is not null and order_from = 'miniapp', order_amount, 0))
        when grouping(brand_id) = 0
            then sum(if(brand_goods_rn = 1 and brand_id is not null and order_from = 'miniapp', total_price, 0))
        when grouping(min_class_id) = 0
            then sum(if(min_class_goods_rn = 1 and min_class_id is not null and order_from = 'miniapp', total_price, 0))
        when grouping(mid_class_id) = 0
            then sum(if(mid_class_goods_rn = 1 and mid_class_id is not null and order_from = 'miniapp', total_price, 0))
        when grouping(max_class_id) = 0
            then sum(if(max_class_goods_rn = 1 and max_class_id is not null and order_from = 'miniapp', total_price, 0))
        when grouping(dt) = 0
            then sum(if(order_rn = 1 and dt is not null and order_from = 'miniapp', order_amount, 0))
        else null
        end as miniapp_sale_amt,
        --安卓
    case
        when grouping(store_id) = 0
            then sum(if(order_rn = 1 and store_id is not null and order_from = 'android', order_amount, 0))
        when grouping(trade_area_id) = 0
            then sum(if(order_rn = 1 and trade_area_id is not null and order_from = 'android', order_amount, 0))
        when grouping(city_id) = 0
            then sum(if(order_rn = 1 and city_id is not null and order_from = 'android', order_amount, 0))
        when grouping(brand_id) = 0
            then sum(if(brand_goods_rn = 1 and brand_id is not null and order_from = 'android', total_price, 0))
        when grouping(min_class_id) = 0
            then sum(if(min_class_goods_rn = 1 and min_class_id is not null and order_from = 'android', total_price, 0))
        when grouping(mid_class_id) = 0
            then sum(if(mid_class_goods_rn = 1 and mid_class_id is not null and order_from = 'android', total_price, 0))
        when grouping(max_class_id) = 0
            then sum(if(max_class_goods_rn = 1 and max_class_id is not null and order_from = 'android', total_price, 0))
        when grouping(dt) = 0
            then sum(if(order_rn = 1 and dt is not null and order_from = 'android', order_amount, 0))
        else null
        end as android_sale_amt,
        --苹果
    case
        when grouping(store_id) = 0
            then sum(if(order_rn = 1 and store_id is not null and order_from = 'ios', order_amount, 0))
        when grouping(trade_area_id) = 0
            then sum(if(order_rn = 1 and trade_area_id is not null and order_from = 'ios', order_amount, 0))
        when grouping(city_id) = 0
            then sum(if(order_rn = 1 and city_id is not null and order_from = 'ios', order_amount, 0))
        when grouping(brand_id) = 0
            then sum(if(brand_goods_rn = 1 and brand_id is not null and order_from = 'ios', total_price, 0))
        when grouping(min_class_id) = 0
            then sum(if(min_class_goods_rn = 1 and min_class_id is not null and order_from = 'ios', total_price, 0))
        when grouping(mid_class_id) = 0
            then sum(if(mid_class_goods_rn = 1 and mid_class_id is not null and order_from = 'ios', total_price, 0))
        when grouping(max_class_id) = 0
            then sum(if(max_class_goods_rn = 1 and max_class_id is not null and order_from = 'ios', total_price, 0))
        when grouping(dt) = 0
            then sum(if(order_rn = 1 and dt is not null and order_from = 'ios', order_amount, 0))
        else null
        end as ios_sale_amt,
        --pc
     case
        when grouping(store_id) = 0
            then sum(if(order_rn = 1 and store_id is not null and order_from = 'pcweb', order_amount, 0))
        when grouping(trade_area_id) = 0
            then sum(if(order_rn = 1 and trade_area_id is not null and order_from = 'pcweb', order_amount, 0))
        when grouping(city_id) = 0
            then sum(if(order_rn = 1 and city_id is not null and order_from = 'pcweb', order_amount, 0))
        when grouping(brand_id) = 0
            then sum(if(brand_goods_rn = 1 and brand_id is not null and order_from = 'pcweb', total_price, 0))
        when grouping(min_class_id) = 0
            then sum(if(min_class_goods_rn = 1 and min_class_id is not null and order_from = 'pcweb', total_price, 0))
        when grouping(mid_class_id) = 0
            then sum(if(mid_class_goods_rn = 1 and mid_class_id is not null and order_from = 'pcweb', total_price, 0))
        when grouping(max_class_id) = 0
            then sum(if(max_class_goods_rn = 1 and max_class_id is not null and order_from = 'pcweb', total_price, 0))
        when grouping(dt) = 0
            then sum(if(order_rn = 1 and dt is not null and order_from = 'pcweb', order_amount, 0))
        else null
        end as pcweb_sale_amt,
    --订单量
    --成交单量
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null,order_id,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1,order_id,null))
        else null
        end as order_cnt,
    --参评
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and evaluation_id is not null ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and evaluation_id is not null,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and evaluation_id is not null,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and evaluation_id is not null,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and evaluation_id is not null,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and evaluation_id is not null,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and evaluation_id is not null,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and evaluation_id is not null,order_id,null))
        else null
        end as eva_order_cnt,
        --差评
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and evaluation_id is not null and geval_scores <6,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and evaluation_id is not null and geval_scores <6,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and evaluation_id is not null and geval_scores <6,order_id,null))
        else null
        end as bad_eva_order_cnt,
    --配送成交
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and delievery_id is not null ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and delievery_id is not null,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and delievery_id is not null,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and delievery_id is not null,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and delievery_id is not null,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and delievery_id is not null,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and delievery_id is not null,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and delievery_id is not null,order_id,null))
        else null
        end as deliver_order_cnt,
    --退款
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and refund_id is not null ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and refund_id is not null,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and refund_id is not null,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and refund_id is not null,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and refund_id is not null,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and refund_id is not null,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and refund_id is not null,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and refund_id is not null,order_id,null))
        else null
        end as refund_order_cnt,
    --小程序
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and order_from='miniapp' ,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and order_from='miniapp' ,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and order_from='miniapp' ,order_id,null))
        else null
        end as miniapp_order_cnt,
    --安卓
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and order_from='android' ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and order_from='android' ,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and order_from='android' ,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and order_from='android' ,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and order_from='android' ,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and order_from='android' ,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and order_from='android' ,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and order_from='android' ,order_id,null))
        else null
        end as android_order_cnt,
    --苹果
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and order_from='ios' ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and order_from='ios' ,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and order_from='ios' ,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and order_from='ios' ,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and order_from='ios' ,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and order_from='ios' ,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and order_from='ios' ,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and order_from='ios' ,order_id,null))
        else null
        end as ios_order_cnt,
    --pc
    case
        when grouping(store_id) = 0
            then count(if(order_rn = 1 and store_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(trade_area_id) = 0
            then count(if(order_rn = 1 and trade_area_id is not null and order_from='pcweb' ,order_id ,null))
        when grouping(city_id) = 0
            then count(if(order_rn = 1 and city_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(brand_id) = 0
            then count(if(brand_rn = 1 and brand_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(min_class_id) = 0
            then count(if(min_class_rn = 1 and min_class_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(mid_class_id) = 0
            then count(if(mid_class_rn = 1 and mid_class_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(max_class_id) = 0
            then count(if(max_class_rn = 1 and max_class_id is not null and order_from='pcweb' ,order_id,null))
        when grouping(dt) = 0
            then count(if(order_rn = 1 and order_from='pcweb' ,order_id,null))
        else null
        end as pcweb_order_cnt,
    dt

from t1
group by grouping sets (
    dt,
    (dt,province_id,province_name,city_id,city_name),
    (dt,province_id,province_name,city_id,city_name,trade_area_id,trade_area_name),
    (dt,province_id,province_name,city_id,city_name,trade_area_id,trade_area_name,store_id,store_name),
    (dt,brand_id,brand_name),
    (dt,max_class_id,max_class_name),
    (dt,max_class_id,max_class_name,mid_class_id,mid_class_name),
    (dt,max_class_id,max_class_name,mid_class_id,mid_class_name,min_class_id,min_class_name)
);

select count(1) from yp_dws.dws_sale_daycount; --557
select count(1) from yp_dwb.dwb_order_detail; --271
select count(1) from yp_dwb.dwb_goods_detail; --2453
select count(1) from yp_dwb.dwb_shop_detail; --352

11.2 商品主题

-- 对订单表进行指标计算的时候,需要先去重(由于我们是基于商品统计, 需要保证每笔订单中有不同的商品, 相同商品去除掉)
insert into hive.yp_dws.dws_sku_daycount
with t1 as (
    select
    --维度字段
        dt,
        goods_id,
        goods_name,
        pay_time, --支付时间
        apply_date, --退款时间

    --指标字段
        order_id, --计算次数
        buy_num, --商品数量
        total_price, --价格

    --判断字段
        is_pay, --是否支付
        order_state, --订单状态
        refund_id, --退款id
        refund_state, --退款状态
    --去重
        row_number() over(partition by order_id,goods_id) as rn1
    from yp_dwb.dwb_order_detail
),
--下单次数、下单件数、下单金额:
order_goods_1 as (
    select
        dt,
        goods_id,
        goods_name,
        count(order_id) as order_count,
        sum(buy_num) as order_num,
        sum(total_price) as order_amount
    from t1 where rn1 =1
    group by goods_id,goods_name,dt
),
---被支付次数、被支付件数、被支付金额:
payment_2 as (
    select
        substr(pay_time,1,10) as dt,
        goods_id,
        goods_name,
        count(order_id) as payment_count,
        sum(buy_num) as payment_num,
        sum(total_price) as payment_amount

    from t1 where rn1=1 and is_pay = 1 and order_state not in (1,7)
    group by substr(pay_time,1,10),goods_id,goods_name
),
refund_3 as (
    select
        substr(apply_date,1,10) as dt,
        goods_id,
        goods_name,
        count(order_id) as refund_count,
        sum(buy_num) as refund_num,
        sum(total_price) as refund_amount

    from t1 where rn1 = 1 and refund_id is not null  and refund_state =5
    group by substr(apply_date,1,10),goods_id,goods_name
),
--被收藏的次数
favor_4 as (
    select
        substr(c.create_time,1,10) as dt,
        c.goods_id,
        g.goods_name,
        count(c.id) as favor_count
    from yp_dwd.fact_goods_collect c
    join yp_dwb.dwb_goods_detail g on c.goods_id = g.id and c.end_date ='9999-99-99'
    group by substr(c.create_time,1,10),c.goods_id,g.goods_name
),
--加入购物车次数和件数
cart_5 as (
    select
        substr(c.create_time,1,10) as dt,
        c.goods_id,
        g.goods_name,
        count(c.id) as cart_count,
        sum(c.buy_num) as cart_num
    from yp_dwd.fact_shop_cart c join yp_dwb.dwb_goods_detail g
    on c.goods_id = g.id and c.end_date ='9999-99-99'
    group by c.create_time ,c.goods_id, g.goods_name
),
--好中差评数量
eva_6 as (
    select
        substr(eva.create_time,1,10) as dt,
        eva.goods_id,
        g.goods_name,
        count(if(eva.geval_scores_goods >8,eva.goods_id,null)) as evaluation_good_count,
        count(if(eva.geval_scores_goods between 6 and 8,eva.goods_id,null))as evaluation_mid_count,
        count(if(eva.geval_scores_goods <6,eva.goods_id,null)) as evaluation_bad_count

    from yp_dwd.fact_goods_evaluation_detail eva join yp_dwb.dwb_goods_detail g
    on eva.goods_id = g.id and eva.end_date = '9999-99-99'
    group by substr(eva.create_time,1,10),eva.goods_id,g.goods_name
),
--将所有临时表灌入到表中
t7 as (
    select
        coalesce(t1.dt,t2.dt,t3.dt,t4.dt,t5.dt,t6.dt) as dt,
        coalesce(t1.goods_id,t2.goods_id,t3.goods_id,t4.goods_id,t5.goods_id,t6.goods_id) as sku_id,
        coalesce(t1.goods_name,t2.goods_name,t3.goods_name,t4.goods_name,t5.goods_name,t6.goods_name) as sku_name,
        coalesce(t1.order_count,0) as order_count,
        coalesce(t1.order_num,0) as order_num,
        coalesce(t1.order_amount,0) as order_amount,
        coalesce(t2.payment_count,0) as payment_count,
        coalesce(t2.payment_num,0) as payment_num,
        coalesce(t2.payment_amount,0) as payment_amount,
        coalesce(t3.refund_count,0) as refund_count,
        coalesce(t3.refund_num,0) as refund_num,
        coalesce(t3.refund_amount,0) as refund_amount,
        coalesce(t5.cart_count,0) as cart_count,
        coalesce(t5.cart_num,0) as cart_num,
        coalesce(t4.favor_count,0) as favor_count,
        coalesce(t6.evaluation_good_count,0) as evaluation_good_count,
        coalesce(t6.evaluation_mid_count,0) as evaluation_mid_count,
        coalesce(t6.evaluation_bad_count,0) as evaluation_bad_count

    from order_goods_1 t1
        full join payment_2 t2 on t1.dt=t2.dt and t1.goods_id = t2.goods_id
        full join refund_3 t3 on t3.dt = t2.dt and t3.goods_id = t2.goods_id
        full join favor_4 t4 on t4.dt = t3.dt and t4.goods_id = t3.goods_id
        full join cart_5 t5 on t5.dt = t4.dt and t5.goods_id = t4.goods_id
        full join eva_6 t6 on t6.dt = t5.dt and t6.goods_id = t5.goods_id
)
select
    sku_id,
    sku_name,
    sum(order_count) as order_count,
    sum(order_num) as order_num,
    sum(order_amount) as order_amount,
    sum(payment_count) as payment_count,
    sum(payment_num) as payment_num,
    sum(payment_amount) as payment_amount,
    sum(refund_count) as refund_count,
    sum(refund_num) as refund_num,
    sum(refund_amount) as refund_amount,
    sum(cart_count) as cart_count,
    sum(cart_num) as cart_num,
    sum(favor_count) as favor_count,
    sum(evaluation_good_count) as evaluation_good_count,
    sum(evaluation_mid_count) as evaluation_mid_count,
    sum(evaluation_bad_count) as evaluation_bad_count,
    dt
from t7
group by dt,sku_id,sku_name;

12. DM

建库

CREATE DATABASE IF NOT EXISTS yp_dm;

12.1 销售主体宽表

建表

--  销售主题宽表
DROP TABLE IF EXISTS yp_dm.dm_sale;
CREATE TABLE yp_dm.dm_sale(
   date_time string COMMENT '统计日期,不能用来分组统计',
   time_type string COMMENT '统计时间维度:year、month、week、date',
   year_code string COMMENT '年code',
   year_month string COMMENT '年月',
   month_code string COMMENT '月份编码', 
   day_month_num string COMMENT '一月第几天', 
   dim_date_id string COMMENT '日期',
   year_week_name_cn string COMMENT '年中第几周',
   
   group_type string COMMENT '分组类型:store,trade_area,city,brand,min_class,mid_class,max_class,all',
   city_id string COMMENT '城市id',
   city_name string COMMENT '城市name',
   trade_area_id string COMMENT '商圈id',
   trade_area_name string COMMENT '商圈名称',
   store_id string COMMENT '店铺的id',
   store_name string COMMENT '店铺名称',
   brand_id string COMMENT '品牌id',
   brand_name string COMMENT '品牌名称',
   max_class_id string COMMENT '商品大类id',
   max_class_name string COMMENT '大类名称',
   mid_class_id string COMMENT '中类id', 
   mid_class_name string COMMENT '中类名称',
   min_class_id string COMMENT '小类id', 
   min_class_name string COMMENT '小类名称',
   --    =======统计=======
   --    销售收入
   sale_amt DECIMAL(38,2) COMMENT '销售收入',
   --    平台收入
   plat_amt DECIMAL(38,2) COMMENT '平台收入',
   --  配送成交额
   deliver_sale_amt DECIMAL(38,2) COMMENT '配送成交额',
   --  小程序成交额
   mini_app_sale_amt DECIMAL(38,2) COMMENT '小程序成交额',
   --  安卓APP成交额
   android_sale_amt DECIMAL(38,2) COMMENT '安卓APP成交额',
   --   苹果APP成交额
   ios_sale_amt DECIMAL(38,2) COMMENT '苹果APP成交额',
   --  PC商城成交额
   pcweb_sale_amt DECIMAL(38,2) COMMENT 'PC商城成交额',
   --  成交单量
   order_cnt BIGINT COMMENT '成交单量',
   --  参评单量
   eva_order_cnt BIGINT COMMENT '参评单量comment=>cmt',
   --  差评单量
   bad_eva_order_cnt BIGINT COMMENT '差评单量negtive-comment=>ncmt',
   --  配送成交单量
   deliver_order_cnt BIGINT COMMENT '配送单量',
   --  退款单量
   refund_order_cnt BIGINT COMMENT '退款单量',
   --  小程序成交单量
   miniapp_order_cnt BIGINT COMMENT '小程序成交单量',
   --  安卓APP订单量
   android_order_cnt BIGINT COMMENT '安卓APP订单量',
   --  苹果APP订单量
   ios_order_cnt BIGINT COMMENT '苹果APP订单量',
   --  PC商城成交单量
   pcweb_order_cnt BIGINT COMMENT 'PC商城成交单量'
)
COMMENT '销售主题宽表' 
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

插入数据

  • 统计分析:先按照天来统计(前面已经统计过了)
insert into hive.yp_dm.dm_sale
select
    t1.dt as date_time,
    'date' as time_type,
    t2.year_code,
    t2.year_month,
    t2.month_code,
    t2.day_month_num,
    t2.dim_date_id,
    t2.year_week_name_cn,

    --维度
    t1.group_type,
    t1.city_id,
    t1.city_name,
    t1.trade_area_id,
    t1.trade_area_name,
    t1.store_id,
    t1.store_name,
    t1.brand_id,
    t1.brand_name,
    t1.max_class_id,
    t1.max_class_name,
    t1.mid_class_id,
    t1.mid_class_name,
    t1.min_class_id,
    t1.min_class_name,

    --指标
    t1.sale_amt,
    t1.plat_amt,
    t1.deliver_sale_amt,
    t1.mini_app_sale_amt,
    t1.android_sale_amt,
    t1.ios_sale_amt,
    t1.pcweb_sale_amt,
    t1.order_cnt,
    t1.eva_order_cnt,
    t1.bad_eva_order_cnt,
    t1.deliver_order_cnt,
    t1.refund_order_cnt,
    t1.miniapp_order_cnt,
    t1.android_order_cnt,
    t1.ios_order_cnt,
    t1.pcweb_order_cnt

from yp_dws.dws_sale_daycount t1
    left join hive.yp_dwd.dim_date t2 on t1.dt = t2.date_code;

select count(1) from yp_dm.dm_sale;
  • 统计分析:按周来统计

insert into hive.yp_dm.dm_sale
with dim_date as (
    select
        date_code,
        year_code,
        year_month,
        month_code,
        day_month_num,
        dim_date_id,
        year_week_name_cn
    from yp_dwd.dim_date
),
t1 as (
select
    '2024-08-12' as date_time,
    'week' as time_type,
    null as year_code,
    null as year_month,
    null as month_code,
    null as day_month_num,
    null as dim_date_id,
    year_week_name_cn,

    --维度
    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then 'store'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then 'trade_area'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then 'city'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then 'brand'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then 'min_class'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then 'mid_class'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then 'max_class'
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then 'all'
        else null
    end as group_type,

    s.city_id,
    s.city_name,
    s.trade_area_id,
    s.trade_area_name,
    s.store_id,
    s.store_name,
    s.brand_id,
    s.brand_name,
    s.max_class_id,
    s.max_class_name,
    s.mid_class_id,
    s.mid_class_name,
    s.min_class_id,
    s.min_class_name,

    --指标
     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.sale_amt)
        else null
     end as sale_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.plat_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.plat_amt)
        else null
     end as plat_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.deliver_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.deliver_sale_amt)
        else null
     end as deliver_sale_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.mini_app_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.mini_app_sale_amt)
        else null
     end as mini_app_sale_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.android_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.android_sale_amt)
        else null
     end as android_sale_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.ios_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.ios_sale_amt)
        else null
     end as ios_sale_amt,

     case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.pcweb_sale_amt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.pcweb_sale_amt)
        else null
     end as pcweb_sale_amt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.order_cnt)
        else null
     end as order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.eva_order_cnt)
        else null
     end as eva_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.bad_eva_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.bad_eva_order_cnt)
        else null
     end as bad_eva_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.deliver_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.deliver_order_cnt)
        else null
     end as deliver_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.refund_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.refund_order_cnt)
        else null
     end as refund_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.miniapp_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.miniapp_order_cnt)
        else null
     end as miniapp_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.android_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.android_order_cnt)
        else null
     end as android_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.ios_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.ios_order_cnt)
        else null
     end as ios_order_cnt,

    case
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=15 and s.group_type = 'store'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=31 and s.group_type = 'trade_area'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=63 and s.group_type = 'city'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=119 and s.group_type = 'brand'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=120 and s.group_type = 'min_class'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=121 and s.group_type = 'mid_class'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=128 and s.group_type = 'mid_class'
            then sum(s.pcweb_order_cnt)
        when grouping(d.year_week_name_cn,s.city_id,s.trade_area_id,s.store_id,s.brand_id,s.max_class_id,mid_class_id,s.min_class_id)=127 and s.group_type = 'all'
            then sum(s.pcweb_order_cnt)
        else null
     end as pcweb_order_cnt

from yp_dws.dws_sale_daycount s left join dim_date d on s.dt = d.date_code
group by
    grouping sets (
    --日期+group_type
    (d.year_code,d.year_week_name_cn,s.group_type),
    --日期+城市+group_type
    (d.year_code,d.year_week_name_cn,s.city_id,s.city_name,s.group_type),
    --日期+城市+商圈+group_type
    (d.year_code,d.year_week_name_cn,s.city_id,s.city_name,s.trade_area_id,s.trade_area_name,s.group_type),
    --日期+城市+商圈+店铺+group_type
    (d.year_code,d.year_week_name_cn,s.city_id,s.city_name,s.trade_area_id,s.trade_area_name,s.store_id,s.store_name,s.group_type),
    --日期+品牌
    (d.year_code,d.year_week_name_cn,s.brand_id,s.brand_name,s.group_type),
    --日期+大类
    (d.year_code,d.year_week_name_cn,s.max_class_id,s.max_class_name,s.group_type),
    --日期+大类+中类
    (d.year_code,d.year_week_name_cn,s.max_class_id,s.max_class_name,mid_class_id,s.mid_class_name,s.group_type),
    --日期+大类+中类+小类
    (d.year_code,d.year_week_name_cn,s.max_class_id,s.max_class_name,mid_class_id,s.mid_class_name,s.min_class_id,s.min_class_name,s.group_type)
    )
)
select * from t1 where group_type is not null;
insert into yp_dm.dm_sale
-- 获取日期数据(周、月的环比/同比日期)
with dt1 as (
  select
   dim_date_id, date_code
    ,date_id_mom -- 与本月环比的上月日期
    ,date_id_mym -- 与本月同比的上年日期
    ,year_code
    ,month_code
    ,year_month     --年月
    ,day_month_num --几号
    ,week_day_code --周几
    ,year_week_name_cn  --年周
from yp_dwd.dim_date
)
select
-- 统计日期
   '2021-03-17' as date_time,
-- 时间维度      year、month、date
   case when grouping(dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id) = 0
      then 'date'
       when grouping(dt1.year_code, dt1.year_week_name_cn) = 0
      then 'week'
      when grouping(dt1.year_code, dt1.month_code, dt1.year_month) = 0
      then 'month'
      when grouping(dt1.year_code) = 0
      then 'year'
   end
   as time_type,
   dt1.year_code,
   dt1.year_month,
   dt1.month_code,
   dt1.day_month_num, --几号
   dt1.dim_date_id,
    dt1.year_week_name_cn,  --第几周
-- 产品维度类型:store,trade_area,city,brand,min_class,mid_class,max_class,all
   CASE WHEN grouping(dc.city_id, dc.trade_area_id, dc.store_id)=0
         THEN 'store'
         WHEN grouping(dc.city_id, dc.trade_area_id)=0
         THEN 'trade_area'
         WHEN grouping(dc.city_id)=0
         THEN 'city'
         WHEN grouping(dc.brand_id)=0
         THEN 'brand'
         WHEN grouping(dc.max_class_id, dc.mid_class_id, dc.min_class_id)=0
         THEN 'min_class'
         WHEN grouping(dc.max_class_id, dc.mid_class_id)=0
         THEN 'mid_class'
         WHEN grouping(dc.max_class_id)=0
         THEN 'max_class'
         ELSE 'all'
         END as group_type,
   dc.city_id,
   dc.city_name,
   dc.trade_area_id,
   dc.trade_area_name,
   dc.store_id,
   dc.store_name,
   dc.brand_id,
   dc.brand_name,
   dc.max_class_id,
   dc.max_class_name,
   dc.mid_class_id,
   dc.mid_class_name,
   dc.min_class_id,
   dc.min_class_name,
-- 统计值
    sum(dc.sale_amt) as sale_amt,
   sum(dc.plat_amt) as plat_amt,
   sum(dc.deliver_sale_amt) as deliver_sale_amt,
   sum(dc.mini_app_sale_amt) as mini_app_sale_amt,
   sum(dc.android_sale_amt) as android_sale_amt,
   sum(dc.ios_sale_amt) as ios_sale_amt,
   sum(dc.pcweb_sale_amt) as pcweb_sale_amt,

   sum(dc.order_cnt) as order_cnt,
   sum(dc.eva_order_cnt) as eva_order_cnt,
   sum(dc.bad_eva_order_cnt) as bad_eva_order_cnt,
   sum(dc.deliver_order_cnt) as deliver_order_cnt,
   sum(dc.refund_order_cnt) as refund_order_cnt,
   sum(dc.miniapp_order_cnt) as miniapp_order_cnt,
   sum(dc.android_order_cnt) as android_order_cnt,
   sum(dc.ios_order_cnt) as ios_order_cnt,
   sum(dc.pcweb_order_cnt) as pcweb_order_cnt
from yp_dws.dws_sale_daycount dc
   left join dt1 on dc.dt = dt1.date_code
--WHERE dc.dt >= '2019-01-01'
group by
grouping sets (
-- 年,注意养成加小括号的习惯
   (dt1.year_code),
   (dt1.year_code, city_id, city_name),
   (dt1.year_code, city_id, city_name, trade_area_id, trade_area_name),
   (dt1.year_code, city_id, city_name, trade_area_id, trade_area_name, store_id, store_name),
    (dt1.year_code, brand_id, brand_name),
    (dt1.year_code, max_class_id, max_class_name),
    (dt1.year_code, max_class_id, max_class_name,mid_class_id, mid_class_name),
    (dt1.year_code, max_class_id, max_class_name,mid_class_id, mid_class_name,min_class_id, min_class_name),
--  月
   (dt1.year_code, dt1.month_code, dt1.year_month),
   (dt1.year_code, dt1.month_code, dt1.year_month, city_id, city_name),
   (dt1.year_code, dt1.month_code, dt1.year_month, city_id, city_name, trade_area_id, trade_area_name),
   (dt1.year_code, dt1.month_code, dt1.year_month, city_id, city_name, trade_area_id, trade_area_name, store_id, store_name),
    (dt1.year_code, dt1.month_code, dt1.year_month, brand_id, brand_name),
    (dt1.year_code, dt1.month_code, dt1.year_month, max_class_id, max_class_name),
    (dt1.year_code, dt1.month_code, dt1.year_month, max_class_id, max_class_name,mid_class_id, mid_class_name),
    (dt1.year_code, dt1.month_code, dt1.year_month, max_class_id, max_class_name,mid_class_id, mid_class_name,min_class_id, min_class_name),
-- 日
   (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id),
   (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, city_id, city_name),
   (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, city_id, city_name, trade_area_id, trade_area_name),
   (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, city_id, city_name, trade_area_id, trade_area_name, store_id, store_name),
    (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, brand_id, brand_name),
    (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, max_class_id, max_class_name),
    (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, max_class_id, max_class_name,mid_class_id, mid_class_name),
    (dt1.year_code, dt1.month_code, dt1.day_month_num, dt1.dim_date_id, max_class_id, max_class_name,mid_class_id, mid_class_name,min_class_id, min_class_name),
--  周
   (dt1.year_code, dt1.year_week_name_cn),
   (dt1.year_code, dt1.year_week_name_cn, city_id, city_name),
   (dt1.year_code, dt1.year_week_name_cn, city_id, city_name, trade_area_id, trade_area_name),
   (dt1.year_code, dt1.year_week_name_cn, city_id, city_name, trade_area_id, trade_area_name, store_id, store_name),
    (dt1.year_code, dt1.year_week_name_cn, brand_id, brand_name),
    (dt1.year_code, dt1.year_week_name_cn, max_class_id, max_class_name),
    (dt1.year_code, dt1.year_week_name_cn, max_class_id, max_class_name,mid_class_id, mid_class_name),
    (dt1.year_code, dt1.year_week_name_cn, max_class_id, max_class_name,mid_class_id, mid_class_name,min_class_id, min_class_name)
)
-- order by time_type desc
;

12.2 商品主体宽表

--  商品主题宽表
drop table if exists yp_dm.dm_sku;
create table yp_dm.dm_sku
(

    sku_id string comment 'sku_id',
    sku_name string comment '商品名称',
    order_last_30d_count bigint comment '最近30日被下单次数',
    order_last_30d_num bigint comment '最近30日被下单件数',
    order_last_30d_amount decimal(38,2)  comment '最近30日被下单金额',
    order_count bigint comment '累积被下单次数',
    order_num bigint comment '累积被下单件数',
    order_amount decimal(38,2) comment '累积被下单金额',
    payment_last_30d_count   bigint  comment '最近30日被支付次数',
    payment_last_30d_num bigint comment '最近30日被支付件数',
    payment_last_30d_amount  decimal(38,2) comment '最近30日被支付金额',
    payment_count   bigint  comment '累积被支付次数',
    payment_num bigint comment '累积被支付件数',
    payment_amount  decimal(38,2) comment '累积被支付金额',
    refund_last_30d_count bigint comment '最近三十日退款次数',
    refund_last_30d_num bigint comment '最近三十日退款件数',
    refund_last_30d_amount decimal(38,2) comment '最近三十日退款金额',
    refund_count bigint comment '累积退款次数',
    refund_num bigint comment '累积退款件数',
    refund_amount decimal(38,2) comment '累积退款金额',
    cart_last_30d_count bigint comment '最近30日被加入购物车次数',
    cart_last_30d_num bigint comment '最近30日被加入购物车件数',
    cart_count bigint comment '累积被加入购物车次数',
    cart_num bigint comment '累积被加入购物车件数',
    favor_last_30d_count bigint comment '最近30日被收藏次数',
    favor_count bigint comment '累积被收藏次数',
    evaluation_last_30d_good_count bigint comment '最近30日好评数',
    evaluation_last_30d_mid_count bigint comment '最近30日中评数',
    evaluation_last_30d_bad_count bigint comment '最近30日差评数',
    evaluation_good_count bigint comment '累积好评数',
    evaluation_mid_count bigint comment '累积中评数',
    evaluation_bad_count bigint comment '累积差评数'
)
COMMENT '商品主题宽表'
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--插入数据
-- 首次执行,需要计算总累计值
insert into yp_dm.dm_sku
with all_count as (
    select
        sum(order_count) as order_count,
        sum(order_num) as order_num,
        sum(order_amount) as order_amount,
        sum(payment_count) payment_count,
        sum(payment_num) payment_num,
        sum(payment_amount) payment_amount,
        sum(refund_count) refund_count,
        sum(refund_num) refund_num,
        sum(refund_amount) refund_amount,
        sum(cart_count) cart_count,
        sum(cart_num) cart_num,
        sum(favor_count) favor_count,
        sum(evaluation_good_count)   evaluation_good_count,
        sum(evaluation_mid_count)    evaluation_mid_count,
        sum(evaluation_bad_count)    evaluation_bad_count,
       sku_id, sku_name
    from yp_dws.dws_sku_daycount
--     where order_count > 0
    group by sku_id, sku_name
),
last_30d as (
    select
        sum(order_count) order_last_30d_count,
        sum(order_num) order_last_30d_num,
        sum(order_amount) as order_last_30d_amount,

        sum(payment_count) payment_last_30d_count,
        sum(payment_num) payment_last_30d_num,
        sum(payment_amount) payment_last_30d_amount,

        sum(refund_count) refund_last_30d_count,
        sum(refund_num) refund_last_30d_num,
        sum(refund_amount) refund_last_30d_amount,

        sum(cart_count) cart_last_30d_count,
        sum(cart_num) cart_last_30d_num,

        sum(favor_count) favor_last_30d_count,

        sum(evaluation_good_count) evaluation_last_30d_good_count,
        sum(evaluation_mid_count)  evaluation_last_30d_mid_count,
        sum(evaluation_bad_count)  evaluation_last_30d_bad_count,

        sku_id, sku_name
    from yp_dws.dws_sku_daycount
    where dt>=cast(date_add('day', -30, date '2019-05-07') as varchar)
    group by sku_id, sku_name
)
select
    ac.sku_id,
    ac.sku_name,
    l30.order_last_30d_count,
    l30.order_last_30d_num,
    l30.order_last_30d_amount,
    ac.order_count,
    ac.order_num,
    ac.order_amount,
    l30.payment_last_30d_count,
    l30.payment_last_30d_num,
    l30.payment_last_30d_amount,
    ac.payment_count,
    ac.payment_num,
    ac.payment_amount,
    l30.refund_last_30d_count,
    l30.refund_last_30d_num,
    l30.refund_last_30d_amount,
    ac.refund_count,
    ac.refund_num,
    ac.refund_amount,
    l30.cart_last_30d_count,
    l30.cart_last_30d_num,
    ac.cart_count,
    ac.cart_num,
    l30.favor_last_30d_count,
    ac.favor_count,
    l30.evaluation_last_30d_good_count,
    l30.evaluation_last_30d_mid_count,
    l30.evaluation_last_30d_bad_count,
    ac.evaluation_good_count,
    ac.evaluation_mid_count,
    ac.evaluation_bad_count
from all_count ac
left join last_30d l30 on ac.sku_id=l30.sku_id;

--每日循环执行
--1.重建临时表
drop table if exists yp_dm.dm_sku_tmp;
create table yp_dm.dm_sku_tmp
(

    sku_id string comment 'sku_id',
    sku_name string comment '商品名称',
    order_last_30d_count bigint comment '最近30日被下单次数',
    order_last_30d_num bigint comment '最近30日被下单件数',
    order_last_30d_amount decimal(38,2)  comment '最近30日被下单金额',
    order_count bigint comment '累积被下单次数',
    order_num bigint comment '累积被下单件数',
    order_amount decimal(38,2) comment '累积被下单金额',
    payment_last_30d_count   bigint  comment '最近30日被支付次数',
    payment_last_30d_num bigint comment '最近30日被支付件数',
    payment_last_30d_amount  decimal(38,2) comment '最近30日被支付金额',
    payment_count   bigint  comment '累积被支付次数',
    payment_num bigint comment '累积被支付件数',
    payment_amount  decimal(38,2) comment '累积被支付金额',
    refund_last_30d_count bigint comment '最近三十日退款次数',
    refund_last_30d_num bigint comment '最近三十日退款件数',
    refund_last_30d_amount decimal(38,2) comment '最近三十日退款金额',
    refund_count bigint comment '累积退款次数',
    refund_num bigint comment '累积退款件数',
    refund_amount decimal(38,2) comment '累积退款金额',
    cart_last_30d_count bigint comment '最近30日被加入购物车次数',
    cart_last_30d_num bigint comment '最近30日被加入购物车件数',
    cart_count bigint comment '累积被加入购物车次数',
    cart_num bigint comment '累积被加入购物车件数',
    favor_last_30d_count bigint comment '最近30日被收藏次数',
    favor_count bigint comment '累积被收藏次数',
    evaluation_last_30d_good_count bigint comment '最近30日好评数',
    evaluation_last_30d_mid_count bigint comment '最近30日中评数',
    evaluation_last_30d_bad_count bigint comment '最近30日差评数',
    evaluation_good_count bigint comment '累积好评数',
    evaluation_mid_count bigint comment '累积中评数',
    evaluation_bad_count bigint comment '累积差评数'
)
COMMENT '商品主题宽表'
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--2.合并新旧数据
insert into yp_dm.dm_sku_tmp
select
    coalesce(new.sku_id,old.sku_id) sku_id,
    coalesce(new.sku_name,old.sku_name) sku_name,
--        订单 30天数据
    coalesce(new.order_count30,0) order_last_30d_count,
    coalesce(new.order_num30,0) order_last_30d_num,
    coalesce(new.order_amount30,0) order_last_30d_amount,
--        订单 累积历史数据
    coalesce(old.order_count,0) + coalesce(new.order_count,0) order_count,
    coalesce(old.order_num,0) + coalesce(new.order_num,0) order_num,
    coalesce(old.order_amount,0) + coalesce(new.order_amount,0) order_amount,
--        支付单 30天数据
    coalesce(new.payment_count30,0) payment_last_30d_count,
    coalesce(new.payment_num30,0) payment_last_30d_num,
    coalesce(new.payment_amount30,0) payment_last_30d_amount,
--        支付单 累积历史数据
    coalesce(old.payment_count,0) + coalesce(new.payment_count,0) payment_count,
    coalesce(old.payment_num,0) + coalesce(new.payment_count,0) payment_num,
    coalesce(old.payment_amount,0) + coalesce(new.payment_count,0) payment_amount,
--        退款单 30天数据
    coalesce(new.refund_count30,0) refund_last_30d_count,
    coalesce(new.refund_num30,0) refund_last_30d_num,
    coalesce(new.refund_amount30,0) refund_last_30d_amount,
--        退款单 累积历史数据
    coalesce(old.refund_count,0) + coalesce(new.refund_count,0) refund_count,
    coalesce(old.refund_num,0) + coalesce(new.refund_num,0) refund_num,
    coalesce(old.refund_amount,0) + coalesce(new.refund_amount,0) refund_amount,
--        购物车 30天数据
    coalesce(new.cart_count30,0) cart_last_30d_count,
    coalesce(new.cart_num30,0) cart_last_30d_num,
--        购物车 累积历史数据
    coalesce(old.cart_count,0) + coalesce(new.cart_count,0) cart_count,
    coalesce(old.cart_num,0) + coalesce(new.cart_num,0) cart_num,
--        收藏 30天数据
    coalesce(new.favor_count30,0) favor_last_30d_count,
--        收藏 累积历史数据
    coalesce(old.favor_count,0) + coalesce(new.favor_count,0) favor_count,
--        评论 30天数据
    coalesce(new.evaluation_good_count30,0) evaluation_last_30d_good_count,
    coalesce(new.evaluation_mid_count30,0) evaluation_last_30d_mid_count,
    coalesce(new.evaluation_bad_count30,0) evaluation_last_30d_bad_count,
--        评论 累积历史数据
    coalesce(old.evaluation_good_count,0) + coalesce(new.evaluation_good_count,0) evaluation_good_count,
    coalesce(old.evaluation_mid_count,0) + coalesce(new.evaluation_mid_count,0) evaluation_mid_count,
    coalesce(old.evaluation_bad_count,0) + coalesce(new.evaluation_bad_count,0) evaluation_bad_count
from
(
--     dm旧数据
    select
        sku_id, sku_name,
        order_last_30d_count,
        order_last_30d_num,
        order_last_30d_amount,
        order_count,
        order_num,
        order_amount  ,
        payment_last_30d_count,
        payment_last_30d_num,
        payment_last_30d_amount,
        payment_count,
        payment_num,
        payment_amount,
        refund_last_30d_count,
        refund_last_30d_num,
        refund_last_30d_amount,
        refund_count,
        refund_num,
        refund_amount,
        cart_last_30d_count,
        cart_last_30d_num,
        cart_count,
        cart_num,
        favor_last_30d_count,
        favor_count,
        evaluation_last_30d_good_count,
        evaluation_last_30d_mid_count,
        evaluation_last_30d_bad_count,
        evaluation_good_count,
        evaluation_mid_count,
        evaluation_bad_count
    from yp_dm.dm_sku
)old
full outer join
(
--     30天 和 昨天 的dws新数据
    select
        sku_id,
        sku_name,
        sum(if(dt='2019-05-07', order_count,0 )) order_count,
        sum(if(dt='2019-05-07',order_num ,0 ))  order_num,
        sum(if(dt='2019-05-07',order_amount,0 )) order_amount ,
        sum(if(dt='2019-05-07',payment_count,0 )) payment_count,
        sum(if(dt='2019-05-07',payment_num,0 )) payment_num,
        sum(if(dt='2019-05-07',payment_amount,0 )) payment_amount,
        sum(if(dt='2019-05-07',refund_count,0 )) refund_count,
        sum(if(dt='2019-05-07',refund_num,0 )) refund_num,
        sum(if(dt='2019-05-07',refund_amount,0 )) refund_amount,
        sum(if(dt='2019-05-07',cart_count,0 )) cart_count,
        sum(if(dt='2019-05-07',cart_num,0 )) cart_num,
        sum(if(dt='2019-05-07',favor_count,0 )) favor_count,
        sum(if(dt='2019-05-07',evaluation_good_count,0 )) evaluation_good_count,
        sum(if(dt='2019-05-07',evaluation_mid_count,0 ) ) evaluation_mid_count ,
        sum(if(dt='2019-05-07',evaluation_bad_count,0 )) evaluation_bad_count,
        sum(order_count) order_count30 ,
        sum(order_num) order_num30,
        sum(order_amount) order_amount30,
        sum(payment_count) payment_count30,
        sum(payment_num) payment_num30,
        sum(payment_amount) payment_amount30,
        sum(refund_count) refund_count30,
        sum(refund_num) refund_num30,
        sum(refund_amount) refund_amount30,
        sum(cart_count) cart_count30,
        sum(cart_num) cart_num30,
        sum(favor_count) favor_count30,
        sum(evaluation_good_count) evaluation_good_count30,
        sum(evaluation_mid_count) evaluation_mid_count30,
        sum(evaluation_bad_count) evaluation_bad_count30
    from yp_dws.dws_sku_daycount
    where dt >= cast(date_add('day', -30, date '2019-05-07') as varchar)
    group by sku_id, sku_name
)new
on new.sku_id = old.sku_id;

--3.临时表覆盖宽表
delete from yp_dm.dm_sku;
insert into yp_dm.dm_sku
select * from yp_dm.dm_sku_tmp;
12.2.1 增量

12.3 用户主体宽表

--  用户主题宽表
drop table if exists yp_dm.dm_user;
create table yp_dm.dm_user
(
   date_time string COMMENT '统计日期',
    user_id string  comment '用户id',
--      登录
    login_date_first string  comment '首次登录时间',
    login_date_last string  comment '末次登录时间',
    login_count bigint comment '累积登录天数',
    login_last_30d_count bigint comment '最近30日登录天数',
    
--      store_collect_date_first bigint comment '首次店铺收藏时间',
--      store_collect_date_last bigint comment '末次店铺收藏时间',
--      store_collect_count bigint comment '累积店铺收藏数量',
--      store_collect_last_30d_count bigint comment '最近30日店铺收藏数量',
-- 
--      goods_collect_date_first bigint comment '首次商品收藏时间',
--      goods_collect_date_last bigint comment '末次商品收藏时间',
--      goods_collect_count bigint comment '累积商品收藏数量',
--      goods_collect_last_30d_count bigint comment '最近30日商品收藏数量',

    -- 购物车
    cart_date_first string comment '首次加入购物车时间',
    cart_date_last string comment '末次加入购物车时间',
    cart_count bigint comment '累积加入购物车次数',
    cart_amount decimal(38,2) comment '累积加入购物车金额',
    cart_last_30d_count bigint comment '最近30日加入购物车次数',
    cart_last_30d_amount decimal(38,2) comment '最近30日加入购物车金额',
   -- 订单
    order_date_first string  comment '首次下单时间',
    order_date_last string  comment '末次下单时间',
    order_count bigint comment '累积下单次数',
    order_amount decimal(38,2) comment '累积下单金额',
    order_last_30d_count bigint comment '最近30日下单次数',
    order_last_30d_amount decimal(38,2) comment '最近30日下单金额',
   -- 支付
    payment_date_first string  comment '首次支付时间',
    payment_date_last string  comment '末次支付时间',
    payment_count bigint comment '累积支付次数',
    payment_amount decimal(38,2) comment '累积支付金额',
    payment_last_30d_count bigint comment '最近30日支付次数',
    payment_last_30d_amount decimal(38,2) comment '最近30日支付金额'
)
COMMENT '用户主题宽表'
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--插入数据
--首次执行,计算总累计值
insert into yp_dm.dm_user
with login_count as (
    select
        min(dt) as login_date_first,
        max (dt) as login_date_last,
        sum(if(login_count>0,1,0)) as login_count,
       user_id
    from yp_dws.dws_user_daycount
    where login_count > 0
    group by user_id
),
cart_count as (
    select
        min(dt) as cart_date_first,
        max(dt) as cart_date_last,
        sum(cart_count) as cart_count,
        sum(cart_amount) as cart_amount,
       user_id
    from yp_dws.dws_user_daycount
    where cart_count > 0
    group by user_id
),
order_count as (
    select
        min(dt) as order_date_first,
        max(dt) as order_date_last,
        sum(order_count) as order_count,
        sum(order_amount) as order_amount,
       user_id
    from yp_dws.dws_user_daycount
    where order_count > 0
    group by user_id
),
payment_count as (
    select
        min(dt) as payment_date_first,
        max(dt) as payment_date_last,
        sum(payment_count) as payment_count,
        sum(payment_amount) as payment_amount,
       user_id
    from yp_dws.dws_user_daycount
    where payment_count > 0
    group by user_id
),
last_30d as (
    select
        user_id,
        sum(if(login_count>0,1,0)) login_last_30d_count,
        sum(cart_count) cart_last_30d_count,
        sum(cart_amount) cart_last_30d_amount,
        sum(order_count) order_last_30d_count,
        sum(order_amount) order_last_30d_amount,
        sum(payment_count) payment_last_30d_count,
        sum(payment_amount) payment_last_30d_amount
    from yp_dws.dws_user_daycount
    where dt>=cast(date_add('day', -30, date '2019-05-07') as varchar)
    group by user_id
)
select
    '2019-05-07' date_time,
    last30.user_id,
--    登录
    l.login_date_first,
    l.login_date_last,
    l.login_count,
    last30.login_last_30d_count,
--    购物车
    cc.cart_date_first,
    cc.cart_date_last,
    cc.cart_count,
    cc.cart_amount,
    last30.cart_last_30d_count,
    last30.cart_last_30d_amount,
--    订单
    o.order_date_first,
    o.order_date_last,
    o.order_count,
    o.order_amount,
    last30.order_last_30d_count,
    last30.order_last_30d_amount,
--    支付
    p.payment_date_first,
    p.payment_date_last,
    p.payment_count,
    p.payment_amount,
    last30.payment_last_30d_count,
    last30.payment_last_30d_amount
from last_30d last30
left join login_count l on last30.user_id=l.user_id
left join order_count o on last30.user_id=o.user_id
left join payment_count p on last30.user_id=p.user_id
left join cart_count cc on last30.user_id=cc.user_id
;

--每日循环执行
--1.建立临时表(Hive)
drop table if exists yp_dm.dm_user_tmp;
create table yp_dm.dm_user_tmp
(
   date_time string COMMENT '统计日期',
    user_id string  comment '用户id',
--    登录
    login_date_first string  comment '首次登录时间',
    login_date_last string  comment '末次登录时间',
    login_count bigint comment '累积登录天数',
    login_last_30d_count bigint comment '最近30日登录天数',

    --购物车
    cart_date_first string comment '首次加入购物车时间',
    cart_date_last string comment '末次加入购物车时间',
    cart_count bigint comment '累积加入购物车次数',
    cart_amount decimal(38,2) comment '累积加入购物车金额',
    cart_last_30d_count bigint comment '最近30日加入购物车次数',
    cart_last_30d_amount decimal(38,2) comment '最近30日加入购物车金额',
   --订单
    order_date_first string  comment '首次下单时间',
    order_date_last string  comment '末次下单时间',
    order_count bigint comment '累积下单次数',
    order_amount decimal(38,2) comment '累积下单金额',
    order_last_30d_count bigint comment '最近30日下单次数',
    order_last_30d_amount decimal(38,2) comment '最近30日下单金额',
   --支付
    payment_date_first string  comment '首次支付时间',
    payment_date_last string  comment '末次支付时间',
    payment_count bigint comment '累积支付次数',
    payment_amount decimal(38,2) comment '累积支付金额',
    payment_last_30d_count bigint comment '最近30日支付次数',
    payment_last_30d_amount decimal(38,2) comment '最近30日支付金额'
)
COMMENT '用户主题宽表'
ROW format delimited fields terminated BY '\t'
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--2.合并新旧数据
insert into yp_dm.dm_user_tmp
select
    '2019-05-08' date_time,
    coalesce(new.user_id,old.user_id) user_id,
--     登录
    if(old.login_date_first is null and new.login_count>0,'2019-05-08',old.login_date_first) login_date_first,
    if(new.login_count>0,'2019-05-08',old.login_date_last) login_date_last,
    coalesce(old.login_count,0)+if(new.login_count>0,1,0) login_count,
    coalesce(new.login_last_30d_count,0) login_last_30d_count,
--     购物车
    if(old.cart_date_first is null and new.cart_count>0,'2019-05-08',old.cart_date_first) cart_date_first,
    if(new.cart_count>0,'2019-05-08',old.cart_date_last) cart_date_last,
    coalesce(old.cart_count,0)+coalesce(new.cart_count,0) cart_count,
    coalesce(old.cart_amount,0)+coalesce(new.cart_amount,0) cart_amount,
    coalesce(new.cart_last_30d_count,0) cart_last_30d_count,
    coalesce(new.cart_last_30d_amount,0) cart_last_30d_amount,
--     订单
    if(old.order_date_first is null and new.order_count>0,'2019-05-08',old.order_date_first) order_date_first,
    if(new.order_count>0,'2019-05-08',old.order_date_last) order_date_last,
    coalesce(old.order_count,0)+coalesce(new.order_count,0) order_count,
    coalesce(old.order_amount,0)+coalesce(new.order_amount,0) order_amount,
    coalesce(new.order_last_30d_count,0) order_last_30d_count,
    coalesce(new.order_last_30d_amount,0) order_last_30d_amount,
--     支付
    if(old.payment_date_first is null and new.payment_count>0,'2019-05-08',old.payment_date_first) payment_date_first,
    if(new.payment_count>0,'2019-05-08',old.payment_date_last) payment_date_last,
    coalesce(old.payment_count,0)+coalesce(new.payment_count,0) payment_count,
    coalesce(old.payment_amount,0)+coalesce(new.payment_amount,0) payment_amount,
    coalesce(new.payment_last_30d_count,0) payment_last_30d_count,
    coalesce(new.payment_last_30d_amount,0) payment_last_30d_amount
from
(
    select * from yp_dm.dm_user
    where date_time=cast((date '2019-05-08' - interval '1' day) as varchar)
) old
full outer join
(
    select
        user_id,
--         登录次数
        sum(if(dt='2019-05-08',login_count,0)) login_count,
--         收藏
        sum(if(dt='2019-05-08',store_collect_count,0)) store_collect_count,
        sum(if(dt='2019-05-08',goods_collect_count,0)) goods_collect_count,
--         购物车
        sum(if(dt='2019-05-08',cart_count,0)) cart_count,
        sum(if(dt='2019-05-08',cart_amount,0)) cart_amount,
--         订单
        sum(if(dt='2019-05-08',order_count,0)) order_count,
        sum(if(dt='2019-05-08',order_amount,0)) order_amount,
--         支付
        sum(if(dt='2019-05-08',payment_count,0)) payment_count,
        sum(if(dt='2019-05-08',payment_amount,0)) payment_amount,
--         30天
        sum(if(login_count>0,1,0)) login_last_30d_count,
        sum(store_collect_count) store_collect_last_30d_count,
        sum(goods_collect_count) goods_collect_last_30d_count,
        sum(cart_count) cart_last_30d_count,
        sum(cart_amount) cart_last_30d_amount,
        sum(order_count) order_last_30d_count,
        sum(order_amount) order_last_30d_amount,
        sum(payment_count) payment_last_30d_count,
        sum(payment_amount) payment_last_30d_amount
    from yp_dws.dws_user_daycount
    where dt>=cast(date_add('day', -30, date '2019-05-08') as varchar)
    group by user_id
) new
on old.user_id=new.user_id;

--3.临时表覆盖宽表
delete from yp_dm.dm_user;
insert into yp_dm.dm_user
select * from yp_dm.dm_user_tmp;

12.4 增量数据

13. RPT

图表需求一:按月统计,各个门店的月销售单量。

--门店月销售单量排行
DROP TABLE IF EXISTS yp_rpt.rpt_sale_store_cnt_month;
CREATE TABLE yp_rpt.rpt_sale_store_cnt_month(
   date_time string COMMENT '统计日期,不能用来分组统计',
   year_code string COMMENT '年code',
   year_month string COMMENT '年月',
   
   city_id string COMMENT '城市id',
   city_name string COMMENT '城市name',
   trade_area_id string COMMENT '商圈id',
   trade_area_name string COMMENT '商圈名称',
   store_id string COMMENT '店铺的id',
   store_name string COMMENT '店铺名称',
   
   order_store_cnt BIGINT COMMENT '店铺成交单量',
   miniapp_order_store_cnt BIGINT COMMENT '店铺成交单量',
   android_order_store_cnt BIGINT COMMENT '店铺成交单量',
   ios_order_store_cnt BIGINT COMMENT '店铺成交单量',
   pcweb_order_store_cnt BIGINT COMMENT '店铺成交单量'
)
COMMENT '门店月销售单量排行' 
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--日销售曲线
DROP TABLE IF EXISTS yp_rpt.rpt_sale_day;
CREATE TABLE yp_rpt.rpt_sale_day(
   date_time string COMMENT '统计日期,不能用来分组统计',
   year_code string COMMENT '年code',
   month_code string COMMENT '月份编码', 
   day_month_num string COMMENT '一月第几天', 
   dim_date_id string COMMENT '日期',

   sale_amt DECIMAL(38,2) COMMENT '销售收入',
   order_cnt BIGINT COMMENT '成交单量'
)
COMMENT '日销售曲线' 
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--渠道销量占比
DROP TABLE IF EXISTS yp_rpt.rpt_sale_fromtype_ratio;
CREATE TABLE yp_rpt.rpt_sale_fromtype_ratio(
   date_time string COMMENT '统计日期,不能用来分组统计',
   time_type string COMMENT '统计时间维度:year、month、day',
   year_code string COMMENT '年code',
   year_month string COMMENT '年月',
   dim_date_id string COMMENT '日期',
   
   order_cnt BIGINT COMMENT '成交单量',
   miniapp_order_cnt BIGINT COMMENT '小程序成交单量',
   miniapp_order_ratio DECIMAL(5,2) COMMENT '小程序成交量占比',
   android_order_cnt BIGINT COMMENT '安卓APP订单量',
   android_order_ratio DECIMAL(5,2) COMMENT '安卓APP订单量占比',
   ios_order_cnt BIGINT COMMENT '苹果APP订单量',
   ios_order_ratio DECIMAL(5,2) COMMENT '苹果APP订单量占比',
   pcweb_order_cnt BIGINT COMMENT 'PC商城成交单量',
   pcweb_order_ratio DECIMAL(5,2) COMMENT 'PC商城成交单量占比'
)
COMMENT '渠道销量占比' 
ROW format delimited fields terminated BY '\t' 
stored AS orc tblproperties ('orc.compress' = 'SNAPPY');

--插入数据
--门店月销售单量排行
insert into yp_rpt.rpt_sale_store_cnt_month
select 
   date_time,
   year_code,
   year_month,
   city_id,
   city_name,
   trade_area_id,
   trade_area_name,
   store_id,
   store_name,
   order_cnt,
   miniapp_order_cnt,
   android_order_cnt,
   ios_order_cnt,
   pcweb_order_cnt
from yp_dm.dm_sale 
where time_type ='month' and group_type='store' and store_id is not null 
order by order_cnt desc;

--日销售曲线
insert into yp_rpt.rpt_sale_day
select 
   date_time,
   year_code,
   month_code,
   day_month_num,
   dim_date_id,
   sale_amt,
   order_cnt
from yp_dm.dm_sale 
where time_type ='date' and group_type='all'
--按照日期排序显示曲线
order by dim_date_id;

--渠道销量占比
insert into yp_rpt.rpt_sale_fromtype_ratio
select 
   date_time,
   time_type,
   year_code,
   year_month,
   dim_date_id,
   
   order_cnt,
    miniapp_order_cnt,
    cast(cast(miniapp_order_cnt as DECIMAL(38,4)) / order_cnt * 100 as DECIMAL(5,2)) as miniapp_order_ratio,
    android_order_cnt,
    cast(cast(android_order_cnt as DECIMAL(38,4)) / order_cnt * 100 as DECIMAL(5,2)) as android_order_ratio,
    ios_order_cnt,
    cast(cast(ios_order_cnt as DECIMAL(38,4)) / order_cnt * 100 as DECIMAL(5,2)) as ios_order_ratio,
    pcweb_order_cnt,
    cast(cast(pcweb_order_cnt as DECIMAL(38,4)) / order_cnt * 100 as DECIMAL(5,2)) as pcweb_order_ratio
from yp_dm.dm_sale
where group_type = 'all';

需求二:统计出总退款率最高的Top10商品。

--创建表
CREATE TABLE IF NOT EXISTS yp_rpt.rpt_goods_refund_topN(
    sku_id STRING COMMENT '商品ID',
    sku_name string COMMENT '商品名称',
    refund_radio DECIMAL(38,2) COMMENT '最近30天的退款率'
)
comment '各个商品TOP100退款率'
PARTITIONED BY (date_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS ORC TBLPROPERTIES('orc.compress'='SNAPPY');

--插入数据

-- 需求二:统计出总退款率最高的Top10商品。
-- 退款率: 退款的订单数量 / 总的支付的数量=退款率
insert into hive.yp_rpt.rpt_goods_refund_topn
select sku_id,
       sku_name,
       --最近30天的累积退款率
       if(
            payment_last_30d_count > 0,
            cast(refund_last_30d_count as decimal(38,2)) / cast(payment_last_30d_count as decimal(38,2)),
            0
       ) as refund_radio,
    '2024-08-12' as date_time
from yp_dm.dm_sku where payment_last_30d_count >= refund_last_30d_count
order by refund_radio desc limit 100;

select * from yp_rpt.rpt_goods_refund_topn;

14.数据展示

14.1 数据导出

--需求1
drop table mysql.yp_olap.rpt_sale_store_cnt_month
CREATE TABLE mysql.yp_olap.rpt_sale_store_cnt_month(
  date_time varchar,
  year_code varchar COMMENT '年code',
  year_month varchar COMMENT '年月',
  city_id varchar COMMENT '城市id',
  city_name varchar COMMENT '城市name',
  trade_area_id varchar COMMENT '商圈id',
  trade_area_name varchar COMMENT '商圈名称',
  store_id varchar COMMENT '店铺的id',
  store_name varchar COMMENT '店铺名称',
  order_store_cnt BIGINT COMMENT '店铺销售单量',
  miniapp_order_store_cnt BIGINT COMMENT '店铺小程序销售单量',
  android_order_store_cnt BIGINT COMMENT '店铺android销售单量',
  ios_order_store_cnt BIGINT COMMENT '店铺ios销售单量',
  pcweb_order_store_cnt BIGINT COMMENT '店铺pcweb销售单量',
  sale_amt DECIMAL(38,4) COMMENT '销售收入',
  mini_app_sale_amt DECIMAL(38,4) COMMENT '小程序成交额',
  android_sale_amt DECIMAL(38,4) COMMENT '安卓APP成交额',
  ios_sale_amt DECIMAL(38,4) COMMENT '苹果APP成交额',
  pcweb_sale_amt DECIMAL(38,4) COMMENT 'PC商城成交额'
)
COMMENT '门店月销售单量排行' ;

-- 插入数据
insert into mysql.yp_olap.rpt_sale_store_cnt_month
select *
from hive.yp_rpt.rpt_sale_store_cnt_month;

--需求2

14.2 数据可视化

三、总结

项目总结说明
项目介绍
项目名称: 亿品新零售大数据分析项目
行业: 零售行业
行业背景说明
百货商店: 纯线下销售模式
超级市场: 纯线下销售模式
连锁超市: 纯线下销售模式
电子商务: 纯线上销售模式
新零售: 线上 + 线下 + 现代物流
目前: 各个零售产业 都在往线上 + 线下 + 物流方式运转
为什么产生了项目:
行业业务介绍
商品上架流程
单店铺订单流程
多店铺订单流程
配送流程
退货流程
项目架构选型
基于cloudera manager 平台构建的大数据分析平台: 搭建有HDFS, YARN , HIVE , OOZIE, HUE , SQOOP .....
项目数据流转流程:
项目基本情况说明
项目的周期
阶段一
基本数仓平台构建
完成数据采集导入操作 (某个业务模块的数据) --- 订单模块
完成某个主题的开发操作: 销售主题
阶段二
1- 服务器扩容
2- 完成其他业务模块数据导入操作: 店铺模块. 商品模块, 用户模块
3-完成各项主题的统计: 销售主题, 商品主题, 用户主题开发
阶段三; 不太清楚, 因为当时直接离开了, 去往其他项目组, 同时 后续听我们同事说 还有配送主题, 平台主题指标 ...
人员配置:
阶段一人员配置:
阶段二人员配置:
项目数据体量(永辉): 35TB 冗余存储105T
商品表: 20w左右的商品体量
订单量: 单店一天的订单量大约3000左右, 六日在1w左右 永辉在北京50个店 全国店铺有多少, 8年数据量
会员量: 几十万左右
集群环境规模;
机器共计 50台
各个机器分配: 可以参考第一天的集群规模图
学习的项目环境构建: 项目云平台将于 1个月后, 删除账号
数仓相关内容
数仓基本概念
什么数仓
数仓的特点
数仓的四大特征
什么是ETL
数据仓库和数据集市区别
维度分析
维度
什么是维度
维度的分类
定性维度
定量维度
分层分级
下钻 和 上卷
指标
什么是指标
指标分类
绝对指标
相对指标
数仓建模
常见数仓建模的模型
三范式建模
维度建模
维度建模的两种表
事实表
如何区分事实表
事实表的分类
事务事实表
周期快照事实表
累计快照事实表
维度表
如何区分维度表
维度表分类
高基数维度表
低基数的维度表
数仓发展的模型
星型模型
特点
反应数仓发展阶段: 初期
雪花模型
特点
反应数仓发展阶段: 走入畸形的时候, 不建议产生, 尽量少出现
星座模型
特点
反应数仓发展阶段: 中后期
缓慢渐变维:
缓慢渐变维: 为了解决历史数据是否需要存储的问题
SCD1
特点是什么
SCD2: 拉链表
特点是什么
SCD3:
特点是什么
数仓分层架构
通用分层
ODS
DW
DA|ADS|APP|RPT
项目分层
ODS
DW
DWD
DWB
DWS
DM
RPT
数仓相关工具的使用
如何使用HUE
如何使用sqoop
如何使用oozie
数仓开发操作
构建集团数据中心
ODS层:
全量
增量
基于 sqoop 完成数据采集工作
需要注意事项:
1- 在构建ODS层表的时候, 需要考虑那些问题:
2- 在进行增量导入的时候 : 不同的同步方式 增量SQL也是不同的
数据同步方式;
全量覆盖同步
仅新增同步
新增及更新同步
全量同步
hive的基础调优(了解即可, 在构建环境的时候, 基本配置完成了)
hdfs的副本条件
yarn的内存和cpu的条件
MR的内存的调整
hive的线程的调整
hive的压缩的配置调整
DWD层:
此层主要是进行清洗转换处理, 以及拉链数据实现操作
清洗转换
清洗:
对一些结果不完整的数据进行过滤
对一些已经标记删除的数据进行过滤
对一些无用字段的清除
转换
对字段进行拉宽操作: 比如 日期拉宽
对一些状态码转换为能够直接识别的内容
如果是Json数据, 会进行Json拉平 (拉宽操作)
......
如何实现拉链表
left join
union all
DWB层
此层主要是进行基于业务形成业务宽表过程
一共有那些业务模块
订单模块
商品模块
店铺模块
多表优化方案
map join
bucket map join
SMB Map Join
索引优化方案
join的数据倾斜的解决方案:
针对ORC优化方案
列值裁剪
批量读取
索引优化
基于主题构建各个数据集市
DWS层:
此层主要基于主题, 进行提前聚合统计操作
当前项目 由于最细粒度为天, 此层主要是基于天 形成 日统计宽表
相关的主题:
销售主题
商品主题
用户主题
join和group by 数据倾斜
针对ORC优化方案
列值裁剪
批量读取
索引优化
讲解presto提供grouping sets
多表进行full join 特殊点
DM层
上卷统计操作
上卷统计的, grouping 所存在问题, 进行了最细粒度的判断各个分组情况
RPT层
大致描述做了那些图表指标即可
描述基于presto完成直接数据导出操作
数据展示: 大致简单说明 基于 sparing boot + vue构建前后端分离的图表项目
各个层次增量
都是基于shell脚本实现, 最后通过oozie完成了整体调度操作

四、相关面试题目

  • 1- 请简单介绍一下最近做的一个项目 (5分钟左右)
  体现到以下这么几个点:  
         项目的基本介绍(项目背景. 主要目的是什么), 项目的架构 , 项目的流转流程 , 我主要负责了那部分内容
         
         
 项目架构 :
     基于cloudera manager构建的大数据分析平台, 在此平台上, 搭建有 zookeeper, HDFS YARN ,HIVE, OOIZE  SQOOP
     同时还使用presto加快hive数据分析操作
 
 项目的流转流程:
     通过sqoop将mysql中数据导入到hive中, 在hive中对数据进行清洗转换处理工作,构建为集团数据中心, 将处理后的数据对接presto, 进行数据分析操作, 将分析后的结果, 通过presto导出到mysql, 最终通过图表进行数据展示操作, 整个统计过程是周而复始, 不断干, 所以加入oozie完成定时化自动调度工作
     
     
 
 负责点:   只需要描述出在这一部分做的一些大致事情即可, 不需要详细描述具体流程 (主要将自己熟悉点,说出来, 便于面试官去问这一部分内容)
     先说整个项目大致分为两个部分, 一个是构建集团数据中心, 一个基于主题各个主题数据集市
     
     在集团数据中心部分:  选择起一个业务模块 或者其中两个讲解
     
     在数据集市部分: 选择一个主题模块或者两个情况
  • 2- 面试官会结合回答 挑选一些本人负责点, 来进行深入问答 : 建议写成话术
 重点关注: 
         DWD层:
         DWB层:
         DWS层: 
     
     在讲述的时候, 一定要紧密贴合项目, 具体描述出项目整个实施过程 , 万万不可描述非常空, 仅在描述层次作用
     
     一般会问一到二个层次
  • 3- 在整个项目实施过程, 有没有遇到过问题 (你认为整个项目中最有难度点, 闪光点在哪里 ?)
 此部分, 在回答的时候, 一定要往特殊的问题上说, 万万不能将一些语法性错误, 表选择性错误,比较低级错误
     
     提供可参考问题点: 
             1- DWB 或者 DWS层的 Join的数据倾斜
             2- DWS层 group by  数据倾斜
             3- DWB层 Join 优化方案
             4- DWS层 多表进行full join情况
             5- DWS层  当需要多表进行多次分组情况
  • 4- 项目的基本情况问题:
     目的 考验是否真实做个这个项目
标签: 数据库 hadoop hive

本文转载自: https://blog.csdn.net/u014142328/article/details/140867550
版权归原作者 叫我王富贵i 所有, 如有侵权,请联系我们删除。

“项目:千亿级离线数仓项目”的评论:

还没有评论