《从零起步，开启 Hudi 大数据魔法之旅》

文章目录

前言

随着大数据的迅猛发展，企业在数据处理和存储方面面临着越来越多的挑战。Apache Hudi（Hadoop Upserts Deletes and Incrementals）作为一个现代化的大数据框架，旨在解决这些挑战，提供高效的数据湖解决方案。本文将介绍Hudi的基本概念、核心特性以及使用场景。

什么是Hudi？

Hudi是一个开源的分布式数据湖框架，主要用于管理和存储大规模数据集。它允许用户在数据湖中进行高效的增量数据更新、删除和查询，同时保持良好的读写性能。Hudi的设计目标是简化数据管理流程，使企业能够更快速地获取和处理数据。

1.1Hudi特性

增量更新与删除： Hudi支持对数据进行增量更新和删除操作。这意味着用户可以方便地处理数据的变化，而无需完全重写数据集。
高效的读取性能： Hudi通过自定义的存储格式（如Parquet）和索引机制，优化了数据的读取速度。用户可以通过快速查询获取最新的数据版本，提升分析效率。
事务支持： Hudi支持ACID（原子性、一致性、隔离性、持久性）事务，确保数据在并发操作下的完整性。这对于需要同时进行多个数据处理任务的场景尤为重要。
集成与兼容性： Hudi能够与多种大数据生态系统工具（如Apache Spark、Apache Hive、Presto等）无缝集成，用户可以利用现有的技术栈，减少学习成本。
时间旅行： Hudi提供了时间旅行功能，用户可以查询历史版本的数据。这对于数据审计和回溯分析非常有用。

1.2使用场景

数据湖管理：企业可以使用Hudi来管理其数据湖中的数据，支持数据的高效更新和查询。

实时数据处理： Hudi非常适合处理实时数据流，企业可以快速将实时数据写入数据湖，并进行增量分析。

数据治理与合规：利用Hudi的时间旅行功能，企业可以实现数据版本控制，确保数据的合规性和可追溯性。

数据分析与挖掘： Hudi优化的数据读写性能，使得数据科学家和分析师能够更快地进行数据分析和挖掘。

Hudi搭建

本教程的相关组件版本如下：
框架版本Hadoop3.1.3Hive3.1.2Spark3.1.1，Scala-2.12Flink1.13，Scala-2.12

2.1安装Maven

（1）上传apache-maven-3.6.1-bin.tar.gz到/opt/software目录，并解压更名

tar -zxvf apache-maven-3.6.1-bin.tar.gz -C /opt/module/

mv apache-maven-3.6.1 maven-3.6.1

（2）添加环境变量到/etc/profile中

sudovim /etc/profile

#MAVEN_HOMEexportMAVEN_HOME=/opt/module/maven-3.6.1
exportPATH=$PATH:$MAVEN_HOME/bin

（3）测试安装结果

source /etc/profile

mvn -v

2）修改为阿里镜像
（1）修改setting.xml，指定为阿里仓库地址

vim /opt/module/maven-3.6.1/conf/settings.xml

<!-- 添加阿里云镜像--><mirror><id>nexus-aliyun</id><mirrorOf>central</mirrorOf><name>Nexus aliyun</name><url>http://maven.aliyun.com/nexus/content/groups/public</url></mirror>

2.2 编译hudi

2.2.1 上传源码包
从github下载：https://github.com/apache/hudi/。将hudi-0.12.0.src.tgz上传到/opt/software，并解压

tar -zxvf /opt/software/hudi-0.12.0.src.tgz -C /opt/software

2.2.2 修改pom文件

vim /opt/software/hudi-0.12.0/pom.xml

1）新增repository加速依赖下载

<repository><id>nexus-aliyun</id><name>nexus-aliyun</name><url>http://maven.aliyun.com/nexus/content/groups/public/</url><releases><enabled>true</enabled></releases><snapshots><enabled>false</enabled></snapshots></repository>

2）修改依赖的组件版本

<hadoop.version>3.1.3</hadoop.version><hive.version>3.1.2</hive.version>

在这里插入图片描述
2.2.2 修改源码兼容hadoop3.x
Hudi默认依赖的hadoop2.x，要兼容hadoop3.x，除了修改版本，还需要修改如下代码：

vim /opt/software/hudi-0.12.0/hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieParquetDataBlock.java

修改第110行，原先只有一个参数，添加第二个参数null：
在这里插入图片描述
否则会因为hadoop2.x和3.x版本兼容问题，报错如下：

2.2.3 手动安装Kafka依赖
有几个kafka的依赖需要手动安装，否则编译报错如下：

在这里插入图片描述
1）下载jar包
通过网址下载：http://packages.confluent.io/archive/5.3/confluent-5.3.4-2.12.zip
解压后找到以下jar包，上传服务器hadoop102

common-config-5.3.4.jar
common-utils-5.3.4.jar
kafka-avro-serializer-5.3.4.jar
kafka-schema-registry-client-5.3.4.jar

2）install到maven本地仓库

mvn install:install-file -DgroupId=io.confluent -DartifactId=common-config -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-config-5.3.4.jar

mvn install:install-file -DgroupId=io.confluent -DartifactId=common-utils -Dversion=5.3.4 -Dpackaging=jar -Dfile=./common-utils-5.3.4.jar

mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-avro-serializer -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-avro-serializer-5.3.4.jar

mvn install:install-file -DgroupId=io.confluent -DartifactId=kafka-schema-registry-client -Dversion=5.3.4 -Dpackaging=jar -Dfile=./kafka-schema-registry-client-5.3.4.jar

2.2.4 解决spark模块依赖冲突
修改了Hive版本为3.1.2，其携带的jetty是0.9.3，hudi本身用的0.9.4，存在依赖冲突。
1）修改hudi-spark-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:

vim /opt/software/hudi-0.12.0/packaging/hudi-spark-bundle/pom.xml

在382行的位置，修改如下：

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>

<exclusions><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.pentaho</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service-rpc</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-jdbc</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-metastore</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.datanucleus</groupId><artifactId>datanucleus-core</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-common</artifactId>
  <version>${hive.version}</version>
  <scope>${spark.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>org.eclipse.jetty.orbit</groupId><artifactId>javax.servlet</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><!-- 增加hudi配置版本的jetty --><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-server</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-util</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-webapp</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-http</artifactId><version>${jetty.version}</version></dependency>

否则在使用spark向hudi表插入数据时，会报错如下：

java.lang.NoSuchMethodError:
org.apache.hudi.org.apache.jetty.server.session.SessionHandler.setHttpOnly(Z)V

在这里插入图片描述

2）修改hudi-utilities-bundle的pom文件，排除低版本jetty，添加hudi指定版本的jetty:

vim /opt/software/hudi-0.12.0/packaging/hudi-utilities-bundle/pom.xml

在405行的位置，修改如下：

<!-- Hoodie -->
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-common</artifactId>
  <version>${project.version}</version>

<exclusions><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>
<dependency>
  <groupId>org.apache.hudi</groupId>
  <artifactId>hudi-client-common</artifactId>
  <version>${project.version}</version>

<exclusions><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>

<exclusions><exclusion><artifactId>servlet-api</artifactId><groupId>javax.servlet</groupId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.pentaho</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-service-rpc</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>
</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-jdbc</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-metastore</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>javax.servlet</groupId><artifactId>*</artifactId></exclusion><exclusion><groupId>org.datanucleus</groupId><artifactId>datanucleus-core</artifactId></exclusion><exclusion><groupId>javax.servlet.jsp</groupId><artifactId>*</artifactId></exclusion><exclusion><artifactId>guava</artifactId><groupId>com.google.guava</groupId></exclusion></exclusions>

</dependency>

<dependency>
  <groupId>${hive.groupid}</groupId>
  <artifactId>hive-common</artifactId>
  <version>${hive.version}</version>
  <scope>${utilities.bundle.hive.scope}</scope>

<exclusions><exclusion><groupId>org.eclipse.jetty.orbit</groupId><artifactId>javax.servlet</artifactId></exclusion><exclusion><groupId>org.eclipse.jetty</groupId><artifactId>*</artifactId></exclusion></exclusions></dependency><!-- 增加hudi配置版本的jetty --><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-server</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-util</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-webapp</artifactId><version>${jetty.version}</version></dependency><dependency><groupId>org.eclipse.jetty</groupId><artifactId>jetty-http</artifactId><version>${jetty.version}</version></dependency>

否则在使用DeltaStreamer工具向hudi表插入数据时，也会报Jetty的错误。
2.2.5 执行编译命令

mvn clean package -DskipTests -Dspark3.2 -Dflink1.13 -Dscala-2.12 -Dhadoop.version=3.1.3 -Pflink-bundle-shade-hive3

2.2.5 编译成功
编译成功后，进入hudi-cli说明成功：
在这里插入图片描述

标签：大数据 scala

本文转载自: https://blog.csdn.net/m0_74453787/article/details/142696093
版权归原作者 霸气的哦尼酱 所有，如有侵权，请联系我们删除。

《从零起步，开启 Hudi 大数据魔法之旅》

文章目录

前言

什么是Hudi？

1.1Hudi特性

1.2使用场景

Hudi搭建

2.1安装Maven

2.2 编译hudi

发表评论

“《从零起步，开启 Hudi 大数据魔法之旅》”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航