0


使用Flink MySQL cdc分别sink到ES、Kafka、Hudi

环境说明

版本问题

flink cdc与flink对应版本

https://ververica.github.io/flink-cdc-connectors/master/content/about.html#supported-flink-versions

hudi与flink版本

https://github.com/apache/hudi/issues/3689

编译所需jar包

编译flink-cdc-connectors

  • 进入flink-cdc-connectors目录,修改flink-cdc-connectors/pom.xml文件中flink的版本号为1.13.1

<flink.version>1.13.1</flink.version>

- maven编译

mvn clean package -DskipTests

- 编译完成,将以下jar包拷贝到flink-1.13.1/lib下

flink-cdc-connectors/flink-sql-connector-mysql-cdc/target/flink-sql-connector-mysql-cdc-2.1-SNAPSHOT.jar

flink-cdc-connectors/flink-sql-connector-postgres-cdc/target/flink-sql-connector-postgres-cdc-2.1-SNAPSHOT.jar

flink-cdc-connectors/flink-format-changelog-json/target/flink-format-changelog-json-2.1-SNAPSHOT.jar

编译hudi

  • 进入hudi目录,修改hudi/pom.xml,修改对应组件的版本,由于flink使用的是scala-2.11版本,spark3.x版本以上默认使用scala-2.12预编译,为了节省时间,我们在此使用spark2.4.8以scala-2.11预编译的版本,对应的hadoop版本为2.7

<flink.version>1.13.1</flink.version>
<hadoop.version>2.7.3</hadoop.version>
<spark2.version>2.4.8</spark2.version>

- maven编译

mvn clean package -DskipTests

- 编译完成,将以下jar包拷贝到flink-1.13.1/lib下

hudi/packaging/hudi-flink-bundle/target/hudi-flink-bundle_2.11-0.10.0-SNAPSHOT.jar

- 将以下jar包拷贝到spark-2.4.8-bin-hadoop2.7/jars下

hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.10.0-SNAPSHOT.jar

- 将以下jar包拷贝到hadoop-2.7.3/share/hadoop/hdfs/下

hudi/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.10.0-SNAPSHOT.jar

- 另外为了把数据写入ES和kafka中,需要下载以下jar包,并拷贝到flink-1.13.1/lib下

flink-sql-connector-elasticsearch7_2.11-1.13.1.jar

flink-sql-connector-kafka_2.11-1.13.1.jar

- flink sink到hudi还需要以下jar包

flink-shaded-hadoop-2-uber-2.7.5-10.0.jar

- 所有依赖放置完成后的flink-1.13.1/lib如下

Docker中创建所需的组件

  • 创建一个目录,创建一个新文件docker-compose.yml,内容如下,包含postgres mysql es kibana kafka zookeeper组件:
  1. version: '2.1'
  2. services:
  3. postgres:
  4. image: debezium/example-postgres:1.1
  5. ports:
  6. - "5432:5432"
  7. environment:
  8. - POSTGRES_PASSWORD=1234
  9. - POSTGRES_DB=postgres
  10. - POSTGRES_USER=postgres
  11. - POSTGRES_PASSWORD=postgres
  12. mysql:
  13. image: debezium/example-mysql:1.1
  14. ports:
  15. - "3306:3306"
  16. environment:
  17. - MYSQL_ROOT_PASSWORD=123456
  18. - MYSQL_USER=mysqluser
  19. - MYSQL_PASSWORD=mysqlpw
  20. elasticsearch:
  21. image: elastic/elasticsearch:7.6.0
  22. environment:
  23. - cluster.name=docker-cluster
  24. - bootstrap.memory_lock=true
  25. - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
  26. - discovery.type=single-node
  27. ports:
  28. - "9200:9200"
  29. - "9300:9300"
  30. ulimits:
  31. memlock:
  32. soft: -1
  33. hard: -1
  34. nofile:
  35. soft: 65536
  36. hard: 65536
  37. kibana:
  38. image: elastic/kibana:7.6.0
  39. ports:
  40. - "5601:5601"
  41. zookeeper:
  42. image: wurstmeister/zookeeper:3.4.6
  43. ports:
  44. - "2181:2181"
  45. kafka:
  46. image: wurstmeister/kafka:2.12-2.2.1
  47. ports:
  48. - "9092:9092"
  49. - "9094:9094"
  50. depends_on:
  51. - zookeeper
  52. environment:
  53. - KAFKA_ADVERTISED_LISTENERS=INSIDE://:9094,OUTSIDE://localhost:9092
  54. - KAFKA_LISTENERS=INSIDE://:9094,OUTSIDE://:9092
  55. - KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
  56. - KAFKA_INTER_BROKER_LISTENER_NAME=INSIDE
  57. - KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
  58. - KAFKA_CREATE_TOPICS="user_behavior:1:1"
  59. volumes:
  60. - /var/run/docker.sock:/var/run/docker.sock

- 进入目录,启动并初始化所有容器

docker-compose build
docker-compose up -d

- 打开docker 应用,可以看到组件已经全部启动

初始化数据

- 通过docker命令进入mysql容器内部

docker-compose exec mysql mysql -uroot -p123456

- 初始化MySQL数据

  1. -- MySQL
  2. CREATE DATABASE mydb;
  3. USE mydb;
  4. CREATE TABLE products (
  5. id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
  6. name VARCHAR(255) NOT NULL,
  7. description VARCHAR(512)
  8. );
  9. ALTER TABLE products AUTO_INCREMENT = 101;
  10. INSERT INTO products
  11. VALUES (default,"scooter","Small 2-wheel scooter"),
  12. (default,"car battery","12V car battery"),
  13. (default,"12-pack drill bits","12-pack of drill bits with sizes ranging from #40 to #3"),
  14. (default,"hammer","12oz carpenter's hammer"),
  15. (default,"hammer","14oz carpenter's hammer"),
  16. (default,"hammer","16oz carpenter's hammer"),
  17. (default,"rocks","box of assorted rocks"),
  18. (default,"jacket","water resistent black wind breaker"),
  19. (default,"spare tire","24 inch spare tire");
  20. CREATE TABLE orders (
  21. order_id INTEGER NOT NULL AUTO_INCREMENT PRIMARY KEY,
  22. order_date DATETIME NOT NULL,
  23. customer_name VARCHAR(255) NOT NULL,
  24. price DECIMAL(10, 5) NOT NULL,
  25. product_id INTEGER NOT NULL,
  26. order_status BOOLEAN NOT NULL -- 是否下单
  27. ) AUTO_INCREMENT = 10001;
  28. INSERT INTO orders
  29. VALUES (default, '2020-07-30 10:08:22', 'Jark', 50.50, 102, false),
  30. (default, '2020-07-30 10:11:09', 'Sally', 15.00, 105, false),
  31. (default, '2020-07-30 12:00:30', 'Edward', 25.25, 106, false);

- 通过docker命令进入postgres容器内部

docker-compose exec postgres psql -h localhost -U postgres

- 初始化postgres数据

  1. -- PG
  2. CREATE TABLE shipments (
  3. shipment_id SERIAL NOT NULL PRIMARY KEY,
  4. order_id SERIAL NOT NULL,
  5. origin VARCHAR(255) NOT NULL,
  6. destination VARCHAR(255) NOT NULL,
  7. is_arrived BOOLEAN NOT NULL
  8. );
  9. ALTER SEQUENCE public.shipments_shipment_id_seq RESTART WITH 1001;
  10. ALTER TABLE public.shipments REPLICA IDENTITY FULL;
  11. INSERT INTO shipments
  12. VALUES (default,10001,'Beijing','Shanghai',false),
  13. (default,10002,'Hangzhou','Shanghai',false),
  14. (default,10003,'Shanghai','Hangzhou',false);

修改环境变量、hadoop配置文件并启动hadoop

- 修改环境变量vim ~/.zshrc 或者 vim /etc/profile

  1. export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8/Contents/Home
  2. export PATH=$PATH:$JAVA_HOME/bin
  3. export SCALA_HOME=/xxx/xxx/xxx/xxx/xxx/xxx/scala
  4. export PATH=$PATH:$SCALA_HOME/bin
  5. export SPARK_HOME=/xxx/xxx/xxx/xxx/xxx/xxx/spark
  6. export PATH=$PATH:$SPARK_HOME/bin
  7. export MAVEN_HOME=/xxx/xxx/xxx/xxx/xxx/xxx/apache-maven
  8. export PATH=$PATH:$MAVEN_HOME/bin
  9. export HADOOP_HOME=/xxx/xxx/xxx/xxx/xxx/xxx/hadoop
  10. export PATH=$PATH:$HADOOP_HOME/bin
  11. export FLINK_HOME=/xxx/xxx/xxx/xxx/xxx/xxx/flink
  12. export PATH=$PATH:$FLINK_HOME/bin
  13. export PATH=/xxx/xxx/xxx/xxx/xxx/xxx/protobuf/bin:$PATH

修改完成执行命令 source ~/.zshrc 或者source /etc/profile让环境变量生效

- 修改hadoop-2.7.3/etc/hadoop/core-site.xml,如果目录不存在,给创建上

  1. <configuration>
  2. <property>
  3. <name>hadoop.tmp.dir</name>
  4. <value>file:/xxx/xxx/xxx/xxx/xxx/hadoop-2.7.3/tmp</value>
  5. </property>
  6. <property>
  7. <name>fs.defaultFS</name>
  8. <value>hdfs://localhost:8020</value>
  9. </property>
  10. </configuration>
  • 修改hadoop-2.7.3/etc/hadoop/hdfs-site.xml,如果目录不存在,给创建上
  1. <configuration>
  2. <property>
  3. <name>dfs.replication</name>
  4. <value>1</value>
  5. </property>
  6. <property>
  7. <name>dfs.namenode.name.dir</name>
  8. <value>file:/xxx/xxx/xxx/xxx/xxx/hadoop-2.7.3/dfs/name</value>
  9. </property>
  10. <property>
  11. <name>dfs.namenode.data.dir</name>
  12. <value>file:/xxx/xxx/xxx/xxx/xxx/hadoop-2.7.3/dfs/data</value>
  13. </property>
  14. </configuration>
  • 修改hadoop-2.7.3/etc/hadoop/yarn-site.xml
  1. <configuration>
  2. <!-- Site specific YARN configuration properties -->
  3. <property>
  4. <name>yarn.nodemanager.aux-services</name>
  5. <value>mapreduce_shuffle</value>
  6. </property>
  7. <property>
  8. <name>yarn.nodemanager.env-whitelist</name>
  9. <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
  10. </property>
  11. </configuration>
  • 修改hadoop-2.7.3/etc/hadoop/mapred-site.xml
  1. <configuration>
  2. <property>
  3. <name>mapreduce.framework.name
  4. </name>
  5. <value>yarn</value>
  6. </property>
  7. </configuration>
  • 启动,hadoop目录下执行
  1. ./sbin/start-all.sh
  2. # 格式化namenode
  3. hdfs namenode -format

如果不能发现Java环境变量,建议直接修改hadoop-2.7.3/etc/hadoop/hadoop-env.sh

  1. # export JAVA_HOME=${JAVA_HOME}
  2. export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8/Contents/Home
  • 启动完成,通过jps命令查看组件是否已启动,NameNode、SecondaryNameNode、DataNode、ResourceManager、NodeManager均已启动,启动完成,可以登录浏览器:http://localhost:50070/查看hdfs状态

➜ hadoop-2.7.3 jps
63696 NodeManager
63969 Jps
62178 DataNode
62805 SecondaryNameNode
63510 ResourceManager
2392
61789 NameNode

修改flink配置文件、并启动

  • 修改flink-1.13.1/conf/flink-conf.yaml,增加slots数量,修改jobmanager的端口

taskmanager.numberOfTaskSlots: 1

taskmanager.numberOfTaskSlots: 6

  • 启动flink

./bin/start-cluster.sh

➜ flink-1.13.1 jps
2320 SecondaryNameNode
3137 ResourceManager
3489 NodeManager
85906 TaskManagerRunner
16275 NameNode
1813 DataNode
2392
85565 StandaloneSessionClusterEntrypoint
86431 Jps

  • 启动flink的sql client

./bin/sql-client.sh

创建flink cdc表,分别以mysql、pgsql作为数据源,将数据写入到es中和kafka中

分别创建mysql和pgsql的flink-sql表

  1. --Flink SQL
  2. -- 设置 checkpoint 间隔为 3
  3. Flink SQL> SET execution.checkpointing.interval = 3s;
  4. Flink SQL> set dfs.client.block.write.replace-datanode-on-failure.enable = ture;
  5. Flink SQL> set dfs.client.block.write.replace-datanode-on-failure.policy = NEVER;
  6. Flink SQL> CREATE TABLE products (
  7. id INT,
  8. name STRING,
  9. description STRING,
  10. PRIMARY KEY (id) NOT ENFORCED
  11. ) WITH (
  12. 'connector' = 'mysql-cdc',
  13. 'hostname' = 'localhost',
  14. 'port' = '3306',
  15. 'username' = 'root',
  16. 'password' = '123456',
  17. 'database-name' = 'mydb',
  18. 'table-name' = 'products'
  19. );
  20. Flink SQL> CREATE TABLE orders (
  21. order_id INT,
  22. order_date TIMESTAMP(0),
  23. customer_name STRING,
  24. price DECIMAL(10, 5),
  25. product_id INT,
  26. order_status BOOLEAN,
  27. PRIMARY KEY (order_id) NOT ENFORCED
  28. ) WITH (
  29. 'connector' = 'mysql-cdc',
  30. 'hostname' = 'localhost',
  31. 'port' = '3306',
  32. 'username' = 'root',
  33. 'password' = '123456',
  34. 'database-name' = 'mydb',
  35. 'table-name' = 'orders'
  36. );
  37. Flink SQL> CREATE TABLE shipments (
  38. shipment_id INT,
  39. order_id INT,
  40. origin STRING,
  41. destination STRING,
  42. is_arrived BOOLEAN,
  43. PRIMARY KEY (shipment_id) NOT ENFORCED
  44. ) WITH (
  45. 'connector' = 'postgres-cdc',
  46. 'hostname' = 'localhost',
  47. 'port' = '5432',
  48. 'username' = 'postgres',
  49. 'password' = 'postgres',
  50. 'database-name' = 'postgres',
  51. 'schema-name' = 'public',
  52. 'table-name' = 'shipments'
  53. );

sink到es

  1. Flink SQL> CREATE TABLE enriched_orders (
  2. order_id INT,
  3. order_date TIMESTAMP(0),
  4. customer_name STRING,
  5. price DECIMAL(10, 5),
  6. product_id INT,
  7. order_status BOOLEAN,
  8. product_name STRING,
  9. product_description STRING,
  10. shipment_id INT,
  11. origin STRING,
  12. destination STRING,
  13. is_arrived BOOLEAN,
  14. PRIMARY KEY (order_id) NOT ENFORCED
  15. ) WITH (
  16. 'connector' = 'elasticsearch-7',
  17. 'hosts' = 'http://localhost:9200',
  18. 'index' = 'enriched_orders'
  19. );
  20. Flink SQL> INSERT INTO enriched_orders
  21. SELECT o.*, p.name, p.description, s.shipment_id, s.origin, s.destination, s.is_arrived
  22. FROM orders AS o
  23. LEFT JOIN products AS p ON o.product_id = p.id
  24. LEFT JOIN shipments AS s ON o.order_id = s.order_id;

  • 修改 mysql 和 postgres 里面的数据,观察 elasticsearch 里的结果
  1. -- 修改MySQL
  2. INSERT INTO orders
  3. VALUES (default, '2020-07-30 15:22:00', 'Jark', 29.71, 104, false);
  4. -- 修改PG
  5. INSERT INTO shipments
  6. VALUES (default,10004,'Shanghai','Beijing',false);
  7. -- 修改MySQL
  8. UPDATE orders SET order_status = true WHERE order_id = 10004;
  9. -- 修改PG
  10. UPDATE shipments SET is_arrived = true WHERE shipment_id = 1004;
  11. -- 修改MySQL
  12. DELETE FROM orders WHERE order_id = 10004;

sink到kafka

  1. --Flink SQL
  2. Flink SQL> CREATE TABLE kafka_gmv (
  3. day_str STRING,
  4. gmv DECIMAL(10, 5)
  5. ) WITH (
  6. 'connector' = 'kafka',
  7. 'topic' = 'kafka_gmv',
  8. 'scan.startup.mode' = 'earliest-offset',
  9. 'properties.bootstrap.servers' = 'localhost:9092',
  10. 'format' = 'changelog-json'
  11. );
  12. Flink SQL> INSERT INTO kafka_gmv
  13. SELECT DATE_FORMAT(order_date, 'yyyy-MM-dd') as day_str, SUM(price) as gmv
  14. FROM orders
  15. WHERE order_status = true
  16. GROUP BY DATE_FORMAT(order_date, 'yyyy-MM-dd');
  17. -- 读取 Kafka changelog 数据,观察 materialize 后的结果
  18. Flink SQL> SELECT * FROM kafka_gmv;
  • 观察kafka的数据
  1. docker-compose exec kafka bash -c 'kafka-console-consumer.sh --topic kafka_gmv --bootstrap-server kafka:9092 --from-beginning'
  • 更新 orders 数据,观察SQL CLI 和 kafka console 的输出
  1. -- 更新MySQL
  2. UPDATE orders SET order_status = true WHERE order_id = 10001;
  3. UPDATE orders SET order_status = true WHERE order_id = 10002;
  4. UPDATE orders SET order_status = true WHERE order_id = 10003;
  5. INSERT INTO orders
  6. VALUES (default, '2020-07-30 17:33:00', 'Timo', 50.00, 104, true);
  7. UPDATE orders SET price = 40.00 WHERE order_id = 10005;
  8. DELETE FROM orders WHERE order_id = 10005;

sink到hudi

  • 创建flink-hudi表
  1. Flink SQL> CREATE TABLE enriched_orders_hudi (
  2. order_id INT PRIMARY KEY NOT ENFORCED,
  3. order_date TIMESTAMP(3),
  4. customer_name STRING,
  5. price DOUBLE,
  6. product_id INT,
  7. order_status BOOLEAN,
  8. `partition` VARCHAR(20)
  9. ) PARTITIONED BY (`partition`) WITH (
  10. 'connector' = 'hudi'
  11. ,'write.precombine.field' = 'order_date'
  12. ,'table.type' = 'MERGE_ON_READ'
  13. ,'path' = 'hdfs://localhost:9000/hudi/enriched_orders_hudi'
  14. ,'compaction.tasks' = '1'
  15. ,'compaction.trigger.strategy' = 'num_or_time'
  16. ,'compaction.delta_commits' = '10'
  17. ,'compaction.delta_seconds' = '10'
  18. ,'read.tasks' = '1'
  19. ,'read.streaming.enabled' = 'true'
  20. ,'hoodie.datasource.query.type' = 'snapshot'
  21. ,'read.streaming.check-interval' = '10'
  22. ,'hoodie.datasource.merge.type' = 'payload_combine'
  23. ,'read.utc-timezone' = 'false'
  24. );
  25. -- 插入数据
  26. Flink SQL> INSERT INTO enriched_orders_hudi
  27. SELECT *,DATE_FORMAT(order_date, 'yyyyMMdd')
  28. FROM orders;
  29. -- 查看表是否有数据
  30. Flink SQL> select customer_name,sum(price) as sum_price from enriched_orders_hudi group by customer_name;

spark-sql使用hudi已有数据,创建spark-sql hudi表

  • 启动spark-sql,必须用带以下参数的方式启动
  1. spark-sql --packages org.apache.hudi:hudi-spark-bundle_2.11:0.9.0,org.apache.spark:spark-avro_2.11:2.4.4 \
  2. --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  3. --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
  • 创建spark-sql hudi表,不指定schema,自适应,分区,主键为可选
  1. create table if not exists enriched_orders_hudi
  2. using hudi
  3. location 'hdfs://localhost:9000/hudi/enriched_orders_hudi'
  4. options (
  5. type = 'mor',
  6. primaryKey = 'order_id',
  7. preCombineField = 'order_date'
  8. )
  9. partitioned by (`partition`);
  • 验证
  1. -- mysqlorders表中数据
  2. INSERT INTO orders
  3. VALUES
  4. (default, default, 'A君', 500.50, 102, false),
  5. (default, default, 'B君', 100.00, 105, false);
  6. INSERT INTO orders
  7. VALUES
  8. (default, default, 'C君', 500.50, 102, false),
  9. (default, default, 'D君', 100.00, 105, false);
  10. -- 分别在flink sql client spark-sql cli执行
  11. Flink SQL> select customer_name,sum(price) as sum_price from enriched_orders_hudi group by customer_name;
  12. spark-sql> select customer_name,sum(price) as sum_price from enriched_orders_hudi group by customer_name;
  13. -- 效果分别如下两图
  14. -- 更新、删除自行演示

遇到的问题

  • 由于我是单节点的hadoop,单个datanode会触发数据写入hdfs失败,报错信息如下:

Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.

网上文章大部分都是让修改hdfs-site.xml添加两个配置项:

  1. <property>
  2. <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
  3. <value>true</value>
  4. </property>
  5. <property>
  6. <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
  7. <value>NEVER</value>
  8. </property>

但并不起效,因为导致这一问题的原因是客户端产生的不是服务端产生的,所以,需要修改hdfs的客户端,那么当前使用的是spark-sql,因此客户端就是spark-sql所持有的hdfs client,因此需要在启动spark-sql之后设置两个参数:

  1. spark-sql> set dfs.client.block.write.replace-datanode-on-failure.enable = ture;
  2. spark-sql> set dfs.client.block.write.replace-datanode-on-failure.policy = NEVER;

报错问题解决,并且可以通过insert into往hudi插入数据,通过select * from table查到hudi中的数据。

  • http-request java.lang.NoSuchMethodError: io.javalin.core.CachedRequestWrapper.getContentLengthLong()J

解决办法:

直接给官方提issue https://github.com/apache/hudi/issues/3797,问题解决


本文转载自: https://blog.csdn.net/saqin6255/article/details/130666787
版权归原作者 白杨Shayne 所有, 如有侵权,请联系我们删除。

“使用Flink MySQL cdc分别sink到ES、Kafka、Hudi”的评论:

还没有评论