0


spark-sql: insert overwrite分区表问题

1. 问题背景

用spark-sql,insert overwrite分区表时发现两个比较麻烦的问题:

  1. 从目标表select出来再insert overwrite目标表时报错:Error in query: Cannot overwrite a path that is also being read from.
  2. 从其他表select出来再insert overwrite目标表时,其他分区都被删除了.

2. 问题描述

2.1 代码示例

  1. drop table pt_table_test1;
  2. create table pt_table_test1 (
  3. id int,
  4. region string,
  5. dt string
  6. ) using parquet
  7. partitioned by (region, dt)
  8. ;
  9. drop table pt_table_test2;
  10. create table pt_table_test2 (
  11. id int,
  12. region string,
  13. dt string
  14. ) using parquet
  15. partitioned by (region, dt)
  16. ;
  17. set hive.exec.dynamic.partition =true;
  18. set hive.exec.dynamic.partition.mode = nonstrict;
  19. truncate table pt_table_test1;
  20. insert into table pt_table_test1 values (1,'id', '2022-10-01'),(2,'id', '2022-10-02'),(3,'ph', '2022-10-03'),(1,'sg', '2022-10-01'),(2,'sg', '2022-10-02'),(3,'ph', '2022-10-03');
  21. select * from pt_table_test1;
  22. insert overwrite table pt_table_test1 select * from pt_table_test1 where dt = '2022-10-01';
  23. select * from pt_table_test1;
  24. truncate table pt_table_test2;
  25. insert into table pt_table_test2 values (2,'id', '2022-10-01'),(2,'id', '2022-10-02'),(2,'sg', '2022-10-01'),(2,'sg', '2022-10-02');
  26. insert overwrite table pt_table_test1 select * from pt_table_test2 where id = 2;
  27. select * from pt_table_test1;

2.2 错误演示

3. 解决方法

印象中这两个问题也出现过,但凭经验和感觉,应该可以解决。找到以前正常运行的表,对比分析了下,发现是建表方式不一致问题:

  • 错误建表,指定表的文件格式:using parquet
  • 正确姿势,指定表的文件格式:stored as parquet

3.1 示例代码

  1. drop table pt_table_test1;
  2. create table pt_table_test1 (
  3. id int,
  4. region string,
  5. dt string
  6. ) stored as parquet
  7. partitioned by (region, dt)
  8. ;
  9. drop table pt_table_test2;
  10. create table pt_table_test2 (
  11. id int,
  12. region string,
  13. dt string
  14. ) stored as parquet
  15. partitioned by (region, dt)
  16. ;
  17. set hive.exec.dynamic.partition =true;
  18. set hive.exec.dynamic.partition.mode = nonstrict;
  19. truncate table pt_table_test1;
  20. insert into table pt_table_test1 values (1,'id', '2022-10-01'),(1,'id', '2022-10-02'),(1,'ph', '2022-10-03'),(1,'sg', '2022-10-01'),(1,'sg', '2022-10-02'),(1,'ph', '2022-10-03');
  21. select * from pt_table_test1;
  22. insert overwrite table pt_table_test1 select * from pt_table_test1 where dt = '2022-10-01';
  23. select * from pt_table_test1;
  24. truncate table pt_table_test2;
  25. insert into table pt_table_test2 values (2,'id', '2022-10-01'),(2,'id', '2022-10-02'),(2,'sg', '2022-10-01'),(2,'sg', '2022-10-02');
  26. insert overwrite table pt_table_test1 select * from pt_table_test2 where id = 2;
  27. select * from pt_table_test1;

3.2 正确演示

4. using parqnet和stored as parquet

对比两种建表:

  1. 建表无论是using parquet还是stored as parquet,执行show create table都显示: USING parquet。
  2. stored as parquet时,执行show create table,新增了TBLPROPERTIES属性。
标签: spark sql 大数据

本文转载自: https://blog.csdn.net/L13763338360/article/details/127653946
版权归原作者 L13763338360 所有, 如有侵权,请联系我们删除。

“spark-sql: insert overwrite分区表问题”的评论:

还没有评论