0


Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法

1 问题现象

多个Spark 作业并发更新同一张ORC表时,部分作业可能会因为某些临时文件不存在而失败退出,典型报错日志如下:

org.apache.spark.SparkException: Job aborted. Caused by: java.io.FileNotFoundException: File hdfs://kxc-cluster/user/hive/warehouse/hstest_dev.db/test_test_test/_temporary/0/task_202309041054037826600124725546762_0176_m_000002/g=2 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:981)

2 问题原因

该问题的原因是spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新,甚至也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区,其底层细节跟 spark作业两阶段提交机制的实现算法有关,详情见后文。

3.问题解决

  • 解决方案1:对于分区表,尽量使用动态分区模式替代静态分区模式: 比如使用insert overwrite table table1 partition (part_date) select client_id, 20230911 as part_date from table0 替代 insert overwrite table table1 partition (part_date=20230911) select client_id from table0; (此时每个作业都有自己独立的临时目录,且位于目录如.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66下,所以互不冲突);
  • 解决方案2:配置spark使用hive serde 而不是spark built-in data source writer: 即配置参数spark.sql.hive.convertInsertingPartitionedTable=false 和spark.sql.hive.convertMetastoreOrc=false,(此时底层使用 hive serde的commit算法,每个作业都有自己独立的临时目录,且位于目录如.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59下,所以互不冲突);
  • 解决方案3:配置fileoutputcommitter 不对临时目录进行清理,即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true;

上述方案各自的局限性如下:

  • 方案1只适用于分区表;
  • 方案2在spark/hive的互操作性可能有局限:即spark/hive能否正常读写彼此生成的数据,取决于 spark/hive的版本是否兼容(以及某些相关参数的具体配置);
  • 方案3需要异步手动清理临时目录,否则日积月累临时目录下会有多个空目录(不是空文件);

4 技术背景-概述

SPARK作业采用了两阶段提交的机制,会对 task/job分别进行提交,其细节如下:

  • Job开始执行时,会先创建一个临时目录

  • {appAttemptId} ,作为本次运行的临时输出目录,其中 ${output.dir} 即对应表的根存储路径如/user/hive/warehouse/test.db/tableA;

  • JOB底层的task开始运行时,会进一步创建临时目录

  • {appAttemptId}/_temporary/${taskAttemptId},作为该task的临时输出目录;

  • 某task运行完毕后,会检查是否需要commit该任务(如果开启了推测执行机制,有些TASK可能会不需要commit),如果需要 commit, 则会将输出文件

  • {appAttemptId}/_temporary/{fileName} 移动到

  • {appAttemptId}/${fileName} ;

  • 所有TASK执行完毕后,则会提交作业,此时会将所有输出文件

  • {appAttemptId}/移动到最终目录

  • {output.dir} 下;

  • 作业提交完毕之后,如果没有显示配置spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true,则会清理临时目录,将${output.dir}/_temporary 目录删除;

  • 在采用动态分区模式插入分区表时,还会使用到暂存目录,即临时目录,此时对应的是

  • {output.dir}/.spark-staging-{jobId}/_temporary,比如/user/hive/warehouse/test.db/tableA/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary;且该暂存目录在作业提交后,总是会被删除;

  • 当显示配置mapreduce.fileoutputcommitter.algorithm.version=2时,上述task提交的细节略有不同(底层会将

  • {appAttemptId}/_temporary/ - {fileName} 直接移动到 ${output.dir}下);- 正是因为上述提交task/job的细节,所以spark不支持对同一张ORC/PARQUET非分区表或ORC/PARQUET分区表的同一个分区的并发更新,也不支持以静态分区模式并发更新 ORC/PARQUET分区表的不同分区;### 5 技术背景-相关源码及相关参数相关源码:- org.apache.spark.internal.io.HadoopMapReduceCommitProtocol- org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol- org.apache.spark.internal.io.FileCommitProtocol- org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter- org.apache.hadoop.mapreduce.OutputCommitter- org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter- org.apache.spark.sql.execution.datasources.FileFormatWriter- org.apache.spark.sql.hive.execution.SaveAsHiveFile- 相关参数:spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version: 该参数hadoop 2.x默认值是1,hadoop3.x默认值是2;spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skippedspark.sql.sources.outputCommitterClassspark.sql.sources.commitProtocolClassmapreduce.fileoutputcommitter.algorithm.version mapreduce.fileoutputcommitter.cleanup.skippedmapreduce.fileoutputcommitter.cleanup-failures.ignoredmapreduce.fileoutputcommitter.marksuccessfuljobsmapreduce.fileoutputcommitter.task.cleanup.enabledmapred.committer.job.setup.cleanup.needed/mapreduce.job.committer.setup.cleanup.needed ### 6 技术背景-spark并发插入非分区表- Job/task执行过程中会生成临时文件:/user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309080930006805722025796783378_0038_m_000000_158/part-00000-a1e1410f-6ca1-4d8b-92b6-78883c9e9a22-c000.zlib.orc- Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/_temporary/0/task_202309080928591897349793317265177_0025_m_000000- Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1/part-00000-8448b8b5-01b1-4348-8f91-5d3acd682f81-c000.zlib.orc- 执行过程中截图如下:- 相关关键日志如下:关键日志-成功的task:23/09/08 09:26:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 09:26:29 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 09:26:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 09:26:30 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.23/09/08 09:26:30 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 26214423/09/08 09:26:30 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/_temporary/attempt_202309080926277463773081868267263_0002_m_000000_2/part-00000-6c45455c-0201-4ad8-9459-fa8b77f37d0e-c000 with stripeSize: 67108864 options: Compress: ZLIB buffer: 26214423/09/08 09:26:49 INFO FileOutputCommitter: Saved output of task 'attempt_202309080926277463773081868267263_0002_m_000000_2' to hdfs://nameservice1/user/hive/warehouse/test_test_test1/.hive-staging_hive_2023-09-08_09-26-26_158_278404270035841685-3/-ext-10000/_temporary/0/task_202309080926277463773081868267263_0002_m_00000023/09/08 09:26:49 INFO SparkHadoopMapRedUtil: attempt_202309080926277463773081868267263_0002_m_000000_2: Committed. Elapsed time: 13 ms.23/09/08 09:26:49 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2541 bytes result sent to driver关键日志-失败的task:23/09/08 10:22:02 WARN DataStreamer: DataStreamer Exceptionjava.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1/_temporary/0/_temporary/attempt_202309081021577566806638904497462_0003_m_000000_10/part-00000-211a80a3-2cce-4f25-8c10-bfa5ecbd421f-c000.zlib.orc (inode 21688384) Holder DFSClient_attempt_202309081021525253836824694806862_0001_m_000003_4_991233622_49 does not have any open files.### 7 技术背景-spark采用静态分区模式并发插入分区表的不同分区- Job/task执行过程中会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc- Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_000000- Job commit后会生成最终文件:/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00000-0732dc56-ae0f-4c32-8347-012870ad7ab1.c000.zlib.orc- 执行过程中截图如下:- 关键日志如下:关键日志-成功的task:23/09/08 10:54:48 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 10:54:48 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 10:54:48 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 10:54:48 INFO CodeGenerator: Code generated in 25.249011 ms23/09/08 10:54:48 INFO CodeGenerator: Code generated in 14.669298 ms23/09/08 10:54:48 INFO CodeGenerator: Code generated in 37.39972 ms23/09/08 10:54:48 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.23/09/08 10:54:48 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 26214423/09/08 10:54:48 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081054408080671360087873016_0001_m_000000_1/g=20/part-00000-92811aeb-309c-4c23-acdd-b8286feadcd4.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 26214423/09/08 10:55:03 INFO FileOutputCommitter: Saved output of task 'attempt_202309081054408080671360087873016_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/task_202309081054408080671360087873016_0001_m_00000023/09/08 10:55:03 INFO SparkHadoopMapRedUtil: attempt_202309081054408080671360087873016_0001_m_000000_1: Committed. Elapsed time: 9 ms.23/09/08 10:55:03 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 3255 bytes result sent to driver关键日志-job commit报错:23/09/08 10:55:22 ERROR FileFormatWriter: Aborting job 966601b8-2679-4dc3-86a1-cebc34d9b8c9.java.io.FileNotFoundException: File hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/_temporary/0 does not exist.关键日志-task commit报错:23/09/08 10:55:43 WARN DataStreamer: DataStreamer Exceptionjava.io.FileNotFoundException: File does not exist: /user/hive/warehouse/test_liming.db/table1_pt/_temporary/0/_temporary/attempt_202309081055288611730301255924365_0005_m_000000_20/g=22/part-00000-88afa539-25ba-4b1d-bd6d-df445863dd8d.c000.zlib.orc (inode 21689816) Holder DFSClient_NONMAPREDUCE_2024885185_46 does not have any open files.### 8 技术背景-spark采用动态分区模式插入分区表不同分区- Job/task执行过程中会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc- Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_000002- Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-2587b707-7675-4547-8ffb-63e2114d1c9b.c000.zlib.orc- 执行过程中截图如下:- 关键日志如下:关键日志-所有task所有Job都是成功的:23/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 11:21:45 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 11:21:45 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 11:21:45 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 11:21:45 INFO CodeGenerator: Code generated in 44.80136 ms23/09/08 11:21:45 INFO CodeGenerator: Code generated in 16.168217 ms23/09/08 11:21:45 INFO CodeGenerator: Code generated in 53.060559 ms23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.23/09/08 11:21:45 INFO HadoopShimsPre2_7: Can't get KeyProvider for ORC encryption from hadoop.security.key.provider.path.23/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 26214423/09/08 11:21:45 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 26214423/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121344786622643944446305_0001_m_000000_1/g=21/part-00000-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 26214423/09/08 11:21:45 INFO WriterImpl: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/_temporary/attempt_202309081121348303587356551291178_0001_m_000002_3/g=23/part-00002-fc1a9f7a-5729-498e-b710-249e90217f66.c000.zlib.orc with stripeSize: 67108864 options: Compress: ZLIB buffer: 26214423/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121344786622643944446305_0001_m_000000_1' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121344786622643944446305_0001_m_00000023/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121344786622643944446305_0001_m_000000_1: Committed. Elapsed time: 18 ms.23/09/08 11:22:04 INFO FileOutputCommitter: Saved output of task 'attempt_202309081121348303587356551291178_0001_m_000002_3' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-fc1a9f7a-5729-498e-b710-249e90217f66/_temporary/0/task_202309081121348303587356551291178_0001_m_00000223/09/08 11:22:04 INFO SparkHadoopMapRedUtil: attempt_202309081121348303587356551291178_0001_m_000002_3: Committed. Elapsed time: 9 ms.23/09/08 11:22:04 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 3470 bytes result sent to driver### 9 技术背景-spark通过多个作业采用动态分区模式和静态分区模式分表插入分区表的不同分区- 经测试,只要以静态分区形式插入数据的作业数不超过2个(以动态分区形式插入数据的作业可以有多个),就不会报错。- 执行过程中截图如下:### 10 技术背景-配置spark使用hive serde 而不是spark built-in data source writer- 配置spark使用hive serde 而不是spark built-in data source writer,即配置参数spark.sql.hive.convertInsertingPartitionedTable=false; spark.sql.hive.convertMetastoreOrc=false;(可以在 kyuubi-default.conf 或spark-default.conf中配置;可以user/session 级别配置),此后对非分区表,分区表静态分区模式,分区表动态分区农事,分别进行测试。- Job/task执行过程中会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000- Task commit后会生成文件:hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_000000- Job commit后会生成文件:/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-6efd7b7b-9a44-410a-b15d-1c5ee49a523f.c000- 关键日志如下:关键日志如下-所有JOB/TASK都是成功的:23/09/08 17:35:01 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 17:35:01 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 17:35:01 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 17:35:01 INFO PhysicalFsWriter: ORC writer created for path: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-35-01_497_4555303478309834157-59/-ext-10000/_temporary/0/_temporary/attempt_20230908173501912656469073753420_0059_m_000000_59/part-00000-6d83cb93-228e-4717-bf77-83e36c10cbe8-c000 with stripeSize: 67108864 blockSize: 268435456 compression: Compress: ZLIB buffer: 26214423/09/08 17:35:02 INFO FileOutputCommitter: Saved output of task 'attempt_202309081734587917020092949673358_0058_m_000000_58' to hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1/.hive-staging_hive_2023-09-08_17-34-58_485_4893366407663162793-58/-ext-10000/_temporary/0/task_202309081734587917020092949673358_0058_m_00000023/09/08 17:35:02 INFO SparkHadoopMapRedUtil: attempt_202309081734587917020092949673358_0058_m_000000_58: Committed. Elapsed time: 4 ms.23/09/08 17:35:02 INFO Executor: Finished task 0.0 in stage 58.0 (TID 58). 2498 bytes result sent to driver23/09/08 17:35:42 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-35-42_083_5954793858553566623-6123/09/08 17:35:42 INFO FileOutputCommitter: File Output Committer Algorithm version is 123/09/08 17:35:42 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false23/09/08 17:35:42 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=21 with partSpec {g=21}23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=22 with partSpec {g=22}23/09/08 17:49:00 INFO Hive: New loading path = hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/.hive-staging_hive_2023-09-08_17-48-20_910_862844915956183505-137/-ext-10000/g=23 with partSpec {g=23}23/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=21/part-00000-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c00023/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=2123/09/08 17:49:00 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=23/part-00002-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c00023/09/08 17:49:00 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=2323/09/08 17:49:01 INFO TrashPolicyDefault: Moved: 'hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c000' to trash at: hdfs://nameservice1/user/hive/.Trash/Current/user/hive/warehouse/test_liming.db/table1_pt/g=22/part-00001-aac3aa0e-5de4-4b8d-ae29-725fb692ed1c.c00023/09/08 17:49:01 INFO FileUtils: Creating directory if it doesn't exist: hdfs://nameservice1/user/hive/warehouse/test_liming.db/table1_pt/g=2223/09/08 17:49:01 INFO Hive: Loaded 3 partitions- 执行过程中截图如下-非分区表:- 执行过程中截图如下-静态分区模式:- 执行过程中截图如下-动态分区模式:### 11 技术背景-配置不清理临时目录- 配置不清理作业执行过程中的临时目录,即配置spark参数spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true,此后对非分区表,分区表静态分区模式,分区表动态分区农事,分别进行测试。- 注意此时作业执行结束后,会残留_temporary目录,需要异步手动清理。- 执行过程中截图如下-非分区表:- 执行过程中截图如下-分区表-静态分区:- 执行过程中截图如下-分区表-动态分区 执行过程中会生成:/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/_temporary/0 但执行结束会清理掉/user/hive/warehouse/test_liming.db/table1_pt/.spark-staging-7809b23e-e675-42f4-93fd-97e6467ed5e4/ 最终只剩下:


本文转载自: https://blog.csdn.net/xylf1988/article/details/140467459
版权归原作者 GXYKJ 所有, 如有侵权,请联系我们删除。

“Spark 作业的 commit 提交机制 - Spark并发更新ORC表失败的问题原因与解决方法”的评论:

还没有评论