0


Hudi Spark SQL Call Procedures学习总结(一)(查询统计表文件信息)

前言

学习总结Hudi Spark SQL Call Procedures,Call Procedures在官网被称作存储过程(Stored Procedures),它是在Hudi 0.11.0版本由腾讯的

  1. ForwardXu

大佬贡献的,它除了官网提到的几个Procedures外,还支持其他许多Procedures命令。本文先学习其中的几个我觉得比较常用的命令,主要是查询统计表路径下的各种文件信息。

版本

Hudi master 0.13.0-SNAPSHOT
Spark 3.1.2 (实际上所有Hudi支持的Spark版本都支持Call Procedures)
Kyuubi 1.5.2 (使用Kyuubi是因为返回结果可以展示列名,Spark自带的spark-sql不返回列名)

参数形式

按名称传递参数,没有顺序,可以省略可选参数

  1. CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1,... arg_name_n => arg_n)

按位置参数传递参数,有顺序,可以省略可选参数

  1. CALL system.procedure_name(arg_1, arg_2,... arg_n)

支持的Procedures命令

我们可以在类

  1. HoodieProcedures

获取对应版本支持的所有的Procedures命令。目前支持如下:

  1. show_fs_path_detail
  2. show_bootstrap_partitions
  3. repair_deduplicate
  4. create_metadata_table
  5. stats_file_sizes
  6. validate_metadata_table_files
  7. show_commit_partitions
  8. show_commit_extra_metadata
  9. show_table_properties
  10. run_clustering
  11. run_bootstrap
  12. show_commit_files
  13. run_clean
  14. show_rollback_detail
  15. rollback_to_savepoint
  16. show_fsview_all
  17. show_compaction
  18. copy_to_temp_view
  19. show_invalid_parquet
  20. delete_savepoint
  21. show_bootstrap_mapping
  22. show_archived_commits
  23. show_fsview_latest
  24. show_metadata_table_files
  25. export_instants
  26. show_commits_metadata
  27. rollback_to_instant
  28. delete_metadata_table
  29. delete_marker
  30. show_metadata_table_stats
  31. sync_validate
  32. copy_to_table
  33. show_savepoints
  34. init_metadata_table
  35. repair_overwrite_hoodie_props
  36. show_metadata_table_partitions
  37. show_logfile_records
  38. downgrade_table
  39. show_clustering
  40. repair_migrate_partition_meta
  41. show_rollbacks
  42. show_logfile_metadata
  43. upgrade_table
  44. repair_add_partition_meta
  45. hive_sync
  46. commits_compare
  47. hdfs_parquet_import
  48. show_commit_write_stats
  49. show_commits
  50. show_archived_commits_metadata
  51. run_compaction
  52. create_savepoint
  53. repair_corrupted_clean_files
  54. stats_wa

具体的定义:

  1. private def initProcedureBuilders:Map[String,Supplier[ProcedureBuilder]]={Map((RunCompactionProcedure.NAME,RunCompactionProcedure.builder),(ShowCompactionProcedure.NAME,ShowCompactionProcedure.builder),(CreateSavepointProcedure.NAME,CreateSavepointProcedure.builder),(DeleteSavepointProcedure.NAME,DeleteSavepointProcedure.builder),(RollbackToSavepointProcedure.NAME,RollbackToSavepointProcedure.builder),(RollbackToInstantTimeProcedure.NAME,RollbackToInstantTimeProcedure.builder),(RunClusteringProcedure.NAME,RunClusteringProcedure.builder),(ShowClusteringProcedure.NAME,ShowClusteringProcedure.builder),(ShowCommitsProcedure.NAME,ShowCommitsProcedure.builder),(ShowCommitsMetadataProcedure.NAME,ShowCommitsMetadataProcedure.builder),(ShowArchivedCommitsProcedure.NAME,ShowArchivedCommitsProcedure.builder),(ShowArchivedCommitsMetadataProcedure.NAME,ShowArchivedCommitsMetadataProcedure.builder),(ShowCommitFilesProcedure.NAME,ShowCommitFilesProcedure.builder),(ShowCommitPartitionsProcedure.NAME,ShowCommitPartitionsProcedure.builder),(ShowCommitWriteStatsProcedure.NAME,ShowCommitWriteStatsProcedure.builder),(CommitsCompareProcedure.NAME,CommitsCompareProcedure.builder),(ShowSavepointsProcedure.NAME,ShowSavepointsProcedure.builder),(DeleteMarkerProcedure.NAME,DeleteMarkerProcedure.builder),(ShowRollbacksProcedure.NAME,ShowRollbacksProcedure.builder),(ShowRollbackDetailProcedure.NAME,ShowRollbackDetailProcedure.builder),(ExportInstantsProcedure.NAME,ExportInstantsProcedure.builder),(ShowAllFileSystemViewProcedure.NAME,ShowAllFileSystemViewProcedure.builder),(ShowLatestFileSystemViewProcedure.NAME,ShowLatestFileSystemViewProcedure.builder),(ShowHoodieLogFileMetadataProcedure.NAME,ShowHoodieLogFileMetadataProcedure.builder),(ShowHoodieLogFileRecordsProcedure.NAME,ShowHoodieLogFileRecordsProcedure.builder),(StatsWriteAmplificationProcedure.NAME,StatsWriteAmplificationProcedure.builder),(StatsFileSizeProcedure.NAME,StatsFileSizeProcedure.builder),(HdfsParquetImportProcedure.NAME,HdfsParquetImportProcedure.builder),(RunBootstrapProcedure.NAME,RunBootstrapProcedure.builder),(ShowBootstrapMappingProcedure.NAME,ShowBootstrapMappingProcedure.builder),(ShowBootstrapPartitionsProcedure.NAME,ShowBootstrapPartitionsProcedure.builder),(UpgradeTableProcedure.NAME,UpgradeTableProcedure.builder),(DowngradeTableProcedure.NAME,DowngradeTableProcedure.builder),(ShowMetadataTableFilesProcedure.NAME,ShowMetadataTableFilesProcedure.builder),(ShowMetadataTablePartitionsProcedure.NAME,ShowMetadataTablePartitionsProcedure.builder),(CreateMetadataTableProcedure.NAME,CreateMetadataTableProcedure.builder),(DeleteMetadataTableProcedure.NAME,DeleteMetadataTableProcedure.builder),(InitMetadataTableProcedure.NAME,InitMetadataTableProcedure.builder),(ShowMetadataTableStatsProcedure.NAME,ShowMetadataTableStatsProcedure.builder),(ValidateMetadataTableFilesProcedure.NAME,ValidateMetadataTableFilesProcedure.builder),(ShowFsPathDetailProcedure.NAME,ShowFsPathDetailProcedure.builder),(CopyToTableProcedure.NAME,CopyToTableProcedure.builder),(RepairAddpartitionmetaProcedure.NAME,RepairAddpartitionmetaProcedure.builder),(RepairCorruptedCleanFilesProcedure.NAME,RepairCorruptedCleanFilesProcedure.builder),(RepairDeduplicateProcedure.NAME,RepairDeduplicateProcedure.builder),(RepairMigratePartitionMetaProcedure.NAME,RepairMigratePartitionMetaProcedure.builder),(RepairOverwriteHoodiePropsProcedure.NAME,RepairOverwriteHoodiePropsProcedure.builder),(RunCleanProcedure.NAME,RunCleanProcedure.builder),(ValidateHoodieSyncProcedure.NAME,ValidateHoodieSyncProcedure.builder),(ShowInvalidParquetProcedure.NAME,ShowInvalidParquetProcedure.builder),(HiveSyncProcedure.NAME,HiveSyncProcedure.builder),(CopyToTempView.NAME,CopyToTempView.builder),(ShowCommitExtraMetadataProcedure.NAME,ShowCommitExtraMetadataProcedure.builder),(ShowTablePropertiesProcedure.NAME,ShowTablePropertiesProcedure.builder))

打印代码

  1. initProcedureBuilders.keySet.foreach(println)

建表造数

  1. createtable test_hudi_call_cow (
  2. id int,
  3. name string,
  4. price double,
  5. ts long,
  6. dt string
  7. )using hudi
  8. partitioned by(dt)
  9. options (
  10. primaryKey ='id',
  11. preCombineField ='ts',type='cow');insertinto test_hudi_call_cow values(1,'hudi',10,100,'2021-05-05');insertinto test_hudi_call_cow values(2,'hudi',10,100,'2021-05-05');insertinto test_hudi_call_cow values(3,'hudi',10,100,'2021-05-05');insertinto test_hudi_call_cow values(4,'hudi',10,100,'2021-05-05');

show_table_properties

查看表的properties,以key,value的形式返回

  1. hoodie.properties

中表的配置

参数

  • table 表名
  • path 表路径
  • limit 可选 默认值10 table和path两个参数必须得有一个,table的优先级高于path,即如果同时指定table和path,那么以table为准,path不生效。 输出返回字段: key,value

示例

  1. call show_table_properties(table=>'test_hudi_call_cow');call show_table_properties(path =>'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow');

默认展示前10条

  1. +--------------------------------------------------+----------------+|key|value|+--------------------------------------------------+----------------+| hoodie.table.precombine.field | ts || hoodie.datasource.write.drop.partition.columns|false|| hoodie.table.partition.fields| dt || hoodie.table.type| COPY_ON_WRITE || hoodie.archivelog.folder | archived || hoodie.timeline.layout.version |1|| hoodie.table.version |5|| hoodie.table.recordkey.fields| id || hoodie.table.metadata.partitions | files || hoodie.datasource.write.partitionpath.urlencode |false|+--------------------------------------------------+----------------+

可以通过设置limit,将limit值设置大一点,查看所有的配置

  1. call show_table_properties(table=>'test_hudi_call_cow',limit=>100);
  1. +--------------------------------------------------+----------------------------------------------------+|key|value|+--------------------------------------------------+----------------------------------------------------+| hoodie.table.precombine.field | ts || hoodie.datasource.write.drop.partition.columns|false|| hoodie.table.partition.fields| dt || hoodie.table.type| COPY_ON_WRITE || hoodie.archivelog.folder | archived || hoodie.timeline.layout.version |1|| hoodie.table.version |5|| hoodie.table.recordkey.fields| id || hoodie.table.metadata.partitions | files || hoodie.datasource.write.partitionpath.urlencode |false|| hoodie.database.name | hudi || hoodie.table.name | test_hudi_call_cow || hoodie.table.keygenerator.class | org.apache.hudi.keygen.SimpleKeyGenerator || hoodie.datasource.write.hive_style_partitioning |true|| hoodie.table.create.schema| {"type":"record","name":"test_hudi_call_cow_record","namespace":"hoodie.test_hudi_call_cow","fields":[{"name":"_hoodie_commit_time","type":["string","null"]},{"name":"_hoodie_commit_seqno","type":["string","null"]},{"name":"_hoodie_record_key","type":["string","null"]},{"name":"_hoodie_partition_path","type":["string","null"]},{"name":"_hoodie_file_name","type":["string","null"]},{"name":"id","type":["int","null"]},{"name":"name","type":["string","null"]},{"name":"price","type":["double","null"]},{"name":"ts","type":["long","null"]},{"name":"dt","type":["string","null"]}]} || hoodie.table.checksum |2721425243|| hoodie.allow.operation.metadata.field |false|+--------------------------------------------------+----------------------------------------------------+

show_commits

参数

  • table 表名 必选
  • limit 默认值10 可选 输出返回字段: commit_time,action,total_bytes_written,total_files_added,total_files_updated,total_partitions_written,total_records_written,total_update_records_written,total_errors

示例

  1. call show_commits(table=>'test_hudi_call_cow');call show_commits(table=>'test_hudi_call_cow',limit=>1);
  1. | commit_time |action| total_bytes_written | total_files_added | total_files_updated | total_partitions_written | total_records_written | total_update_records_written | total_errors |+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+|20221123205701931|commit|435308|0|1|1|4|0|0||20221123205650038|commit|435279|0|1|1|3|0|0||20221123205636715|commit|435246|0|1|1|2|0|0||20221123205546254|commit|435148|1|0|1|1|0|0|+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+

show_commits_metadata

和show_commits功能差不多,不同的是输出字段不一样,和show_commits一样都是通过ShowCommitsProcedures实现的,区别是show_commits_metadata的includeExtraMetadata为true,show_commits的includeExtraMetadata为false

参数

  • table 表名 必选
  • limit 默认值10 可选 输出返回字段: commit_time,action,partition,file_id,previous_commit,num_writes,num_inserts,num_deletes,num_update_writes,total_errors,total_log_blocks,total_corrupt_log_blocks,total_rollback_blocks,total_log_records, total_updated_records_compacted,total_bytes_written

示例

  1. call show_commits_metadata(table=>'test_hudi_call_cow');call show_commits_metadata(table=>'test_hudi_call_cow',limit=>1);
  1. +--------------------+---------+----------------+-----------------------------------------+--------------------+-------------+--------------+--------------+--------------------+---------------+-------------------+---------------------------+------------------------+--------------------+----------------------------------+----------------------+| commit_time |action|partition| file_id | previous_commit | num_writes | num_inserts | num_deletes | num_update_writes | total_errors | total_log_blocks | total_corrupt_log_blocks | total_rollback_blocks | total_log_records | total_updated_records_compacted | total_bytes_written |+--------------------+---------+----------------+-----------------------------------------+--------------------+-------------+--------------+--------------+--------------------+---------------+-------------------+---------------------------+------------------------+--------------------+----------------------------------+----------------------+|20221123205701931|commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|20221123205650038|4|1|0|0|0|0|0|0|0|0|435308||20221123205650038|commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|20221123205636715|3|1|0|0|0|0|0|0|0|0|435279||20221123205636715|commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|20221123205546254|2|1|0|0|0|0|0|0|0|0|435246||20221123205546254|commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|null|1|1|0|0|0|0|0|0|0|0|435148|+--------------------+---------+----------------+-----------------------------------------+--------------------+-------------+--------------+--------------+--------------------+---------------+-------------------+---------------------------+------------------------+--------------------+----------------------------------+----------------------+

show_commit_files

根据instantTime返回对应的文件信息,比如fileId

参数

  • table 表名 必选
  • instant_time 必选
  • limit 默认10 可选 输出返回字段: action,partition_path,file_id,previous_commit,total_records_updated,total_records_written,total_bytes_written,total_errors,file_size

示例

  1. call show_commit_files(table=>'test_hudi_call_cow', instant_time =>'20221123205701931');

因为测试数据比较少件,且只有一个分区,,所以只有一个文件

  1. +---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+|action| partition_path | file_id | previous_commit | total_records_updated | total_records_written | total_bytes_written | total_errors | file_size |+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+|commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|20221123205650038|0|4|435308|0|435308|+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+

执行下面的sql,使一次commit涉及两个文件

  1. mergeinto test_hudi_call_cow as t0
  2. using(select5as id,'hudi'as name,112as price,98as ts,'2022-11-23'as dt,'INSERT'as opt_type unionselect2as id,'hudi_2'as name,10as price,100as ts,'2021-05-05'as dt,'UPDATE'as opt_type unionselect4as id,'hudi'as name,10as price,100as ts,'2021-05-05'as dt ,'DELETE'as opt_type
  3. )as s0
  4. on t0.id = s0.id
  5. whenmatchedand opt_type!='DELETE'thenupdateset*whenmatchedand opt_type='DELETE'thendeletewhennotmatchedand opt_type!='DELETE'theninsert*;

先用show_commits查看最新的commit_time为20221123232449644

  1. call show_commits(table=>'test_hudi_call_cow',limit=>1);+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+| commit_time |action| total_bytes_written | total_files_added | total_files_updated | total_partitions_written | total_records_written | total_update_records_written | total_errors |+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+|20221123232449644|commit|870474|1|1|2|4|1|0|+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+

然后再用show_commit_files看一下20221123232449644对应的文件

  1. call show_commit_files(table=>'test_hudi_call_cow', instant_time =>'20221123232449644');+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+|action| partition_path | file_id | previous_commit | total_records_updated | total_records_written | total_bytes_written | total_errors | file_size |+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+|commit| dt=2022-11-23|8f2aecfd-198f-405b-ab5d-46e0cc997d97-0|null|0|1|435176|0|435176||commit| dt=2021-05-05|35b07424-6e63-4b65-9182-7c37cbe756b1-0|20221123231230786|1|3|435298|0|435298|+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+

show_commit_partitions

根据instantTime返回涉及的每个分区对应的文件和记录信息

参数

  • table 表名 必选
  • instant_time 必选
  • limit 默认10 可选 输出返回字段: action,partition_path,total_files_added,total_files_updated,total_records_inserted,total_records_updated,total_bytes_written,total_errors

示例

  1. call show_commit_partitions(table=>'test_hudi_call_cow', instant_time =>'20221123232449644');
  1. +---------+-----------------+--------------------+----------------------+-------------------------+------------------------+----------------------+---------------+|action| partition_path | total_files_added | total_files_updated | total_records_inserted | total_records_updated | total_bytes_written | total_errors |+---------+-----------------+--------------------+----------------------+-------------------------+------------------------+----------------------+---------------+|commit| dt=2022-11-23|1|0|1|0|435176|0||commit| dt=2021-05-05|0|1|0|1|435298|0|+---------+-----------------+--------------------+----------------------+-------------------------+------------------------+----------------------+---------------+

show_commit_write_stats

根据instantTime返回write_stats

参数:

  • table 表名 必选
  • instant_time 必选
  • limit 默认10 可选 输出返回字段: action,total_bytes_written,total_records_written,avg_record_size

示例

  1. call show_commit_write_stats(table=>'test_hudi_call_cow', instant_time =>'20221123232449644');
  1. +---------+----------------------+------------------------+------------------+|action| total_bytes_written | total_records_written | avg_record_size |+---------+----------------------+------------------------+------------------+|commit|870474|4|217619|+---------+----------------------+------------------------+------------------+

show_commit_extra_metadata

返回.commit、.deltacommit、.replacecommit中的extraMetadata

参数

  • table 表名 必选
  • instant_time 可选
  • limit 默认100 可选
  • metadata_key 可选 如schema 输出返回字段: instant_time,action,metadata_key,metadata_value

默认返回最后一个commit文件中的extraMetadata,如果指定了instant_time,那么返回指定instant_time对应的commit文件中的extraMetadata

先看一下.commit里的内容,可以看到里面有一个extraMetadata,并且包含一个key:schema以及schema对应的value

  1. hadoop fs -cat hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/.hoodie/20221123232449644.commit
  2. {"partitionToWriteStats":{"dt=2022-11-23":[{"fileId":"8f2aecfd-198f-405b-ab5d-46e0cc997d97-0",
  3. "path":"dt=2022-11-23/8f2aecfd-198f-405b-ab5d-46e0cc997d97-0_1-238-2983_20221123232449644.parquet",
  4. "cdcStats": null,
  5. "prevCommit":"null",
  6. "numWrites":1,
  7. "numDeletes":0,
  8. "numUpdateWrites":0,
  9. "numInserts":1,
  10. "totalWriteBytes":435176,
  11. "totalWriteErrors":0,
  12. "tempPath": null,
  13. "partitionPath":"dt=2022-11-23",
  14. "totalLogRecords":0,
  15. "totalLogFilesCompacted":0,
  16. "totalLogSizeCompacted":0,
  17. "totalUpdatedRecordsCompacted":0,
  18. "totalLogBlocks":0,
  19. "totalCorruptLogBlock":0,
  20. "totalRollbackBlocks":0,
  21. "fileSizeInBytes":435176,
  22. "minEventTime": null,
  23. "maxEventTime": null,
  24. "runtimeStats":{"totalScanTime":0,
  25. "totalUpsertTime":0,
  26. "totalCreateTime":1013}}],
  27. "dt=2021-05-05":[{"fileId":"35b07424-6e63-4b65-9182-7c37cbe756b1-0",
  28. "path":"dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-238-2982_20221123232449644.parquet",
  29. "cdcStats": null,
  30. "prevCommit":"20221123231230786",
  31. "numWrites":3,
  32. "numDeletes":1,
  33. "numUpdateWrites":1,
  34. "numInserts":0,
  35. "totalWriteBytes":435298,
  36. "totalWriteErrors":0,
  37. "tempPath": null,
  38. "partitionPath":"dt=2021-05-05",
  39. "totalLogRecords":0,
  40. "totalLogFilesCompacted":0,
  41. "totalLogSizeCompacted":0,
  42. "totalUpdatedRecordsCompacted":0,
  43. "totalLogBlocks":0,
  44. "totalCorruptLogBlock":0,
  45. "totalRollbackBlocks":0,
  46. "fileSizeInBytes":435298,
  47. "minEventTime": null,
  48. "maxEventTime": null,
  49. "runtimeStats":{"totalScanTime":0,
  50. "totalUpsertTime":4162,
  51. "totalCreateTime":0}}]},
  52. "compacted": false,
  53. "extraMetadata":{"schema":"{\"type\":\"record\",\"name\":\"test_hudi_call_cow_record\",\"namespace\":\"hoodie.test_hudi_call_cow\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"price\",\"type\":\"double\"},{\"name\":\"ts\",\"type\":\"long\"},{\"name\":\"dt\",\"type\":\"string\"}]}"},
  54. "operationType":"UPSERT"}

示例

  1. call show_commit_extra_metadata(table=>'test_hudi_call_cow');
  1. +--------------------+---------+---------------+----------------------------------------------------+| instant_time |action| metadata_key | metadata_value |+--------------------+---------+---------------+----------------------------------------------------+|20221123232449644|commit|schema| {"type":"record","name":"test_hudi_call_cow_record","namespace":"hoodie.test_hudi_call_cow","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"},{"name":"price","type":"double"},{"name":"ts","type":"long"},{"name":"dt","type":"string"}]} |+--------------------+---------+---------------+----------------------------------------------------+
  1. call show_commit_extra_metadata(table=>'test_hudi_call_cow', instant_time =>'20221123205701931', metadata_key =>'schema',limit=>10);
  1. +--------------------+---------+---------------+----------------------------------------------------+| instant_time |action| metadata_key | metadata_value |+--------------------+---------+---------------+----------------------------------------------------+|20221123205701931|commit|schema| {"type":"record","name":"test_hudi_call_cow_record","namespace":"hoodie.test_hudi_call_cow","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"},{"name":"price","type":"double"},{"name":"ts","type":"long"},{"name":"dt","type":"string"}]} |+--------------------+---------+---------------+----------------------------------------------------+

我目前已知的extraMetadata中只有schema,且只有一条,所以只返回一条记录,且指定不指定metadata_key效果一样,不确定是否还有其他的extraMetadata

show_fs_path_detail

展示指定路径下面的文件和路径的统计信息,默认按照文件大小进行排序
返回.commit、.deltacommit、.replacecommit中的extraMetadata

参数

  • path 表文件路径 必选
  • is_sub 可选 是否查询子目录,只查询一级子目录 默认false
  • sort 可选 是否按文件大小排序 默认true
  • metadata_key 可选 如schema 输出返回字段: instant_time,action,metadata_key,metadata_value

示例

  1. call show_fs_path_detail(path =>'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow');+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+| path_num | file_num | storage_size | storage_size(unit)| storage_path | space_consumed | quota | space_quota |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+|22|53|3200109|3.05MB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow | -1 | 9600327 | -1 |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+

查询一级子目录

  1. call show_fs_path_detail(path =>'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow', is_sub =>true);+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+| path_num | file_num | storage_size | storage_size(unit)| storage_path | space_consumed | quota | space_quota |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+|1|7|2611728|2.49MB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05 | -1 | 7835184 | -1 ||1|2|435272|425.07KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2022-11-23 | -1 | 1305816 | -1 ||19|44|153109|149.52KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/.hoodie | -1 | 459327 | -1 |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+

查询二级子目录

  1. call show_fs_path_detail(path =>'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05', is_sub =>true);+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+| path_num | file_num | storage_size | storage_size(unit)| storage_path | space_consumed | quota | space_quota |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+|0|1|435353|425.15KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-192-1549_20221123231230786.parquet | -1 | 1306059 | -1 ||0|1|435308|425.11KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-147-118_20221123205701931.parquet | -1 | 1305924 | -1 ||0|1|435298|425.10KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-238-2982_20221123232449644.parquet | -1 | 1305894 | -1 ||0|1|435279|425.08KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-105-83_20221123205650038.parquet | -1 | 1305837 | -1 ||0|1|435246|425.04KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-66-52_20221123205636715.parquet | -1 | 1305738 | -1 ||0|1|435148|424.95KB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-27-21_20221123205546254.parquet | -1 | 1305444 | -1 ||0|1|96|96B | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/.hoodie_partition_metadata | -1 | 288 | -1 |+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+

相关阅读

  • Apache Hudi 入门学习总结
  • Hudi Spark SQL总结
  • Spark3.12+Kyuubi1.5.2+kyuubi-spark-authz源码编译打包+部署配置HA
  • Hudi Spark SQL源码学习总结-Create Table
  • Hudi Spark SQL源码学习总结-CTAS
  • Hudi Spark源码学习总结-df.write.format(“hudi”).save
  • Hudi Spark源码学习总结-spark.read.format(“hudi”).load
  • Hudi Spark源码学习总结-spark.read.format(“hudi”).load(2)
  • Hudi Spark SQL源码学习总结-select(查询)
  • Hudi源码 | Insert源码分析总结(二)(WorkloadProfile)
  • 开源经验分享 | 如何从一名小白成为Apache Hudi Contributor
标签: spark hudi 数据湖

本文转载自: https://blog.csdn.net/dkl12/article/details/128018873
版权归原作者 董可伦 所有, 如有侵权,请联系我们删除。

“Hudi Spark SQL Call Procedures学习总结(一)(查询统计表文件信息)”的评论:

还没有评论