前些天发现了一个巨牛的人工智能学习网站,通俗易懂,风趣幽默,忍不住给大家分享一下。点击跳转到网站:https://www.captainai.net/dongkelun
        前言
学习总结Hudi Spark SQL Call Procedures,Call Procedures在官网被称作存储过程(Stored Procedures),它是在Hudi 0.11.0版本由腾讯的ForwardXu大佬贡献的,它除了官网提到的几个Procedures外,还支持其他许多Procedures命令。本文先学习其中的几个我觉得比较常用的命令,主要是查询统计表路径下的各种文件信息。
版本
Hudi  master 0.13.0-SNAPSHOT
Spark 3.1.2 (实际上所有Hudi支持的Spark版本都支持Call Procedures)
Kyuubi 1.5.2 (使用Kyuubi是因为返回结果可以展示列名,Spark自带的spark-sql不返回列名)
参数形式
按名称传递参数,没有顺序,可以省略可选参数1
CALL system.procedure_name(arg_name_2 => arg_2, arg_name_1 => arg_1, ... arg_name_n => arg_n)
按位置参数传递参数,有顺序,可以省略可选参数1
CALL system.procedure_name(arg_1, arg_2, ... arg_n)
支持的Procedures命令
我们可以在类HoodieProcedures获取对应版本支持的所有的Procedures命令。目前支持如下:
1  | show_fs_path_detail  | 
具体的定义:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56private def initProcedureBuilders: Map[String, Supplier[ProcedureBuilder]] = {
  Map((RunCompactionProcedure.NAME, RunCompactionProcedure.builder)
    ,(ShowCompactionProcedure.NAME, ShowCompactionProcedure.builder)
    ,(CreateSavepointProcedure.NAME, CreateSavepointProcedure.builder)
    ,(DeleteSavepointProcedure.NAME, DeleteSavepointProcedure.builder)
    ,(RollbackToSavepointProcedure.NAME, RollbackToSavepointProcedure.builder)
    ,(RollbackToInstantTimeProcedure.NAME, RollbackToInstantTimeProcedure.builder)
    ,(RunClusteringProcedure.NAME, RunClusteringProcedure.builder)
    ,(ShowClusteringProcedure.NAME, ShowClusteringProcedure.builder)
    ,(ShowCommitsProcedure.NAME, ShowCommitsProcedure.builder)
    ,(ShowCommitsMetadataProcedure.NAME, ShowCommitsMetadataProcedure.builder)
    ,(ShowArchivedCommitsProcedure.NAME, ShowArchivedCommitsProcedure.builder)
    ,(ShowArchivedCommitsMetadataProcedure.NAME, ShowArchivedCommitsMetadataProcedure.builder)
    ,(ShowCommitFilesProcedure.NAME, ShowCommitFilesProcedure.builder)
    ,(ShowCommitPartitionsProcedure.NAME, ShowCommitPartitionsProcedure.builder)
    ,(ShowCommitWriteStatsProcedure.NAME, ShowCommitWriteStatsProcedure.builder)
    ,(CommitsCompareProcedure.NAME, CommitsCompareProcedure.builder)
    ,(ShowSavepointsProcedure.NAME, ShowSavepointsProcedure.builder)
    ,(DeleteMarkerProcedure.NAME, DeleteMarkerProcedure.builder)
    ,(ShowRollbacksProcedure.NAME, ShowRollbacksProcedure.builder)
    ,(ShowRollbackDetailProcedure.NAME, ShowRollbackDetailProcedure.builder)
    ,(ExportInstantsProcedure.NAME, ExportInstantsProcedure.builder)
    ,(ShowAllFileSystemViewProcedure.NAME, ShowAllFileSystemViewProcedure.builder)
    ,(ShowLatestFileSystemViewProcedure.NAME, ShowLatestFileSystemViewProcedure.builder)
    ,(ShowHoodieLogFileMetadataProcedure.NAME, ShowHoodieLogFileMetadataProcedure.builder)
    ,(ShowHoodieLogFileRecordsProcedure.NAME, ShowHoodieLogFileRecordsProcedure.builder)
    ,(StatsWriteAmplificationProcedure.NAME, StatsWriteAmplificationProcedure.builder)
    ,(StatsFileSizeProcedure.NAME, StatsFileSizeProcedure.builder)
    ,(HdfsParquetImportProcedure.NAME, HdfsParquetImportProcedure.builder)
    ,(RunBootstrapProcedure.NAME, RunBootstrapProcedure.builder)
    ,(ShowBootstrapMappingProcedure.NAME, ShowBootstrapMappingProcedure.builder)
    ,(ShowBootstrapPartitionsProcedure.NAME, ShowBootstrapPartitionsProcedure.builder)
    ,(UpgradeTableProcedure.NAME, UpgradeTableProcedure.builder)
    ,(DowngradeTableProcedure.NAME, DowngradeTableProcedure.builder)
    ,(ShowMetadataTableFilesProcedure.NAME, ShowMetadataTableFilesProcedure.builder)
    ,(ShowMetadataTablePartitionsProcedure.NAME, ShowMetadataTablePartitionsProcedure.builder)
    ,(CreateMetadataTableProcedure.NAME, CreateMetadataTableProcedure.builder)
    ,(DeleteMetadataTableProcedure.NAME, DeleteMetadataTableProcedure.builder)
    ,(InitMetadataTableProcedure.NAME, InitMetadataTableProcedure.builder)
    ,(ShowMetadataTableStatsProcedure.NAME, ShowMetadataTableStatsProcedure.builder)
    ,(ValidateMetadataTableFilesProcedure.NAME, ValidateMetadataTableFilesProcedure.builder)
    ,(ShowFsPathDetailProcedure.NAME, ShowFsPathDetailProcedure.builder)
    ,(CopyToTableProcedure.NAME, CopyToTableProcedure.builder)
    ,(RepairAddpartitionmetaProcedure.NAME, RepairAddpartitionmetaProcedure.builder)
    ,(RepairCorruptedCleanFilesProcedure.NAME, RepairCorruptedCleanFilesProcedure.builder)
    ,(RepairDeduplicateProcedure.NAME, RepairDeduplicateProcedure.builder)
    ,(RepairMigratePartitionMetaProcedure.NAME, RepairMigratePartitionMetaProcedure.builder)
    ,(RepairOverwriteHoodiePropsProcedure.NAME, RepairOverwriteHoodiePropsProcedure.builder)
    ,(RunCleanProcedure.NAME, RunCleanProcedure.builder)
    ,(ValidateHoodieSyncProcedure.NAME, ValidateHoodieSyncProcedure.builder)
    ,(ShowInvalidParquetProcedure.NAME, ShowInvalidParquetProcedure.builder)
    ,(HiveSyncProcedure.NAME, HiveSyncProcedure.builder)
    ,(CopyToTempView.NAME, CopyToTempView.builder)
    ,(ShowCommitExtraMetadataProcedure.NAME, ShowCommitExtraMetadataProcedure.builder)
    ,(ShowTablePropertiesProcedure.NAME, ShowTablePropertiesProcedure.builder)
  )
打印代码
1  | initProcedureBuilders.keySet.foreach(println)  | 
建表造数
1  | create table test_hudi_call_cow (  | 
show_table_properties
查看表的properties,以key,value的形式返回hoodie.properties中表的配置
参数
- table 表名
 - path 表路径
 - limit 可选 默认值10
table和path两个参数必须得有一个,table的优先级高于path,即如果同时指定table和path,那么以table为准,path不生效。
输出返回字段:
key,value 
示例
1  | call show_table_properties(table => 'test_hudi_call_cow');  | 
默认展示前10条1
2
3
4
5
6
7
8
9
10
11
12
13
14+--------------------------------------------------+----------------+
|                       key                        |     value      |
+--------------------------------------------------+----------------+
| hoodie.table.precombine.field                    | ts             |
| hoodie.datasource.write.drop.partition.columns   | false          |
| hoodie.table.partition.fields                    | dt             |
| hoodie.table.type                                | COPY_ON_WRITE  |
| hoodie.archivelog.folder                         | archived       |
| hoodie.timeline.layout.version                   | 1              |
| hoodie.table.version                             | 5              |
| hoodie.table.recordkey.fields                    | id             |
| hoodie.table.metadata.partitions                 | files          |
| hoodie.datasource.write.partitionpath.urlencode  | false          |
+--------------------------------------------------+----------------+
可以通过设置limit,将limit值设置大一点,查看所有的配置
1  | call show_table_properties(table => 'test_hudi_call_cow', limit => 100);  | 
1  | +--------------------------------------------------+----------------------------------------------------+  | 
show_commits
参数
- table 表名 必选
 - limit 默认值10 可选
输出返回字段:
commit_time,action,total_bytes_written,total_files_added,total_files_updated,total_partitions_written,total_records_written,total_update_records_written,total_errors 
示例
1  | call show_commits(table => 'test_hudi_call_cow');  | 
1  | 
  | 
show_commits_metadata
和show_commits功能差不多,不同的是输出字段不一样,和show_commits一样都是通过ShowCommitsProcedures实现的,区别是show_commits_metadata的includeExtraMetadata为true,show_commits的includeExtraMetadata为false
参数
- table 表名 必选
 - limit 默认值10 可选
输出返回字段:
commit_time,action,partition,file_id,previous_commit,num_writes,num_inserts,num_deletes,num_update_writes,total_errors,total_log_blocks,total_corrupt_log_blocks,total_rollback_blocks,total_log_records, total_updated_records_compacted,total_bytes_written 
示例
1  | call show_commits_metadata(table => 'test_hudi_call_cow');  | 
1  | +--------------------+---------+----------------+-----------------------------------------+--------------------+-------------+--------------+--------------+--------------------+---------------+-------------------+---------------------------+------------------------+--------------------+----------------------------------+----------------------+  | 
show_commit_files
根据instantTime返回对应的文件信息,比如fileId
参数
- table 表名 必选
 - instant_time 必选
 - limit 默认10 可选
输出返回字段:
action,partition_path,file_id,previous_commit,total_records_updated,total_records_written,total_bytes_written,total_errors,file_size 
示例
1  | call show_commit_files(table => 'test_hudi_call_cow', instant_time => '20221123205701931');  | 
因为测试数据比较少件,且只有一个分区,,所以只有一个文件
1  | +---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+  | 
执行下面的sql,使一次commit涉及两个文件1
2
3
4
5
6
7
8
9
10merge into test_hudi_call_cow as t0
using (
  select 5 as id, 'hudi' as name, 112 as price, 98 as ts, '2022-11-23' as dt,'INSERT' as opt_type union
  select 2 as id, 'hudi_2' as name, 10 as price, 100 as ts, '2021-05-05' as dt,'UPDATE' as opt_type union
  select 4 as id, 'hudi' as name, 10 as price, 100 as ts, '2021-05-05' as dt ,'DELETE' as opt_type
 ) as s0
on t0.id = s0.id
when matched and opt_type!='DELETE' then update set *
when matched and opt_type='DELETE' then delete
when not matched and opt_type!='DELETE' then insert *;
先用show_commits查看最新的commit_time为202211232324496441
2
3
4
5
6
7call show_commits(table => 'test_hudi_call_cow', limit => 1);
+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+
|    commit_time     | action  | total_bytes_written  | total_files_added  | total_files_updated  | total_partitions_written  | total_records_written  | total_update_records_written  | total_errors  |
+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+
| 20221123232449644  | commit  | 870474               | 1                  | 1                    | 2                         | 4                      | 1                             | 0             |
+--------------------+---------+----------------------+--------------------+----------------------+---------------------------+------------------------+-------------------------------+---------------+
然后再用show_commit_files看一下20221123232449644对应的文件1
2
3
4
5
6
7
8call show_commit_files(table => 'test_hudi_call_cow', instant_time => '20221123232449644');
+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+
| action  | partition_path  |                 file_id                 |  previous_commit   | total_records_updated  | total_records_written  | total_bytes_written  | total_errors  | file_size  |
+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+
| commit  | dt=2022-11-23   | 8f2aecfd-198f-405b-ab5d-46e0cc997d97-0  | null               | 0                      | 1                      | 435176               | 0             | 435176     |
| commit  | dt=2021-05-05   | 35b07424-6e63-4b65-9182-7c37cbe756b1-0  | 20221123231230786  | 1                      | 3                      | 435298               | 0             | 435298     |
+---------+-----------------+-----------------------------------------+--------------------+------------------------+------------------------+----------------------+---------------+------------+
show_commit_partitions
根据instantTime返回涉及的每个分区对应的文件和记录信息
参数
- table 表名 必选
 - instant_time 必选
 - limit 默认10 可选
输出返回字段:
action,partition_path,total_files_added,total_files_updated,total_records_inserted,total_records_updated,total_bytes_written,total_errors示例
1
call show_commit_partitions(table => 'test_hudi_call_cow', instant_time => '20221123232449644');
 
1  | +---------+-----------------+--------------------+----------------------+-------------------------+------------------------+----------------------+---------------+  | 
show_commit_write_stats
根据instantTime返回write_stats
参数:
- table 表名 必选
 - instant_time 必选
 - limit 默认10 可选
输出返回字段:
action,total_bytes_written,total_records_written,avg_record_size示例
1
call show_commit_write_stats(table => 'test_hudi_call_cow', instant_time => '20221123232449644');
 
1  | +---------+----------------------+------------------------+------------------+  | 
show_commit_extra_metadata
返回.commit、.deltacommit、.replacecommit中的extraMetadata
参数
- table 表名 必选
 - instant_time 可选
 - limit 默认100 可选
 - metadata_key 可选 如schema
输出返回字段:
instant_time,action,metadata_key,metadata_value 
默认返回最后一个commit文件中的extraMetadata,如果指定了instant_time,那么返回指定instant_time对应的commit文件中的extraMetadata
先看一下.commit里的内容,可以看到里面有一个extraMetadata,并且包含一个key:schema以及schema对应的value1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69hadoop fs -cat hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/.hoodie/20221123232449644.commit
{
  "partitionToWriteStats" : {
    "dt=2022-11-23" : [ {
      "fileId" : "8f2aecfd-198f-405b-ab5d-46e0cc997d97-0",
      "path" : "dt=2022-11-23/8f2aecfd-198f-405b-ab5d-46e0cc997d97-0_1-238-2983_20221123232449644.parquet",
      "cdcStats" : null,
      "prevCommit" : "null",
      "numWrites" : 1,
      "numDeletes" : 0,
      "numUpdateWrites" : 0,
      "numInserts" : 1,
      "totalWriteBytes" : 435176,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "dt=2022-11-23",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 435176,
      "minEventTime" : null,
      "maxEventTime" : null,
      "runtimeStats" : {
        "totalScanTime" : 0,
        "totalUpsertTime" : 0,
        "totalCreateTime" : 1013
      }
    } ],
    "dt=2021-05-05" : [ {
      "fileId" : "35b07424-6e63-4b65-9182-7c37cbe756b1-0",
      "path" : "dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-238-2982_20221123232449644.parquet",
      "cdcStats" : null,
      "prevCommit" : "20221123231230786",
      "numWrites" : 3,
      "numDeletes" : 1,
      "numUpdateWrites" : 1,
      "numInserts" : 0,
      "totalWriteBytes" : 435298,
      "totalWriteErrors" : 0,
      "tempPath" : null,
      "partitionPath" : "dt=2021-05-05",
      "totalLogRecords" : 0,
      "totalLogFilesCompacted" : 0,
      "totalLogSizeCompacted" : 0,
      "totalUpdatedRecordsCompacted" : 0,
      "totalLogBlocks" : 0,
      "totalCorruptLogBlock" : 0,
      "totalRollbackBlocks" : 0,
      "fileSizeInBytes" : 435298,
      "minEventTime" : null,
      "maxEventTime" : null,
      "runtimeStats" : {
        "totalScanTime" : 0,
        "totalUpsertTime" : 4162,
        "totalCreateTime" : 0
      }
    } ]
  },
  "compacted" : false,
  "extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"test_hudi_call_cow_record\",\"namespace\":\"hoodie.test_hudi_call_cow\",\"fields\":[{\"name\":\"id\",\"type\":\"int\"},{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"price\",\"type\":\"double\"},{\"name\":\"ts\",\"type\":\"long\"},{\"name\":\"dt\",\"type\":\"string\"}]}"
  },
  "operationType" : "UPSERT"
}
示例
1  | call show_commit_extra_metadata(table => 'test_hudi_call_cow');  | 
1  | +--------------------+---------+---------------+----------------------------------------------------+  | 
1  | call show_commit_extra_metadata(table => 'test_hudi_call_cow', instant_time => '20221123205701931', metadata_key => 'schema', limit => 10);  | 
1  | +--------------------+---------+---------------+----------------------------------------------------+  | 
我目前已知的extraMetadata中只有schema,且只有一条,所以只返回一条记录,且指定不指定metadata_key效果一样,不确定是否还有其他的extraMetadata
show_fs_path_detail
展示指定路径下面的文件和路径的统计信息,默认按照文件大小进行排序
返回.commit、.deltacommit、.replacecommit中的extraMetadata
参数
- path 表文件路径 必选
 - is_sub 可选 是否查询子目录,只查询一级子目录 默认false
 - sort 可选 是否按文件大小排序 默认true
 - metadata_key 可选 如schema
输出返回字段:
instant_time,action,metadata_key,metadata_value示例
1
2
3
4
5
6
7call show_fs_path_detail(path => 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow');
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| path_num | file_num | storage_size | storage_size(unit) | storage_path | space_consumed | quota | space_quota |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| 22 | 53 | 3200109 | 3.05MB | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow | -1 | 9600327 | -1 |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+ 
查询一级子目录1
2
3
4
5
6
7
8
9call show_fs_path_detail(path => 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow', is_sub => true);
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| path_num  | file_num  | storage_size  | storage_size(unit)  |                    storage_path                    | space_consumed  |  quota   | space_quota  |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| 1         | 7         | 2611728       | 2.49MB              | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05 | -1              | 7835184  | -1           |
| 1         | 2         | 435272        | 425.07KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2022-11-23 | -1              | 1305816  | -1           |
| 19        | 44        | 153109        | 149.52KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/.hoodie | -1              | 459327   | -1           |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
查询二级子目录1
2
3
4
5
6
7
8
9
10
11
12
13call show_fs_path_detail(path => 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05', is_sub => true);
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| path_num  | file_num  | storage_size  | storage_size(unit)  |                    storage_path                    | space_consumed  |  quota   | space_quota  |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
| 0         | 1         | 435353        | 425.15KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-192-1549_20221123231230786.parquet | -1              | 1306059  | -1           |
| 0         | 1         | 435308        | 425.11KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-147-118_20221123205701931.parquet | -1              | 1305924  | -1           |
| 0         | 1         | 435298        | 425.10KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-238-2982_20221123232449644.parquet | -1              | 1305894  | -1           |
| 0         | 1         | 435279        | 425.08KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-105-83_20221123205650038.parquet | -1              | 1305837  | -1           |
| 0         | 1         | 435246        | 425.04KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-66-52_20221123205636715.parquet | -1              | 1305738  | -1           |
| 0         | 1         | 435148        | 424.95KB            | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/35b07424-6e63-4b65-9182-7c37cbe756b1-0_0-27-21_20221123205546254.parquet | -1              | 1305444  | -1           |
| 0         | 1         | 96            | 96B                 | hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_hudi_call_cow/dt=2021-05-05/.hoodie_partition_metadata | -1              | 288      | -1           |
+-----------+-----------+---------------+---------------------+----------------------------------------------------+-----------------+----------+--------------+
相关阅读
- Apache Hudi 入门学习总结
 - Hudi Spark SQL总结
 - Spark3.12+Kyuubi1.5.2+kyuubi-spark-authz源码编译打包+部署配置HA
 - Hudi Spark SQL源码学习总结-Create Table
 - Hudi Spark SQL源码学习总结-CTAS
 - Hudi Spark源码学习总结-df.write.format(“hudi”).save
 - Hudi Spark源码学习总结-spark.read.format(“hudi”).load
 - Hudi Spark源码学习总结-spark.read.format(“hudi”).load(2)
 - Hudi Spark SQL源码学习总结-select(查询)
 - Hudi源码 | Insert源码分析总结(二)(WorkloadProfile)
 - 开源经验分享 | 如何从一名小白成为Apache Hudi Contributor