文章目录
一、环境及测试数据
环境:CDH6.3.0,Hive 2.1.1-cdh6.3.0
基础数据分区表test1,包含a,b,c,d共4列加分区列p_day,向其中插入两行数据
create table test1(a int,b bigint,c float,d string) partitioned by(p_day string) stored as parquet;
insert into test1 partition(p_day) values(1,11,1.1,'str1','2022-11-19'),(2,22,2.2,'str2','2022-11-19');
表中数据及parquet文件信息如下:
0: jdbc:hive2://dev-master1:10000 > select * from test1;
+----------+----------+----------+----------+--------------+
| test1.a | test1.b | test1.c | test1.d | test1.p_day |
+----------+----------+----------+----------+--------------+
| 1 | 11 | 1.1 | str1 | 2022-11-19 |
| 2 | 22 | 2.2 | str2 | 2022-11-19 |
+----------+----------+----------+----------+--------------+
[hive@dev-master1 tmp]$ hdfs dfs -get /user/hive/warehouse/debug_test.db/test1/p_day=2022-11-19/000000_0 ./
[hive@dev-master1 tmp]$ parquet-tools schema 000000_0
message hive_schema {
optional int32 a;
optional int64 b;
optional float c;
optional binary d (STRING);
}
[hive@dev-master1 tmp]$ parquet-tools cat 000000_0
a = 1
b = 11
c = 1.1
d = str1
a = 2
b = 22
c = 2.2
d = str2
二、 删除列
2.1 测试表test2
create table test2(b bigint,a int,c float) partitioned by(p_day string) stored as parquet;
test2表直接使用test1表的文件:
[hive@dev-master1 tmp]$ hdfs dfs -mkdir /user/hive/warehouse/debug_test.db/test2/p_day=2022-11-19
[hive@dev-master1 tmp]$ hdfs dfs -cp /user/hive/warehouse/debug_test.db/test1/p_day=2022-11-19/000000_0 /user/hive/warehouse/debug_test.db/test2/p_day=2022-11-19/
修复分区并查询数据
msck repair table test2;
select * from test2;
+----------+----------+----------+--------------+
| test2.b | test2.a | test2.c | test2.p_day |
+----------+----------+----------+--------------+
| 11 | 1 | 1.1 | 2022-11-19 |
| 22 | 2 | 2.2 | 2022-11-19 |
+----------+----------+----------+--------------+
2.2 DDL删除列?
删除test2表的a列,看起来只有通过replace columns实现,但是运行报错,根据官方文档,只有表的序列化方式为native SerDe(DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe)才能执行。
0: jdbc:hive2://dev-master1:10000> alter table test2 replace columns(b bigint,c float);
INFO : Compiling command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd): alter table test2 replace columns(b bigint,c float)
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
INFO : Completed compiling command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd); Time taken: 1.479 seconds
INFO : Executing command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd): alter table test2 replace columns(b bigint,c float)
INFO : Starting task [Stage-0:DDL] in serial mode
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table debug_test.test2. SerDe may be incompatible
INFO : Completed executing command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd); Time taken: 0.018 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table debug_test.test2. SerDe may be incompatible (state=42000,code=1)
2.3 代码连接Hive Metastore删除列
主要Maven依赖:
用代码连接Hive MetaStore修改可以成功:
packagecom.bigdata.databasetest.hive.metastore;importorg.apache.hadoop.hive.conf.HiveConf;importorg.apache.hadoop.hive.metastore.HiveMetaStoreClient;importorg.apache.hadoop.hive.metastore.api.FieldSchema;importorg.apache.hadoop.hive.metastore.api.Table;importorg.apache.thrift.TException;importjava.util.List;importjava.util.stream.Collectors;/**
* ClassName: HiveMetaStoreClientTest
* Description:
*
* @author 0x3E6
* @version 1.0.0
* @date 2022/11/19 17:12
*/publicclassHiveMetaStoreClientTest{publicstaticvoidmain(String[] args)throwsTException{HiveConf hiveConf =newHiveConf();System.setProperty("HADOOP_USER_NAME","hive");
hiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname,"thrift://dev-master1:9083");try(HiveMetaStoreClient client =newHiveMetaStoreClient(hiveConf)){Table table = client.getTable("debug_test","test2");List<FieldSchema> cols = table.getSd().getCols();
cols = cols.stream().filter(fieldSchema ->!"a".equalsIgnoreCase(fieldSchema.getName())).collect(Collectors.toList());
table.getSd().setCols(cols);
client.alter_table("debug_test","test2", table);}}}
这样查询数据有问题,不论sql是否带分区:
select * from test2;
select * from test2 where p_day='2022-11-19';
查询结果为:
+----------+----------+--------------+
| test2.b | test2.c | test2.p_day |
+----------+----------+--------------+
| 11 | NULL | 2022-11-19 |
| 22 | NULL | 2022-11-19 |
+----------+----------+--------------+
因为只改了表的元数据,而未改分区的元数据。
0: jdbc:hive2://dev-master1:10000> desc test2;
+--------------------------+-----------------------+-----------------------+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+
| b | bigint | |
| c | float | |
| p_day | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| p_day | string | |
+--------------------------+-----------------------+-----------------------+
0: jdbc:hive2://dev-master1:10000> desc test2 partition(p_day='2022-11-19');
+--------------------------+-----------------------+-----------------------+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+
| b | bigint | |
| a | int | |
| c | float | |
| p_day | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| p_day | string | |
+--------------------------+-----------------------+-----------------------+
具体原理及原因还未分析,但可以修复,只需要保持分区的列与表的列顺序一致,无法执行REPLACE COLUMNS语句,通过CHANGE COLUMNS语句将c列移动到a列之前:
alter table test2 partition(p_day='2022-11-19') change column c c float after b;
select * from test2 where p_day='2022-11-19';
+----------+----------+--------------+
| test2.b | test2.c | test2.p_day |
+----------+----------+--------------+
| 11 | 1.1 | 2022-11-19 |
| 22 | 2.2 | 2022-11-19 |
+----------+----------+--------------+
2.3.1 同时更新表与分区元数据
ALTER TABLE语句可以添加CASCADE,更新表元数据的同时级联更新分区元数据,Hive MetaStoreClient API也可以使用带CASCADE的方法,前面的代码调用的HiveMetaStoreClient的
alter_table(String dbname, String tbl_name, Table new_tbl)
方法,可以调用
alter_table(String defaultDatabaseName, String tblName, Table table, boolean cascade)
方法,新建表test3与test2结构一致,也同样把test1的数据拷贝到表test3,再通过代码删除表test3的列a:
packagecom.bigdata.databasetest.hive.metastore;importorg.apache.hadoop.hive.conf.HiveConf;importorg.apache.hadoop.hive.metastore.HiveMetaStoreClient;importorg.apache.hadoop.hive.metastore.api.FieldSchema;importorg.apache.hadoop.hive.metastore.api.Table;importorg.apache.thrift.TException;importjava.util.List;importjava.util.stream.Collectors;/**
* ClassName: HiveMetaStoreClientTest
* Description:
*
* @author 0x3E6
* @version 1.0.0
* @date 2022/11/19 17:12
*/publicclassHiveMetaStoreClientTest{publicstaticvoidmain(String[] args)throwsTException{HiveConf hiveConf =newHiveConf();System.setProperty("HADOOP_USER_NAME","hive");
hiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname,"thrift://dev-master1:9083");try(HiveMetaStoreClient client =newHiveMetaStoreClient(hiveConf)){Table table = client.getTable("debug_test","test3");List<FieldSchema> cols = table.getSd().getCols();
cols = cols.stream().filter(fieldSchema ->!"a".equalsIgnoreCase(fieldSchema.getName())).collect(Collectors.toList());
table.getSd().setCols(cols);// client.alter_table("debug_test", "test2", table);
client.alter_table("debug_test","test3", table,true);}}}
更新后表test3可正常查询:
0: jdbc:hive2://dev-master1:10000> select * from test3;
+----------+----------+--------------+
| test3.b | test3.c | test3.p_day |
+----------+----------+--------------+
| 11 | 1.1 | 2022-11-19 |
| 22 | 2.2 | 2022-11-19 |
+----------+----------+--------------+
表与分区元数据也相同:
0: jdbc:hive2://dev-master1:10000> desc test3;
+--------------------------+-----------------------+-----------------------+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+
| b | bigint | |
| c | float | |
| p_day | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| p_day | string | |
+--------------------------+-----------------------+-----------------------+
0: jdbc:hive2://dev-master1:10000> desc test3 partition(p_day='2022-11-19');
+--------------------------+-----------------------+-----------------------+
| col_name | data_type | comment |
+--------------------------+-----------------------+-----------------------+
| b | bigint | |
| c | float | |
| p_day | string | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| | NULL | NULL |
| p_day | string | |
+--------------------------+-----------------------+-----------------------+
三、添加列
根据Hive文档,可通过ADD COLUMNS语句添加列,ADD COLUMNS语句会将指定的列添加到(除分区列外)其他列后面。
ALTER TABLE table_name
[PARTITION partition_spec] -- (Note: Hive 0.14.0 and later)
ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
[CASCADE|RESTRICT] -- (Note: Hive 1.1.0 and later)
如果使用Hive MetaStoreClient API不小心将列加到了(除分区列)外其他列之间,且造成了表与分区列顺序不同,也可结合CHANGE COLUMNS语句修改表或分区的列顺序恢复即可:
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
[COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
结论
可以通过Hive MetaStoreClient API增删Hive列,但必须保持Hive表与各分区元数据的列顺序一致。
参考链接
- Hive LanguageManual DDL
- Parquet格式表重命名列名后Hive查询列数据显示NULL异常分析
- Java调用Hive-metastore接口
版权归原作者 0x3E6 所有, 如有侵权,请联系我们删除。