CDH 6.3.0安装Apache Hudi指南
The following article is from OLAP Author liujinhui
1. 源码编译
首先从Hudi仓库https://github.com/apache/hudi.git将Hudi源代码Clone到自己本地,使用以下命令编译hudi
mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true -Dhadoop.version=3.0.0
注意:目前Hudi使用的是hadoop2.7.3版本,CDH6.3.0 环境使用的是hadoop3.0.0, 所以在打包的时候需要加上-Dhadoop.version=3.0.0 参数
2. 查询Hudi表配置
为查询Hudi表,需要进行如下配置。
1.将hudi-hadoop-mr-0.6.0.jar包上传到/opt/cloudera/parcels/CDH-6.3.0/jars2.软连接到此目录 /opt/cloudera/parcels/CDH-6.3.0/lib/hive/lib3.执行安装MR框架JAR
4.新建Hive辅助路径 /data/hive/jars (根据你的需求命名)并且在CDH界面配置
5.将以下jar包上传至辅助路径下
•hudi-hadoop-mr-bundle-0.6.0.jar(如果数据存储在aliyun OSS上则需要以下三个jar一并放置在hive辅助路径下)•aliyun-sdk-oss-3.8.1.jar•hadoop-aliyun-3.2.1.jar•jdom-1.1.jar
3. 权限配置
使用Hive用户执行赋权命令
GRANT all on uri 'oss://data-lake/xxxxx' to role xxxx;
4. 运行DeltaStreamer作业
运行Detlastreamer Hudi任务
spark-submit --name xxxx \
--driver-memory 2G \
--num-executors 4 \
--executor-memory 4G \
--executor-cores 1 \
--deploy-mode cluster \
--conf spark.executor.userClassPathFirst=true \
--jars hdfs://nameservice1/data_lake/jars/hive-jdbc-2.1.1.jar,hdfs://nameservice1/data_lake/jars/hive-service-2.1.1.jar,hdfs://nameservice1/data_lake/jars/jdom-1.1.jar,hdfs://nameservice1/data_lake/jars/hadoop-aliyun-3.2.1.jar,hdfs://nameservice1/data_lake/jars/aliyun-sdk-oss-3.8.1.jar \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer hdfs://nameservice1/data_lake/jars/data_lake_1.jar \
--op INSERT \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--target-table t3_ts_iov_event_push_detail \
--table-type COPY_ON_WRITE \
--source-ordering-field updateTime \
--continuous \
--source-limit 100000 \
--target-base-path oss://data-lake/xxxxxx \
--enable-hive-sync \
--transformer-class org.apache.hudi.utilities.transform.AddStringDateColumnTransform \
--props hdfs://nameservice1/data_lake/xxxxxx/kafka-source.properties
推荐阅读
数仓实时化改造:Hudi on Flink 在顺丰的实践应用
基于阿里云数据湖分析服务和Apache Hudi构建云上实时数据湖