Hudi搭建和使用

  

2024-04-09 帮助1人

Hudi搭建与使用

一、拉取hudi

git clone --branch release-0.11.0 https://gitee.com/apache/Hudi.git

或者

git clone --branch release-0.11.0 https://github.com/apache/hudi.git

#二、替换阿里云仓库

 <repository>
        <id>nexus-aliyun</id>
        <name>nexus-aliyun</name>
        <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        <releases>
            <enabled>true</enabled>
        </releases>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
</repository>

三、编译

选择spark2.4版本，scala2.11，flink1.13

mvn clean package -DskipTests -Dspark2.4 -Dscala-2.11 -Dflink1.13

四、启动及使用

1、spark-shell启动

spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.5 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --jars /usr/local/Hudi/packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.11.0.jar

1.1导入依赖、设置表名、基本路径和数据生成器以生成本指南的记录

// spark-shell

import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._

val tableName = "hudi_trips_cow"
val basePath = "file:///tmp/hudi_trips_cow"
val dataGen = new DataGenerator

1.2生成一些新的行程，将它们加载到 DataFrame 中并将 DataFrame 写入 Hudi 表，如下所示。

// spark-shell

val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Overwrite).
  save(basePath)

1.3将数据文件加载到 DataFrame 中

// spark-shell

val tripsSnapshotDF = spark.
  read.
  format("hudi").
  load(basePath)
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

1.4查询

spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

1.5三种时间格式查询

spark.read.
  format("hudi").
  option("as.of.instant", "20210728141108100").
  load(basePath)

spark.read.
  format("hudi").
  option("as.of.instant", "2021-07-28 14:11:08.200").
  load(basePath)

// It is equal to "as.of.instant = 2021-07-28 00:00:00"
spark.read.
  format("hudi").
  option("as.of.instant", "2021-07-28").
  load(basePath)

1.6更新

// spark-shell

val updates = convertToStringList(dataGen.generateUpdates(10))
val df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)

1.7增量查询

Hudi 还提供了获取自给定提交时间戳以来更改的记录流的功能。这可以使用 Hudi 的增量查询并提供需要流式传输更改的开始时间来实现。如果我们想要在给定的提交之后进行所有更改（通常情况下），我们不需要指定 endTime。

// spark-shell
// reload data

spark.
  read.
  format("hudi").
  load(basePath).
  createOrReplaceTempView("hudi_trips_snapshot")

val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").map(k => k.getString(0)).take(50)
val beginTime = commits(commits.length - 2) // commit time we are interested in

// incrementally query data
val tripsIncrementalDF = spark.read.format("hudi").
  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
  load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")

spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()

1.8时间点查询

具体时间可以通过将 endTime 指向特定提交时间并将 beginTime 指向“000”（表示最早可能的提交时间）来表示。
// spark-shell

val beginTime = "000" // Represents all commits > this time.
val endTime = commits(commits.length - 2) // commit time we are interested in

//incrementally query data
val tripsPointInTimeDF = spark.read.format("hudi").
  option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL).
  option(BEGIN_INSTANTTIME_OPT_KEY, beginTime).
  option(END_INSTANTTIME_OPT_KEY, endTime).
  load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()

1.9删除

删除传入的 HoodieKeys 的记录。
// spark-shell

// fetch total records count
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
// fetch two records to be deleted
val ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)

// issue deletes
val deletes = dataGen.generateDeletes(ds.collectAsList())
val df = spark.read.json(spark.sparkContext.parallelize(deletes, 2))

df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(OPERATION_OPT_KEY,"delete").
  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
  option(TABLE_NAME, tableName).
  mode(Append).
  save(basePath)

// run the same read query as above.
val roAfterDeleteViewDF = spark.
  read.
  format("hudi").
  load(basePath)

roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
// fetch should return (total - 2) records
spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()

1.10 插入数据

生成一些新的行程，覆盖输入中存在的所有分区。此操作可能比upsert批处理 ETL 作业更快，后者一次重新计算整个目标分区（与增量更新目标表相反）。这是因为，我们能够完全绕过 upsert 写入路径中的索引、预组合和其他重新分区步骤。
// spark-shell

spark.
  read.format("hudi").
  load(basePath).
  select("uuid","partitionpath").
  sort("partitionpath","uuid").
  show(100, false)

val inserts = convertToStringList(dataGen.generateInserts(10))
val df = spark.
  read.json(spark.sparkContext.parallelize(inserts, 2)).
  filter("partitionpath = 'americas/united_states/san_francisco'")
df.write.format("hudi").
  options(getQuickstartWriteConfigs).
  option(OPERATION.key(),"insert_overwrite").
  option(PRECOMBINE_FIELD.key(), "ts").
  option(RECORDKEY_FIELD.key(), "uuid").
  option(PARTITIONPATH_FIELD.key(), "partitionpath").
  option(TBL_NAME.key(), tableName).
  mode(Append).
  save(basePath)

// Should have different keys now for San Francisco alone, from query before.
spark.
  read.format("hudi").
  load(basePath).
  select("uuid","partitionpath").
  sort("partitionpath","uuid").
  show(100, false)

2、flink sql 启动

2.1 启动flink-sql

./bin/sql-client.sh embedded -j .../hudi-flink-bundle_2.1?-*.*.*.jar shell

2.2 插入数据

创建一个 Flink Hudi 表，并使用 SQL 将数据插入到 Hudi 表中VALUES
– sets up the result mode to tableau to show the results directly in the CLI

set execution.result-mode=tableau;

CREATE TABLE t1(
  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = '${path}',
  'table.type' = 'MERGE_ON_READ' -- this creates a MERGE_ON_READ table, by default is COPY_ON_WRITE
);

-- insert data using values
INSERT INTO t1 VALUES
  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');

2.3 查询数据

query from the Hudi table

select * from t1;

2.4 更新数据

this would update the record with key ‘id1’

insert into t1 values
  ('id1','Danny',27,TIMESTAMP '1970-01-01 00:00:01','par1');

请注意，保存模式现在是Append. 通常，除非您是第一次尝试创建表，否则请始终使用附加模式。现在再次查询数据将显示更新的记录。每个写入操作都会生成一个由时间戳表示的新提交。_hoodie_commit_time在,字段中查找先前提交age中相同s的更改。_hoodie_record_key

2.5流式查询

Hudi Flink 还提供了获取自给定提交时间戳以来更改的记录流的功能。这可以使用 Hudi 的流式查询并提供需要流式传输更改的开始时间来实现。如果我们想要在给定的提交之后进行所有更改（通常情况下），我们不需要指定 endTime。
read.streaming.start-commit这将给出提交后发生的所有更改。此功能的独特之处在于它现在允许您在流式或批处理数据源上创作流式管道。

CREATE TABLE t1(
  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = '${path}',
  'table.type' = 'MERGE_ON_READ',
  'read.streaming.enabled' = 'true',  -- this option enable the streaming read
  'read.start-commit' = '20210316134557', -- specifies the start commit instant time
  'read.streaming.check-interval' = '4' -- specifies the check interval for finding new source commits, default 60s.
);

– Then query the table in stream mode

select * from t1;

这篇好文章是转载于：学新通技术网

Hudi搭建和使用

Hudi搭建与使用

一、拉取hudi

#二、替换阿里云仓库

三、编译

四、启动及使用

1、spark-shell启动

1.1导入依赖、设置表名、基本路径和数据生成器以生成本指南的记录

1.2生成一些新的行程，将它们加载到 DataFrame 中并将 DataFrame 写入 Hudi 表，如下所示。

1.3将数据文件加载到 DataFrame 中

1.4查询

1.5三种时间格式查询

1.6更新

1.7增量查询

1.8时间点查询

1.9删除

1.10 插入数据

2、flink sql 启动

2.1 启动flink-sql

2.2 插入数据

2.3 查询数据

2.4 更新数据

2.5流式查询

photoshop保存的图片太大微信发不了怎么办

Android 11 保存文件到外部存储，并分享文件

word里面弄一个表格后上面的标题会跑到下面怎么办

《学习通》视频自动暂停处理方法

photoshop扩展功能面板显示灰色怎么办

微信公众号没有声音提示怎么办

excel下划线不显示怎么办

excel打印预览压线压字怎么办

怎样阻止微信小程序自动打开

TikTok加速器哪个好免费的TK加速器推荐