Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
BBC RussianHomePhabricator
Log In
Maniphest T340463

[Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates
Closed, DeclinedPublic

Description

This POC should evaluate the following solution:

Using an Iceberg table to keep track of the status of the updates to all our datasets.
For instance, when an Airflow job outputs data for 2023-06-01T00->2023-06-01T01 (1 full hour) for the dataset A;
The Iceberg table should keep track that the hour 2023-06-01T00 of the dataset A has been updated successfully.
In addition to that, an Iceberg sensor should query that Iceberg table to determine whether a given interval of a dataset has been updated successfully.
Note that the update frequency does not necessarily match the dataset's partition scheme. In Iceberg, for instance, a dataset with daily partitioning can be updated every hour.
Some sort of benchmark should be executed, since it is possible that our current Airflow setup launches a couple hundred sensors within the same 2 minutes.

Current implementation ideas:

  • Use Presto SQL sever in the sensor (instead of SparkSQL) to avoid having to spin up the JVM+Spark for each sensor poke.

Event Timeline

Ahoelzl renamed this task from [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates to [Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates.Oct 20 2023, 5:06 PM
lbowmaker subscribed.

Declining. We will not implement this approach and will likely favor a combined use of Airflows ExternalTaskSensor and HttpSensors.

Related task on investigating approach: https://phabricator.wikimedia.org/T360922