[Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	mforns
	Jun 26 2023, 4:47 PM

Description

This POC should evaluate the following solution:

Using an Iceberg table to keep track of the status of the updates to all our datasets.
For instance, when an Airflow job outputs data for 2023-06-01T00->2023-06-01T01 (1 full hour) for the dataset A;
The Iceberg table should keep track that the hour 2023-06-01T00 of the dataset A has been updated successfully.
In addition to that, an Iceberg sensor should query that Iceberg table to determine whether a given interval of a dataset has been updated successfully.
Note that the update frequency does not necessarily match the dataset's partition scheme. In Iceberg, for instance, a dataset with daily partitioning can be updated every hour.
Some sort of benchmark should be executed, since it is possible that our current Airflow setup launches a couple hundred sensors within the same 2 minutes.

Current implementation ideas:

Use Presto SQL sever in the sensor (instead of SparkSQL) to avoid having to spin up the JVM+Spark for each sensor poke.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T333013 [Iceberg Migration] Apache Iceberg Migration
Resolved	Spike	mforns	T335306 [SPIKE] Evaluation on iceberg sensor for airflow
Declined		None	T340463 [Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates

Event Timeline

mforns created this task.Jun 26 2023, 4:47 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2023, 4:47 PM

mforns added a parent task: T335306: [SPIKE] Evaluation on iceberg sensor for airflow.Jun 26 2023, 4:47 PM

mforns claimed this task.Jun 28 2023, 4:03 PM

mforns edited projects, added Data Pipelines (Sprint 14); removed Data Pipelines.

mforns moved this task from Next Up to In Progress on the Data Pipelines (Sprint 14) board.

JArguello-WMF added a project: Data Engineering and Event Platform Team (Sprint 0).Jun 30 2023, 4:51 PM

JArguello-WMF moved this task from Next Up to In progress on the Data Engineering and Event Platform Team (Sprint 0) board.Jun 30 2023, 5:10 PM

odimitrijevic reassigned this task from mforns to JAllemandou.Jul 7 2023, 8:28 PM

JAllemandou moved this task from In progress to Next Up on the Data Engineering and Event Platform Team (Sprint 0) board.Jul 11 2023, 5:33 PM

JAllemandou removed JAllemandou as the assignee of this task.Jul 11 2023, 5:40 PM

JAllemandou subscribed.

lbowmaker moved this task from Sprint 0 to Sprint 1 on the Data Engineering and Event Platform Team board.Jul 25 2023, 12:17 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 1); removed Data Engineering and Event Platform Team (Sprint 0).

lbowmaker removed a project: Data Pipelines (Sprint 14).

lbowmaker moved this task from Sprint 1 to Sprint 2 on the Data Engineering and Event Platform Team board.Sep 8 2023, 1:23 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 2); removed Data Engineering and Event Platform Team (Sprint 1).

lbowmaker moved this task from Sprint 2 to Sprint 3 on the Data Engineering and Event Platform Team board.Sep 26 2023, 7:56 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team (Sprint 3); removed Data Engineering and Event Platform Team (Sprint 2).

lbowmaker moved this task from Sprint 3 to Apache Iceberg Migration on the Data Engineering and Event Platform Team board.Oct 2 2023, 3:10 PM

lbowmaker edited projects, added Data Engineering and Event Platform Team; removed Data Engineering and Event Platform Team (Sprint 3).

Ahoelzl renamed this task from [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates to [Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates.Oct 20 2023, 5:06 PM

lbowmaker added a project: Data-Engineering.Nov 10 2023, 1:31 PM

lbowmaker moved this task from Incoming (new tickets) to Apache Iceberg Migration on the Data-Engineering board.Nov 10 2023, 1:32 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 1:34 PM

xcollazo mentioned this in T346278: Implement an Airflow job that runs and publishes the XML dumps.Jan 20 2024, 12:40 AM

Declining. We will not implement this approach and will likely favor a combined use of Airflows ExternalTaskSensor and HttpSensors.

Related task on investigating approach: https://phabricator.wikimedia.org/T360922

[Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updatesClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

[Iceberg Migration] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates
Closed, DeclinedPublic
Actions

Related Objects
Search...