Abdallah Serghine ac967dfe50
ISSUE-16094: fix s3 storage parquet structureFormat ingestion (#18660)
This aims at fixing the s3 ingestion for parquet files, current behaviour is that
the pipeline will break if it encounters a file that is not valid parquet in the
the container, this is not great as containers might container non parquet files
on purpose like for example _SUCCESS files created by spark.

For that do not fail the whole pipeline when a single container fails, instead
count it as a failure and move on with the remainder of the containers, this is
already an improvement by ideally the ingestion should try a couple more files
under the given prefix before given up, additionally we can allow users to specify
file patterns to be ignored.

Co-authored-by: Abdallah Serghine <abdallah.serghine@olx.pl>
Co-authored-by: Pere Miquel Brull <peremiquelbrull@gmail.com>
2024-12-14 11:40:23 +01:00
..
2024-09-17 08:58:53 -07:00
2024-12-12 15:12:55 +05:30

This guide will help you setup the Ingestion framework and connectors
This guide will help you setup the Ingestion framework and connectors

Python version 3.8+

OpenMetadata Ingestion is a simple framework to build connectors and ingest metadata of various systems through OpenMetadata APIs. It could be used in an orchestration framework(e.g. Apache Airflow) to ingest metadata. Prerequisites

  • Python >= 3.8.x

Docs

Please refer to the documentation here https://docs.open-metadata.org/connectors

TopologyRunner

All the Ingestion Workflows run through the TopologyRunner.

The flow is depicted in the images below.

TopologyRunner Standard Flow

image

TopologyRunner Multithread Flow

image