datahub

mirror of https://github.com/datahub-project/datahub.git synced 2025-11-05 13:20:33 +00:00

History

doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252 )

Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>

2025-04-28 10:34:33 -04:00

gcs_recipe.yml

feat(ingest): add GCS ingestion source (#7903 )

2023-04-27 19:03:41 +02:00

gcs.md

ci(graphql,workflows): Format .md, .graphql, and workflow .yml files via prettier (#13220 )

2025-04-16 16:55:51 -07:00

README.md

doc: Acryl to DataHub, datahubproject.io to datahub.com (#13252 )

2025-04-28 10:34:33 -04:00

README.md

This connector ingests Google Cloud Storage datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. To specify the group of files that form a dataset, use path_specs configuration in ingestion recipe. This source leverages Interoperability of GCS with S3 and uses DataHub S3 Data Lake integration source under the hood. Refer section Path Specs from S3 connector for more details.

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source Concept	DataHub Concept	Notes
`"Google Cloud Storage"`	Data Platform
GCS object / Folder containing GCS objects	Dataset
GCS bucket	Container	Subtype `GCS bucket`
GCS folder	Container	Subtype `Folder`

Supported file types

Supported file types are as follows:

CSV
TSV
JSONL
JSON
Parquet
Apache Avro

Schemas for Parquet and Avro files are extracted as provided.

Schemas for schemaless formats (CSV, TSV, JSONL, JSON) are inferred. For CSV, TSV and JSONL files, we consider the first 100 rows by default, which can be controlled via the max_rows recipe parameter (see below) JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance. We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.

Prerequisites

Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .