docs(ingest): update s3 and gcs doc with concept mapping (#8575)

2025-12-27 09:58:14 +00:00 · 2023-08-11 23:31:15 +05:30 · 2023-08-11 23:31:15 +05:30 · 7bb4e7b90d
commit 7bb4e7b90d
parent ef5931fbed
5 changed files with 145 additions and 89 deletions
--- a/metadata-ingestion/docs/sources/gcs/README.md
+++ b/metadata-ingestion/docs/sources/gcs/README.md
@ -0,0 +1,38 @@
+This connector ingests Google Cloud Storage datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. 
+To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. This source leverages [Interoperability of GCS with S3](https://cloud.google.com/storage/docs/interoperability)
+and uses DataHub S3 Data Lake integration source under the hood. Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) from S3 connector for more details.
+
+
+
+### Concept Mapping
+
+This ingestion source maps the following Source System Concepts to DataHub Concepts:
+
+| Source Concept                             | DataHub Concept                                                                            | Notes                |
+| ------------------------------------------ | ------------------------------------------------------------------------------------------ | -------------------- |
+| `"Google Cloud Storage"`                   | [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataPlatform/) |                      |
+| GCS object / Folder containing GCS objects | [Dataset](https://datahubproject.io/docs/generated/metamodel/entities/dataset/)            |                      |
+| GCS bucket                                 | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/)        | Subtype `GCS bucket` |
+| GCS folder                                 | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/)        | Subtype `Folder`     |
+
+
+### Supported file types
+Supported file types are as follows:
+
+- CSV
+- TSV
+- JSON
+- Parquet
+- Apache Avro
+
+Schemas for Parquet and Avro files are extracted as provided.
+
+Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
+JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
+
+
+### Prerequisites
+1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
+2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
+3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .
--- a/metadata-ingestion/docs/sources/s3/README.md
+++ b/metadata-ingestion/docs/sources/s3/README.md
@ -0,0 +1,44 @@
+This connector ingests S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub. 
+To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
+
+### Concept Mapping
+
+This ingestion source maps the following Source System Concepts to DataHub Concepts:
+
+| Source Concept                           | DataHub Concept                                                                            | Notes               |
+| ---------------------------------------- | ------------------------------------------------------------------------------------------ | ------------------- |
+| `"s3"`                                   | [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataPlatform/) |                     |
+| s3 object / Folder containing s3 objects | [Dataset](https://datahubproject.io/docs/generated/metamodel/entities/dataset/)            |                     |
+| s3 bucket                                | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/)        | Subtype `S3 bucket` |
+| s3 folder                                | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/)        | Subtype `Folder`    |
+
+This connector supports both local files as well as those stored on AWS S3 (which must be identified using the prefix `s3://`). 
+[a]
+### Supported file types
+Supported file types are as follows:
+
+- CSV (*.csv)
+- TSV (*.tsv)
+- JSON (*.json)
+- Parquet (*.parquet)
+- Apache Avro (*.avro)
+
+Schemas for Parquet and Avro files are extracted as provided.
+
+Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
+JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
+We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
+
+
+### Profiling
+
+This plugin extracts:
+- Row and column counts for each dataset
+- For each column, if profiling is enabled:
+    - null counts and proportions
+    - distinct counts and proportions
+    - minimum, maximum, mean, median, standard deviation, some quantile values
+    - histograms or frequencies of unique values
+
+Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for **s3a://** access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).
+Enabling profiling will slow down ingestion runs.
--- a/metadata-ingestion/docs/sources/s3/s3.md
+++ b/metadata-ingestion/docs/sources/s3/s3.md
@ -1,28 +1,63 @@

 ### Path Specs

-**Example - Dataset per file**
+Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
+
+All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
+
+Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
+
+Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your bucket has more complex dataset representation.
+
+**Additional points to note**
+- Folder names should not contain {, }, *, / in their names.
+- Named variable {folder} is reserved for internal working. please do not use in named variables.
+
+
+### Path Specs -  Examples
+#### Example 1 - Individual file as Dataset

 Bucket structure:

 ```
-test-s3-bucket
+test-bucket
 ├── employees.csv
+├── departments.json
 └── food_items.csv
 ```

-Path specs config
+Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:
 ```
 path_specs:
-    - include: s3://test-s3-bucket/*.csv
+    - include: s3://test-bucket/*.csv

 ```
+This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.

-**Example - Datasets with partitions**
+#### Example 2 - Folder of files as Dataset (without Partitions)

 Bucket structure:
 ```
-test-s3-bucket
+test-bucket
+└──  offers
+     ├── 1.avro
+     └── 2.avro
+
+```
+
+Path specs config to ingest folder `offers` as dataset:
+```
+path_specs:
+    - include: s3://test-bucket/{table}/*.avro
+```
+
+`{table}` represents folder for which dataset will be created.
+ 
+#### Example 3 - Folder of files as Dataset (with Partitions)
+
+Bucket structure:
+```
+test-bucket
 ├── orders
 │   └── year=2022
 │       └── month=2
@ -35,17 +70,19 @@ test-s3-bucket

 ```

-Path specs config:
+Path specs config to ingest folders `orders` and `returns` as datasets:
 ```
 path_specs:
-    - include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
+    - include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
 ```

-**Example - Datasets with partition and exclude**
+One can also use `include: s3://test-bucket/{table}/*/*/*.parquet` here however above format is preferred as it allows declaring partitions explicitly.
+
+#### Example 4 - Folder of files as Dataset (with Partitions), and Exclude Filter

 Bucket structure:
 ```
-test-s3-bucket
+test-bucket
 ├── orders
 │   └── year=2022
 │       └── month=2
@ -59,18 +96,20 @@ test-s3-bucket

 ```

-Path specs config:
+Path specs config to ingest folder `orders` as dataset but not folder `tmp_orders`:
 ```
 path_specs:
-    - include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
+    - include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
      exclude: 
        - **/tmp_orders/**
 ```
-**Example - Datasets of mixed nature**
+
+
+#### Example 5 - Advanced - Either Individual file OR Folder of files as Dataset

 Bucket structure:
 ```
-test-s3-bucket
+test-bucket
 ├── customers
 │   ├── part1.json
 │   ├── part2.json
@ -91,13 +130,20 @@ test-s3-bucket
 Path specs config:
 ```
 path_specs:
-    - include: s3://test-s3-bucket/*.csv
+    - include: s3://test-bucket/*.csv
      exclude:
        - **/tmp_10101000.csv
-    - include: s3://test-s3-bucket/{table}/*.json
-    - include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
+    - include: s3://test-bucket/{table}/*.json
+    - include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
 ```

+Above config has 3 path_specs and will ingest following datasets
+- `employees.csv` - Single File as Dataset
+- `food_items.csv` - Single File as Dataset
+- `customers` - Folder as Dataset
+- `orders` - Folder as Dataset
+  and will ignore file `tmp_10101000.csv`
+
 **Valid path_specs.include**

 ```python
@ -120,20 +166,6 @@ s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # ta
 - **/tests/*.csv
 - s3://my-bucket/foo/*/my_table/**

-**Notes**
-
- {table} represents folder for which dataset will be created.
- include path must end with (*.* or *.[ext]) to represent leaf level.
- if *.[ext] is provided then only files with specified type will be scanned.
- /*/ represents single folder.
- {partition[i]} represents value of partition.
- {partition_key[i]} represents name of the partition.
- While extracting, “i” will be used to match partition_key to partition.
- all folder levels need to be specified in include. Only exclude path can have ** like matching.
- exclude path cannot have named variables ( {} ).
- Folder names should not contain {, }, *, / in their names.
- {folder} is reserved for internal working. please do not use in named variables.
-


 If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
--- a/metadata-ingestion/src/datahub/ingestion/source/gcs/gcs_source.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/gcs/gcs_source.py
@ -84,33 +84,6 @@ class GCSSourceReport(DataLakeSourceReport):
@capability(SourceCapability.SCHEMA_METADATA, "Enabled by default")
@capability(SourceCapability.DATA_PROFILING, "Not supported", supported=False)
 class GCSSource(StatefulIngestionSourceBase):
-    """
-    This connector extracting datasets located on Google Cloud Storage. Supported file types are as follows:
-
-    - CSV
-    - TSV
-    - JSON
-    - Parquet
-    - Apache Avro
-
-    Schemas for Parquet and Avro files are extracted as provided.
-
-    Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
-    JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
-
-
-    This source leverages [Interoperability of GCS with S3](https://cloud.google.com/storage/docs/interoperability)
-    and uses DataHub S3 Data Lake integration source under the hood.
-
-    ### Prerequisites
-    1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
-    2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
-    3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .
-
-    To ingest datasets from your data lake, you need to provide the dataset path format specifications using `path_specs` configuration in ingestion recipe.
-    Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/gcs/#path-specs) for examples.
-    """
-
    def __init__(self, config: GCSSourceConfig, ctx: PipelineContext):
        super().__init__(config, ctx)
        self.config = config
--- a/metadata-ingestion/src/datahub/ingestion/source/s3/source.py
+++ b/metadata-ingestion/src/datahub/ingestion/source/s3/source.py
@ -219,37 +219,6 @@ class TableData:
    supported=True,
 )
 class S3Source(StatefulIngestionSourceBase):
-    """
-    This plugin extracts:
-
-    - Row and column counts for each table
-    - For each column, if profiling is enabled:
-      - null counts and proportions
-      - distinct counts and proportions
-      - minimum, maximum, mean, median, standard deviation, some quantile values
-      - histograms or frequencies of unique values
-
-    This connector supports both local files as well as those stored on AWS S3 (which must be identified using the prefix `s3://`). Supported file types are as follows:
-
-    - CSV
-    - TSV
-    - JSON
-    - Parquet
-    - Apache Avro
-
-    Schemas for Parquet and Avro files are extracted as provided.
-
-    Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
-    JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
-    We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
-
-    To ingest datasets from your data lake, you need to provide the dataset path format specifications using `path_specs` configuration in ingestion recipe.
-    Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for examples.
-
-    Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for **s3a://** access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).
-    Enabling profiling will slow down ingestion runs.
-    """
-
    source_config: DataLakeSourceConfig
    report: DataLakeSourceReport
    profiling_times_taken: List[float]