mirror of
https://github.com/datahub-project/datahub.git
synced 2025-12-27 09:58:14 +00:00
docs(ingest): update s3 and gcs doc with concept mapping (#8575)
This commit is contained in:
parent
ef5931fbed
commit
7bb4e7b90d
38
metadata-ingestion/docs/sources/gcs/README.md
Normal file
38
metadata-ingestion/docs/sources/gcs/README.md
Normal file
@ -0,0 +1,38 @@
|
||||
This connector ingests Google Cloud Storage datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
|
||||
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. This source leverages [Interoperability of GCS with S3](https://cloud.google.com/storage/docs/interoperability)
|
||||
and uses DataHub S3 Data Lake integration source under the hood. Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) from S3 connector for more details.
|
||||
|
||||
|
||||
|
||||
### Concept Mapping
|
||||
|
||||
This ingestion source maps the following Source System Concepts to DataHub Concepts:
|
||||
|
||||
| Source Concept | DataHub Concept | Notes |
|
||||
| ------------------------------------------ | ------------------------------------------------------------------------------------------ | -------------------- |
|
||||
| `"Google Cloud Storage"` | [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataPlatform/) | |
|
||||
| GCS object / Folder containing GCS objects | [Dataset](https://datahubproject.io/docs/generated/metamodel/entities/dataset/) | |
|
||||
| GCS bucket | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/) | Subtype `GCS bucket` |
|
||||
| GCS folder | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/) | Subtype `Folder` |
|
||||
|
||||
|
||||
### Supported file types
|
||||
Supported file types are as follows:
|
||||
|
||||
- CSV
|
||||
- TSV
|
||||
- JSON
|
||||
- Parquet
|
||||
- Apache Avro
|
||||
|
||||
Schemas for Parquet and Avro files are extracted as provided.
|
||||
|
||||
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
|
||||
JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
|
||||
We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
|
||||
|
||||
|
||||
### Prerequisites
|
||||
1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
|
||||
2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
|
||||
3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .
|
||||
44
metadata-ingestion/docs/sources/s3/README.md
Normal file
44
metadata-ingestion/docs/sources/s3/README.md
Normal file
@ -0,0 +1,44 @@
|
||||
This connector ingests S3 datasets into DataHub. It allows mapping an individual file or a folder of files to a dataset in DataHub.
|
||||
To specify the group of files that form a dataset, use `path_specs` configuration in ingestion recipe. Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for more details.
|
||||
|
||||
### Concept Mapping
|
||||
|
||||
This ingestion source maps the following Source System Concepts to DataHub Concepts:
|
||||
|
||||
| Source Concept | DataHub Concept | Notes |
|
||||
| ---------------------------------------- | ------------------------------------------------------------------------------------------ | ------------------- |
|
||||
| `"s3"` | [Data Platform](https://datahubproject.io/docs/generated/metamodel/entities/dataPlatform/) | |
|
||||
| s3 object / Folder containing s3 objects | [Dataset](https://datahubproject.io/docs/generated/metamodel/entities/dataset/) | |
|
||||
| s3 bucket | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/) | Subtype `S3 bucket` |
|
||||
| s3 folder | [Container](https://datahubproject.io/docs/generated/metamodel/entities/container/) | Subtype `Folder` |
|
||||
|
||||
This connector supports both local files as well as those stored on AWS S3 (which must be identified using the prefix `s3://`).
|
||||
[a]
|
||||
### Supported file types
|
||||
Supported file types are as follows:
|
||||
|
||||
- CSV (*.csv)
|
||||
- TSV (*.tsv)
|
||||
- JSON (*.json)
|
||||
- Parquet (*.parquet)
|
||||
- Apache Avro (*.avro)
|
||||
|
||||
Schemas for Parquet and Avro files are extracted as provided.
|
||||
|
||||
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
|
||||
JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
|
||||
We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
|
||||
|
||||
|
||||
### Profiling
|
||||
|
||||
This plugin extracts:
|
||||
- Row and column counts for each dataset
|
||||
- For each column, if profiling is enabled:
|
||||
- null counts and proportions
|
||||
- distinct counts and proportions
|
||||
- minimum, maximum, mean, median, standard deviation, some quantile values
|
||||
- histograms or frequencies of unique values
|
||||
|
||||
Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for **s3a://** access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).
|
||||
Enabling profiling will slow down ingestion runs.
|
||||
@ -1,28 +1,63 @@
|
||||
|
||||
### Path Specs
|
||||
|
||||
**Example - Dataset per file**
|
||||
Path Specs (`path_specs`) is a list of Path Spec (`path_spec`) objects where each individual `path_spec` represents one or more datasets. Include path (`path_spec.include`) represents formatted path to the dataset. This path must end with `*.*` or `*.[ext]` to represent leaf level. If `*.[ext]` is provided then files with only specified extension type will be scanned. "`.[ext]`" can be any of [supported file types](#supported-file-types). Refer [example 1](#example-1---individual-file-as-dataset) below for more details.
|
||||
|
||||
All folder levels need to be specified in include path. You can use `/*/` to represent a folder level and avoid specifying exact folder name. To map folder as a dataset, use `{table}` placeholder to represent folder level for which dataset is to be created. For a partitioned dataset, you can use placeholder `{partition_key[i]}` to represent name of `i`th partition and `{partition[i]}` to represent value of `i`th partition. During ingestion, `i` will be used to match partition_key to partition. Refer [example 2 and 3](#example-2---folder-of-files-as-dataset-without-partitions) below for more details.
|
||||
|
||||
Exclude paths (`path_spec.exclude`) can be used to ignore paths that are not relevant to current `path_spec`. This path cannot have named variables ( `{}` ). Exclude path can have `**` to represent multiple folder levels. Refer [example 4](#example-4---folder-of-files-as-dataset-with-partitions-and-exclude-filter) below for more details.
|
||||
|
||||
Refer [example 5](#example-5---advanced---either-individual-file-or-folder-of-files-as-dataset) if your bucket has more complex dataset representation.
|
||||
|
||||
**Additional points to note**
|
||||
- Folder names should not contain {, }, *, / in their names.
|
||||
- Named variable {folder} is reserved for internal working. please do not use in named variables.
|
||||
|
||||
|
||||
### Path Specs - Examples
|
||||
#### Example 1 - Individual file as Dataset
|
||||
|
||||
Bucket structure:
|
||||
|
||||
```
|
||||
test-s3-bucket
|
||||
test-bucket
|
||||
├── employees.csv
|
||||
├── departments.json
|
||||
└── food_items.csv
|
||||
```
|
||||
|
||||
Path specs config
|
||||
Path specs config to ingest `employees.csv` and `food_items.csv` as datasets:
|
||||
```
|
||||
path_specs:
|
||||
- include: s3://test-s3-bucket/*.csv
|
||||
- include: s3://test-bucket/*.csv
|
||||
|
||||
```
|
||||
This will automatically ignore `departments.json` file. To include it, use `*.*` instead of `*.csv`.
|
||||
|
||||
**Example - Datasets with partitions**
|
||||
#### Example 2 - Folder of files as Dataset (without Partitions)
|
||||
|
||||
Bucket structure:
|
||||
```
|
||||
test-s3-bucket
|
||||
test-bucket
|
||||
└── offers
|
||||
├── 1.avro
|
||||
└── 2.avro
|
||||
|
||||
```
|
||||
|
||||
Path specs config to ingest folder `offers` as dataset:
|
||||
```
|
||||
path_specs:
|
||||
- include: s3://test-bucket/{table}/*.avro
|
||||
```
|
||||
|
||||
`{table}` represents folder for which dataset will be created.
|
||||
|
||||
#### Example 3 - Folder of files as Dataset (with Partitions)
|
||||
|
||||
Bucket structure:
|
||||
```
|
||||
test-bucket
|
||||
├── orders
|
||||
│ └── year=2022
|
||||
│ └── month=2
|
||||
@ -35,17 +70,19 @@ test-s3-bucket
|
||||
|
||||
```
|
||||
|
||||
Path specs config:
|
||||
Path specs config to ingest folders `orders` and `returns` as datasets:
|
||||
```
|
||||
path_specs:
|
||||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
- include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
```
|
||||
|
||||
**Example - Datasets with partition and exclude**
|
||||
One can also use `include: s3://test-bucket/{table}/*/*/*.parquet` here however above format is preferred as it allows declaring partitions explicitly.
|
||||
|
||||
#### Example 4 - Folder of files as Dataset (with Partitions), and Exclude Filter
|
||||
|
||||
Bucket structure:
|
||||
```
|
||||
test-s3-bucket
|
||||
test-bucket
|
||||
├── orders
|
||||
│ └── year=2022
|
||||
│ └── month=2
|
||||
@ -59,18 +96,20 @@ test-s3-bucket
|
||||
|
||||
```
|
||||
|
||||
Path specs config:
|
||||
Path specs config to ingest folder `orders` as dataset but not folder `tmp_orders`:
|
||||
```
|
||||
path_specs:
|
||||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
- include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
exclude:
|
||||
- **/tmp_orders/**
|
||||
```
|
||||
**Example - Datasets of mixed nature**
|
||||
|
||||
|
||||
#### Example 5 - Advanced - Either Individual file OR Folder of files as Dataset
|
||||
|
||||
Bucket structure:
|
||||
```
|
||||
test-s3-bucket
|
||||
test-bucket
|
||||
├── customers
|
||||
│ ├── part1.json
|
||||
│ ├── part2.json
|
||||
@ -91,13 +130,20 @@ test-s3-bucket
|
||||
Path specs config:
|
||||
```
|
||||
path_specs:
|
||||
- include: s3://test-s3-bucket/*.csv
|
||||
- include: s3://test-bucket/*.csv
|
||||
exclude:
|
||||
- **/tmp_10101000.csv
|
||||
- include: s3://test-s3-bucket/{table}/*.json
|
||||
- include: s3://test-s3-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
- include: s3://test-bucket/{table}/*.json
|
||||
- include: s3://test-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||||
```
|
||||
|
||||
Above config has 3 path_specs and will ingest following datasets
|
||||
- `employees.csv` - Single File as Dataset
|
||||
- `food_items.csv` - Single File as Dataset
|
||||
- `customers` - Folder as Dataset
|
||||
- `orders` - Folder as Dataset
|
||||
and will ignore file `tmp_10101000.csv`
|
||||
|
||||
**Valid path_specs.include**
|
||||
|
||||
```python
|
||||
@ -120,20 +166,6 @@ s3://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # ta
|
||||
- **/tests/*.csv
|
||||
- s3://my-bucket/foo/*/my_table/**
|
||||
|
||||
**Notes**
|
||||
|
||||
- {table} represents folder for which dataset will be created.
|
||||
- include path must end with (*.* or *.[ext]) to represent leaf level.
|
||||
- if *.[ext] is provided then only files with specified type will be scanned.
|
||||
- /*/ represents single folder.
|
||||
- {partition[i]} represents value of partition.
|
||||
- {partition_key[i]} represents name of the partition.
|
||||
- While extracting, “i” will be used to match partition_key to partition.
|
||||
- all folder levels need to be specified in include. Only exclude path can have ** like matching.
|
||||
- exclude path cannot have named variables ( {} ).
|
||||
- Folder names should not contain {, }, *, / in their names.
|
||||
- {folder} is reserved for internal working. please do not use in named variables.
|
||||
|
||||
|
||||
|
||||
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
|
||||
|
||||
@ -84,33 +84,6 @@ class GCSSourceReport(DataLakeSourceReport):
|
||||
@capability(SourceCapability.SCHEMA_METADATA, "Enabled by default")
|
||||
@capability(SourceCapability.DATA_PROFILING, "Not supported", supported=False)
|
||||
class GCSSource(StatefulIngestionSourceBase):
|
||||
"""
|
||||
This connector extracting datasets located on Google Cloud Storage. Supported file types are as follows:
|
||||
|
||||
- CSV
|
||||
- TSV
|
||||
- JSON
|
||||
- Parquet
|
||||
- Apache Avro
|
||||
|
||||
Schemas for Parquet and Avro files are extracted as provided.
|
||||
|
||||
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
|
||||
JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
|
||||
|
||||
|
||||
This source leverages [Interoperability of GCS with S3](https://cloud.google.com/storage/docs/interoperability)
|
||||
and uses DataHub S3 Data Lake integration source under the hood.
|
||||
|
||||
### Prerequisites
|
||||
1. Create a service account with "Storage Object Viewer" Role - https://cloud.google.com/iam/docs/service-accounts-create
|
||||
2. Make sure you meet following requirements to generate HMAC key - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#before-you-begin
|
||||
3. Create an HMAC key for service account created above - https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create .
|
||||
|
||||
To ingest datasets from your data lake, you need to provide the dataset path format specifications using `path_specs` configuration in ingestion recipe.
|
||||
Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/gcs/#path-specs) for examples.
|
||||
"""
|
||||
|
||||
def __init__(self, config: GCSSourceConfig, ctx: PipelineContext):
|
||||
super().__init__(config, ctx)
|
||||
self.config = config
|
||||
|
||||
@ -219,37 +219,6 @@ class TableData:
|
||||
supported=True,
|
||||
)
|
||||
class S3Source(StatefulIngestionSourceBase):
|
||||
"""
|
||||
This plugin extracts:
|
||||
|
||||
- Row and column counts for each table
|
||||
- For each column, if profiling is enabled:
|
||||
- null counts and proportions
|
||||
- distinct counts and proportions
|
||||
- minimum, maximum, mean, median, standard deviation, some quantile values
|
||||
- histograms or frequencies of unique values
|
||||
|
||||
This connector supports both local files as well as those stored on AWS S3 (which must be identified using the prefix `s3://`). Supported file types are as follows:
|
||||
|
||||
- CSV
|
||||
- TSV
|
||||
- JSON
|
||||
- Parquet
|
||||
- Apache Avro
|
||||
|
||||
Schemas for Parquet and Avro files are extracted as provided.
|
||||
|
||||
Schemas for schemaless formats (CSV, TSV, JSON) are inferred. For CSV and TSV files, we consider the first 100 rows by default, which can be controlled via the `max_rows` recipe parameter (see [below](#config-details))
|
||||
JSON file schemas are inferred on the basis of the entire file (given the difficulty in extracting only the first few objects of the file), which may impact performance.
|
||||
We are working on using iterator-based JSON parsers to avoid reading in the entire JSON object.
|
||||
|
||||
To ingest datasets from your data lake, you need to provide the dataset path format specifications using `path_specs` configuration in ingestion recipe.
|
||||
Refer section [Path Specs](https://datahubproject.io/docs/generated/ingestion/sources/s3/#path-specs) for examples.
|
||||
|
||||
Note that because the profiling is run with PySpark, we require Spark 3.0.3 with Hadoop 3.2 to be installed (see [compatibility](#compatibility) for more details). If profiling, make sure that permissions for **s3a://** access are set because Spark and Hadoop use the s3a:// protocol to interface with AWS (schema inference outside of profiling requires s3:// access).
|
||||
Enabling profiling will slow down ingestion runs.
|
||||
"""
|
||||
|
||||
source_config: DataLakeSourceConfig
|
||||
report: DataLakeSourceReport
|
||||
profiling_times_taken: List[float]
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user