mirror of
https://github.com/datahub-project/datahub.git
synced 2025-07-04 15:50:14 +00:00
158 lines
4.4 KiB
Markdown
158 lines
4.4 KiB
Markdown
### Path Specs
|
||
|
||
**Example - Dataset per file**
|
||
|
||
Bucket structure:
|
||
|
||
```
|
||
test-gs-bucket
|
||
├── employees.csv
|
||
└── food_items.csv
|
||
```
|
||
|
||
Path specs config
|
||
|
||
```
|
||
path_specs:
|
||
- include: gs://test-gs-bucket/*.csv
|
||
|
||
```
|
||
|
||
**Example - Datasets with partitions**
|
||
|
||
Bucket structure:
|
||
|
||
```
|
||
test-gs-bucket
|
||
├── orders
|
||
│ └── year=2022
|
||
│ └── month=2
|
||
│ ├── 1.parquet
|
||
│ └── 2.parquet
|
||
└── returns
|
||
└── year=2021
|
||
└── month=2
|
||
└── 1.parquet
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
|
||
```
|
||
path_specs:
|
||
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
```
|
||
|
||
**Example - Datasets with partition and exclude**
|
||
|
||
Bucket structure:
|
||
|
||
```
|
||
test-gs-bucket
|
||
├── orders
|
||
│ └── year=2022
|
||
│ └── month=2
|
||
│ ├── 1.parquet
|
||
│ └── 2.parquet
|
||
└── tmp_orders
|
||
└── year=2021
|
||
└── month=2
|
||
└── 1.parquet
|
||
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
|
||
```
|
||
path_specs:
|
||
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
exclude:
|
||
- **/tmp_orders/**
|
||
```
|
||
|
||
**Example - Datasets of mixed nature**
|
||
|
||
Bucket structure:
|
||
|
||
```
|
||
test-gs-bucket
|
||
├── customers
|
||
│ ├── part1.json
|
||
│ ├── part2.json
|
||
│ ├── part3.json
|
||
│ └── part4.json
|
||
├── employees.csv
|
||
├── food_items.csv
|
||
├── tmp_10101000.csv
|
||
└── orders
|
||
└── year=2022
|
||
└── month=2
|
||
├── 1.parquet
|
||
├── 2.parquet
|
||
└── 3.parquet
|
||
|
||
```
|
||
|
||
Path specs config:
|
||
|
||
```
|
||
path_specs:
|
||
- include: gs://test-gs-bucket/*.csv
|
||
exclude:
|
||
- **/tmp_10101000.csv
|
||
- include: gs://test-gs-bucket/{table}/*.json
|
||
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
|
||
```
|
||
|
||
**Valid path_specs.include**
|
||
|
||
```python
|
||
gs://my-bucket/foo/tests/bar.avro # single file table
|
||
gs://my-bucket/foo/tests/*.* # mulitple file level tables
|
||
gs://my-bucket/foo/tests/{table}/*.avro #table without partition
|
||
gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
|
||
gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
|
||
gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
|
||
gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
|
||
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
|
||
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
|
||
gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
|
||
gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
|
||
```
|
||
|
||
**Valid path_specs.exclude**
|
||
|
||
- \*\*/tests/\*\*
|
||
- gs://my-bucket/hr/\*\*
|
||
- \*_/tests/_.csv
|
||
- gs://my-bucket/foo/\*/my_table/\*\*
|
||
|
||
**Notes**
|
||
|
||
- {table} represents folder for which dataset will be created.
|
||
- include path must end with (_._ or \*.[ext]) to represent leaf level.
|
||
- if \*.[ext] is provided then only files with specified type will be scanned.
|
||
- /\*/ represents single folder.
|
||
- {partition[i]} represents value of partition.
|
||
- {partition_key[i]} represents name of the partition.
|
||
- While extracting, “i” will be used to match partition_key to partition.
|
||
- all folder levels need to be specified in include. Only exclude path can have \*\* like matching.
|
||
- exclude path cannot have named variables ( {} ).
|
||
- Folder names should not contain {, }, \*, / in their names.
|
||
- {folder} is reserved for internal working. please do not use in named variables.
|
||
|
||
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
|
||
|
||
:::caution
|
||
|
||
Specify as long fixed prefix ( with out /\*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on Google Cloud Storage.
|
||
|
||
:::
|
||
|
||
:::caution
|
||
|
||
If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs.
|
||
|
||
:::
|