158 lines
4.4 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### Path Specs
**Example - Dataset per file**
Bucket structure:
```
test-gs-bucket
├── employees.csv
└── food_items.csv
```
Path specs config
```
path_specs:
- include: gs://test-gs-bucket/*.csv
```
**Example - Datasets with partitions**
Bucket structure:
```
test-gs-bucket
├── orders
│   └── year=2022
│   └── month=2
│   ├── 1.parquet
│   └── 2.parquet
└── returns
└── year=2021
└── month=2
└── 1.parquet
```
Path specs config:
```
path_specs:
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
```
**Example - Datasets with partition and exclude**
Bucket structure:
```
test-gs-bucket
├── orders
│   └── year=2022
│   └── month=2
│   ├── 1.parquet
│   └── 2.parquet
└── tmp_orders
└── year=2021
└── month=2
└── 1.parquet
```
Path specs config:
```
path_specs:
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
exclude:
- **/tmp_orders/**
```
**Example - Datasets of mixed nature**
Bucket structure:
```
test-gs-bucket
├── customers
│   ├── part1.json
│   ├── part2.json
│   ├── part3.json
│   └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
   └── year=2022
    └── month=2
   ├── 1.parquet
   ├── 2.parquet
   └── 3.parquet
```
Path specs config:
```
path_specs:
- include: gs://test-gs-bucket/*.csv
exclude:
- **/tmp_10101000.csv
- include: gs://test-gs-bucket/{table}/*.json
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
```
**Valid path_specs.include**
```python
gs://my-bucket/foo/tests/bar.avro # single file table
gs://my-bucket/foo/tests/*.* # mulitple file level tables
gs://my-bucket/foo/tests/{table}/*.avro #table without partition
gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
```
**Valid path_specs.exclude**
- \*\*/tests/\*\*
- gs://my-bucket/hr/\*\*
- \*_/tests/_.csv
- gs://my-bucket/foo/\*/my_table/\*\*
**Notes**
- {table} represents folder for which dataset will be created.
- include path must end with (_._ or \*.[ext]) to represent leaf level.
- if \*.[ext] is provided then only files with specified type will be scanned.
- /\*/ represents single folder.
- {partition[i]} represents value of partition.
- {partition_key[i]} represents name of the partition.
- While extracting, “i” will be used to match partition_key to partition.
- all folder levels need to be specified in include. Only exclude path can have \*\* like matching.
- exclude path cannot have named variables ( {} ).
- Folder names should not contain {, }, \*, / in their names.
- {folder} is reserved for internal working. please do not use in named variables.
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
:::caution
Specify as long fixed prefix ( with out /\*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on Google Cloud Storage.
:::
:::caution
If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs.
:::