2023-04-27 22:33:41 +05:30
### Path Specs
**Example - Dataset per file**
Bucket structure:
```
test-gs-bucket
├── employees.csv
└── food_items.csv
```
Path specs config
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
path_specs:
- include: gs://test-gs-bucket/*.csv
```
**Example - Datasets with partitions**
Bucket structure:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
test-gs-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── returns
└── year=2021
└── month=2
└── 1.parquet
```
Path specs config:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
path_specs:
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
```
**Example - Datasets with partition and exclude**
Bucket structure:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
test-gs-bucket
├── orders
│ └── year=2022
│ └── month=2
│ ├── 1.parquet
│ └── 2.parquet
└── tmp_orders
└── year=2021
└── month=2
└── 1.parquet
```
Path specs config:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
path_specs:
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
2025-04-16 16:55:51 -07:00
exclude:
2023-04-27 22:33:41 +05:30
- **/tmp_orders/**
```
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
**Example - Datasets of mixed nature**
Bucket structure:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
test-gs-bucket
├── customers
│ ├── part1.json
│ ├── part2.json
│ ├── part3.json
│ └── part4.json
├── employees.csv
├── food_items.csv
├── tmp_10101000.csv
└── orders
└── year=2022
└── month=2
├── 1.parquet
├── 2.parquet
└── 3.parquet
```
Path specs config:
2025-04-16 16:55:51 -07:00
2023-04-27 22:33:41 +05:30
```
path_specs:
- include: gs://test-gs-bucket/*.csv
exclude:
- **/tmp_10101000.csv
- include: gs://test-gs-bucket/{table}/*.json
- include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet
```
**Valid path_specs.include**
```python
2025-04-16 16:55:51 -07:00
gs://my-bucket/foo/tests/bar.avro # single file table
2023-04-27 22:33:41 +05:30
gs://my-bucket/foo/tests/*.* # mulitple file level tables
gs://my-bucket/foo/tests/{table}/*.avro #table without partition
gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified
gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified
gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name
gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format
gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions
gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket
gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket
```
**Valid path_specs.exclude**
2025-04-16 16:55:51 -07:00
- \*\*/tests/\*\*
- gs://my-bucket/hr/\*\*
- \*_/tests/_.csv
- gs://my-bucket/foo/\*/my_table/\*\*
2023-04-27 22:33:41 +05:30
**Notes**
- {table} represents folder for which dataset will be created.
2025-04-16 16:55:51 -07:00
- include path must end with (_._ or \*.[ext]) to represent leaf level.
- if \*.[ext] is provided then only files with specified type will be scanned.
- /\*/ represents single folder.
2023-04-27 22:33:41 +05:30
- {partition[i]} represents value of partition.
- {partition_key[i]} represents name of the partition.
- While extracting, “i” will be used to match partition_key to partition.
2025-04-16 16:55:51 -07:00
- all folder levels need to be specified in include. Only exclude path can have \*\* like matching.
2023-04-27 22:33:41 +05:30
- exclude path cannot have named variables ( {} ).
2025-04-16 16:55:51 -07:00
- Folder names should not contain {, }, \*, / in their names.
2023-04-27 22:33:41 +05:30
- {folder} is reserved for internal working. please do not use in named variables.
If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit.
:::caution
2025-04-16 16:55:51 -07:00
Specify as long fixed prefix ( with out /\*/ ) as possible in `path_specs.include` . This will reduce the scanning time and cost, specifically on Google Cloud Storage.
2023-04-27 22:33:41 +05:30
:::
:::caution
If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs.
:::