### Path Specs **Example - Dataset per file** Bucket structure: ``` test-gs-bucket ├── employees.csv └── food_items.csv ``` Path specs config ``` path_specs: - include: gs://test-gs-bucket/*.csv ``` **Example - Datasets with partitions** Bucket structure: ``` test-gs-bucket ├── orders │   └── year=2022 │   └── month=2 │   ├── 1.parquet │   └── 2.parquet └── returns └── year=2021 └── month=2 └── 1.parquet ``` Path specs config: ``` path_specs: - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet ``` **Example - Datasets with partition and exclude** Bucket structure: ``` test-gs-bucket ├── orders │   └── year=2022 │   └── month=2 │   ├── 1.parquet │   └── 2.parquet └── tmp_orders └── year=2021 └── month=2 └── 1.parquet ``` Path specs config: ``` path_specs: - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet exclude: - **/tmp_orders/** ``` **Example - Datasets of mixed nature** Bucket structure: ``` test-gs-bucket ├── customers │   ├── part1.json │   ├── part2.json │   ├── part3.json │   └── part4.json ├── employees.csv ├── food_items.csv ├── tmp_10101000.csv └── orders    └── year=2022     └── month=2    ├── 1.parquet    ├── 2.parquet    └── 3.parquet ``` Path specs config: ``` path_specs: - include: gs://test-gs-bucket/*.csv exclude: - **/tmp_10101000.csv - include: gs://test-gs-bucket/{table}/*.json - include: gs://test-gs-bucket/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.parquet ``` **Valid path_specs.include** ```python gs://my-bucket/foo/tests/bar.avro # single file table gs://my-bucket/foo/tests/*.* # mulitple file level tables gs://my-bucket/foo/tests/{table}/*.avro #table without partition gs://my-bucket/foo/tests/{table}/*/*.avro #table where partitions are not specified gs://my-bucket/foo/tests/{table}/*.* # table where no partitions as well as data type specified gs://my-bucket/{dept}/tests/{table}/*.avro # specifying keywords to be used in display name gs://my-bucket/{dept}/tests/{table}/{partition_key[0]}={partition[0]}/{partition_key[1]}={partition[1]}/*.avro # specify partition key and value format gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.avro # specify partition value only format gs://my-bucket/{dept}/tests/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # for all extensions gs://my-bucket/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 2 levels down in bucket gs://my-bucket/*/*/{table}/{partition[0]}/{partition[1]}/{partition[2]}/*.* # table is present at 3 levels down in bucket ``` **Valid path_specs.exclude** - \*\*/tests/\*\* - gs://my-bucket/hr/\*\* - \*_/tests/_.csv - gs://my-bucket/foo/\*/my_table/\*\* **Notes** - {table} represents folder for which dataset will be created. - include path must end with (_._ or \*.[ext]) to represent leaf level. - if \*.[ext] is provided then only files with specified type will be scanned. - /\*/ represents single folder. - {partition[i]} represents value of partition. - {partition_key[i]} represents name of the partition. - While extracting, “i” will be used to match partition_key to partition. - all folder levels need to be specified in include. Only exclude path can have \*\* like matching. - exclude path cannot have named variables ( {} ). - Folder names should not contain {, }, \*, / in their names. - {folder} is reserved for internal working. please do not use in named variables. If you would like to write a more complicated function for resolving file names, then a {transformer} would be a good fit. :::caution Specify as long fixed prefix ( with out /\*/ ) as possible in `path_specs.include`. This will reduce the scanning time and cost, specifically on Google Cloud Storage. ::: :::caution If you are ingesting datasets from Google Cloud Storage, we recommend running the ingestion on a server in the same region to avoid high egress costs. :::