unstructured/examples/ingest/airtable/ingest.sh

#!/usr/bin/env bash

# Processes all the documents in all bases (in all workspaces) within an Airtable org,
# using the `unstructured` library.

# Structured outputs are stored in airtable-ingest-output
SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/../../.. || exit 1

# Required arguments:
# --personal-access-token
#   --> Personal access token to authenticate into Airtable.
#       Check https://support.airtable.com/docs/creating-and-using-api-keys-and-access-tokens for more info.

# Optional arguments that you can use:
# --list-of-paths
#   --> A list of paths that specify the locations to ingest data from within Airtable.
#       If this argument is not set, the connector ingests all tables within each and every base.
#   --list-of-paths: path1 path2 path3 ….
#   path: base_id/table_id(optional)/view_id(optional)/

#     To obtain (base, table, view) ids in bulk, check:
#     https://airtable.com/developers/web/api/list-bases (base ids)
#     https://airtable.com/developers/web/api/get-base-schema (table and view ids)
#     https://pyairtable.readthedocs.io/en/latest/metadata.html (base, table and view ids)

#     To obtain specific ids from Airtable UI, go to your workspace, and copy any
#     relevant id from the URL structure:
#     https://airtable.com/appAbcDeF1ghijKlm/tblABcdEfG1HIJkLm/viwABCDEfg6hijKLM
#     appAbcDeF1ghijKlm -> base_id
#     tblABcdEfG1HIJkLm -> table_id
#     viwABCDEfg6hijKLM -> view_id

#     You can also check: https://support.airtable.com/docs/finding-airtable-ids

#     Here is an example for one --list-of-paths:
#         base1/		→ gets the entirety of all tables inside base1
#         base1/table1		→ gets all rows and columns within table1 in base1
#         base1/table1/view1	→ gets the rows and columns that are visible in view1 for the table1 in base1

#     Examples to invalid paths:
#         table1                        → has to mention base to be valid
#         base1/view1			→ has to mention table to be valid

PYTHONPATH=. ./unstructured/ingest/main.py \
  airtable \
  --metadata-exclude filename,file_directory,metadata.data_source.date_processed \
  --personal-access-token "$AIRTABLE_PERSONAL_ACCESS_TOKEN" \
  --output-dir airtable-ingest-output \
  --num-processes 2 \
  --reprocess
feat: airtable connector (#1012) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-08-11 22:02:51 +03:00			`#!/usr/bin/env bash`

			`# Processes all the documents in all bases (in all workspaces) within an Airtable org,`
			# using the `unstructured` library.

			`# Structured outputs are stored in airtable-ingest-output`
			`SCRIPT_DIR=$(dirname "$(realpath "$0")")`
			`cd "$SCRIPT_DIR"/../../.. \|\| exit 1`

			`# Required arguments:`
			`# --personal-access-token`
			`# --> Personal access token to authenticate into Airtable.`
			`# Check https://support.airtable.com/docs/creating-and-using-api-keys-and-access-tokens for more info.`

			`# Optional arguments that you can use:`
			`# --list-of-paths`
			`# --> A list of paths that specify the locations to ingest data from within Airtable.`
			`# If this argument is not set, the connector ingests all tables within each and every base.`
			`# --list-of-paths: path1 path2 path3 ….`
			`# path: base_id/table_id(optional)/view_id(optional)/`

			`# To obtain (base, table, view) ids in bulk, check:`
			`# https://airtable.com/developers/web/api/list-bases (base ids)`
			`# https://airtable.com/developers/web/api/get-base-schema (table and view ids)`
			`# https://pyairtable.readthedocs.io/en/latest/metadata.html (base, table and view ids)`

			`# To obtain specific ids from Airtable UI, go to your workspace, and copy any`
			`# relevant id from the URL structure:`
			`# https://airtable.com/appAbcDeF1ghijKlm/tblABcdEfG1HIJkLm/viwABCDEfg6hijKLM`
			`# appAbcDeF1ghijKlm -> base_id`
			`# tblABcdEfG1HIJkLm -> table_id`
			`# viwABCDEfg6hijKLM -> view_id`

			`# You can also check: https://support.airtable.com/docs/finding-airtable-ids`

			`# Here is an example for one --list-of-paths:`
			`# base1/ → gets the entirety of all tables inside base1`
			`# base1/table1 → gets all rows and columns within table1 in base1`
			`# base1/table1/view1 → gets the rows and columns that are visible in view1 for the table1 in base1`

			`# Examples to invalid paths:`
			`# table1 → has to mention base to be valid`
			`# base1/view1 → has to mention table to be valid`

			`PYTHONPATH=. ./unstructured/ingest/main.py \`
chore: shell scripts default indent of 2 instead of 4 (#2287) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces. 2023-12-18 23:48:21 -08:00			`airtable \`
			`--metadata-exclude filename,file_directory,metadata.data_source.date_processed \`
			`--personal-access-token "$AIRTABLE_PERSONAL_ACCESS_TOKEN" \`
			`--output-dir airtable-ingest-output \`
			`--num-processes 2 \`
			`--reprocess`