unstructured/examples/ingest/biomed/ingest-with-path.sh

#!/usr/bin/env bash

# Processes the Unstructured-IO/unstructured repository
# through Unstructured's library in 2 processes.

# Structured outputs are stored in biomed-ingest-output-path/

# Biomedical documents can be extracted in one of two ways, in this script is the FTP directory approach.

# The supported ftp directories is:
# https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf

# By providing the path, the documents existing therein are downloaded.
# For example, to download the documents in the path: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/07/
# The path needed is oa_pdf/07/

SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)
cd "$SCRIPT_DIR"/../../.. || exit 1

# The example below will ingest the PDF from the "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf" path.

# You can ingest all the documents in the "oa_pdf/07/07" path by passing "oa_pdf/07/07" instead.
# WARNING: There are many documents in that path.

PYTHONPATH=. ./unstructured/ingest/main.py \
  biomed \
  --path "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf" \
  --output-dir biomed-ingest-output-path \
  --num-processes 2 \
  --verbose \
  --preserve-downloads

# Alternatively, you can call it using:
# unstructured-ingest --biomed-path ...
Connector for Biomedical Literature (#345) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download. 2023-03-11 01:09:54 +00:00			`#!/usr/bin/env bash`

			`# Processes the Unstructured-IO/unstructured repository`
			`# through Unstructured's library in 2 processes.`

			`# Structured outputs are stored in biomed-ingest-output-path/`

			`# Biomedical documents can be extracted in one of two ways, in this script is the FTP directory approach.`

			`# The supported ftp directories is:`
			`# https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf`

			`# By providing the path, the documents existing therein are downloaded.`
			`# For example, to download the documents in the path: https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/07/`
			`# The path needed is oa_pdf/07/`

chore: add shfmt (#2246) ### Description Given all the shell files that now exist in the repo, would be nice to have linting/formatting around them (in addition to the existing shellcheck which doesn't do anything to format the shell code). This PR introduces `shfmt` to both check for changes and apply formatting when the associated make targets are called. 2023-12-11 20:04:15 -05:00			`SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)`
Connector for Biomedical Literature (#345) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download. 2023-03-11 01:09:54 +00:00			`cd "$SCRIPT_DIR"/../../.. \|\| exit 1`

			`# The example below will ingest the PDF from the "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf" path.`

			`# You can ingest all the documents in the "oa_pdf/07/07" path by passing "oa_pdf/07/07" instead.`
			`# WARNING: There are many documents in that path.`

			`PYTHONPATH=. ./unstructured/ingest/main.py \`
chore: shell scripts default indent of 2 instead of 4 (#2287) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces. 2023-12-18 23:48:21 -08:00			`biomed \`
			`--path "oa_pdf/07/07/sbaa031.073.PMC7234218.pdf" \`
			`--output-dir biomed-ingest-output-path \`
			`--num-processes 2 \`
			`--verbose \`
			`--preserve-downloads`
Connector for Biomedical Literature (#345) The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download. 2023-03-11 01:09:54 +00:00
			`# Alternatively, you can call it using:`
			`# unstructured-ingest --biomed-path ...`