unstructured/test_unstructured_ingest/src/jira.sh

#!/usr/bin/env bash
set -e

# Description: This test checks if all the processed content is the same as the expected outputs
SRC_PATH=$(dirname "$(realpath "$0")")
SCRIPT_DIR=$(dirname "$SRC_PATH")
cd "$SCRIPT_DIR"/.. || exit 1

OUTPUT_FOLDER_NAME=jira-diff
OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}
OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME
WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME
DOWNLOAD_DIR=$OUTPUT_ROOT/download/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
CI=${CI:-"false"}

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
  cleanup_dir "$OUTPUT_DIR"
  cleanup_dir "$WORK_DIR"
  if [ "$CI" == "true" ]; then
    cleanup_dir "$DOWNLOAD_DIR"
  fi
}
trap cleanup EXIT

if [ -z "$JIRA_INGEST_USER_EMAIL" ] || [ -z "$JIRA_INGEST_API_TOKEN" ]; then
  echo "Skipping Jira ingest test because the JIRA_INGEST_USER_EMAIL or JIRA_INGEST_API_TOKEN env var is not set."
  exit 8
fi

# Required arguments:
# --url
#   --> Atlassian (Jira) domain URL
# --api-token
#   --> Api token to authenticate into Atlassian (Jira).
#       Check https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/ for more info.
# --user-email
#   --> User email for the domain, such as xyz@unstructured.io

# Optional arguments:
# --list-of-projects
#     --> Comma separated project ids or keys
# --list-of-boards
#     --> Comma separated board ids or keys
# --list-of-issues
#     --> Comma separated issue ids or keys

# Note: When any of the optional arguments are provided, connector will ingest only those components, and nothing else.
#       When none of the optional arguments are provided, all issues in all projects will be ingested.

RUN_SCRIPT=${RUN_SCRIPT:-./unstructured/ingest/main.py}
PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \
  jira \
  --download-dir "$DOWNLOAD_DIR" \
  --metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
  --num-processes "$max_processes" \
  --preserve-downloads \
  --reprocess \
  --output-dir "$OUTPUT_DIR" \
  --verbose \
  --url https://unstructured-jira-connector-test.atlassian.net \
  --user-email "$JIRA_INGEST_USER_EMAIL" \
  --api-token "$JIRA_INGEST_API_TOKEN" \
  --projects "JCTP3" \
  --boards "1" \
  --issues "JCTP2-4,JCTP2-7,JCTP2-8,10012,JCTP2-11" \
  --work-dir "$WORK_DIR"

"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`#!/usr/bin/env bash`
			`set -e`

			`# Description: This test checks if all the processed content is the same as the expected outputs`
separate ingest tests (#1951) ### Description This splits the source ingest tests from the destination ingest tests since they share a different pattern: * src tests pull data from a source and compare the partitioned content to the expected results * destingation tests leverage the local connector to produce results to push to a destination and leverages overhead to create temporary locations at those destinations to write to and delete when done. Only the src tests create partitioned content that needs to be checked so the update ingest test CI job only needs to run these. 2023-11-01 15:23:44 -04:00			`SRC_PATH=$(dirname "$(realpath "$0")")`
			`SCRIPT_DIR=$(dirname "$SRC_PATH")`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`cd "$SCRIPT_DIR"/.. \|\| exit 1`

			`OUTPUT_FOLDER_NAME=jira-diff`
test: parametrize ingest test scripts (#1979) This PR resolves [CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453): - parametrizes the output folder so that ingest output files can be saved other than the same place where the scripts are; this is set by env `OUTPUT_ROOT` - parametrize the python path `PYTHONPATH` to first check existing definition before default to `.`, the current folder - parametrize the run script that carries out ingest using `RUN_SCRIPT`, default is still `./unstructured/ingest/main.py` These changes allows us to run ingest test with more control. To test: - run `OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/local-single-file.sh`: the output now should be in `/tmp` instead of in the ingest test folder - run `RUN_SCRIPT=/hope/you/do/not/have/this/folder ./test_unstructured_ingest/src/local-single-file.sh` would raise an error because system can't find `/hope/you/do/not/have/this/folder` - run `RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh` should run as normal - do the following ```bash cp ./unstructured/ingest/main.py /tmp/main.py OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh ``` This will run and generate output at `/tmp` [CORE-2453]: https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ 2023-11-02 16:41:56 -05:00			`OUTPUT_ROOT=${OUTPUT_ROOT:-$SCRIPT_DIR}`
			`OUTPUT_DIR=$OUTPUT_ROOT/structured-output/$OUTPUT_FOLDER_NAME`
			`WORK_DIR=$OUTPUT_ROOT/workdir/$OUTPUT_FOLDER_NAME`
			`DOWNLOAD_DIR=$OUTPUT_ROOT/download/$OUTPUT_FOLDER_NAME`
build(image): call python3 not python for image compat (#1574) Fixes docker exec unstructured-smoke-test /bin/bash -c /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh: line 10: python: command not found in https://github.com/Unstructured-IO/unstructured/blob/6ad4971/scripts/docker-smoke-test.sh#L43 that was preventing docker images from being built. 2023-09-28 21:48:19 -07:00			`max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}`
roman/drop downloads in ingest tests (#1614) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one. 2023-10-02 16:47:24 -04:00			`CI=${CI:-"false"}`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00
chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00			`# shellcheck disable=SC1091`
			`source "$SCRIPT_DIR"/cleanup.sh`
roman/drop downloads in ingest tests (#1614) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one. 2023-10-02 16:47:24 -04:00			`function cleanup() {`
chore: shell scripts default indent of 2 instead of 4 (#2287) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces. 2023-12-18 23:48:21 -08:00			`cleanup_dir "$OUTPUT_DIR"`
			`cleanup_dir "$WORK_DIR"`
			`if [ "$CI" == "true" ]; then`
			`cleanup_dir "$DOWNLOAD_DIR"`
			`fi`
roman/drop downloads in ingest tests (#1614) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one. 2023-10-02 16:47:24 -04:00			`}`
			`trap cleanup EXIT`
chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`if [ -z "$JIRA_INGEST_USER_EMAIL" ] \|\| [ -z "$JIRA_INGEST_API_TOKEN" ]; then`
chore: shell scripts default indent of 2 instead of 4 (#2287) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces. 2023-12-18 23:48:21 -08:00			`echo "Skipping Jira ingest test because the JIRA_INGEST_USER_EMAIL or JIRA_INGEST_API_TOKEN env var is not set."`
			`exit 8`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`fi`

			`# Required arguments:`
			`# --url`
			`# --> Atlassian (Jira) domain URL`
			`# --api-token`
			`# --> Api token to authenticate into Atlassian (Jira).`
			`# Check https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/ for more info.`
			`# --user-email`
			`# --> User email for the domain, such as xyz@unstructured.io`

			`# Optional arguments:`
			`# --list-of-projects`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated project ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`# --list-of-boards`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated board ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`# --list-of-issues`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated issue ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00
			`# Note: When any of the optional arguments are provided, connector will ingest only those components, and nothing else.`
			`# When none of the optional arguments are provided, all issues in all projects will be ingested.`

test: parametrize ingest test scripts (#1979) This PR resolves [CORE-2453](https://unstructured-ai.atlassian.net/browse/CORE-2453): - parametrizes the output folder so that ingest output files can be saved other than the same place where the scripts are; this is set by env `OUTPUT_ROOT` - parametrize the python path `PYTHONPATH` to first check existing definition before default to `.`, the current folder - parametrize the run script that carries out ingest using `RUN_SCRIPT`, default is still `./unstructured/ingest/main.py` These changes allows us to run ingest test with more control. To test: - run `OUTPUT_ROOT=/tmp ./test_unstructured_ingest/src/local-single-file.sh`: the output now should be in `/tmp` instead of in the ingest test folder - run `RUN_SCRIPT=/hope/you/do/not/have/this/folder ./test_unstructured_ingest/src/local-single-file.sh` would raise an error because system can't find `/hope/you/do/not/have/this/folder` - run `RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh` should run as normal - do the following ```bash cp ./unstructured/ingest/main.py /tmp/main.py OUTPUT_ROOT=/tmp PYTHONPATH=$(pwd) RUN_SCRIPT=./unstructured/ingest/main.py ./test_unstructured_ingest/src/local-single-file.sh ``` This will run and generate output at `/tmp` [CORE-2453]: https://unstructured-ai.atlassian.net/browse/CORE-2453?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ 2023-11-02 16:41:56 -05:00			`RUN_SCRIPT=${RUN_SCRIPT:-./unstructured/ingest/main.py}`
			`PYTHONPATH=${PYTHONPATH:-.} "$RUN_SCRIPT" \`
chore: shell scripts default indent of 2 instead of 4 (#2287) Given the tendency for shell scripts to easily enter into a few levels of indentation and long line lengths, update the default to 2 spaces. 2023-12-18 23:48:21 -08:00			`jira \`
			`--download-dir "$DOWNLOAD_DIR" \`
			`--metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \`
			`--num-processes "$max_processes" \`
			`--preserve-downloads \`
			`--reprocess \`
			`--output-dir "$OUTPUT_DIR" \`
			`--verbose \`
			`--url https://unstructured-jira-connector-test.atlassian.net \`
			`--user-email "$JIRA_INGEST_USER_EMAIL" \`
			`--api-token "$JIRA_INGEST_API_TOKEN" \`
			`--projects "JCTP3" \`
			`--boards "1" \`
			`--issues "JCTP2-4,JCTP2-7,JCTP2-8,10012,JCTP2-11" \`
			`--work-dir "$WORK_DIR"`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00
chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00			`"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME`