unstructured/test_unstructured_ingest/test-ingest-jira.sh

#!/usr/bin/env bash
set -e

# Description: This test checks if all the processed content is the same as the expected outputs
SCRIPT_DIR=$(dirname "$(realpath "$0")")
cd "$SCRIPT_DIR"/.. || exit 1

OUTPUT_FOLDER_NAME=jira-diff
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
CI=${CI:-"false"}

# shellcheck disable=SC1091
source "$SCRIPT_DIR"/cleanup.sh
function cleanup() {
  cleanup_dir "$OUTPUT_DIR"
  if [ "$CI" == "true" ]; then
    cleanup_dir "$DOWNLOAD_DIR"
  fi
}
trap cleanup EXIT

if [ -z "$JIRA_INGEST_USER_EMAIL" ] || [ -z "$JIRA_INGEST_API_TOKEN" ]; then
   echo "Skipping Jira ingest test because the JIRA_INGEST_USER_EMAIL or JIRA_INGEST_API_TOKEN env var is not set."
   exit 0
fi

# Required arguments:
# --url
#   --> Atlassian (Jira) domain URL
# --api-token
#   --> Api token to authenticate into Atlassian (Jira).
#       Check https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/ for more info.
# --user-email
#   --> User email for the domain, such as xyz@unstructured.io

# Optional arguments:
# --list-of-projects
#     --> Comma separated project ids or keys
# --list-of-boards
#     --> Comma separated board ids or keys
# --list-of-issues
#     --> Comma separated issue ids or keys

# Note: When any of the optional arguments are provided, connector will ingest only those components, and nothing else.
#       When none of the optional arguments are provided, all issues in all projects will be ingested.

PYTHONPATH=. ./unstructured/ingest/main.py \
        jira \
        --download-dir "$DOWNLOAD_DIR" \
        --metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
        --num-processes "$max_processes" \
        --preserve-downloads \
        --reprocess \
        --output-dir "$OUTPUT_DIR" \
        --verbose \
        --url https://unstructured-jira-connector-test.atlassian.net \
        --user-email "$JIRA_INGEST_USER_EMAIL" \
        --api-token "$JIRA_INGEST_API_TOKEN" \
        --projects "JCTP3" \
        --boards "1" \
        --issues "JCTP2-4,JCTP2-7,JCTP2-8,10012,JCTP2-11"


"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`#!/usr/bin/env bash`
			`set -e`

			`# Description: This test checks if all the processed content is the same as the expected outputs`
			`SCRIPT_DIR=$(dirname "$(realpath "$0")")`
			`cd "$SCRIPT_DIR"/.. \|\| exit 1`

			`OUTPUT_FOLDER_NAME=jira-diff`
			`OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME`
			`DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME`
build(image): call python3 not python for image compat (#1574) Fixes docker exec unstructured-smoke-test /bin/bash -c /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh /home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh: line 10: python: command not found in https://github.com/Unstructured-IO/unstructured/blob/6ad4971/scripts/docker-smoke-test.sh#L43 that was preventing docker images from being built. 2023-09-28 21:48:19 -07:00			`max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}`
roman/drop downloads in ingest tests (#1614) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one. 2023-10-02 16:47:24 -04:00			`CI=${CI:-"false"}`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00
chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00			`# shellcheck disable=SC1091`
			`source "$SCRIPT_DIR"/cleanup.sh`
roman/drop downloads in ingest tests (#1614) ### Description In an effort to mitigate resource consumption when running CI tests, cleanup download dir for ingest tests after each one. 2023-10-02 16:47:24 -04:00			`function cleanup() {`
			`cleanup_dir "$OUTPUT_DIR"`
			`if [ "$CI" == "true" ]; then`
			`cleanup_dir "$DOWNLOAD_DIR"`
			`fi`
			`}`
			`trap cleanup EXIT`
chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`if [ -z "$JIRA_INGEST_USER_EMAIL" ] \|\| [ -z "$JIRA_INGEST_API_TOKEN" ]; then`
			`echo "Skipping Jira ingest test because the JIRA_INGEST_USER_EMAIL or JIRA_INGEST_API_TOKEN env var is not set."`
			`exit 0`
			`fi`

			`# Required arguments:`
			`# --url`
			`# --> Atlassian (Jira) domain URL`
			`# --api-token`
			`# --> Api token to authenticate into Atlassian (Jira).`
			`# Check https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/ for more info.`
			`# --user-email`
			`# --> User email for the domain, such as xyz@unstructured.io`

			`# Optional arguments:`
			`# --list-of-projects`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated project ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`# --list-of-boards`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated board ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`# --list-of-issues`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`# --> Comma separated issue ids or keys`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00
			`# Note: When any of the optional arguments are provided, connector will ingest only those components, and nothing else.`
			`# When none of the optional arguments are provided, all issues in all projects will be ingested.`

			`PYTHONPATH=. ./unstructured/ingest/main.py \`
			`jira \`
			`--download-dir "$DOWNLOAD_DIR" \`
Feat: Create a naive hierarchy for elements (#1268) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com> 2023-09-14 11:23:16 -04:00			`--metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \`
roman/increase ingest tests num processes (#1500) ### Description In an effort to speed up the ingest tests, bumping the num if processes to the max on the system for each 2023-09-26 17:06:53 -04:00			`--num-processes "$max_processes" \`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`--preserve-downloads \`
			`--reprocess \`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`--output-dir "$OUTPUT_DIR" \`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00			`--verbose \`
			`--url https://unstructured-jira-connector-test.atlassian.net \`
			`--user-email "$JIRA_INGEST_USER_EMAIL" \`
			`--api-token "$JIRA_INGEST_API_TOKEN" \`
Roman/downstream connector cli subcommand (#1302) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311 2023-09-11 11:40:56 -04:00			`--projects "JCTP3" \`
			`--boards "1" \`
			`--issues "JCTP2-4,JCTP2-7,JCTP2-8,10012,JCTP2-11"`
feat: jira connector (cloud) (#1238) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> 2023-09-06 13:10:48 +03:00


chore: ingest test file cleanup (#1366) 2023-09-21 14:51:08 -04:00			`"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME`