feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
#!/usr/bin/env bash
|
|
|
|
set -e
|
|
|
|
|
|
|
|
# Description: This test checks if all the processed content is the same as the expected outputs
|
|
|
|
SCRIPT_DIR=$(dirname "$(realpath "$0")")
|
|
|
|
cd "$SCRIPT_DIR"/.. || exit 1
|
|
|
|
|
|
|
|
OUTPUT_FOLDER_NAME=jira-diff
|
|
|
|
OUTPUT_DIR=$SCRIPT_DIR/structured-output/$OUTPUT_FOLDER_NAME
|
|
|
|
DOWNLOAD_DIR=$SCRIPT_DIR/download/$OUTPUT_FOLDER_NAME
|
2023-09-28 21:48:19 -07:00
|
|
|
max_processes=${MAX_PROCESSES:=$(python3 -c "import os; print(os.cpu_count())")}
|
2023-10-02 16:47:24 -04:00
|
|
|
CI=${CI:-"false"}
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
|
2023-09-21 14:51:08 -04:00
|
|
|
# shellcheck disable=SC1091
|
|
|
|
source "$SCRIPT_DIR"/cleanup.sh
|
2023-10-02 16:47:24 -04:00
|
|
|
function cleanup() {
|
|
|
|
cleanup_dir "$OUTPUT_DIR"
|
|
|
|
if [ "$CI" == "true" ]; then
|
|
|
|
cleanup_dir "$DOWNLOAD_DIR"
|
|
|
|
fi
|
|
|
|
}
|
|
|
|
trap cleanup EXIT
|
2023-09-21 14:51:08 -04:00
|
|
|
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
if [ -z "$JIRA_INGEST_USER_EMAIL" ] || [ -z "$JIRA_INGEST_API_TOKEN" ]; then
|
|
|
|
echo "Skipping Jira ingest test because the JIRA_INGEST_USER_EMAIL or JIRA_INGEST_API_TOKEN env var is not set."
|
|
|
|
exit 0
|
|
|
|
fi
|
|
|
|
|
|
|
|
# Required arguments:
|
|
|
|
# --url
|
|
|
|
# --> Atlassian (Jira) domain URL
|
|
|
|
# --api-token
|
|
|
|
# --> Api token to authenticate into Atlassian (Jira).
|
|
|
|
# Check https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/ for more info.
|
|
|
|
# --user-email
|
|
|
|
# --> User email for the domain, such as xyz@unstructured.io
|
|
|
|
|
|
|
|
# Optional arguments:
|
|
|
|
# --list-of-projects
|
2023-09-11 11:40:56 -04:00
|
|
|
# --> Comma separated project ids or keys
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
# --list-of-boards
|
2023-09-11 11:40:56 -04:00
|
|
|
# --> Comma separated board ids or keys
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
# --list-of-issues
|
2023-09-11 11:40:56 -04:00
|
|
|
# --> Comma separated issue ids or keys
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
|
|
|
|
# Note: When any of the optional arguments are provided, connector will ingest only those components, and nothing else.
|
|
|
|
# When none of the optional arguments are provided, all issues in all projects will be ingested.
|
|
|
|
|
|
|
|
PYTHONPATH=. ./unstructured/ingest/main.py \
|
|
|
|
jira \
|
|
|
|
--download-dir "$DOWNLOAD_DIR" \
|
Feat: Create a naive hierarchy for elements (#1268)
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.
### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset
### How it works
Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:
1. Title, "This is the Title"
2. Text, "this is the text"
And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.
### Schema Changes
```
@dataclass
class ElementMetadata:
coordinates: Optional[CoordinatesMetadata] = None
data_source: Optional[DataSourceMetadata] = None
filename: Optional[str] = None
file_directory: Optional[str] = None
last_modified: Optional[str] = None
filetype: Optional[str] = None
attached_to_filename: Optional[str] = None
+ parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+ category_depth: Optional[int] = None
...
```
### Testing
```
from unstructured.partition.auto import partition
from typing import List
elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")
for element in elements:
print(
f"Category: {getattr(element, 'category', '')}\n"\
f"Text: {getattr(element, 'text', '')}\n"
f"ID: {element.id}\n" \
f"Parent ID: {element.metadata.parent_id}\n"\
f"Depth: {element.metadata.category_depth}\n" \
)
```
### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.
I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.
---------
Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
|
|
|
--metadata-exclude filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth \
|
2023-09-26 17:06:53 -04:00
|
|
|
--num-processes "$max_processes" \
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
--preserve-downloads \
|
|
|
|
--reprocess \
|
2023-09-11 11:40:56 -04:00
|
|
|
--output-dir "$OUTPUT_DIR" \
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
--verbose \
|
|
|
|
--url https://unstructured-jira-connector-test.atlassian.net \
|
|
|
|
--user-email "$JIRA_INGEST_USER_EMAIL" \
|
|
|
|
--api-token "$JIRA_INGEST_API_TOKEN" \
|
2023-09-11 11:40:56 -04:00
|
|
|
--projects "JCTP3" \
|
|
|
|
--boards "1" \
|
|
|
|
--issues "JCTP2-4,JCTP2-7,JCTP2-8,10012,JCTP2-11"
|
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
- either all issues in all projects in a Jira Cloud Organization
- or
- issues in user specified projects, boards
- user specified issues
- processes this kind of data:
- text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
- notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263
To test the changes, make the necessary setups and run the relevant
ingest test scripts.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 13:10:48 +03:00
|
|
|
|
|
|
|
|
|
|
|
|
2023-09-21 14:51:08 -04:00
|
|
|
"$SCRIPT_DIR"/check-diff-expected-output.sh $OUTPUT_FOLDER_NAME
|