11 Commits

Author SHA1 Message Date
cragwolfe
44f5605ef3
build(image): call python3 not python for image compat (#1574)
Fixes 

docker exec unstructured-smoke-test /bin/bash -c
/home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh
/home/notebook-user/test_unstructured_ingest/test-ingest-wikipedia.sh:
line 10: python: command not found

in
https://github.com/Unstructured-IO/unstructured/blob/6ad4971/scripts/docker-smoke-test.sh#L43
that was preventing docker images from being built.
2023-09-28 21:48:19 -07:00
Roman Isecke
81af879038
roman/increase ingest tests num processes (#1500)
### Description
In an effort to speed up the ingest tests, bumping the num if processes
to the max on the system for each
2023-09-26 16:06:53 -05:00
Roman Isecke
e88f7d9eab
chore: ingest test file cleanup (#1366) 2023-09-21 11:51:08 -07:00
Newel H
cd704e873b
Feat: Create a naive hierarchy for elements (#1268)
## **Summary**
By adding hierarchy to unstructured elements, users will have more
information for implementing vector db/LLM chunking strategies. For
example, text elements could be queried by their preceding title
element. The hierarchy is implemented by a parent_id tag in the
element's metadata.

### Features
- Introduces a parent_id to ElementMetadata (The id of the parent
element, not a pointer)
- Creates a rule set for assigning hierarchies. Sensible default is
assigned, with an optional override parameter
- Sets element parent ids if there isn't an existing parent id or
matches the ruleset

### How it works

Hierarchies are assigned via a parent id field in element metadata.
Elements are read sequentially and evaluated against a ruleset. For
example take the following elements:

1. Title, "This is the Title"
2. Text, "this is the text"

And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of
2 will be the id of 1. The algorithm for determining this is more
complex and resolves several edge cases, so please read the code for
further details.

### Schema Changes

```
@dataclass
class ElementMetadata:
      coordinates: Optional[CoordinatesMetadata] = None
      data_source: Optional[DataSourceMetadata] = None
      filename: Optional[str] = None
      file_directory: Optional[str] = None
      last_modified: Optional[str] = None
      filetype: Optional[str] = None
      attached_to_filename: Optional[str] = None
+     parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None
+     category_depth: Optional[int] = None

...
```

### Testing
```
from unstructured.partition.auto import partition
from typing import List

elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto")

for element in elements:
    print(
        f"Category:  {getattr(element, 'category', '')}\n"\
        f"Text:      {getattr(element, 'text', '')}\n"
        f"ID:        {element.id}\n" \
        f"Parent ID: {element.metadata.parent_id}\n"\
        f"Depth:     {element.metadata.category_depth}\n" \
    )
```

### Additional Notes
Implementing this feature revealed a possibly undesired side-effect in
how element metadata are processed. In
`unstructured/partition/common.py` the `_add_element_metadata` is
invoked as part of the `add_metadata_with_filetype` decorator for
filetype partitioning. This method is intended to add additional
information to the metadata generated with the element including
filename and filetype, however the existing metadata is merged into a
newly created metadata object rather than the other way around. Because
of the way it's structured, new metadata fields can easily be forgotten
and pose debugging challenges to developers. This likely warrants a new
issue.

I'm guessing that the implementation is done this way to avoid issues
with deserializing elements, but could be wrong.

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-09-14 11:23:16 -04:00
Roman Isecke
59e850bbd9
Roman/downstream connector cli subcommand (#1302)
### Description
Update all other connectors to use the new downstream architecture that
was recently introduced for the s3 connector.

Closes #1313 and #1311
2023-09-11 11:40:56 -04:00
pravin-unstructured
8641fe39dc
Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323)
If a layout model is used from unstructured-inference, you get back
class probabilities in the element metadata from partition.
extra-pdf-image-in in requirements already has the newest version of
unstructured-inference in there without a pinned version. Is there any
place else that the unstructured-inference version needs to be updated
to the required release version, 0.5.22?
2023-09-07 22:56:43 -04:00
Yuming Long
b4fe40e484
Chore[ingest]: adding parameter --partition-pdf-infer-table-structure (#1056)
* add param

* expected test

* add option (to do doc nit)

* test with api for now

* typo

* test with api key

* use local only

* encoding -> partition-encoding

* changelog and version

* Update ingest test fixtures (#1055)

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>

* ignore coordinates

* no witespace lol

* Update ingest test fixtures (#1061)

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-08-08 18:11:06 -04:00
cragwolfe
13d3559fa4
chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
2023-08-01 02:55:43 +00:00
Roman Isecke
28214a6cc3
Roman/ingest refactor (#978)
* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-31 13:20:10 -04:00
Matt Robinson
d9aed66b65
feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
Jason Scheirer
196efa09b1
chore: Add encoding param to ingest (#955)
* Add encoding param to ingest
2023-07-24 10:06:13 -07:00