929 Commits

Author SHA1 Message Date
fran-unstructured
26da51c765
docs: Add source code links to bricks' docs (#923)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-07-13 17:27:47 +00:00
Matt Robinson
9b830693bd
fix: adds to list of extensions to check if a file has a plain text MIME type (#916)
* added .txt, .text, and .tab to text file list

* changelog and version
2023-07-12 20:07:43 +00:00
fran-unstructured
f7b3c0f741
docs: adds connectors' documentation (#917)
* Add connectors documentation

* Add connectors documentation with corrections and index.rst update

* Add connectors documentation - add API information

---------

Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-12 14:56:09 -04:00
dependabot[bot]
f490b82d5b
build(deps): bump praw from 7.7.0 to 7.7.1 in /requirements (#922)
Bumps [praw](https://github.com/praw-dev/praw) from 7.7.0 to 7.7.1.
- [Release notes](https://github.com/praw-dev/praw/releases)
- [Changelog](https://github.com/praw-dev/praw/blob/master/CHANGES.rst)
- [Commits](https://github.com/praw-dev/praw/compare/v7.7.0...v7.7.1)

---
updated-dependencies:
- dependency-name: praw
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:19 -04:00
dependabot[bot]
a7d6edc528
build(deps): bump google-api-python-client in /requirements (#921)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.92.0 to 2.93.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.92.0...v2.93.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:03 -04:00
Matt Robinson
a583d47b84
docs: update table and API documentation (#919)
* more detailed api docs

* add table docs

* remove rtf/epubs comment

* remove confusing request_kwargs verbiage

* add missing a
2023-07-12 12:59:59 -04:00
dependabot[bot]
1fa944ec87
build(deps): bump black from 23.3.0 to 23.7.0 in /requirements (#920)
Bumps [black](https://github.com/psf/black) from 23.3.0 to 23.7.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.3.0...23.7.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 15:30:57 +00:00
dependabot[bot]
80bdd60b32
build(deps): bump protobuf from 3.20.3 to 4.23.4 in /requirements (#910)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.3 to 4.23.4.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v3.20.3...v4.23.4)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-12 10:41:02 -04:00
Roman Isecke
8b233b4f62
Set the version to 0.8.1 (#914) 2023-07-11 10:27:54 -04:00
Emily Chen
2635b0be07
Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913) 2023-07-10 21:47:41 -07:00
Matt Robinson
b3936893b8
build: add python 3.11 to CI (#908)
* remove argilla; bump reqs

* enable py 3.11

* add 3.11 to setup.py

* make pip-compile

* ignore cli mypy errors

* install argilla

* fix constraints

* install argilla

* changelog and version

* skip argilla in docker

* dont import argilla in docker

* skip all of argilla if in container

* only import argilla if outside docker

* more docker skips

* remove weird pypi settings
2023-07-10 18:52:25 +00:00
Trevor Bossert
66f2d4b280
Add both arm and amd builds to manifests (#899) 2023-07-10 10:15:15 -07:00
John
6173362620
fix: detect list items in MS Word documents (#909)
* fix merge conflict

* update changelog and version
2023-07-10 15:29:08 +00:00
qued
79f734d3f9
fix: better extractable check (#900)
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
2023-07-07 23:41:37 -05:00
Matt Robinson
f51ae45050
fix: grab all metadata fields in convert_to_dataframe (#893)
* add all fieldnames to dataframe

* drop empty columns in convert_to_dataframe

* test for maintaining metadata

* version and changelog
2023-07-07 20:04:35 +00:00
dependabot[bot]
c8e6f0e141
build(deps): bump elasticsearch from 8.8.0 to 8.8.2 in /requirements (#898)
Bumps [elasticsearch](https://github.com/elastic/elasticsearch-py) from 8.8.0 to 8.8.2.
- [Release notes](https://github.com/elastic/elasticsearch-py/releases)
- [Commits](https://github.com/elastic/elasticsearch-py/compare/v8.8.0...v8.8.2)

---
updated-dependencies:
- dependency-name: elasticsearch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 19:19:45 +00:00
dependabot[bot]
05d51cfb4f
build(deps): bump ruff from 0.0.275 to 0.0.277 in /requirements (#897)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.0.275 to 0.0.277.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md)
- [Commits](https://github.com/astral-sh/ruff/compare/v0.0.275...v0.0.277)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:52 -04:00
dependabot[bot]
7f9532f8b3
build(deps): bump lxml from 4.9.2 to 4.9.3 in /requirements (#896)
Bumps [lxml](https://github.com/lxml/lxml) from 4.9.2 to 4.9.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.2...lxml-4.9.3)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:05 -04:00
dependabot[bot]
1619a0b6a2
build(deps): bump google-api-python-client in /requirements (#895)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.91.0 to 2.92.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.91.0...v2.92.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-07 13:50:25 -04:00
Roman Isecke
5e1150184c
Add optional param for model name when partitioning pdfs (#890)
* Add optional param for model name when partitioning pdfs

* Pull in latest inference changes

* Fix linting
0.8.0
2023-07-07 11:16:55 -04:00
Christine Straub
47bc4009a8
fix: adjust threshold for encoding detection (#894)
* chore: add example doc

* fix: adjust encoding recognition threshold value in `detect_file_encoding`

* test: add test cases for German characters

* chore: update changelog & version
2023-07-07 09:25:03 -04:00
Matt Robinson
52aced8677
fix: validate encodings from email headers (#881)
* add validate encoding function

* remove extraneous file

* added test case for malformed encoding

* version and changelog
2023-07-06 13:49:27 +00:00
cragwolfe
209054f0db
build(image): revert docker build tweak for arm64 (#887)
arm64 Images (and amd64 ones) now building again in CI 😐 .
2023-07-06 06:46:40 +00:00
Ahmet Melek
4b827f0793
fix: local connector output filename when a single file is being processed (#879)
* fix string processing error for _output_filename

* Add docstring and type hint, update CHANGELOG, update version

* update test fixture

* simple code change commit to retrigger ci checks

* update test fixture - after brew install tesseract-lang

* Update ingest test fixtures (#882)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* correct CHANGELOG

* correct CHANGELOG

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-05 14:37:40 -07:00
Nathan Chappell
24dad24f87
chore: changed type IO to IO[bytes] (#878)
Co-authored-by: Nathan Chappell <nchappell@mono.software>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-05 16:37:31 -04:00
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
Austin Walker
8d2e7c0746
fix: Fix KeyError in isd_to_elements (#876) 2023-07-05 19:09:18 +00:00
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
Johnny Lim
6ec177e7c6
Add a missing space in a warning message in filetype.py (#873)
Adds a missing space in a warning message in the filetype.py file.
2023-07-01 20:54:39 +00:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
cragwolfe
cb2866b159
build(image): docker build tweak for arm64 (#871)
Fixes issue where arm64 docker builds were failing and preventing images from being published.
2023-06-30 20:49:31 -07:00
Trevor Bossert
6249e1553e
New base image with security patches (#869)
* New base image with security patches

* Bump version

* remove line from changelog

not code related
0.7.12
2023-06-30 19:14:06 -07:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00
John
e9fdbb0943
feat: add include_metadata across all partition functions (#853)
* add include_metadata kwarg and tests to parsers

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse

* WIP add include_metadata to partition_pdf

* add include_metadata tests to partition_pdf

* make tidy/check

* update changelog and version

* change test asserts and move docstring logic to process_metadata

* make tidy

* fix tests asserts

* linting, linting, linting

* sync versions

* skip api call test not on main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-30 10:44:46 -04:00
qued
350bb1dad5
enhancement: clean pdf elements (bump unstructured-inference) (#790)
More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructured.io>
0.7.11
2023-06-29 18:35:06 -07:00
ryannikolaidis
642562beb5
fix: skip test with api call when run outside CI (#862) 2023-06-30 00:47:51 +00:00
ryannikolaidis
62e20442df
chore: refactor ingest tests (#814)
- Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth
- Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory
- Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository
- Construct paths as reusable variables declared at top of scripts
- Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance)
- OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind
- Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true
- Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)
2023-06-29 23:13:41 +00:00
Matt Robinson
c581a33c8a
feat: attachment processing for emails (#855)
* process attachments for email

* add attachment processing to msg

* fix up metadata for attachments

* add test for processing email attachments

* added test for processing msg attachments

* update docs

* tests for error conditions

* version and changelog
2023-06-29 18:01:12 -04:00
ryannikolaidis
92e55eb89e
fix: add api key to image publishing workflow tests (#854) 2023-06-29 20:46:07 +00:00
ryannikolaidis
4f891fff63
fix: ingest download skipping (#847) 2023-06-29 20:04:37 +00:00
ryannikolaidis
8ea5f6939e
fix: parameterized ingest test overwriting (#838)
* sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh
* fixes incorrect expected results
* update expected results to properly parse Korean text
* bonus: installs language pack for Korean in CI and ingest fixture workflows
2023-06-29 18:37:09 +00:00
ryannikolaidis
60fe231f08
fix: use api key where needed in tests (#843)
* passes api key for unstructured-api to unit and ingest tests as needed.
* adds check for env var CI to otherwise skip tests that require an api key
2023-06-29 17:31:01 +00:00
Roman Isecke
9882c2b83f
Avoid setting metadata in constructor signature for elements (#837)
Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification).

Bonus refactor for PageBreak to have text values of "".

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-29 03:14:05 +00:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
0.7.10
2023-06-28 15:26:01 -04:00
Matt Robinson
38457777fa
fix: ignore escaped commas in CSV checks (#832)
* fix file content checking bug

* skip counting commas in quotes for csv detection

* add test for comma count

* change file content grab to -1

* version and changelog

* add csv to extension check

* add file to tests

* ingest-test-fixtures-update

* Update ingest test fixtures (#833)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix typo

* fix changelog wording

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 17:22:23 +00:00
Matt Robinson
06077b09ee
fix: don't detect line breaks as list items (#831)
* add negative lookahead to bullet pattern

* version and changelog

* update paragraph pattern

* add list item assert
2023-06-28 12:49:12 -04:00
qued
773d9a4f37
feat: choose model (#824)
Added the ability to select the hi_res model via the environment variable UNSTRUCTURED_HI_RES_MODEL_NAME. Variable must be a string that matches up with a model name defined in unstructured_inference.

Also removed code related to old unstructured_inference API which has been removed from currently pinned version of unstructured-inference and is no longer running as a service.
2023-06-28 04:06:08 +00:00
shreyanid
433d6af1bc
fix: format Arabic and Hebrew annotated encodings (#823)
* add modified arabic and hebrew encodings

* added calls to format_encoding_str so encoding is checked before use

* added formatting to detect_filetype()

* explicitly provided default value for null encoding parameter

* fixed format of annotated encodings list

* adding hebrew base64 test file

* small lint fixes

* update changelog

* bump version to -dev2
2023-06-27 18:15:02 -07:00
kravetsmic
58e988e110
feature(html partition): parse pre tag (#642)
* feature(html partition): parse pre tag

* chore: update CHANGELOG.md

* style: black format xml.py

* Added tests dor html with pre tag

* remove skip test, update parse pre tag

* fix style

* chore: spell check

* chore: update changelog & version

* chore: update ingest test fixtures

* chore: add exception handling if `element.text` is `None` in `_read_xml`

* test: add more sanity testing on the `.text` content of the element(s)

* refactor: move the conditional logic for <pre> outside of the `try/except` block

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-06-27 18:52:39 +00:00
ryannikolaidis
078e2aa116
ci: fix arm build issue with docker driver (#810) 2023-06-26 22:39:00 +00:00