1109 Commits

Author SHA1 Message Date
Christine Straub
47bc4009a8
fix: adjust threshold for encoding detection (#894)
* chore: add example doc

* fix: adjust encoding recognition threshold value in `detect_file_encoding`

* test: add test cases for German characters

* chore: update changelog & version
2023-07-07 09:25:03 -04:00
Matt Robinson
52aced8677
fix: validate encodings from email headers (#881)
* add validate encoding function

* remove extraneous file

* added test case for malformed encoding

* version and changelog
2023-07-06 13:49:27 +00:00
cragwolfe
209054f0db
build(image): revert docker build tweak for arm64 (#887)
arm64 Images (and amd64 ones) now building again in CI 😐 .
2023-07-06 06:46:40 +00:00
Ahmet Melek
4b827f0793
fix: local connector output filename when a single file is being processed (#879)
* fix string processing error for _output_filename

* Add docstring and type hint, update CHANGELOG, update version

* update test fixture

* simple code change commit to retrigger ci checks

* update test fixture - after brew install tesseract-lang

* Update ingest test fixtures (#882)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* correct CHANGELOG

* correct CHANGELOG

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-05 14:37:40 -07:00
Nathan Chappell
24dad24f87
chore: changed type IO to IO[bytes] (#878)
Co-authored-by: Nathan Chappell <nchappell@mono.software>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-05 16:37:31 -04:00
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
Austin Walker
8d2e7c0746
fix: Fix KeyError in isd_to_elements (#876) 2023-07-05 19:09:18 +00:00
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
Johnny Lim
6ec177e7c6
Add a missing space in a warning message in filetype.py (#873)
Adds a missing space in a warning message in the filetype.py file.
2023-07-01 20:54:39 +00:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
cragwolfe
cb2866b159
build(image): docker build tweak for arm64 (#871)
Fixes issue where arm64 docker builds were failing and preventing images from being published.
2023-06-30 20:49:31 -07:00
Trevor Bossert
6249e1553e
New base image with security patches (#869)
* New base image with security patches

* Bump version

* remove line from changelog

not code related
0.7.12
2023-06-30 19:14:06 -07:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00
John
e9fdbb0943
feat: add include_metadata across all partition functions (#853)
* add include_metadata kwarg and tests to parsers

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse

* WIP add include_metadata to partition_pdf

* add include_metadata tests to partition_pdf

* make tidy/check

* update changelog and version

* change test asserts and move docstring logic to process_metadata

* make tidy

* fix tests asserts

* linting, linting, linting

* sync versions

* skip api call test not on main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-30 10:44:46 -04:00
qued
350bb1dad5
enhancement: clean pdf elements (bump unstructured-inference) (#790)
More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructured.io>
0.7.11
2023-06-29 18:35:06 -07:00
ryannikolaidis
642562beb5
fix: skip test with api call when run outside CI (#862) 2023-06-30 00:47:51 +00:00
ryannikolaidis
62e20442df
chore: refactor ingest tests (#814)
- Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth
- Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory
- Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository
- Construct paths as reusable variables declared at top of scripts
- Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance)
- OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind
- Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true
- Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)
2023-06-29 23:13:41 +00:00
Matt Robinson
c581a33c8a
feat: attachment processing for emails (#855)
* process attachments for email

* add attachment processing to msg

* fix up metadata for attachments

* add test for processing email attachments

* added test for processing msg attachments

* update docs

* tests for error conditions

* version and changelog
2023-06-29 18:01:12 -04:00
ryannikolaidis
92e55eb89e
fix: add api key to image publishing workflow tests (#854) 2023-06-29 20:46:07 +00:00
ryannikolaidis
4f891fff63
fix: ingest download skipping (#847) 2023-06-29 20:04:37 +00:00
ryannikolaidis
8ea5f6939e
fix: parameterized ingest test overwriting (#838)
* sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh
* fixes incorrect expected results
* update expected results to properly parse Korean text
* bonus: installs language pack for Korean in CI and ingest fixture workflows
2023-06-29 18:37:09 +00:00
ryannikolaidis
60fe231f08
fix: use api key where needed in tests (#843)
* passes api key for unstructured-api to unit and ingest tests as needed.
* adds check for env var CI to otherwise skip tests that require an api key
2023-06-29 17:31:01 +00:00
Roman Isecke
9882c2b83f
Avoid setting metadata in constructor signature for elements (#837)
Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification).

Bonus refactor for PageBreak to have text values of "".

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-29 03:14:05 +00:00
Matt Robinson
44411ecc59
enhancement: max_partition kwarg for limiting element size (#818)
* add max partition size logic

* work splitting logic into split_by_paragraph

* pass through max_partition to other functions

* added test for splitting long document

* add type hint

* add documentation

* version and changelog

* ingest-test-fixtures-update

* Update ingest test fixtures (#819)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* retrigger ci

* ingest-test-fixtures-update

* ingest-test-fixtures-update

* Update ingest test fixtures (#821)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* update default for partition_xml

* update version for release

* update msg doc string

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
0.7.10
2023-06-28 15:26:01 -04:00
Matt Robinson
38457777fa
fix: ignore escaped commas in CSV checks (#832)
* fix file content checking bug

* skip counting commas in quotes for csv detection

* add test for comma count

* change file content grab to -1

* version and changelog

* add csv to extension check

* add file to tests

* ingest-test-fixtures-update

* Update ingest test fixtures (#833)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix typo

* fix changelog wording

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 17:22:23 +00:00
Matt Robinson
06077b09ee
fix: don't detect line breaks as list items (#831)
* add negative lookahead to bullet pattern

* version and changelog

* update paragraph pattern

* add list item assert
2023-06-28 12:49:12 -04:00
qued
773d9a4f37
feat: choose model (#824)
Added the ability to select the hi_res model via the environment variable UNSTRUCTURED_HI_RES_MODEL_NAME. Variable must be a string that matches up with a model name defined in unstructured_inference.

Also removed code related to old unstructured_inference API which has been removed from currently pinned version of unstructured-inference and is no longer running as a service.
2023-06-28 04:06:08 +00:00
shreyanid
433d6af1bc
fix: format Arabic and Hebrew annotated encodings (#823)
* add modified arabic and hebrew encodings

* added calls to format_encoding_str so encoding is checked before use

* added formatting to detect_filetype()

* explicitly provided default value for null encoding parameter

* fixed format of annotated encodings list

* adding hebrew base64 test file

* small lint fixes

* update changelog

* bump version to -dev2
2023-06-27 18:15:02 -07:00
kravetsmic
58e988e110
feature(html partition): parse pre tag (#642)
* feature(html partition): parse pre tag

* chore: update CHANGELOG.md

* style: black format xml.py

* Added tests dor html with pre tag

* remove skip test, update parse pre tag

* fix style

* chore: spell check

* chore: update changelog & version

* chore: update ingest test fixtures

* chore: add exception handling if `element.text` is `None` in `_read_xml`

* test: add more sanity testing on the `.text` content of the element(s)

* refactor: move the conditional logic for <pre> outside of the `try/except` block

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-06-27 18:52:39 +00:00
ryannikolaidis
078e2aa116
ci: fix arm build issue with docker driver (#810) 2023-06-26 22:39:00 +00:00
ryannikolaidis
a5c7e5b41e
chore: DRY ingest connectors (#769) 2023-06-26 20:12:05 +00:00
Amanda Cameron
95f02f290d
chore: update readme for api keys (#792)
* api announcement

* updating copy

* version bump
0.7.9
2023-06-26 11:56:01 -07:00
ryannikolaidis
7f0f5fab04
ci: fix amd build issue (#804) 2023-06-24 23:46:32 +00:00
MalteHB
030c56fcba
enhancement: better leaf element string check in XML parsing (#734)
* Enhance leaf element string check in XML parsing

* fix is_string check

* changelog and version

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-23 20:44:50 +00:00
Emily Chen
a8a19ceba0
chore: Add --ocr-languages parameter to unstructured ingest (#793) 2023-06-23 12:38:33 -07:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
little_huang
5320aa681f
docs: fix indentation (#802)
Co-authored-by: 黄宝成 <huangbc@publink.cn>
2023-06-23 09:50:31 -05:00
Christine Straub
5f5da65e0b
Fix/handle-spooled-temp-file-eml (#800)
This PR is for the unstructured-api smoke tests pass.
0.7.8
2023-06-22 19:21:28 -07:00
Matt Robinson
901ef16835
fix: allow partition_email to process emails with no content (#797)
* version and changelog

* ingest-test-fixtures-update
2023-06-22 12:52:27 -04:00
Matt Robinson
8683e2695c
fix: enable partition_pdf to recursively grab text with fast strategy (#796)
* initial pass on text in figures

* refactor text extraction

* update tests

* fix title test

* add test for docs that require recursive text grab

* version and changelog

* ingest-test-fixtures-update

* there are 8 pdf files now
2023-06-22 11:19:54 -04:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
shreyanid
21c346dab8
broken file link in quick start sample code (#789) 2023-06-21 13:39:10 -07:00
Roman Isecke
61ea00a06f
Update Dockerfile to use multistage build and cache layers (#785)
* Update Dockerfile to use multistage build and cache layers

* Fix Dockerfile
2023-06-21 13:12:45 -04:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
0.7.7
2023-06-20 17:07:30 +00:00
Matt Robinson
feaf1cb4df
fix: check for xml attribute when identifying pagebreaks (#778) 2023-06-20 12:44:00 -04:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Christine Straub
743482b6d3
Bug/635 unicode decode error eml (#739)
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
   common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
2023-06-17 00:52:13 +00:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
cragwolfe
68f04159bc
chore: rm old detectron2 install from makefile (#767)
* chore: remove vestigal Makefile target and tensorboard
2023-06-16 10:05:36 -07:00