485 Commits

Author SHA1 Message Date
Matt Robinson
38457777fa
fix: ignore escaped commas in CSV checks (#832)
* fix file content checking bug

* skip counting commas in quotes for csv detection

* add test for comma count

* change file content grab to -1

* version and changelog

* add csv to extension check

* add file to tests

* ingest-test-fixtures-update

* Update ingest test fixtures (#833)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix typo

* fix changelog wording

---------

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-28 17:22:23 +00:00
Matt Robinson
06077b09ee
fix: don't detect line breaks as list items (#831)
* add negative lookahead to bullet pattern

* version and changelog

* update paragraph pattern

* add list item assert
2023-06-28 12:49:12 -04:00
qued
773d9a4f37
feat: choose model (#824)
Added the ability to select the hi_res model via the environment variable UNSTRUCTURED_HI_RES_MODEL_NAME. Variable must be a string that matches up with a model name defined in unstructured_inference.

Also removed code related to old unstructured_inference API which has been removed from currently pinned version of unstructured-inference and is no longer running as a service.
2023-06-28 04:06:08 +00:00
shreyanid
433d6af1bc
fix: format Arabic and Hebrew annotated encodings (#823)
* add modified arabic and hebrew encodings

* added calls to format_encoding_str so encoding is checked before use

* added formatting to detect_filetype()

* explicitly provided default value for null encoding parameter

* fixed format of annotated encodings list

* adding hebrew base64 test file

* small lint fixes

* update changelog

* bump version to -dev2
2023-06-27 18:15:02 -07:00
kravetsmic
58e988e110
feature(html partition): parse pre tag (#642)
* feature(html partition): parse pre tag

* chore: update CHANGELOG.md

* style: black format xml.py

* Added tests dor html with pre tag

* remove skip test, update parse pre tag

* fix style

* chore: spell check

* chore: update changelog & version

* chore: update ingest test fixtures

* chore: add exception handling if `element.text` is `None` in `_read_xml`

* test: add more sanity testing on the `.text` content of the element(s)

* refactor: move the conditional logic for <pre> outside of the `try/except` block

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
2023-06-27 18:52:39 +00:00
ryannikolaidis
078e2aa116
ci: fix arm build issue with docker driver (#810) 2023-06-26 22:39:00 +00:00
ryannikolaidis
a5c7e5b41e
chore: DRY ingest connectors (#769) 2023-06-26 20:12:05 +00:00
Amanda Cameron
95f02f290d
chore: update readme for api keys (#792)
* api announcement

* updating copy

* version bump
0.7.9
2023-06-26 11:56:01 -07:00
ryannikolaidis
7f0f5fab04
ci: fix amd build issue (#804) 2023-06-24 23:46:32 +00:00
MalteHB
030c56fcba
enhancement: better leaf element string check in XML parsing (#734)
* Enhance leaf element string check in XML parsing

* fix is_string check

* changelog and version

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-23 20:44:50 +00:00
Emily Chen
a8a19ceba0
chore: Add --ocr-languages parameter to unstructured ingest (#793) 2023-06-23 12:38:33 -07:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
little_huang
5320aa681f
docs: fix indentation (#802)
Co-authored-by: 黄宝成 <huangbc@publink.cn>
2023-06-23 09:50:31 -05:00
Christine Straub
5f5da65e0b
Fix/handle-spooled-temp-file-eml (#800)
This PR is for the unstructured-api smoke tests pass.
0.7.8
2023-06-22 19:21:28 -07:00
Matt Robinson
901ef16835
fix: allow partition_email to process emails with no content (#797)
* version and changelog

* ingest-test-fixtures-update
2023-06-22 12:52:27 -04:00
Matt Robinson
8683e2695c
fix: enable partition_pdf to recursively grab text with fast strategy (#796)
* initial pass on text in figures

* refactor text extraction

* update tests

* fix title test

* add test for docs that require recursive text grab

* version and changelog

* ingest-test-fixtures-update

* there are 8 pdf files now
2023-06-22 11:19:54 -04:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
shreyanid
21c346dab8
broken file link in quick start sample code (#789) 2023-06-21 13:39:10 -07:00
Roman Isecke
61ea00a06f
Update Dockerfile to use multistage build and cache layers (#785)
* Update Dockerfile to use multistage build and cache layers

* Fix Dockerfile
2023-06-21 13:12:45 -04:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
0.7.7
2023-06-20 17:07:30 +00:00
Matt Robinson
feaf1cb4df
fix: check for xml attribute when identifying pagebreaks (#778) 2023-06-20 12:44:00 -04:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Christine Straub
743482b6d3
Bug/635 unicode decode error eml (#739)
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
   common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
2023-06-17 00:52:13 +00:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
cragwolfe
68f04159bc
chore: rm old detectron2 install from makefile (#767)
* chore: remove vestigal Makefile target and tensorboard
2023-06-16 10:05:36 -07:00
ryannikolaidis
4faa27ffe7
test: add google drive ingest test (#764) 2023-06-16 16:28:24 +00:00
Yuming Long
a611532e3c
Chore: convert fast strategy to ocr_only for images (#735)
* fall back to ocr only

* more note

* add test case

* maybe remove skipping dockertest for kor ocr?

* bump again

* clean up flag

* empty commit
0.7.6
2023-06-16 10:59:13 -04:00
Matt Robinson
4ea716837d
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
2023-06-16 10:10:56 -04:00
Angus Sinclair
ec403e245c
fix malformed pptx issue (#761)
* fix malformed pptx issue

Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions.

* update changelog

* fix lint issue SIM102 nested ifs

* fix black linting
2023-06-15 19:52:44 +00:00
Yuming Long
5bf78c077d
Fix: remove fake api key in test (#762)
* no fake api key

* changlog and version

* remove kwarg since we have default
2023-06-15 19:18:22 +00:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
075bf0bdba fix test that requires api key 2023-06-15 14:34:57 -04:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
kravetsmic
7fd7d7afae
feat(biomed connector): added additional params (#468) (#623)
Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay.

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-15 01:57:45 -07:00
Matt Robinson
e0c477de68
docs: update slack invite link (#749) 2023-06-14 10:06:45 -04:00
Matt Robinson
053a6c6e5c
enhancement: extract headers and footers in partition_docx (#742)
* added tests for headers and footers

* add docs on headers and footers; tweak to metadata

* version and changelog
2023-06-14 09:42:59 -04:00
cragwolfe
3fe7e1b6ca
fix: pdf2image library is core requirement (#745) 0.7.5 2023-06-13 23:04:41 -07:00
kravetsmic
8258dbb25f
feat: add --api-key parameter to unstructured-ingest (#644) 2023-06-14 05:05:18 +00:00
ryannikolaidis
9443bd40e2
ci: add set up python to test_unit job (#743) 2023-06-14 01:50:37 +00:00
ryannikolaidis
9d3f7183fd
ci: add cache version to ingest-test-fixtures-update-pr workflow (#737) 2023-06-13 18:15:35 -07:00
ryannikolaidis
a753370dc7
ci: update ingest fixtures from gh workflow (#702) 2023-06-13 10:27:32 -07:00
fran-unstructured
a313c02f69
docs: sort functions in bricks.rst in alphabetical order v2 (#728)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-06-12 18:22:23 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
Matt Robinson
3f80301964
fix: handling for emails without datetimes (#724)
* add empty filetype

* add empty handling to partition

* changelog and version

* handling for when there is no datetime

* changelog and version
2023-06-12 17:11:04 +00:00
Yuming Long
b354e8eec6
Chore: Allow passing kwargs to request data field (#716)
* bump again :(

* update to kwarg

* add test case

* rename to request_kwargs

* remove install detectron2

* pip compile

* add changelog for remove detectron2 install

* resolve weaviate import issue on python 3.9
0.7.4
2023-06-12 12:39:58 -04:00
John
fc53277826
fix: Enable MIME type detection if libmagic is not available (#714)
* fix: Add filetype check if libmagic unavailable

* make tidy

* make check

* fix: change mime_type error to warning

* Update changelog and __version__

* fix: Add filetype to requirements
2023-06-09 17:06:21 -04:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype and partition (#710)
* add empty filetype

* add empty handling to partition

* changelog and version
2023-06-09 16:07:50 -04:00
Yuming Long
80f0b4a132
Fix: Pass strategy parameter down from partition for partition_image (#708)
* changelog and version

* passing param down

* test should be auto

* doc nit

* lint

* update image output
2023-06-09 13:54:18 -04:00