929 Commits

Author SHA1 Message Date
ryannikolaidis
a5c7e5b41e
chore: DRY ingest connectors (#769) 2023-06-26 20:12:05 +00:00
Amanda Cameron
95f02f290d
chore: update readme for api keys (#792)
* api announcement

* updating copy

* version bump
0.7.9
2023-06-26 11:56:01 -07:00
ryannikolaidis
7f0f5fab04
ci: fix amd build issue (#804) 2023-06-24 23:46:32 +00:00
MalteHB
030c56fcba
enhancement: better leaf element string check in XML parsing (#734)
* Enhance leaf element string check in XML parsing

* fix is_string check

* changelog and version

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-23 20:44:50 +00:00
Emily Chen
a8a19ceba0
chore: Add --ocr-languages parameter to unstructured ingest (#793) 2023-06-23 12:38:33 -07:00
Martin Mauch
752e78e803
feat: partition_org for Org Mode documents (#780)
* feat: partition_org for Org Mode documents

* update version
2023-06-23 18:45:31 +00:00
little_huang
5320aa681f
docs: fix indentation (#802)
Co-authored-by: 黄宝成 <huangbc@publink.cn>
2023-06-23 09:50:31 -05:00
Christine Straub
5f5da65e0b
Fix/handle-spooled-temp-file-eml (#800)
This PR is for the unstructured-api smoke tests pass.
0.7.8
2023-06-22 19:21:28 -07:00
Matt Robinson
901ef16835
fix: allow partition_email to process emails with no content (#797)
* version and changelog

* ingest-test-fixtures-update
2023-06-22 12:52:27 -04:00
Matt Robinson
8683e2695c
fix: enable partition_pdf to recursively grab text with fast strategy (#796)
* initial pass on text in figures

* refactor text extraction

* update tests

* fix title test

* add test for docs that require recursive text grab

* version and changelog

* ingest-test-fixtures-update

* there are 8 pdf files now
2023-06-22 11:19:54 -04:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
shreyanid
21c346dab8
broken file link in quick start sample code (#789) 2023-06-21 13:39:10 -07:00
Roman Isecke
61ea00a06f
Update Dockerfile to use multistage build and cache layers (#785)
* Update Dockerfile to use multistage build and cache layers

* Fix Dockerfile
2023-06-21 13:12:45 -04:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
0.7.7
2023-06-20 17:07:30 +00:00
Matt Robinson
feaf1cb4df
fix: check for xml attribute when identifying pagebreaks (#778) 2023-06-20 12:44:00 -04:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Christine Straub
743482b6d3
Bug/635 unicode decode error eml (#739)
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
   common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
2023-06-17 00:52:13 +00:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
cragwolfe
68f04159bc
chore: rm old detectron2 install from makefile (#767)
* chore: remove vestigal Makefile target and tensorboard
2023-06-16 10:05:36 -07:00
ryannikolaidis
4faa27ffe7
test: add google drive ingest test (#764) 2023-06-16 16:28:24 +00:00
Yuming Long
a611532e3c
Chore: convert fast strategy to ocr_only for images (#735)
* fall back to ocr only

* more note

* add test case

* maybe remove skipping dockertest for kor ocr?

* bump again

* clean up flag

* empty commit
0.7.6
2023-06-16 10:59:13 -04:00
Matt Robinson
4ea716837d
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
2023-06-16 10:10:56 -04:00
Angus Sinclair
ec403e245c
fix malformed pptx issue (#761)
* fix malformed pptx issue

Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions.

* update changelog

* fix lint issue SIM102 nested ifs

* fix black linting
2023-06-15 19:52:44 +00:00
Yuming Long
5bf78c077d
Fix: remove fake api key in test (#762)
* no fake api key

* changlog and version

* remove kwarg since we have default
2023-06-15 19:18:22 +00:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
075bf0bdba fix test that requires api key 2023-06-15 14:34:57 -04:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
kravetsmic
7fd7d7afae
feat(biomed connector): added additional params (#468) (#623)
Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay.

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-15 01:57:45 -07:00
Matt Robinson
e0c477de68
docs: update slack invite link (#749) 2023-06-14 10:06:45 -04:00
Matt Robinson
053a6c6e5c
enhancement: extract headers and footers in partition_docx (#742)
* added tests for headers and footers

* add docs on headers and footers; tweak to metadata

* version and changelog
2023-06-14 09:42:59 -04:00
cragwolfe
3fe7e1b6ca
fix: pdf2image library is core requirement (#745) 0.7.5 2023-06-13 23:04:41 -07:00
kravetsmic
8258dbb25f
feat: add --api-key parameter to unstructured-ingest (#644) 2023-06-14 05:05:18 +00:00
ryannikolaidis
9443bd40e2
ci: add set up python to test_unit job (#743) 2023-06-14 01:50:37 +00:00
ryannikolaidis
9d3f7183fd
ci: add cache version to ingest-test-fixtures-update-pr workflow (#737) 2023-06-13 18:15:35 -07:00
ryannikolaidis
a753370dc7
ci: update ingest fixtures from gh workflow (#702) 2023-06-13 10:27:32 -07:00
fran-unstructured
a313c02f69
docs: sort functions in bricks.rst in alphabetical order v2 (#728)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-06-12 18:22:23 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
Matt Robinson
3f80301964
fix: handling for emails without datetimes (#724)
* add empty filetype

* add empty handling to partition

* changelog and version

* handling for when there is no datetime

* changelog and version
2023-06-12 17:11:04 +00:00
Yuming Long
b354e8eec6
Chore: Allow passing kwargs to request data field (#716)
* bump again :(

* update to kwarg

* add test case

* rename to request_kwargs

* remove install detectron2

* pip compile

* add changelog for remove detectron2 install

* resolve weaviate import issue on python 3.9
0.7.4
2023-06-12 12:39:58 -04:00
John
fc53277826
fix: Enable MIME type detection if libmagic is not available (#714)
* fix: Add filetype check if libmagic unavailable

* make tidy

* make check

* fix: change mime_type error to warning

* Update changelog and __version__

* fix: Add filetype to requirements
2023-06-09 17:06:21 -04:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype and partition (#710)
* add empty filetype

* add empty handling to partition

* changelog and version
2023-06-09 16:07:50 -04:00
Yuming Long
80f0b4a132
Fix: Pass strategy parameter down from partition for partition_image (#708)
* changelog and version

* passing param down

* test should be auto

* doc nit

* lint

* update image output
2023-06-09 13:54:18 -04:00
Matt Robinson
0289ca3ea7
fix: handle encoding for text file checks (#707)
* fixed encoding issue for _is_text_file_a_json

* changelog and version
2023-06-09 11:08:16 -04:00
John
b2b92ea79d
fix: filetype detection if a CSV has a text/plain MIME type (#691)
* fix:  Filetype detection if a CSV has a text/plain MIME type #621

* bug: fix csv detection and create _read_file_start_for_type_check func

* fix: Make call to _is_text_file_a_csv from detect_filetype
2023-06-08 16:21:07 -04:00
Matt Robinson
c1ba090c34
fix: suppress file conversion warnings in convert_office_doc (#703)
* test that output is suppressed

* add test for error output

* changelog and version
2023-06-08 12:33:06 -04:00
dependabot[bot]
559a5578ba
build(deps): bump label-studio-sdk in /requirements (#701)
Bumps [label-studio-sdk](https://github.com/heartexlabs/label-studio-sdk) from 0.0.27 to 0.0.28.
- [Commits](https://github.com/heartexlabs/label-studio-sdk/commits)

---
updated-dependencies:
- dependency-name: label-studio-sdk
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 11:41:40 -04:00
dependabot[bot]
8aa87fc3b7
build(deps): bump ruff from 0.0.270 to 0.0.272 in /requirements (#699)
Bumps [ruff](https://github.com/charliermarsh/ruff) from 0.0.270 to 0.0.272.
- [Release notes](https://github.com/charliermarsh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md)
- [Commits](https://github.com/charliermarsh/ruff/compare/v0.0.270...v0.0.272)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 09:40:17 -04:00
dependabot[bot]
f681de8d74
build(deps): bump sphinx-rtd-theme from 1.2.1 to 1.2.2 in /requirements (#698)
Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.1 to 1.2.2.
- [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst)
- [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.1...1.2.2)

---
updated-dependencies:
- dependency-name: sphinx-rtd-theme
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-08 09:39:35 -04:00