619 Commits

Author SHA1 Message Date
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
shreyanid
21c346dab8
broken file link in quick start sample code (#789) 2023-06-21 13:39:10 -07:00
Roman Isecke
61ea00a06f
Update Dockerfile to use multistage build and cache layers (#785)
* Update Dockerfile to use multistage build and cache layers

* Fix Dockerfile
2023-06-21 13:12:45 -04:00
ryannikolaidis
e08936b6fb
chore: update all bash scripts to use shebang: /usr/bin/env bash (#779) 2023-06-20 16:00:55 -07:00
Matt Robinson
c53ce117bc
fix: enable partition_html to grab content outside of <article> tags (#772)
* optionally dont assemble articles

* add test for content outside of articles

* pass kwargs in partition

* changelog and version

* update default to False

* bump version for release

* back to dev version to get another fix in the release
0.7.7
2023-06-20 17:07:30 +00:00
Matt Robinson
feaf1cb4df
fix: check for xml attribute when identifying pagebreaks (#778) 2023-06-20 12:44:00 -04:00
qued
db4c5dfdf7
feat: coordinate systems (#774)
Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.
2023-06-20 11:19:55 -05:00
Christine Straub
743482b6d3
Bug/635 unicode decode error eml (#739)
* Adds functionality to extract charset info from eml files
* Adds missed file-like object handling in detect_file_encoding
* Adds functionality to replace the MIME encodings for eml files with one of the
   common encodings if a unicode error occurs
* Organize the eml example files in the example-docs/eml directory
2023-06-17 00:52:13 +00:00
cragwolfe
2989f53358
chore: bump to python 3.8.17 (#766)
The images pushed quay.io will now have python 3.8.17 rather than python 3.8.15.
2023-06-16 11:17:03 -07:00
cragwolfe
68f04159bc
chore: rm old detectron2 install from makefile (#767)
* chore: remove vestigal Makefile target and tensorboard
2023-06-16 10:05:36 -07:00
ryannikolaidis
4faa27ffe7
test: add google drive ingest test (#764) 2023-06-16 16:28:24 +00:00
Yuming Long
a611532e3c
Chore: convert fast strategy to ocr_only for images (#735)
* fall back to ocr only

* more note

* add test case

* maybe remove skipping dockertest for kor ocr?

* bump again

* clean up flag

* empty commit
0.7.6
2023-06-16 10:59:13 -04:00
Matt Robinson
4ea716837d
feat: add ability to extract extra metadata with regex (#763)
* first pass on regex metadata

* fix typing for regex metadata

* add dataclass back in

* add decorators

* fix tests

* update docs

* add tests for regex metadata

* add process metadata to tsv

* changelog and version

* docs typos

* consolidate to using a single kwarg

* fix test
2023-06-16 10:10:56 -04:00
Angus Sinclair
ec403e245c
fix malformed pptx issue (#761)
* fix malformed pptx issue

Added a new test to check for the ability to partition a malformed PowerPoint file. Modified the `partition_pptx` function to skip processing shapes that are not on the actual slide, but only if they have top and left positions. Also modified `_order_shapes` function to handle cases where shapes do not have top or left positions.

* update changelog

* fix lint issue SIM102 nested ifs

* fix black linting
2023-06-15 19:52:44 +00:00
Yuming Long
5bf78c077d
Fix: remove fake api key in test (#762)
* no fake api key

* changlog and version

* remove kwarg since we have default
2023-06-15 19:18:22 +00:00
John
a9b9b873b1
feat: partition_tsv for tab separated value files (#758)
* first pass at partition_tsv

* working tests

* create constants for tests and debug `make test` failure

* make check and tidy

* undo changes for testing locally

* update changelog and version

* fix bricks.rst

* refactor if statements

* make tidy

* fix README and change try/except to if/else

* update changelog and version

* fix\ docstring
2023-06-15 18:50:53 +00:00
Matt Robinson
075bf0bdba fix test that requires api key 2023-06-15 14:34:57 -04:00
Matt Robinson
a800967478
enhancements: add page numbers for word docs when available (#750)
* add support for page numbers in docx when present

* version and changelog

* add comment on page numbers

* add header and footer to doc elements list

* update integrations docs

* include_page_breaks kwarg for doc and docx

* merge element metadata for pagebreaks

* fix typo

* fix changelog typo

* change page number default to None

* add initial_page_number kwarg

* make page number tests in pdf more explicit

* revert test file

* update ingest tests

* update test fixture outputs

* updates to IRS forms fixtures

* ingest-test-fixtures-update

* Update ingest test fixtures (#759)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-06-15 12:21:17 -04:00
kravetsmic
7fd7d7afae
feat(biomed connector): added additional params (#468) (#623)
Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay.

---------

Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-06-15 01:57:45 -07:00
Matt Robinson
e0c477de68
docs: update slack invite link (#749) 2023-06-14 10:06:45 -04:00
Matt Robinson
053a6c6e5c
enhancement: extract headers and footers in partition_docx (#742)
* added tests for headers and footers

* add docs on headers and footers; tweak to metadata

* version and changelog
2023-06-14 09:42:59 -04:00
cragwolfe
3fe7e1b6ca
fix: pdf2image library is core requirement (#745) 0.7.5 2023-06-13 23:04:41 -07:00
kravetsmic
8258dbb25f
feat: add --api-key parameter to unstructured-ingest (#644) 2023-06-14 05:05:18 +00:00
ryannikolaidis
9443bd40e2
ci: add set up python to test_unit job (#743) 2023-06-14 01:50:37 +00:00
ryannikolaidis
9d3f7183fd
ci: add cache version to ingest-test-fixtures-update-pr workflow (#737) 2023-06-13 18:15:35 -07:00
ryannikolaidis
a753370dc7
ci: update ingest fixtures from gh workflow (#702) 2023-06-13 10:27:32 -07:00
fran-unstructured
a313c02f69
docs: sort functions in bricks.rst in alphabetical order v2 (#728)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-06-12 18:22:23 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst for ReStructured Text documents (#725)
* add example rst file

* filetype detection for rst files

* add partition_rst function

* add partition_rst to auto

* update readme

* update docs

* changelog and version

* pandocs -> pandoc

* fix typo
2023-06-12 19:31:10 +00:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy (#614)
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh

Updated ingest tests procedure:

* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
Matt Robinson
3f80301964
fix: handling for emails without datetimes (#724)
* add empty filetype

* add empty handling to partition

* changelog and version

* handling for when there is no datetime

* changelog and version
2023-06-12 17:11:04 +00:00
Yuming Long
b354e8eec6
Chore: Allow passing kwargs to request data field (#716)
* bump again :(

* update to kwarg

* add test case

* rename to request_kwargs

* remove install detectron2

* pip compile

* add changelog for remove detectron2 install

* resolve weaviate import issue on python 3.9
0.7.4
2023-06-12 12:39:58 -04:00
John
fc53277826
fix: Enable MIME type detection if libmagic is not available (#714)
* fix: Add filetype check if libmagic unavailable

* make tidy

* make check

* fix: change mime_type error to warning

* Update changelog and __version__

* fix: Add filetype to requirements
2023-06-09 17:06:21 -04:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype and partition (#710)
* add empty filetype

* add empty handling to partition

* changelog and version
2023-06-09 16:07:50 -04:00
Yuming Long
80f0b4a132
Fix: Pass strategy parameter down from partition for partition_image (#708)
* changelog and version

* passing param down

* test should be auto

* doc nit

* lint

* update image output
2023-06-09 13:54:18 -04:00
Matt Robinson
0289ca3ea7
fix: handle encoding for text file checks (#707)
* fixed encoding issue for _is_text_file_a_json

* changelog and version
2023-06-09 11:08:16 -04:00
John
b2b92ea79d
fix: filetype detection if a CSV has a text/plain MIME type (#691)
* fix:  Filetype detection if a CSV has a text/plain MIME type #621

* bug: fix csv detection and create _read_file_start_for_type_check func

* fix: Make call to _is_text_file_a_csv from detect_filetype
2023-06-08 16:21:07 -04:00
Matt Robinson
c1ba090c34
fix: suppress file conversion warnings in convert_office_doc (#703)
* test that output is suppressed

* add test for error output

* changelog and version
2023-06-08 12:33:06 -04:00
dependabot[bot]
559a5578ba
build(deps): bump label-studio-sdk in /requirements (#701)
Bumps [label-studio-sdk](https://github.com/heartexlabs/label-studio-sdk) from 0.0.27 to 0.0.28.
- [Commits](https://github.com/heartexlabs/label-studio-sdk/commits)

---
updated-dependencies:
- dependency-name: label-studio-sdk
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 11:41:40 -04:00
dependabot[bot]
8aa87fc3b7
build(deps): bump ruff from 0.0.270 to 0.0.272 in /requirements (#699)
Bumps [ruff](https://github.com/charliermarsh/ruff) from 0.0.270 to 0.0.272.
- [Release notes](https://github.com/charliermarsh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md)
- [Commits](https://github.com/charliermarsh/ruff/compare/v0.0.270...v0.0.272)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 09:40:17 -04:00
dependabot[bot]
f681de8d74
build(deps): bump sphinx-rtd-theme from 1.2.1 to 1.2.2 in /requirements (#698)
Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.1 to 1.2.2.
- [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst)
- [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.1...1.2.2)

---
updated-dependencies:
- dependency-name: sphinx-rtd-theme
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-08 09:39:35 -04:00
Matt Robinson
aa4d4329db
fix: partition_via_api reflects actual filetype in metadata (#696)
* fix: `partition_via_api` reflects actual filetype in metadata

* added in list length check

* changelog typo
2023-06-08 13:24:16 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars (#697) 2023-06-08 04:48:49 +00:00
ryannikolaidis
2094b976cf
feat: adds data_source metadata to ElementMetadata (#690) 2023-06-07 21:22:18 -07:00
Matt Robinson
6bc116887f
enhancement: add encoding to elements_to_json and elements_from_json (#694)
* add encoding to elements_to_json and elements_from_json

* version and changelog

* add new test

* fix version

* revert test file

* blank line to test

* no blank line
0.7.2
2023-06-07 13:20:06 -04:00
Matt Robinson
c6dc466e79
docs: update capabilities table; fix mistake in para grouping docs (#683)
* docs: update capabilities table with rtf/md/epub tables

* fix regex in docs

* revert bricks update

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-06-06 18:29:56 +00:00
Yuming Long
533689196b
Chore: bump base image to update tesseract version (#680)
* dockerfile

* changelog version

* version bump
2023-06-06 17:01:16 +00:00
kravetsmic
7df31ead75
feat: if no params show help (#649)
* feat: if no params show help

* Remove comments

* feat: update checking params

* updated main script and changelog

* version bump

---------

Co-authored-by: yuming <305248291@qq.com>
2023-06-06 16:25:44 +00:00
ryannikolaidis
29f0deda63
test: revive ingest unit tests (#688) 2023-06-06 09:03:13 -07:00
Sebastian Laverde Alfonso
508ce48d54
Feat: notebook for Elasticsearch integration (#681)
* feat: nb elasticsearch unstructured sentiment

* chore: refactor readme for elasticsearch nb

* fix: update es-credentials.ini

* chore: update es-credentials.ini

* fix: type in nb load-into-es.ipynb

exist --> exists

* fix: typo 2 in nb load-into-es.ipynb

obtaing --> obtain
2023-06-05 19:05:08 +00:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00