929 Commits

Author SHA1 Message Date
Matt Robinson
b8037118c4
feat: add partition_xlsx for MSFT Excel files (#594)
* first pass on partition_xlsx

* add support for files

* add test for xlsx from filename

* added filetype metadata

* add xlsx to auto

* remove fake excel from unsupported

* version and changelog

* update docs

* update readme

* fix removed file reference

* fix some more tests

* pass in metadata filename

* add include_metadata flag
2023-05-16 19:40:40 +00:00
Trevor Bossert
830d67f653
Feat: Discord connector (#515)
* Initial commit of discord connector

based off of initial work by @tnachen with modifications

https://github.com/tnachen/unstructured/tree/tnachen/discord_connector

* Add test file

change format of imports

* working version of the connector

More work to be done to tidy it up and add any additional options

* add to test fixtures update

* fix spacing

* tests working, switching to bot testing channel

* add additional channel

add reprocess to tests

* add try clause to allow for exit on error

Update changelog and bump version

* add updated expected output filtes

* add logic to check if —discord-period is an integer

Add more to option description

* fix lint error

* Update discord reqs

* PR feedback

* add newline

* another newline

---------

Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>
2023-05-16 11:46:30 -07:00
Nicolas
c62bee48ad
Update installing.rst (#590) 2023-05-16 02:08:01 +00:00
Matt Robinson
bd6a8a3a40
enhancement: add file_directory to element metadata (#585)
* enhancement: add `file_directory` to element metadata

* update msg test

* exclude file_directory

* update slack output

* added file directory tests on partition_x paths
2023-05-15 18:25:39 -04:00
Yuming Long
33cc3f8637
Fix: support hml filetype in partition as a variation of html (#586)
* quick fx to add hml filetype

* changelog and version
2023-05-15 16:35:53 -04:00
Yuming Long
5b6f11bb88
Chore(ingest): Add --partition-strategy parameter in CLI (#582)
* change strategy arg defalut to auto in partition

* passing --partition-strategy down

* add strategy="hi_res" to test (default changed)

* made an error on param name, added note
2023-05-15 19:26:53 +00:00
qued
55272eeceb
enhancement: filetype in metadata (#583)
Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.

Tests are added to make sure:

* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata

Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
2023-05-15 13:23:19 -05:00
Matt Robinson
99aa346186
fix: make pytesseract a function level import (#581)
* make pytesseract a function level import

* version and changelog

* small docs formatting fix
2023-05-12 17:18:51 -05:00
Matt Robinson
e052c2a9b2
docs: example of how to use unstructured with pgvector (#571)
* pgvector requirements

* first pass on pgvector notebook and sql alchemy file

* created code for loading vectors into db

* added query for embedding distance

* updates to pgvector notebook

* update function with time decay

* update pgvector notebook to use example code

* remove old create table script

* add readme for pgvector

* update example to use get_date()
2023-05-12 13:54:38 -04:00
Matt Robinson
727d366a94
enhancement: auto strategy for PDFs and images (#578)
* added functions for determining auto stratgy

* change default strategy to auto

* tests for auto strategy

* update docs

* changelog and version

* bump version

* remove ingest file in wrong location

* update jpg output

* typo fix
0.6.6
2023-05-12 17:45:08 +00:00
Matt Robinson
210e735f6f Revert "bump version for release"
This reverts commit 296959b91e425ad6b99c85c240bdd86ec098ae17.
2023-05-12 11:50:25 -04:00
Matt Robinson
296959b91e bump version for release 2023-05-12 11:33:47 -04:00
Matt Robinson
8da1ddc6ec
enhancement: add method for getting datetime; cleanup filename attribute (#575)
* added method for extracting datetime

* change filename metadata to the base filename

* fix filename metadata for msg

* changelog and bump version

* fix expected structured output

* newline back in file

* reset outpout file

* update filename output

* update test fixtures

* update fixture
2023-05-12 11:33:01 -04:00
Kevin Pan
7c07b3f690
feat: Read docx tables (#572)
* add table parsing

* import paragraph

* update changelog

* add example docx

* revert changelog formatting

* update function name for consistency

* add both text and html metadata for table

* update with metadata in docx table note

---------

Co-authored-by: kevin pan <kevin.pan@strivr.com>
2023-05-11 18:31:38 +00:00
Matt Robinson
38f7b652de
fix: add handling for non-standard rfc-2822 formats (#564)
* fix: add handling for non-standard rfc-2822 formats

* version and changelog

* linting, linting, linting
2023-05-11 14:36:25 +00:00
Yida Liu
f46eb06e2d
fix: check json and eml decode ignore error (#574) 2023-05-10 22:00:11 -07:00
John
328863375e
fix: include all metadata fields when converting to dataframe or CSV (#568)
* fix: include all metadata fields when converting to dataframe or CSV (#555)

* bump version after merge from main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-10 13:03:33 -04:00
Yuming Long
0f91a9bfa8
Chore: Add a trace logger for NLP output (#561)
* add and config trace logger

* chore: update loggers in partition

* doc: changelog and version

* doc: update changelog

* doc: remove placeholder

* chore: bypass mypy
2023-05-10 16:16:15 +00:00
ryannikolaidis
b52638f8e3
chore: add support for SpooledTemporaryFiles (#569) 0.6.5 2023-05-09 21:39:07 -07:00
Matt Robinson
19beb24e03
docs: unstructured -> MySQL example (#557)
* added requirements for mysql

* first bit of mysql notebook

* update requirements file

* wrap with mysql example

* update readme with install instructions
2023-05-09 13:26:49 +00:00
cragwolfe
aaea6358f6
build(deps): bump pip (#558) 2023-05-08 23:08:10 -07:00
ryannikolaidis
2fc4d37454
chore: pin inference version, bump deps, and update openssl (#551) 2023-05-08 17:02:55 -07:00
Matt Robinson
3d3f3df3ec
enhancement: add "ocr_only" strategy for PDFs (#553)
* add tests for validating strategy

* refactor into determine_pdf_strategy function

* refactor pdf strategies into strategies

* remove commented out code

* remove unreachable code

* add in handling for image types

* a little more refactoring

* import ocr partioning for images

* catch warnings, partition type for valid strategies

* fallback to ocr_only from fast

* fallback logic for hi_res

* test for fallback to ocr only

* fallback logic ofr ocr_only

* more tests for fallback logic

* update doc strings

* version and changelog

* linting, linting, linting

* update docs to include notes about strategy

* fix typos

* change back patched filename
0.6.4
2023-05-08 17:21:24 +00:00
Trevor Bossert
1ac72c6ee8
Fixes issue where detectron2 would not install on OSX (#552)
* Fixes issue where detectron2 would not install on OSX

Tested on Apple silicon based MacBook Pro.  This installs tensorboard which is required on OSX and arm based cpu’s for detectron2.

* Improve Arch detection for tensorboard

* remove makefile from commands in readme

pin tensorboard version
2023-05-05 17:16:28 -07:00
Matt Robinson
0fc0571c02
fix(ci): don't skip deploy for tags (#549) 2023-05-05 09:51:41 -04:00
Matt Robinson
392cccdbf7
enhancement: add ocr_only strategy for partition_image (#540)
* spike for ocr-only strategy for images

* fix for file processing

* extra space

* add korean to ci

* added test for ocr_only strategy

* added docs for ocr_only

* changelog and version

* added test for bad strategy

* skip korean test if in docker

* bump version

* version bump

* document valid strategies

* bump version for release

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
0.6.3
2023-05-04 20:23:51 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Matt Robinson
981805e435
feat: stage_for_baseplate function (#546)
* added a staging brick for baseplate

* added a test for baseplate

* update documentation

* version and changelog
2023-05-04 11:05:38 -04:00
Matt Robinson
aa01cdfc7a
fix: group together text from the same bounding box in partition_pdf with fast strategy (#542)
* switch to using PDF objects

* linting, linting, linting

* couple more tweaks

* added test for chevron-page

* version and changelog

* linting, linting, linting

* now processing 4 files
2023-05-03 18:33:24 -04:00
Matt Robinson
7e43a25f07
feat: add partition_multiple_via_api function (#539)
* added function for multiple files via api

* make multiple work with files

* updated docs strings

* changelog and version

* docs and contextlib for open files

* tests for partition multiple

* add tests for error conditions

* add output example
2023-05-03 15:06:06 -04:00
Matt Robinson
3c3c59a726
build(deps): add pdfminer.six to dependencies (#537) 2023-05-02 15:36:12 +00:00
Matt Robinson
19488bf15f
ci: only build docs on tags (#538)
* ci: only build docs on tags

* add branch for docs builds
2023-05-02 15:15:23 +00:00
dependabot[bot]
61209b34bd
build(deps): bump yarl from 1.8.2 to 1.9.2 in /requirements (#530)
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.8.2 to 1.9.2.
- [Release notes](https://github.com/aio-libs/yarl/releases)
- [Changelog](https://github.com/aio-libs/yarl/blob/master/CHANGES.rst)
- [Commits](https://github.com/aio-libs/yarl/compare/v1.8.2...v1.9.2)

---
updated-dependencies:
- dependency-name: yarl
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 18:18:45 -04:00
Matt Robinson
e805ed465d
docs: add slack and github links back into docs page (#535)
* stars and github link to top of page

* wording updates

* remove unnecessary font weight change

* remove next arrows

* buttons to bottom on sidebar
2023-05-01 18:17:52 -04:00
Matt Robinson
22ebfa6714
docs: add download badges to README (#536)
* downloads badge

* total downloads
2023-05-01 18:17:31 -04:00
dependabot[bot]
7f9ec8108d
build(deps): bump importlib-metadata in /requirements (#531)
Bumps [importlib-metadata](https://github.com/python/importlib_metadata) from 6.5.0 to 6.6.0.
- [Release notes](https://github.com/python/importlib_metadata/releases)
- [Changelog](https://github.com/python/importlib_metadata/blob/main/CHANGES.rst)
- [Commits](https://github.com/python/importlib_metadata/compare/v6.5.0...v6.6.0)

---
updated-dependencies:
- dependency-name: importlib-metadata
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-01 21:49:26 +00:00
dependabot[bot]
8ed1627928
build(deps): bump huggingface-hub from 0.13.4 to 0.14.1 in /requirements (#528)
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.13.4 to 0.14.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases)
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.13.4...v0.14.1)

---
updated-dependencies:
- dependency-name: huggingface-hub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-05-01 17:22:21 -04:00
Matt Robinson
1b8e9a353a version bump for release 0.6.2 2023-04-26 16:29:16 -04:00
Matt Robinson
9fdc310358
fix: update detect_filetype for JSONs with text/plain MIME type (#520)
* check to see if text file is a json

* add json check into filetype detection

* added test for updated file detection logic

* bytes/strings handling

* changlog and version bump
2023-04-26 13:52:47 -04:00
Matt Robinson
4156cb12e0
feat: partition_via_api helper function (#518)
* added function for partitioning via api

* added tests for api function

* changelog and version

* add docs for partition_via_api
2023-04-26 09:05:35 -04:00
JaeyongLee
be8e6da884
fix: correct return types in exceeds_caps_ratio (#489)
* fix: fix text_type.py exceeds_cap_ratio() returns

There are cases when function is_possible_narrative_text receives an incorrect return from function exceeds_cap_ratio and does an incorrect classification, so some of the return values of exceeds_cap_ratio are corrected

* Update text_type.py exceeds_cap_ratio()

..

* Update text_type.py

..

* Update CHANGELOG.md

..

* linting, linting, linting ...

* update tests

* more test fixes

* Update text_type.py

..

* bump version and changelog

* add punctuation check

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-04-24 10:45:09 -04:00
Matt Robinson
894a190001
enhancement: check for copy protection on PDFs and fallback to hi res when necessary (#514)
* function to check if pdf is extractable

* add fallback logic for unextractable pdfs

* tests for docs with copy protection

* add test for unprocessable pdf

* update docs

* changelog and version

* update logic for images; reset file before proceeding

* 3 files for api tests

* docs update
2023-04-21 21:35:43 +00:00
qued
5b6640a55a
chore: change table param name (#513)
Updated parameter names that controls whether we try to infer table structure.
0.6.1
2023-04-21 13:48:19 -05:00
Sebastian Laverde Alfonso
ba59ad6b3a
chore: add copy-protected pdf to sample-docs (#512) 2023-04-21 18:02:38 +00:00
Matt Robinson
a7a9ccd3a4
ci: separate job for ingest tests (#511)
* separate job for ingest tests

* remove lint from description
2023-04-21 13:31:36 -04:00
qued
dc4147d7df
feat: extract tables (#503)
Exposes table extraction through partition and partition_pdf.
0.6.0
2023-04-21 17:01:29 +00:00
Mallori Harrell
5d1e61cb3f
feat: add msg attachment support (#510)
* add msg function and fix bug in eml attachment function
2023-04-21 11:14:46 -05:00
Matt Robinson
6874df91ef
feat: allow users to pass OCR language into partition (#509)
* pip-compile new reqs

* bump inference version

* add language to pdf and image calls

* tests for passing in language

* version bump and changelog

* update docs

* pass ocr_languages in auto

* updated test fixtures

* typo in doc string
2023-04-21 13:41:26 +00:00
natygyoon
db2f70dbc4
sync version-sync.sh with other repos (#508) 2023-04-21 05:48:38 +09:00
Matt Robinson
bd1e540af9
feat: parameter to turn off SSL verification (#506)
* add kwarg for ssl verification

* update docs

* update version and changelog

* add verify kwarg to test
2023-04-20 11:13:56 -04:00