929 Commits

Author SHA1 Message Date
Matt Robinson
aa4d4329db
fix: partition_via_api reflects actual filetype in metadata (#696)
* fix: `partition_via_api` reflects actual filetype in metadata

* added in list length check

* changelog typo
2023-06-08 13:24:16 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars (#697) 2023-06-08 04:48:49 +00:00
ryannikolaidis
2094b976cf
feat: adds data_source metadata to ElementMetadata (#690) 2023-06-07 21:22:18 -07:00
Matt Robinson
6bc116887f
enhancement: add encoding to elements_to_json and elements_from_json (#694)
* add encoding to elements_to_json and elements_from_json

* version and changelog

* add new test

* fix version

* revert test file

* blank line to test

* no blank line
0.7.2
2023-06-07 13:20:06 -04:00
Matt Robinson
c6dc466e79
docs: update capabilities table; fix mistake in para grouping docs (#683)
* docs: update capabilities table with rtf/md/epub tables

* fix regex in docs

* revert bricks update

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-06-06 18:29:56 +00:00
Yuming Long
533689196b
Chore: bump base image to update tesseract version (#680)
* dockerfile

* changelog version

* version bump
2023-06-06 17:01:16 +00:00
kravetsmic
7df31ead75
feat: if no params show help (#649)
* feat: if no params show help

* Remove comments

* feat: update checking params

* updated main script and changelog

* version bump

---------

Co-authored-by: yuming <305248291@qq.com>
2023-06-06 16:25:44 +00:00
ryannikolaidis
29f0deda63
test: revive ingest unit tests (#688) 2023-06-06 09:03:13 -07:00
Sebastian Laverde Alfonso
508ce48d54
Feat: notebook for Elasticsearch integration (#681)
* feat: nb elasticsearch unstructured sentiment

* chore: refactor readme for elasticsearch nb

* fix: update es-credentials.ini

* chore: update es-credentials.ini

* fix: type in nb load-into-es.ipynb

exist --> exists

* fix: typo 2 in nb load-into-es.ipynb

obtaing --> obtain
2023-06-05 19:05:08 +00:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto (#660)
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.

Change auto.py to have a None default for encoding

Remove the unused parameter encoding from partition_pdf

Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
ryannikolaidis
7d157c1ede
test: add benchmark script (#638) 2023-06-05 09:14:43 -07:00
John
18aefc854a
chore: Re-enable test_upload_label_studio_data_with_sdk (#674) 2023-06-02 23:38:43 +00:00
Matt Robinson
cf0ff91e37
fix: recognize code files with auto (#677)
* add check for code mime type

* add file extensions

* add new tests

* version and changelog
2023-06-02 20:09:43 +00:00
Matt Robinson
6c10d8f022
docs: update detectron2 instructions in readme (#678) 2023-06-02 19:44:41 +00:00
Meir
74a61e33d8
fix: metadata.page_number of pptx files (#675)
* fix: metadata.page_number of pptx files

* update changelog
2023-06-02 13:22:43 +00:00
qued
01f76888e0
build(deps): add tabulate dependency (#673)
tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt.

This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.
2023-06-01 16:56:24 -05:00
ryannikolaidis
bdef4fd398
test: adds profiling script (#661) 2023-06-01 21:26:05 +00:00
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose

* added staging brick and tests for weaviate

* initial notebook and requirements file

* add commentary to weaviate notebook

* weaviate readme

* update docs

* version and change log

* install weaviate client

* install weaviate; skip for docker

* linting, linting, linting

* install weaviate client with deps

* comments on weaviate client

* fix module not found error for docker container

* skipped wrong test in docker

* fix typos

* add in local-inference
0.7.1
2023-06-01 20:48:54 +00:00
Trevor Bossert
cf70c86574
Build from rocky base image (#665)
* build from Rocky linux unstructured base image

* add qemu for arm

* comment out push while testing

* remove quotes

* Add arch

* bump login action

* add ARCH env var to the push step

* run only subset of tests on arm image

Tests on emulated arm are extremely slow.  Likelyhood of something breaking in arm image only, is minimal.  I say that knowing I likely just jinxed us.

* re-enable push from main

* add a dnf cleanup

* version bump

* move from dev to minor version bump
2023-06-01 12:16:04 -07:00
dependabot[bot]
cd9fd9b395
build(deps): bump pygithub from 1.57.0 to 1.58.2 in /requirements (#669)
Bumps [pygithub](https://github.com/pygithub/pygithub) from 1.57.0 to 1.58.2.
- [Release notes](https://github.com/pygithub/pygithub/releases)
- [Changelog](https://github.com/PyGithub/PyGithub/blob/master/doc/changes.rst)
- [Commits](https://github.com/pygithub/pygithub/compare/v1.57...v1.58.2)

---
updated-dependencies:
- dependency-name: pygithub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-01 18:45:47 +00:00
dependabot[bot]
1152fe4383
build(deps): bump sphinx-rtd-theme in /requirements (#670)
Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme) from 1.2.0rc3 to 1.2.1.
- [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst)
- [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.0rc3...1.2.1)

---
updated-dependencies:
- dependency-name: sphinx-rtd-theme
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-01 14:28:27 -04:00
Matt Robinson
be04e1b7c4 docs: tables supported for ppt now 2023-05-31 16:15:04 -04:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
0.7.0
2023-05-31 13:50:15 -05:00
cshaddox
d23e0d6420
feat: table extraction for power points (#664)
* Handling tables

* updating changelog

* Adding accidentally removed code

* remove newline

* reuse table extraction function; add test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 18:26:32 +00:00
Matt Robinson
52e5a5ca8d
fix: raise ValueError in partition_via_api if filename not present (#663)
* raise value error if filename not specified for api

* version and changelog
2023-05-31 18:09:58 +00:00
kravetsmic
795a9a0b4c
feat: add jupyter make commands (#651)
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 14:01:23 +00:00
John
c78c5b6adf
fix: page_number appears in partition_html metadata if include_metadata=False (#658)
* fix: page_number appears in partition_html metadata if include_metadata=False

* Update common.py

* Update CHANGELOG

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-30 20:47:55 +00:00
Matt Robinson
f7cde5539a
fix: page_number should not always be 1 in the metadata (#657)
* fix page number issue

* add tests

* changelog and version

* update changelog
2023-05-30 15:10:14 -04:00
wesleysanjose
b8dcf437ee
fix: add .log to list of TXT filetypes 2023-05-30 14:13:58 -04:00
Christine Straub
5b5fb3e13b
Issue/encoding error eml (#639)
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
2023-05-30 10:24:02 -07:00
Matt Robinson
3e983efce3
docs: add feature table to README (#655)
* remove announcement

* add table with filetypes

* remove filetype specific examples

* remove line break

* remove easy gif

* fix extra whitespace
2023-05-30 15:56:25 +00:00
Yuming Long
66058e76bf
changelog and version (#645) 0.6.11 2023-05-26 22:21:16 -04:00
Yuming Long
fc59a043b7
Chore: Support epub tests in docker image (#630)
* docker works

* more epub tests

* changelog version

* support epub + odt + rtf

* update dockerfile

* revert..

* install pandoc on ci env

* pandoc docker grab bashed on arch

* move arch into image

* move back to base image
2023-05-26 15:38:48 -04:00
cragwolfe
c5d9469001
feat: add xls support (#632)
Add support for older .XLS files from the partition function in unstructured.partition.auto.

Note, this should also work on the centos7 unstructured image (with the requirements/*txt updates in this PR).
0.6.10
2023-05-26 01:55:32 -07:00
ryannikolaidis
b767f6b0ec
fix(ci): prevent gha caching conflicts (#643) 2023-05-25 17:20:28 -07:00
qued
c82bad1061
build(deps): avoid version conflicts (#636)
Addresses #631.

* Uses constraints to keep dependency versions more consistent.
* Moves all dependencies to .in files which are then ingested by setup.py.
* Adds script to check consistency of all extras.
* Adds consistency check to CI.

I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt.

Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.
0.6.9
2023-05-24 22:29:35 +00:00
Christine Straub
a1fed6d4c6
Issue/unicode error (#608)
This PR adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
2023-05-23 13:35:38 -07:00
Trevor Bossert
a78719666a
Build using base image (#625)
This should speed up the builds a lot
2023-05-22 11:13:24 -07:00
qued
55e5d8ea2f
enhancement: include coords in fast (#626)
Makes the bounding box coordinates available when using fast strategy.

* Refactored partition_text to make the workflow of categorizing an element purely from the text available without running the entirety of partition_text.
* Transformed the coordinates from pdf space into pixel space to be consistent with hi_res. We will probably want to revisit the coordinate system soon.
2023-05-20 16:26:55 -05:00
Matt Robinson
fda51d6ead
fix: add more mime types for csv (#620) 2023-05-19 16:40:26 -05:00
Matt Robinson
21c821d651
feat: add partition_csv function (#619)
* add csv into filetype detection

* first pass on csv

* add tests for csv

* add csv to auto

* version bump

* update readme and docs

* fix doc strings
0.6.8
2023-05-19 15:57:42 -04:00
Matt Robinson
046af734d7
release: bump version for 0.6.7 release (#617) 0.6.7 2023-05-19 13:30:17 -04:00
Yuming Long
ab5f92dd79
Fix(ingest): Deprecate --s3-url in favor of --remote-url (#616)
* deprecation s3-url

* changelopg and versioin

* download dir not now
2023-05-19 12:11:40 -04:00
ryannikolaidis
7942bc9d5b
chore: refactor for ingest standard_config options (#599) 2023-05-18 16:49:30 -07:00
Matt Robinson
23ff32cc42
feat: add partition_xml for XML files (#596)
* first pass on partition_xml

* add option to keep xml tags

* added tests for xml

* fix filename

* update filenames

* remove outdated readme

* add xml to auto

* version and changelog

* update readme and docs

* pass through include_metadata

* update include_metadata description

* add README back in

* linting, linting, linting

* more linting

* spooled to bytes doesnt need to be a tuple

* Add tests for newly supported filetypes

* Correct metadata filetype

* doc typo

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* typo fix

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

* keep_xml_tags -> xml_keep_tags

---------

Co-authored-by: Alan Bertl <alan@unstructured.io>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-05-18 15:40:12 +00:00
Matt Robinson
b6bfbf9108
fix: track filename in metadata for docx tables (#597)
* fix: track filename in metadata for docx tables

* bump version

* remove accidental commit
2023-05-18 10:20:38 -04:00
Meir
301cef27a4
feat: add page_name to metadata for Excel documents (#609)
* Add page_name to metadata for Excel documents

* Update changelog and version number

* fix lint
2023-05-18 13:53:23 +00:00
Mallori Harrell
34d563c1fc
feat: Create spacy notebook example (#593)
* add new notebook for spacy
2023-05-17 15:42:15 -05:00
Eu Jin Marcus Yatim
7eac1f8ca7
refactor: update detect_filetype() to use hashmap for mime type return (#591)
* Update detect_filetype() to use hashmap for mime type return

* fix: text mime type and linting

* fix: declare docx and xlsx mime types locally and also fix linting

* Update CHANGELOG.md

* tweaks for failing tests

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-05-17 13:48:52 +00:00
Trevor Bossert
f4f40f58e3
Add discord token so tests run (#598)
* Add discord token so tests run

* install discord deps

* Update expected results for discord test
2023-05-16 16:46:20 -07:00