kravetsmic
8258dbb25f
feat: add --api-key parameter to unstructured-ingest ( #644 )
2023-06-14 05:05:18 +00:00
ryannikolaidis
9443bd40e2
ci: add set up python to test_unit job ( #743 )
2023-06-14 01:50:37 +00:00
ryannikolaidis
9d3f7183fd
ci: add cache version to ingest-test-fixtures-update-pr workflow ( #737 )
2023-06-13 18:15:35 -07:00
ryannikolaidis
a753370dc7
ci: update ingest fixtures from gh workflow ( #702 )
2023-06-13 10:27:32 -07:00
fran-unstructured
a313c02f69
docs: sort functions in bricks.rst in alphabetical order v2 ( #728 )
...
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-06-12 18:22:23 -04:00
Matt Robinson
c82fdb6a89
feat: partition_rst
for ReStructured Text documents ( #725 )
...
* add example rst file
* filetype detection for rst files
* add partition_rst function
* add partition_rst to auto
* update readme
* update docs
* changelog and version
* pandocs -> pandoc
* fix typo
2023-06-12 19:31:10 +00:00
Yuming Long
2fbb1ccd30
Chore(ingest) : add tests on PDFs with fast strategy ( #614 )
...
Summary
* Updates "fast" PDF output element ordering to be consistent across Python versions by using the X,Y coordinates of elements extracted
* Added PDFs ingest tests with fast strategy with new script ./test_unstructured_ingest/test-ingest-pdf-fast-reprocess.sh
Updated ingest tests procedure:
* Processing files with hi_res strategy, and preserve downloads to repo files-ingest-download/<ingest_test_name>
* Reprocessing all PDFs with fast strategy from local file files-ingest-download, the partition outputs are stored at expected-structured-output/pdf-fast-reprocess/<ingest_test_name>
Test
* Reproduce tests with ./scripts/ingest-test-fixtures-update.sh , should expect no update. Also don't need any secret tokens since relevant tests won't produce PDFs.
2023-06-12 19:02:48 +00:00
Matt Robinson
3f80301964
fix: handling for emails without datetimes ( #724 )
...
* add empty filetype
* add empty handling to partition
* changelog and version
* handling for when there is no datetime
* changelog and version
2023-06-12 17:11:04 +00:00
Yuming Long
b354e8eec6
Chore: Allow passing kwargs to request data field ( #716 )
...
* bump again :(
* update to kwarg
* add test case
* rename to request_kwargs
* remove install detectron2
* pip compile
* add changelog for remove detectron2 install
* resolve weaviate import issue on python 3.9
0.7.4
2023-06-12 12:39:58 -04:00
John
fc53277826
fix: Enable MIME type detection if libmagic is not available ( #714 )
...
* fix: Add filetype check if libmagic unavailable
* make tidy
* make check
* fix: change mime_type error to warning
* Update changelog and __version__
* fix: Add filetype to requirements
2023-06-09 17:06:21 -04:00
Matt Robinson
19ab6d960f
enhancement: handling for empty files in detect_filetype
and partition
( #710 )
...
* add empty filetype
* add empty handling to partition
* changelog and version
2023-06-09 16:07:50 -04:00
Yuming Long
80f0b4a132
Fix: Pass strategy
parameter down from partition
for partition_image
( #708 )
...
* changelog and version
* passing param down
* test should be auto
* doc nit
* lint
* update image output
2023-06-09 13:54:18 -04:00
Matt Robinson
0289ca3ea7
fix: handle encoding for text file checks ( #707 )
...
* fixed encoding issue for _is_text_file_a_json
* changelog and version
2023-06-09 11:08:16 -04:00
John
b2b92ea79d
fix: filetype detection if a CSV has a text/plain MIME type ( #691 )
...
* fix: Filetype detection if a CSV has a text/plain MIME type #621
* bug: fix csv detection and create _read_file_start_for_type_check func
* fix: Make call to _is_text_file_a_csv from detect_filetype
2023-06-08 16:21:07 -04:00
Matt Robinson
c1ba090c34
fix: suppress file conversion warnings in convert_office_doc
( #703 )
...
* test that output is suppressed
* add test for error output
* changelog and version
2023-06-08 12:33:06 -04:00
dependabot[bot]
559a5578ba
build(deps): bump label-studio-sdk in /requirements ( #701 )
...
Bumps [label-studio-sdk](https://github.com/heartexlabs/label-studio-sdk ) from 0.0.27 to 0.0.28.
- [Commits](https://github.com/heartexlabs/label-studio-sdk/commits )
---
updated-dependencies:
- dependency-name: label-studio-sdk
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 11:41:40 -04:00
dependabot[bot]
8aa87fc3b7
build(deps): bump ruff from 0.0.270 to 0.0.272 in /requirements ( #699 )
...
Bumps [ruff](https://github.com/charliermarsh/ruff ) from 0.0.270 to 0.0.272.
- [Release notes](https://github.com/charliermarsh/ruff/releases )
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md )
- [Commits](https://github.com/charliermarsh/ruff/compare/v0.0.270...v0.0.272 )
---
updated-dependencies:
- dependency-name: ruff
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-08 09:40:17 -04:00
dependabot[bot]
f681de8d74
build(deps): bump sphinx-rtd-theme from 1.2.1 to 1.2.2 in /requirements ( #698 )
...
Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme ) from 1.2.1 to 1.2.2.
- [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst )
- [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.1...1.2.2 )
---
updated-dependencies:
- dependency-name: sphinx-rtd-theme
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-08 09:39:35 -04:00
Matt Robinson
aa4d4329db
fix: partition_via_api
reflects actual filetype in metadata ( #696 )
...
* fix: `partition_via_api` reflects actual filetype in metadata
* added in list length check
* changelog typo
2023-06-08 13:24:16 +00:00
ryannikolaidis
dabda67c8f
fix: ingest-test-fixtures-update script to pass env vars ( #697 )
2023-06-08 04:48:49 +00:00
ryannikolaidis
2094b976cf
feat: adds data_source metadata to ElementMetadata ( #690 )
2023-06-07 21:22:18 -07:00
Matt Robinson
6bc116887f
enhancement: add encoding to elements_to_json
and elements_from_json
( #694 )
...
* add encoding to elements_to_json and elements_from_json
* version and changelog
* add new test
* fix version
* revert test file
* blank line to test
* no blank line
0.7.2
2023-06-07 13:20:06 -04:00
Matt Robinson
c6dc466e79
docs: update capabilities table; fix mistake in para grouping docs ( #683 )
...
* docs: update capabilities table with rtf/md/epub tables
* fix regex in docs
* revert bricks update
---------
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-06-06 18:29:56 +00:00
Yuming Long
533689196b
Chore: bump base image to update tesseract version ( #680 )
...
* dockerfile
* changelog version
* version bump
2023-06-06 17:01:16 +00:00
kravetsmic
7df31ead75
feat: if no params show help ( #649 )
...
* feat: if no params show help
* Remove comments
* feat: update checking params
* updated main script and changelog
* version bump
---------
Co-authored-by: yuming <305248291@qq.com>
2023-06-06 16:25:44 +00:00
ryannikolaidis
29f0deda63
test: revive ingest unit tests ( #688 )
2023-06-06 09:03:13 -07:00
Sebastian Laverde Alfonso
508ce48d54
Feat: notebook for Elasticsearch integration ( #681 )
...
* feat: nb elasticsearch unstructured sentiment
* chore: refactor readme for elasticsearch nb
* fix: update es-credentials.ini
* chore: update es-credentials.ini
* fix: type in nb load-into-es.ipynb
exist --> exists
* fix: typo 2 in nb load-into-es.ipynb
obtaing --> obtain
2023-06-05 19:05:08 +00:00
Christine Straub
547bb38d86
fix: encoding/decoding error with default utf-8 encoding for html, xml, and auto ( #660 )
...
Add functionality to try other common encodings for html, xml files if an error related to the encoding is raised and the user has not specified an encoding.
Change auto.py to have a None default for encoding
Remove the unused parameter encoding from partition_pdf
Add functionality to the read_txt_file utility function to handle file-like object from URL
2023-06-05 11:27:12 -07:00
ryannikolaidis
7d157c1ede
test: add benchmark script ( #638 )
2023-06-05 09:14:43 -07:00
John
18aefc854a
chore: Re-enable test_upload_label_studio_data_with_sdk ( #674 )
2023-06-02 23:38:43 +00:00
Matt Robinson
cf0ff91e37
fix: recognize code files with auto ( #677 )
...
* add check for code mime type
* add file extensions
* add new tests
* version and changelog
2023-06-02 20:09:43 +00:00
Matt Robinson
6c10d8f022
docs: update detectron2 instructions in readme ( #678 )
2023-06-02 19:44:41 +00:00
Meir
74a61e33d8
fix: metadata.page_number of pptx files ( #675 )
...
* fix: metadata.page_number of pptx files
* update changelog
2023-06-02 13:22:43 +00:00
qued
01f76888e0
build(deps): add tabulate dependency ( #673 )
...
tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt.
This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.
2023-06-01 16:56:24 -05:00
ryannikolaidis
bdef4fd398
test: adds profiling script ( #661 )
2023-06-01 21:26:05 +00:00
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate
and schema creation function ( #672 )
...
* add weaviate docker compose
* added staging brick and tests for weaviate
* initial notebook and requirements file
* add commentary to weaviate notebook
* weaviate readme
* update docs
* version and change log
* install weaviate client
* install weaviate; skip for docker
* linting, linting, linting
* install weaviate client with deps
* comments on weaviate client
* fix module not found error for docker container
* skipped wrong test in docker
* fix typos
* add in local-inference
0.7.1
2023-06-01 20:48:54 +00:00
Trevor Bossert
cf70c86574
Build from rocky base image ( #665 )
...
* build from Rocky linux unstructured base image
* add qemu for arm
* comment out push while testing
* remove quotes
* Add arch
* bump login action
* add ARCH env var to the push step
* run only subset of tests on arm image
Tests on emulated arm are extremely slow. Likelyhood of something breaking in arm image only, is minimal. I say that knowing I likely just jinxed us.
* re-enable push from main
* add a dnf cleanup
* version bump
* move from dev to minor version bump
2023-06-01 12:16:04 -07:00
dependabot[bot]
cd9fd9b395
build(deps): bump pygithub from 1.57.0 to 1.58.2 in /requirements ( #669 )
...
Bumps [pygithub](https://github.com/pygithub/pygithub ) from 1.57.0 to 1.58.2.
- [Release notes](https://github.com/pygithub/pygithub/releases )
- [Changelog](https://github.com/PyGithub/PyGithub/blob/master/doc/changes.rst )
- [Commits](https://github.com/pygithub/pygithub/compare/v1.57...v1.58.2 )
---
updated-dependencies:
- dependency-name: pygithub
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-06-01 18:45:47 +00:00
dependabot[bot]
1152fe4383
build(deps): bump sphinx-rtd-theme in /requirements ( #670 )
...
Bumps [sphinx-rtd-theme](https://github.com/readthedocs/sphinx_rtd_theme ) from 1.2.0rc3 to 1.2.1.
- [Changelog](https://github.com/readthedocs/sphinx_rtd_theme/blob/master/docs/changelog.rst )
- [Commits](https://github.com/readthedocs/sphinx_rtd_theme/compare/1.2.0rc3...1.2.1 )
---
updated-dependencies:
- dependency-name: sphinx-rtd-theme
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-01 14:28:27 -04:00
Matt Robinson
be04e1b7c4
docs: tables supported for ppt now
2023-05-31 16:15:04 -04:00
qued
d3600dd5da
build(deps): update inference version ( #662 )
...
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
0.7.0
2023-05-31 13:50:15 -05:00
cshaddox
d23e0d6420
feat: table extraction for power points ( #664 )
...
* Handling tables
* updating changelog
* Adding accidentally removed code
* remove newline
* reuse table extraction function; add test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 18:26:32 +00:00
Matt Robinson
52e5a5ca8d
fix: raise ValueError
in partition_via_api
if filename not present ( #663 )
...
* raise value error if filename not specified for api
* version and changelog
2023-05-31 18:09:58 +00:00
kravetsmic
795a9a0b4c
feat: add jupyter make commands ( #651 )
...
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 14:01:23 +00:00
John
c78c5b6adf
fix: page_number
appears in partition_html
metadata if include_metadata=False
( #658 )
...
* fix: page_number appears in partition_html metadata if include_metadata=False
* Update common.py
* Update CHANGELOG
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-30 20:47:55 +00:00
Matt Robinson
f7cde5539a
fix: page_number
should not always be 1 in the metadata ( #657 )
...
* fix page number issue
* add tests
* changelog and version
* update changelog
2023-05-30 15:10:14 -04:00
wesleysanjose
b8dcf437ee
fix: add .log
to list of TXT filetypes
2023-05-30 14:13:58 -04:00
Christine Straub
5b5fb3e13b
Issue/encoding error eml ( #639 )
...
This PR adds functionality to try other common encodings for email (.eml) files if an error related to the
encoding is raised and the user has not specified an encoding.
2023-05-30 10:24:02 -07:00
Matt Robinson
3e983efce3
docs: add feature table to README ( #655 )
...
* remove announcement
* add table with filetypes
* remove filetype specific examples
* remove line break
* remove easy gif
* fix extra whitespace
2023-05-30 15:56:25 +00:00
Yuming Long
66058e76bf
changelog and version ( #645 )
0.6.11
2023-05-26 22:21:16 -04:00