1109 Commits

Author SHA1 Message Date
Matt Robinson
15618e8346
fix: handling for empty tables in word docs and powerpoints (#982)
* fix table index error

* changelog and version
2023-07-27 11:07:27 -04:00
Yuming Long
df1ba39905
Chore: add uns api repo unittests (#954)
* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-07-26 20:55:35 +00:00
Matt Robinson
d9aed66b65
feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-07-26 15:10:14 -04:00
cragwolfe
1e2d531bb9
build(release): cut 0.8.4 release (#979) 0.8.4 2023-07-26 18:01:31 +00:00
shreyanid
71a24b2887
Update partition_via_api to not post a strategy value if not user specified (#967)
* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements
2023-07-26 09:56:39 -07:00
Matt Robinson
08fc41cde2
chore: cleanup changelog for 0.8.2 (#976) 2023-07-26 15:33:30 +00:00
Roman Isecke
b39e0d7354
Roman/expose dpi param (#966)
* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well
0.8.2
2023-07-26 09:26:06 -04:00
David Potter
f7e46af22f
feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00
Matt Robinson
d694cd53bf
refactor: simplifies JSON detection and add tests (#975)
* refactor json detection

* version and changelog

* fix mock in test
2023-07-25 19:59:45 +00:00
John
f282a10715
enhancement: improve json detection by detect_filetype (#971)
* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version
2023-07-25 12:47:39 -04:00
Christine Straub
f7def03d55
Fix/521 pdf2image memory error hi res (#948)
This PR is to reflect changes in the unstructured-inference PR #152

* Update functionality to retrieve image metadata from a page for document_to_element_list
2023-07-24 19:22:56 +00:00
Matt Robinson
6e852cbe70
feat: track links from anchor tags in partition_html (#959)
* track tags in html

* pass through links as metadata

* add test for grabbing links

* one more link

* changelog and version

* update docs

* fix tests

* update empty link assertion

* ingest-test-fixtures-update

* Update ingest test fixtures (#961)
2023-07-24 18:28:56 +00:00
Jason Scheirer
196efa09b1
chore: Add encoding param to ingest (#955)
* Add encoding param to ingest
2023-07-24 10:06:13 -07:00
John
676c50a6ec
feat: add min_partition kwarg to that combines elements below a specified threshold (#926)
* add min_partition

* functioning _split_content_to_fit_min_max

* create test and make tidy/check

* fix rebase issues

* fix type hinting, remove unused code, add tests

* various changes and refactoring of methods

* add test, refactor, change var names for debugging purposes

* update test

* make tidy/check

* give more descriptive var names and add comments

* update xml partition via partition_text and create test

* fix <pre> bug for test_partition_html_with_pre_tag

* make tidy

* refactor and fix tests

* make tidy/check

* ingest-test-fixtures-update

* change list comprehension to for loop

* fix error check
2023-07-24 15:57:24 +00:00
qued
d0329126ef
chore: remove outdated error message (#935)
There's an issue in unstructured-inference about these blocks trapping unrelated import errors. The fix for that would be to narrow the scope of the traps, but I think this is made redundant by the requires_dependencies decorator, so I removed it completely.
2023-07-22 05:10:26 +00:00
Emily Chen
050cfafb70
Add subsection for docs; prioritize getting started with container (#962) 2023-07-21 17:29:58 -07:00
Amanda Cameron
35e529f2d4
updating api key link (#960) 2023-07-21 13:05:40 -07:00
Jack Retterer
708714dab5
docs: fixed typo in Installation guide (#945) 2023-07-21 13:33:44 +00:00
Yuming Long
208148abe7
Chore: update require api key in readme (#952) 2023-07-20 16:10:03 +00:00
Ronny H
31511793cb
Update README and API doc for Chipper announcement (#940)
Update README and API doc for Chipper model beta version announcement
2023-07-19 13:00:37 -07:00
Emily Chen
4b1e5a8057
Publicly document OneDrive connector (#949) 2023-07-18 16:37:44 -07:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
Chris Watts
bf47dc10ae
feat: add slide notes to pptx (#942)
* Add slide notes to pptx

* Add include_slide_notes flag for pptx

* Update CHANGELOG.md 0.8.2-dev2 - Add slide notes to pptx

* Fix lint error

* Fix pptx.py lint
2023-07-17 17:52:34 -04:00
ryannikolaidis
3b33331082
docs: fix readme word docs typo (#946) 2023-07-17 20:04:50 +00:00
Matt Robinson
0d332743eb
fix: enable passing filters to partition_doc for libreoffice conversion (#934)
* add optional filter to docx conversion

* add filters to tests

* changelog and version

* update filter for power point
2023-07-17 13:54:44 -04:00
Yuming Long
067eb5701f
Fix: docker build with missing dependency (#931)
* pip -compile

* test trigger

* Revert "test trigger"

This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c.

* version conflict and pip compile
2023-07-14 22:20:11 +00:00
Matt Robinson
685e33f890
build: remove docs-build branch (#933) 2023-07-14 16:23:47 -04:00
Christine Straub
5b7ae29876
fix: 521 pdf2image memory error (#924)
Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy.

* Add functionality to convert a PDF in small chunks of pages at a time
* Add functionality to write images to computer storage temporarily instead of keeping them in memory
* Set the file's current position to the beginning after reading the file in convert_to_bytes
2023-07-14 15:08:33 -05:00
fran-unstructured
dd4bb752e2
docs: Add Unstructured API documentation (#928)
* fran-unstructured/Add unstructured API documentation

* fran-unstructured/add api docs to index.rst

* fran-unstructured/add api docs with changes requested
0.8.1-docs-rebuild
2023-07-14 18:28:57 +00:00
rvztz
ce20c3f2bc
feat: add OneDrive connector (#834) 2023-07-13 20:57:54 +00:00
fran-unstructured
26da51c765
docs: Add source code links to bricks' docs (#923)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-07-13 17:27:47 +00:00
Matt Robinson
9b830693bd
fix: adds to list of extensions to check if a file has a plain text MIME type (#916)
* added .txt, .text, and .tab to text file list

* changelog and version
2023-07-12 20:07:43 +00:00
fran-unstructured
f7b3c0f741
docs: adds connectors' documentation (#917)
* Add connectors documentation

* Add connectors documentation with corrections and index.rst update

* Add connectors documentation - add API information

---------

Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-12 14:56:09 -04:00
dependabot[bot]
f490b82d5b
build(deps): bump praw from 7.7.0 to 7.7.1 in /requirements (#922)
Bumps [praw](https://github.com/praw-dev/praw) from 7.7.0 to 7.7.1.
- [Release notes](https://github.com/praw-dev/praw/releases)
- [Changelog](https://github.com/praw-dev/praw/blob/master/CHANGES.rst)
- [Commits](https://github.com/praw-dev/praw/compare/v7.7.0...v7.7.1)

---
updated-dependencies:
- dependency-name: praw
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:19 -04:00
dependabot[bot]
a7d6edc528
build(deps): bump google-api-python-client in /requirements (#921)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.92.0 to 2.93.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.92.0...v2.93.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:03 -04:00
Matt Robinson
a583d47b84
docs: update table and API documentation (#919)
* more detailed api docs

* add table docs

* remove rtf/epubs comment

* remove confusing request_kwargs verbiage

* add missing a
2023-07-12 12:59:59 -04:00
dependabot[bot]
1fa944ec87
build(deps): bump black from 23.3.0 to 23.7.0 in /requirements (#920)
Bumps [black](https://github.com/psf/black) from 23.3.0 to 23.7.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.3.0...23.7.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 15:30:57 +00:00
dependabot[bot]
80bdd60b32
build(deps): bump protobuf from 3.20.3 to 4.23.4 in /requirements (#910)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.3 to 4.23.4.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v3.20.3...v4.23.4)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-12 10:41:02 -04:00
Roman Isecke
8b233b4f62
Set the version to 0.8.1 (#914) 2023-07-11 10:27:54 -04:00
Emily Chen
2635b0be07
Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913) 2023-07-10 21:47:41 -07:00
Matt Robinson
b3936893b8
build: add python 3.11 to CI (#908)
* remove argilla; bump reqs

* enable py 3.11

* add 3.11 to setup.py

* make pip-compile

* ignore cli mypy errors

* install argilla

* fix constraints

* install argilla

* changelog and version

* skip argilla in docker

* dont import argilla in docker

* skip all of argilla if in container

* only import argilla if outside docker

* more docker skips

* remove weird pypi settings
2023-07-10 18:52:25 +00:00
Trevor Bossert
66f2d4b280
Add both arm and amd builds to manifests (#899) 2023-07-10 10:15:15 -07:00
John
6173362620
fix: detect list items in MS Word documents (#909)
* fix merge conflict

* update changelog and version
2023-07-10 15:29:08 +00:00
qued
79f734d3f9
fix: better extractable check (#900)
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
2023-07-07 23:41:37 -05:00
Matt Robinson
f51ae45050
fix: grab all metadata fields in convert_to_dataframe (#893)
* add all fieldnames to dataframe

* drop empty columns in convert_to_dataframe

* test for maintaining metadata

* version and changelog
2023-07-07 20:04:35 +00:00
dependabot[bot]
c8e6f0e141
build(deps): bump elasticsearch from 8.8.0 to 8.8.2 in /requirements (#898)
Bumps [elasticsearch](https://github.com/elastic/elasticsearch-py) from 8.8.0 to 8.8.2.
- [Release notes](https://github.com/elastic/elasticsearch-py/releases)
- [Commits](https://github.com/elastic/elasticsearch-py/compare/v8.8.0...v8.8.2)

---
updated-dependencies:
- dependency-name: elasticsearch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 19:19:45 +00:00
dependabot[bot]
05d51cfb4f
build(deps): bump ruff from 0.0.275 to 0.0.277 in /requirements (#897)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.0.275 to 0.0.277.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md)
- [Commits](https://github.com/astral-sh/ruff/compare/v0.0.275...v0.0.277)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:52 -04:00
dependabot[bot]
7f9532f8b3
build(deps): bump lxml from 4.9.2 to 4.9.3 in /requirements (#896)
Bumps [lxml](https://github.com/lxml/lxml) from 4.9.2 to 4.9.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.2...lxml-4.9.3)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:05 -04:00
dependabot[bot]
1619a0b6a2
build(deps): bump google-api-python-client in /requirements (#895)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.91.0 to 2.92.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.91.0...v2.92.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-07 13:50:25 -04:00
Roman Isecke
5e1150184c
Add optional param for model name when partitioning pdfs (#890)
* Add optional param for model name when partitioning pdfs

* Pull in latest inference changes

* Fix linting
0.8.0
2023-07-07 11:16:55 -04:00