1393 Commits

Author SHA1 Message Date
Amanda Cameron
35e529f2d4
updating api key link (#960) 2023-07-21 13:05:40 -07:00
Jack Retterer
708714dab5
docs: fixed typo in Installation guide (#945) 2023-07-21 13:33:44 +00:00
Yuming Long
208148abe7
Chore: update require api key in readme (#952) 2023-07-20 16:10:03 +00:00
Ronny H
31511793cb
Update README and API doc for Chipper announcement (#940)
Update README and API doc for Chipper model beta version announcement
2023-07-19 13:00:37 -07:00
Emily Chen
4b1e5a8057
Publicly document OneDrive connector (#949) 2023-07-18 16:37:44 -07:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
Chris Watts
bf47dc10ae
feat: add slide notes to pptx (#942)
* Add slide notes to pptx

* Add include_slide_notes flag for pptx

* Update CHANGELOG.md 0.8.2-dev2 - Add slide notes to pptx

* Fix lint error

* Fix pptx.py lint
2023-07-17 17:52:34 -04:00
ryannikolaidis
3b33331082
docs: fix readme word docs typo (#946) 2023-07-17 20:04:50 +00:00
Matt Robinson
0d332743eb
fix: enable passing filters to partition_doc for libreoffice conversion (#934)
* add optional filter to docx conversion

* add filters to tests

* changelog and version

* update filter for power point
2023-07-17 13:54:44 -04:00
Yuming Long
067eb5701f
Fix: docker build with missing dependency (#931)
* pip -compile

* test trigger

* Revert "test trigger"

This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c.

* version conflict and pip compile
2023-07-14 22:20:11 +00:00
Matt Robinson
685e33f890
build: remove docs-build branch (#933) 2023-07-14 16:23:47 -04:00
Christine Straub
5b7ae29876
fix: 521 pdf2image memory error (#924)
Closes issue #521. Implements the same logic as unstructured-inference/PR #136 for the ocr_only strategy.

* Add functionality to convert a PDF in small chunks of pages at a time
* Add functionality to write images to computer storage temporarily instead of keeping them in memory
* Set the file's current position to the beginning after reading the file in convert_to_bytes
2023-07-14 15:08:33 -05:00
fran-unstructured
dd4bb752e2
docs: Add Unstructured API documentation (#928)
* fran-unstructured/Add unstructured API documentation

* fran-unstructured/add api docs to index.rst

* fran-unstructured/add api docs with changes requested
0.8.1-docs-rebuild
2023-07-14 18:28:57 +00:00
rvztz
ce20c3f2bc
feat: add OneDrive connector (#834) 2023-07-13 20:57:54 +00:00
fran-unstructured
26da51c765
docs: Add source code links to bricks' docs (#923)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
2023-07-13 17:27:47 +00:00
Matt Robinson
9b830693bd
fix: adds to list of extensions to check if a file has a plain text MIME type (#916)
* added .txt, .text, and .tab to text file list

* changelog and version
2023-07-12 20:07:43 +00:00
fran-unstructured
f7b3c0f741
docs: adds connectors' documentation (#917)
* Add connectors documentation

* Add connectors documentation with corrections and index.rst update

* Add connectors documentation - add API information

---------

Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-12 14:56:09 -04:00
dependabot[bot]
f490b82d5b
build(deps): bump praw from 7.7.0 to 7.7.1 in /requirements (#922)
Bumps [praw](https://github.com/praw-dev/praw) from 7.7.0 to 7.7.1.
- [Release notes](https://github.com/praw-dev/praw/releases)
- [Changelog](https://github.com/praw-dev/praw/blob/master/CHANGES.rst)
- [Commits](https://github.com/praw-dev/praw/compare/v7.7.0...v7.7.1)

---
updated-dependencies:
- dependency-name: praw
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:19 -04:00
dependabot[bot]
a7d6edc528
build(deps): bump google-api-python-client in /requirements (#921)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.92.0 to 2.93.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.92.0...v2.93.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 14:55:03 -04:00
Matt Robinson
a583d47b84
docs: update table and API documentation (#919)
* more detailed api docs

* add table docs

* remove rtf/epubs comment

* remove confusing request_kwargs verbiage

* add missing a
2023-07-12 12:59:59 -04:00
dependabot[bot]
1fa944ec87
build(deps): bump black from 23.3.0 to 23.7.0 in /requirements (#920)
Bumps [black](https://github.com/psf/black) from 23.3.0 to 23.7.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.3.0...23.7.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-12 15:30:57 +00:00
dependabot[bot]
80bdd60b32
build(deps): bump protobuf from 3.20.3 to 4.23.4 in /requirements (#910)
Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 3.20.3 to 4.23.4.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v3.20.3...v4.23.4)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-12 10:41:02 -04:00
Roman Isecke
8b233b4f62
Set the version to 0.8.1 (#914) 2023-07-11 10:27:54 -04:00
Emily Chen
2635b0be07
Don't instantiate an element with a coordinate system when there isn't a way to get its location (#913) 2023-07-10 21:47:41 -07:00
Matt Robinson
b3936893b8
build: add python 3.11 to CI (#908)
* remove argilla; bump reqs

* enable py 3.11

* add 3.11 to setup.py

* make pip-compile

* ignore cli mypy errors

* install argilla

* fix constraints

* install argilla

* changelog and version

* skip argilla in docker

* dont import argilla in docker

* skip all of argilla if in container

* only import argilla if outside docker

* more docker skips

* remove weird pypi settings
2023-07-10 18:52:25 +00:00
Trevor Bossert
66f2d4b280
Add both arm and amd builds to manifests (#899) 2023-07-10 10:15:15 -07:00
John
6173362620
fix: detect list items in MS Word documents (#909)
* fix merge conflict

* update changelog and version
2023-07-10 15:29:08 +00:00
qued
79f734d3f9
fix: better extractable check (#900)
auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.
2023-07-07 23:41:37 -05:00
Matt Robinson
f51ae45050
fix: grab all metadata fields in convert_to_dataframe (#893)
* add all fieldnames to dataframe

* drop empty columns in convert_to_dataframe

* test for maintaining metadata

* version and changelog
2023-07-07 20:04:35 +00:00
dependabot[bot]
c8e6f0e141
build(deps): bump elasticsearch from 8.8.0 to 8.8.2 in /requirements (#898)
Bumps [elasticsearch](https://github.com/elastic/elasticsearch-py) from 8.8.0 to 8.8.2.
- [Release notes](https://github.com/elastic/elasticsearch-py/releases)
- [Commits](https://github.com/elastic/elasticsearch-py/compare/v8.8.0...v8.8.2)

---
updated-dependencies:
- dependency-name: elasticsearch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 19:19:45 +00:00
dependabot[bot]
05d51cfb4f
build(deps): bump ruff from 0.0.275 to 0.0.277 in /requirements (#897)
Bumps [ruff](https://github.com/astral-sh/ruff) from 0.0.275 to 0.0.277.
- [Release notes](https://github.com/astral-sh/ruff/releases)
- [Changelog](https://github.com/astral-sh/ruff/blob/main/BREAKING_CHANGES.md)
- [Commits](https://github.com/astral-sh/ruff/compare/v0.0.275...v0.0.277)

---
updated-dependencies:
- dependency-name: ruff
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:52 -04:00
dependabot[bot]
7f9532f8b3
build(deps): bump lxml from 4.9.2 to 4.9.3 in /requirements (#896)
Bumps [lxml](https://github.com/lxml/lxml) from 4.9.2 to 4.9.3.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.2...lxml-4.9.3)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-07 13:51:05 -04:00
dependabot[bot]
1619a0b6a2
build(deps): bump google-api-python-client in /requirements (#895)
Bumps [google-api-python-client](https://github.com/googleapis/google-api-python-client) from 2.91.0 to 2.92.0.
- [Release notes](https://github.com/googleapis/google-api-python-client/releases)
- [Changelog](https://github.com/googleapis/google-api-python-client/blob/main/CHANGELOG.md)
- [Commits](https://github.com/googleapis/google-api-python-client/compare/v2.91.0...v2.92.0)

---
updated-dependencies:
- dependency-name: google-api-python-client
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-07 13:50:25 -04:00
Roman Isecke
5e1150184c
Add optional param for model name when partitioning pdfs (#890)
* Add optional param for model name when partitioning pdfs

* Pull in latest inference changes

* Fix linting
0.8.0
2023-07-07 11:16:55 -04:00
Christine Straub
47bc4009a8
fix: adjust threshold for encoding detection (#894)
* chore: add example doc

* fix: adjust encoding recognition threshold value in `detect_file_encoding`

* test: add test cases for German characters

* chore: update changelog & version
2023-07-07 09:25:03 -04:00
Matt Robinson
52aced8677
fix: validate encodings from email headers (#881)
* add validate encoding function

* remove extraneous file

* added test case for malformed encoding

* version and changelog
2023-07-06 13:49:27 +00:00
cragwolfe
209054f0db
build(image): revert docker build tweak for arm64 (#887)
arm64 Images (and amd64 ones) now building again in CI 😐 .
2023-07-06 06:46:40 +00:00
Ahmet Melek
4b827f0793
fix: local connector output filename when a single file is being processed (#879)
* fix string processing error for _output_filename

* Add docstring and type hint, update CHANGELOG, update version

* update test fixture

* simple code change commit to retrigger ci checks

* update test fixture - after brew install tesseract-lang

* Update ingest test fixtures (#882)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* correct CHANGELOG

* correct CHANGELOG

---------

Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-05 14:37:40 -07:00
Nathan Chappell
24dad24f87
chore: changed type IO to IO[bytes] (#878)
Co-authored-by: Nathan Chappell <nchappell@mono.software>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-07-05 16:37:31 -04:00
John
dc6d7d7268
feat: add metadata_filename parameter across all partition functions (#811)
* fix conflicts

* add tests and clean metadata_filename in partitions

* fix test_email and remove comments

* make tidy/check

* update changelog and version

* fix tests

* make tidy again
2023-07-05 16:02:22 -04:00
Austin Walker
8d2e7c0746
fix: Fix KeyError in isd_to_elements (#876) 2023-07-05 19:09:18 +00:00
Emily Chen
24ebd0fa4e
chore: Move coordinate details from Element model to a metadata model (#827) 2023-07-05 11:25:11 -07:00
Johnny Lim
6ec177e7c6
Add a missing space in a warning message in filetype.py (#873)
Adds a missing space in a warning message in the filetype.py file.
2023-07-01 20:54:39 +00:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
cragwolfe
cb2866b159
build(image): docker build tweak for arm64 (#871)
Fixes issue where arm64 docker builds were failing and preventing images from being published.
2023-06-30 20:49:31 -07:00
Trevor Bossert
6249e1553e
New base image with security patches (#869)
* New base image with security patches

* Bump version

* remove line from changelog

not code related
0.7.12
2023-06-30 19:14:06 -07:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00
John
e9fdbb0943
feat: add include_metadata across all partition functions (#853)
* add include_metadata kwarg and tests to parsers

add exclude_metadata to docx

add test for doc to exclude metadata

add include_metadata kwarg to email

add include_metadata kwarg to epub

add include_metadata kwarg to json

add exclude_metadata tests to md

add include_metadata kwarg and tests for msg parse

add include_metadata kwarg and tests for odt parse

add include_metadata kwarg and tests for org parse

add include_metadata kwarg and tests for ppt and pptx parse

add include_metadata kwarg and tests for rst parse

add include_metadata kwarg and tests for rtf parse

add include_metadata tests for text parse

add include_metadata tests for tsv parse

add include_metadata tests for xlsx parse

add include_metadata tests for xml parse

* WIP add include_metadata to partition_pdf

* add include_metadata tests to partition_pdf

* make tidy/check

* update changelog and version

* change test asserts and move docstring logic to process_metadata

* make tidy

* fix tests asserts

* linting, linting, linting

* sync versions

* skip api call test not on main

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-06-30 10:44:46 -04:00
qued
350bb1dad5
enhancement: clean pdf elements (bump unstructured-inference) (#790)
More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)

---------

Co-authored-by: Roman Isecke <roman@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructured.io>
0.7.11
2023-06-29 18:35:06 -07:00
ryannikolaidis
642562beb5
fix: skip test with api call when run outside CI (#862) 2023-06-30 00:47:51 +00:00