929 Commits

Author SHA1 Message Date
Francisco Kurucz
d2a41f462d
doc: fix typo on partition_md function in bricks documentation (#1147) 2023-08-17 20:54:11 -07:00
ryannikolaidis
668d0f1b01
feat: per-process ingest connections (#1058)
* adds per process connections for Google Drive connector
2023-08-17 17:34:08 +00:00
cragwolfe
dd0f582585
build(deps): bump unstructured-inference==0.5.13 (#1141)
Bump to unstructured-inference==0.5.13, which includes:

Fix extracted image elements being included in layout merge, addresses the issue
where an entire-page image in a PDF was not passed to the layout model when using hi_res.
0.10.2
2023-08-17 06:25:00 +00:00
John
9f7bd6127b
enhancement: Add include_header kwarg for xlsx, default True(#1125)
Closes Github issue #1121

Adds include_header kwarg to partition_xlsx and change default behavior to True.
0.10.1
2023-08-17 04:16:23 +00:00
cragwolfe
22c12ef806
bump unstructured-inference (#1140)
Pulls in fix from unstructured-inference==0.5.12:

When a pdf page doesn't have much data, it may get buffered in the write to a tempfile. If this happens, we'll hit an error reading the file back. Open to suggestions for a way to unit test this - I was creating some test files with pypdf but I couldn't trigger the error.
2023-08-16 22:29:37 +00:00
cragwolfe
6f1b8d5f28
build(deps): bump unstructured-inference to 0.5.11 (#1138)
* Bump unstructured-inference==0.5.11:
  - better defaults for DPI for hi_res and  Chipper
2023-08-16 20:52:40 +00:00
Christine Straub
0a23139720
enhancement: implement full-page OCR(#1133)
*implements full-page OCR as supported in unstructured-inference=0.5.11.
2023-08-16 19:16:35 +00:00
Newel H
be093d2e66
chore: Update dead links to correct pages (#1127)
Summary
Closes #1124

Updates dead links in repository README
- Quick Start > Install for local development
- Learn more > Batch Processing)

Updates document dependencies to include tesseract-lang for additional language support (requirement for tests to pass)

Testing
All tests pass
2023-08-16 10:43:37 -04:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields (#1099)
Closes Github Issue #1060.

* update the metadata field links
* update the metadata field emphasized_texts
0.10.0
2023-08-16 04:33:06 +00:00
Sebastian Laverde Alfonso
fe5048a834
feat: chipper local inference notebook (#1116)
Download chipper model for local use and demonstrate how to partition a .pdf document
through the unstructured and unstructured_inference libraries.
2023-08-15 20:43:23 -07:00
cragwolfe
d19183f442
build(lint): don't check version in main against self (#1123)
If on the main branch already, it does not make sense to check if the latest commit is the same non-dev version.

This fixes an annoyance where the CI Lint job would fail on release main commits, but besides that was not causing any other issues.
2023-08-15 17:57:59 +00:00
John
6e5d27c6c3
fix pdf partition of list items being detected as titles in OCR only mode (#1119)
Closes Github issue #1010

adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines
2023-08-15 09:35:54 -07:00
qued
cb923b96a2
build(deps): dependency cleanup (#1102)
Cleans up some pins that were prone to conflicts. All pins belong in constraints.in.
0.9.3
2023-08-15 05:15:44 +00:00
cragwolfe
d835fb1086
chore: bump pip version in published image (#1111)
for consistency with the development environment, i.e. the Makefile.
2023-08-14 21:59:31 +00:00
Mike Lay
79a1eb8683
Handle inline and lacking filename (#1109)
Handle Content-Disposition: inline and attachment without filename

* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
2023-08-14 18:38:53 +00:00
Christine Straub
80266460fd
fix: GH issue 1057 etree parser error (csv) (#1112)
Addresses #1057 for CSV. Related to PR #1077.

* update partition_csv to always use soupparser_fromstring to parse html text
2023-08-14 17:48:57 +00:00
Mark Risher
612f9da6e8
Update news-of-the-day.ipynb - typo (#1113)
Fixed typo
2023-08-14 16:48:49 +00:00
Mike Lay
2e0ab86c6a
Fix attachments with = in filename (#1110)
Fix attachments with = in filename

* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
2023-08-13 20:35:18 -07:00
Christine Straub
fc2699ff06
Fix/1057 etree parser error tsv (#1106)
* feat: always use `soupparser_fromstring` to parse `html text` which gracefully handles emoji
* chore: update changelog & version
2023-08-14 01:22:36 +00:00
cragwolfe
b4b8ac4d8a
chore: run make pip-compile on mac (#1107)
so cuda deps removed.
2023-08-13 20:42:12 +00:00
Christine Straub
4a3176885f
Fix/1057 etree parser error xlsx (#1094)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version

* feat: add functionality to switch html text parser based on whether the html text contains emoji

* chore: update changelog & version

* fix lint errors

* test: revert the `EXPECTED_XLS_TEXT_LEN` value back

* feat: always use `soupparser_fromstring` to parse `html text`

* fix lint error
2023-08-13 12:20:33 -07:00
cragwolfe
02af625b93
chore: fix fickle test to not be so time sensitive (#1105) 2023-08-13 10:58:46 -07:00
Noah Greer
fa0a5afb71
docs: correct spelling of partition in docs (#1104)
Fixes a typo in several places where the word `partition` is misspelled
as `partiton`
2023-08-12 14:57:27 -07:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ronny H
0d5b5a0e79
Revamp README & Bricks documentation (#1103)
Reorganize README.md
2023-08-12 19:58:51 +00:00
Roman Isecke
9d29f5dc2e
Add init file to make notion module discoverable (#1100)
One of the added modules was missing an __init__.py file which made it undiscoverable in the path when running as a cli command via console script rather than the PYTHONPATH=. python ... approach.
2023-08-12 12:21:07 -07:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
Christine Straub
d26ab1deac
fix: etree parser error (#1077)
* feat: add functionality to check if a string contains any emoji characters

* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji

* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use

* chore: update changelog & version

* chore: update changelog & version

* chore: update dependencies

* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`

* chore: update changelog & version
2023-08-10 23:28:57 +00:00
Ronny H
b31c62fa84
replace Weaviate nearText with BM25 query algorithm (#1078) 2023-08-10 22:15:27 +00:00
cragwolfe
6779918406
build(release): bump unstructured-inference (#1074)
* build(release): bump unstructured-inference

Related to downstream issue:
Unstructured-IO/unstructured-api#182

And upstream PR:
Unstructured-IO/unstructured-inference#165

---------

Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>
0.9.2
2023-08-10 20:57:46 +00:00
Ahmet Melek
64a1930c46
chore[ingest]: fix confluence ingest diff tests (#1082)
* trigger CI

* trigger CI

* trigger CI

* do not ingest personal spaces in the diff test

* fix argument

* Update ingest test fixtures (#1083)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-10 17:45:17 +00:00
rvztz
dee9b405cd
feat: Sharepoint connector (#918) 2023-08-10 09:37:58 -07:00
Chris Pappalardo
ef5091f276
feat: added UUID option for element_id arg in element constructor (#1076)
* added UUID option for element_id arg in element constructor and updated unit tests

* updated CHANGELOG and bumped to dev2
2023-08-09 18:32:20 -04:00
Yuming Long
112347aa0d
doc: update API doc to sync with new parameter in prod API (#1049)
* doc doc

* changelog and version

* sample docs -> example docs

* nit on compute cost doc

* pass empty dict not none

* note note

* cutting release
2023-08-09 11:09:37 -04:00
Roman Isecke
df15ba2f07
Support extracting Notion content to html (#1063)
* Add notion connector and supporting code

* minor fixes

* Don't ignore types that aren't recognized when mapping json

* Add support for recursively getting docs

* Add recursive search for databases

* fix logging

* fix linting

* Support extracting page content to html

* Support extracting database content to html

* update CHANGELOG

* fix linting

* fix linting
2023-08-09 09:56:59 -04:00
ryannikolaidis
2a9fb057c1
fix: unstructured-ingest entrypoint (#1068) 0.9.1 2023-08-09 05:49:30 +00:00
Roman Isecke
50389f15a8
roman/Add notion connector (#1033)
* Add notion connector and supporting code

* minor fixes

* Add notion deps to extras

* Use the same return type for both helper methods

* Don't ignore types that aren't recognized when mapping json

* Add support for recursively getting docs

* Add recursive search for databases

* fix logging

* fix linting

* remove debugging code
2023-08-08 22:01:25 -04:00
Yuming Long
b4fe40e484
Chore[ingest]: adding parameter --partition-pdf-infer-table-structure (#1056)
* add param

* expected test

* add option (to do doc nit)

* test with api for now

* typo

* test with api key

* use local only

* encoding -> partition-encoding

* changelog and version

* Update ingest test fixtures (#1055)

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>

* ignore coordinates

* no witespace lol

* Update ingest test fixtures (#1061)

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-08-08 18:11:06 -04:00
Matt Robinson
ac7efa19e7
docs: news of the day (chroma + langchain) (#1054)
* news of the day notebook

* readme and requirements

* change to single mode instead of elements
2023-08-08 11:36:04 -04:00
shreyanid
463c498c78
Update make check-version script to fail if release version is unchanged (#1039)
* TEMP adding git current release check

* working, checks version file against current release

* clean up comments

* shellcheck
2023-08-07 21:21:11 -07:00
Klaijan
ad386af8b5
Klaijan/auto paragraph grouper (#994)
* add auto_paragraph_grouper. add line break pattern.

* combine group_broken_paragraph and blank_line_grouper function

* fix make check errors

* fix make check errors

* fix make check errors

* fix make check errors

* run make tidy to fix errors

* tidy core.py and text.py

* fix blank-line breaker to extends the result and replace new line with space

* fix function name typo

* call group_broken_paragraphs for blank_line_grouper

* edit function name from one_line_grouper to new_line_grouper for consistency

* edit threshold from 0.5 to 0.1

* edit threshold from 0.5 to 0.1

* Revert "call group_broken_paragraphs for blank_line_grouper"

This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.

* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1

* edit test_text assertion. remove all BULLETS_PATTERN.

* Update ingest test fixtures (#1052)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* edit test case in test_xml_partition

* update assertion on test_auto

---------

Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-07 18:37:18 -04:00
ryannikolaidis
fac2da6117
fix: ingest test check num files in smoke test (#1051) 2023-08-06 22:58:50 +00:00
ryannikolaidis
cd1df5e8e6
fix: remove default encoding for ingest (#1036) 2023-08-05 16:57:45 +00:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
Christine Straub
b76d2ee745
feat: track emphasized text msword (#1048)
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph

* chore: add docstring

* chore: fix lint errors

* feat: ignore spaces when extracting emphasized texts from a paragraph

* feat: add functionality to track emphasized text (`bold/italic` formatting) from table

* test: add test case for grabbing emphasized texts from element metadata

* chore: fix lint errors

* chore: update changelog & version

* Update ingest test fixtures (#1047)
2023-08-04 17:04:12 -04:00
kravetsmic
2888c20a46
chore: remove unused _partition_via_api function (#999)
* don't push

* fix: clean up code

* fix: remove unused _partition_via_api

* feat: update changelog

* clean up

* changelog and version

* remove print

* remove print

* revert test file

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-08-04 19:07:15 +00:00
kravetsmic
bef93aef6e
fix: email addresses shouldn't be flagged as titles (#957)
* feat: add func for checking on EmailAddress type

* feat: add EmailAddress type

* feat: add check for email type

* feat: add test for cheking EmailAdress type

* feat: update existing example files with email

* feat: add new exampe fileds with email in the text

* fix: apply linter

* feat: update changelog file

* feat: add test for is_email_address function

* don't push

* fix: clean up code

* apply linter

* fix: clean up

* fix: remove file chaanges

* fix: remove not used  files for email address test

* fix: remove not necessary tests

* clean up

* fix: apply linter

* fix: update CHANGELOG

* fix: change version

* fix: fix  msg test

* fix: apply linter for tests

* fix: remove spaces

* fix: apply linter with longer line

* feat: update documentation

* fix: remove duplicates

* Update getting_started.rst

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 11:28:36 -04:00
Hynek Kydlíček
47b20119c3
fix: extract emojis with partition_xlsx (#1009)
* 🐛 fixxed emoji xlsx bug

* update version and changelog

* check if beautifulsoup exists

* update docs

* fix html parser call

* fix failing attachment test

*   added emoji test, added requirment fixed dependency

* 🐛 dependency

* 🐛 correct depeendency

* linting, linting, linting

* check for bs4

* skip auto xls filename test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 10:14:08 -04:00
Matt Robinson
a1ef6248bf
fix: simplify min_partition logic for partition_text (#1032)
* min simplify first pass

* update tests

* better max partition default

* version and changelog
2023-08-04 13:32:42 +00:00