459 Commits

Author SHA1 Message Date
sparkbrains
2b88890210
docs: customize sphinx doc theme (#192)
* feature: adding a feature for customizing color theme of sphinx docs

* fix: adding changelog and comments

* Adding css for changing colors of sidebar

* fix: removing changelog description
2023-02-06 17:30:55 +00:00
Matt Robinson
782b4352ec
build(deps): weekly dependency update; reduce dependabot frequency (#194)
* deps: pip-compile to update dependencies

* bump version

* linting, linting, linting

* typo
2023-02-06 16:39:29 +00:00
Matt Robinson
014585e872
fix: preserve the order of shapes in partition_pptx output (#193)
* order the shapes top to bottom and left to right

* added tests for ordering

* update change log and bump version

* more tests

* don't need enumerate

* n -> on
0.4.6
2023-02-03 22:12:33 +00:00
Matt Robinson
a7ca58e0bc
fix: more english words; split on punctuation (#191)
* add a bigger list of english words

* update thresholds and add tests

* update docs; bump version

* fix version

* add additional english words back in

* linting, linting, linting

* add slashes

* work -> word
2023-02-02 17:25:47 +00:00
Matt Robinson
0589344ff7
fix: require a minimum prop of alpha characters for titles and narrative text (#190)
* added alpha ratio check

* added tests for alpha ratio

* bump changelog and update docs

* update changelog/version; update docs

* ofr -> or
2023-02-02 14:59:04 +00:00
Matt Robinson
1230a163fd
feat: set a user controlled max word length for titles (#189)
* update the docs

* add option for title max word length

* bump version; update changelog

* change max length to 12

* docs updates

* to -> too
2023-02-01 19:32:16 +00:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word (#188)
* added check for english words

* update docs

* at least one word needs to have multiple characters

* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
d0bf8904fa
docs: example notebooks from community repo (#187) 2023-01-31 10:37:32 -05:00
sparkbrains
243bf7ed5e
test: Increase coverage (#181) 2023-01-30 22:47:09 -08:00
Matt Robinson
f36e514c6d
build(deps): weekly dependency bump (#183) 2023-01-30 11:05:48 -05:00
Matt Robinson
e6cfde5c4a
fix: no UserWarning when partition_pdf is called (#179) 2023-01-27 12:08:18 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx tests (#177)
* add env var for cap threshold; raise default threshold

* update docs and tests

* added check for ending in a comma

* update docs

* no caps check for all upper text

* capture Text in html and text

* check category in Text equality check

* lower case all caps before checking for verbs

* added check for us city/state/zip

* added address type

* add address to html

* add address to text

* fix for text tests; escape for large text segments

* refactor regex for readability

* update comment

* additional test for text with linebreaks

* update docs

* update changelog

* update elements docs

* remove old comment

* case -> cast

* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
1ce8447ba7
build(deps): bump unstructured inference; compile from setup.py (#176)
* bump unstructured inference; compile from setup.py

* bump version

* compile the local-inference extra

* linting, linting, linting
0.4.4
2023-01-25 16:32:57 +00:00
Matt Robinson
26a5546152
fix: handle xml filetype detection on amazon linux (#173)
* fix: handle xml filetype detection on amazon linux

* option for html or xml

* fix typo

* back to dev tag
2023-01-25 11:20:01 -05:00
Matt Robinson
3b6546515d
docs: add links to linkedin and slack (#175) 2023-01-24 13:51:10 -08:00
qued
d2909ac688
chore: update all deps (#172) 2023-01-23 13:03:02 -06:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx (#166)
* parition pptx and tests

* add parition_pptx to auto

* update doc types in readme

* add pptx docs

* bump version

* remove extra whitespace

* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
8d3e616846
feat: add ability to parse LayoutElement lists (#165)
* added ability to split list items

* changelog and version bump

* retrigger ci
2023-01-20 08:55:11 -05:00
Matt Robinson
c1822911a5
chore: return Element objects in partition_pdf and partition_image (#164)
* helper function to convert to element

* test for element types

* fix for healthcheck url

* version bump

* note on coordinates

* mention FigureCaption

* test_shared -> test_common

* add check boxes for checkbox template

* update changelog
2023-01-19 14:29:28 +00:00
Matt Robinson
59f972d739
build(deps): add requests as a base dependency (#162)
* build(deps): add `requests` as a base dependency

* linting, linting, linting

* changelog typo
0.4.3
2023-01-18 16:36:23 +00:00
Matt Robinson
74ce2ae6e5
fix: update detect_filetype to properly handle older office files (#161) 2023-01-18 11:18:20 -05:00
Mallori Harrell
08ccee0acb
chore: Fix parse received data (#143)
* fix parse_received data
2023-01-17 16:36:44 -06:00
Matt Robinson
749f9c6be8
fix: avoid divide by zero in exceeds_cap_ratio (#160) 2023-01-17 15:22:12 -05:00
gokullan
5d9183dc99
chore: graceful exit if sed is an old version (#157) 2023-01-17 18:11:14 +00:00
Matt Robinson
9c3c14e94d
fix: resolves UnicodeDecodeError in partition_email for emails with attachments (#158)
* split emails by \n=

* added test for equivalence betweent html and plain text

* changelog and bump version

* add check for content disposition
0.4.2
2023-01-17 11:33:45 -05:00
dependabot[bot]
7ed5f71e30
build(deps): Bump packaging from 22.0 to 23.0 in /requirements (#156)
Bumps [packaging](https://github.com/pypa/packaging) from 22.0 to 23.0.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/packaging/compare/22.0...23.0)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 11:03:03 -05:00
dependabot[bot]
04c1813c7f
build(deps): Bump filelock from 3.8.2 to 3.9.0 in /requirements (#152)
Bumps [filelock](https://github.com/tox-dev/py-filelock) from 3.8.2 to 3.9.0.
- [Release notes](https://github.com/tox-dev/py-filelock/releases)
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.2...3.9.0)

---
updated-dependencies:
- dependency-name: filelock
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-17 15:40:26 +00:00
dependabot[bot]
49392a2955
build(deps): Bump requests from 2.28.1 to 2.28.2 in /requirements (#154)
Bumps [requests](https://github.com/psf/requests) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.28.1...v2.28.2)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 10:28:23 -05:00
qued
8abf1f119d
feat: partition image (#144)
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
419c0867d3
build(deps): bump unstructured_inference version range (#151)
* bump unstructured-inference to 0.2.3

* bump version
0.4.1
2023-01-13 22:21:36 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info for exploring a directory of files (#142)
* added python-pptx to requirements

* added filetype detection for powerpoint

* add more filetypes to detect

* more tests

* added tests for filetype

* reorder document types

* tests for get_directory_file_info

* added docs for get_directory_file_info

* bump version

* Word -> Office

* added test for filetype

* add group by filetype example
0.4.0
2023-01-11 12:40:50 -05:00
qued
7e3af6c609
chore: remove extra requirements.txt (#140) 2023-01-10 22:12:10 -06:00
Mallori Harrell
e0feba83f6
feat: Add Image element and find_embedded_image function (#130)
* add find_embedded_image
2023-01-09 19:49:19 -06:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci target (#138)
* fix install-ci make target

* add note to readme about libmagic

* remove mydoc.docx

* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00
dependabot[bot]
7fb8713527
build(deps): Bump black from 22.10.0 to 22.12.0 in /requirements (#137)
Bumps [black](https://github.com/psf/black) from 22.10.0 to 22.12.0.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/22.10.0...22.12.0)

---
updated-dependencies:
- dependency-name: black
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:54:26 +00:00
dependabot[bot]
5129809f00
build(deps): Bump numpy from 1.23.5 to 1.24.1 in /requirements (#136)
Bumps [numpy](https://github.com/numpy/numpy) from 1.23.5 to 1.24.1.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst)
- [Commits](https://github.com/numpy/numpy/compare/v1.23.5...v1.24.1)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:44:09 +00:00
dependabot[bot]
f06189b292
build(deps-dev): Bump jupyter-core from 5.1.0 to 5.1.3 in /requirements (#134)
Bumps [jupyter-core](https://github.com/jupyter/jupyter_core) from 5.1.0 to 5.1.3.
- [Release notes](https://github.com/jupyter/jupyter_core/releases)
- [Changelog](https://github.com/jupyter/jupyter_core/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jupyter/jupyter_core/compare/v5.1.0...v5.1.3)

---
updated-dependencies:
- dependency-name: jupyter-core
  dependency-type: direct:development
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-09 16:32:31 +00:00
dependabot[bot]
ca8e1ee9f3
build(deps): Bump pydantic from 1.10.2 to 1.10.4 in /requirements (#133)
Bumps [pydantic](https://github.com/pydantic/pydantic) from 1.10.2 to 1.10.4.
- [Release notes](https://github.com/pydantic/pydantic/releases)
- [Changelog](https://github.com/pydantic/pydantic/blob/v1.10.4/HISTORY.md)
- [Commits](https://github.com/pydantic/pydantic/compare/v1.10.2...v1.10.4)

---
updated-dependencies:
- dependency-name: pydantic
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 11:22:14 -05:00
Matt Robinson
fee95b643c
feat: add partition_docx for Word documents (#131)
* first pass on docx parsing

* linting, linting, linting

* test docx with filename

* added documentation

* more tests; version bump

* typo

* another typo

* another typo!

* it -> its

* save -> saved

* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda (#129)
* add environment.yml

* instructions on how to install base package and detectron2

* added instructions on paddleocr

* remove covers

* install -> to install

* specified the shell

* updated example snippets

* update environment.yml

* updated the repo reference

* no more ands!
2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets (#128)
* feat: new cleaning brick for ordered bullets

* test: add test for cleaning ordered bullets

* feat: new brick for extracting ordered bullets

* test: add test for extracting ordered bullets

* docs: update CHANGELOG and bump new dev version

* chore: change extract ordered bullets return type to tuple

* chore: made tidy

* chore: regex to split on pattern instead of built-in

* chore: catch ValueError, made tidy and fix incompatible type

* chore: assertion statements in one line of code

* docs: add documentation for new clean and extract bricks to bricks.rst

* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets

* docs: update CHANGELOG 0.3.6-dev0 changes and bump version

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
qued
a75499d465
feat: local inference (#125)
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
0.3.5
2023-01-04 16:19:05 -06:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe staging brick (#127)
* add pandas to deps; pip-compile

* staging brick to convert elements to dataframe

* bump version

* add convert_to_dataframe docs

* bump wheel version

* typo fix

* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers (#124)
* added pattern for finding phone numbers

* added cleaning brick for extracting phone numbers

* add docs

* changelog and bump version

* switch to us phone numbers

* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info (#112)
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
dependabot[bot]
456735735c
build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements (#120)
Bumps [pillow](https://github.com/python-pillow/Pillow) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases)
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst)
- [Commits](https://github.com/python-pillow/Pillow/compare/9.3.0...9.4.0)

---
updated-dependencies:
- dependency-name: pillow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:52:53 +00:00
dependabot[bot]
f80d05c7b0
build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements (#122)
Bumps [pytz](https://github.com/stub42/pytz) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases)
- [Commits](https://github.com/stub42/pytz/compare/release_2022.6...release_2022.7)

---
updated-dependencies:
- dependency-name: pytz
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:42:49 +00:00