Matt Robinson
e6cfde5c4a
fix: no UserWarning
when partition_pdf
is called ( #179 )
2023-01-27 12:08:18 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx
tests ( #177 )
...
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
1ce8447ba7
build(deps): bump unstructured inference; compile from setup.py ( #176 )
...
* bump unstructured inference; compile from setup.py
* bump version
* compile the local-inference extra
* linting, linting, linting
0.4.4
2023-01-25 16:32:57 +00:00
Matt Robinson
26a5546152
fix: handle xml filetype detection on amazon linux ( #173 )
...
* fix: handle xml filetype detection on amazon linux
* option for html or xml
* fix typo
* back to dev tag
2023-01-25 11:20:01 -05:00
Matt Robinson
3b6546515d
docs: add links to linkedin and slack ( #175 )
2023-01-24 13:51:10 -08:00
qued
d2909ac688
chore: update all deps ( #172 )
2023-01-23 13:03:02 -06:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx
( #166 )
...
* parition pptx and tests
* add parition_pptx to auto
* update doc types in readme
* add pptx docs
* bump version
* remove extra whitespace
* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
8d3e616846
feat: add ability to parse LayoutElement
lists ( #165 )
...
* added ability to split list items
* changelog and version bump
* retrigger ci
2023-01-20 08:55:11 -05:00
Matt Robinson
c1822911a5
chore: return Element
objects in partition_pdf
and partition_image
( #164 )
...
* helper function to convert to element
* test for element types
* fix for healthcheck url
* version bump
* note on coordinates
* mention FigureCaption
* test_shared -> test_common
* add check boxes for checkbox template
* update changelog
2023-01-19 14:29:28 +00:00
Matt Robinson
59f972d739
build(deps): add requests
as a base dependency ( #162 )
...
* build(deps): add `requests` as a base dependency
* linting, linting, linting
* changelog typo
0.4.3
2023-01-18 16:36:23 +00:00
Matt Robinson
74ce2ae6e5
fix: update detect_filetype
to properly handle older office files ( #161 )
2023-01-18 11:18:20 -05:00
Mallori Harrell
08ccee0acb
chore: Fix parse received data ( #143 )
...
* fix parse_received data
2023-01-17 16:36:44 -06:00
Matt Robinson
749f9c6be8
fix: avoid divide by zero in exceeds_cap_ratio
( #160 )
2023-01-17 15:22:12 -05:00
gokullan
5d9183dc99
chore: graceful exit if sed is an old version ( #157 )
2023-01-17 18:11:14 +00:00
Matt Robinson
9c3c14e94d
fix: resolves UnicodeDecodeError
in partition_email
for emails with attachments ( #158 )
...
* split emails by \n=
* added test for equivalence betweent html and plain text
* changelog and bump version
* add check for content disposition
0.4.2
2023-01-17 11:33:45 -05:00
dependabot[bot]
7ed5f71e30
build(deps): Bump packaging from 22.0 to 23.0 in /requirements ( #156 )
...
Bumps [packaging](https://github.com/pypa/packaging ) from 22.0 to 23.0.
- [Release notes](https://github.com/pypa/packaging/releases )
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pypa/packaging/compare/22.0...23.0 )
---
updated-dependencies:
- dependency-name: packaging
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 11:03:03 -05:00
dependabot[bot]
04c1813c7f
build(deps): Bump filelock from 3.8.2 to 3.9.0 in /requirements ( #152 )
...
Bumps [filelock](https://github.com/tox-dev/py-filelock ) from 3.8.2 to 3.9.0.
- [Release notes](https://github.com/tox-dev/py-filelock/releases )
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst )
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.2...3.9.0 )
---
updated-dependencies:
- dependency-name: filelock
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-17 15:40:26 +00:00
dependabot[bot]
49392a2955
build(deps): Bump requests from 2.28.1 to 2.28.2 in /requirements ( #154 )
...
Bumps [requests](https://github.com/psf/requests ) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/psf/requests/releases )
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md )
- [Commits](https://github.com/psf/requests/compare/v2.28.1...v2.28.2 )
---
updated-dependencies:
- dependency-name: requests
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 10:28:23 -05:00
qued
8abf1f119d
feat: partition image ( #144 )
...
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
419c0867d3
build(deps): bump unstructured_inference
version range ( #151 )
...
* bump unstructured-inference to 0.2.3
* bump version
0.4.1
2023-01-13 22:21:36 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt
files in partition
( #150 )
...
* added partition_text for auto
* rename partition_text tests
* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info
for exploring a directory of files ( #142 )
...
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
0.4.0
2023-01-11 12:40:50 -05:00
qued
7e3af6c609
chore: remove extra requirements.txt ( #140 )
2023-01-10 22:12:10 -06:00
Mallori Harrell
e0feba83f6
feat: Add Image element and find_embedded_image
function ( #130 )
...
* add find_embedded_image
2023-01-09 19:49:19 -06:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci
target ( #138 )
...
* fix install-ci make target
* add note to readme about libmagic
* remove mydoc.docx
* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition
brick with filetype detection ( #132 )
...
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails ( #111 )
...
* partition_text function
2023-01-09 17:08:08 +00:00
dependabot[bot]
7fb8713527
build(deps): Bump black from 22.10.0 to 22.12.0 in /requirements ( #137 )
...
Bumps [black](https://github.com/psf/black ) from 22.10.0 to 22.12.0.
- [Release notes](https://github.com/psf/black/releases )
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md )
- [Commits](https://github.com/psf/black/compare/22.10.0...22.12.0 )
---
updated-dependencies:
- dependency-name: black
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:54:26 +00:00
dependabot[bot]
5129809f00
build(deps): Bump numpy from 1.23.5 to 1.24.1 in /requirements ( #136 )
...
Bumps [numpy](https://github.com/numpy/numpy ) from 1.23.5 to 1.24.1.
- [Release notes](https://github.com/numpy/numpy/releases )
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst )
- [Commits](https://github.com/numpy/numpy/compare/v1.23.5...v1.24.1 )
---
updated-dependencies:
- dependency-name: numpy
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:44:09 +00:00
dependabot[bot]
f06189b292
build(deps-dev): Bump jupyter-core from 5.1.0 to 5.1.3 in /requirements ( #134 )
...
Bumps [jupyter-core](https://github.com/jupyter/jupyter_core ) from 5.1.0 to 5.1.3.
- [Release notes](https://github.com/jupyter/jupyter_core/releases )
- [Changelog](https://github.com/jupyter/jupyter_core/blob/main/CHANGELOG.md )
- [Commits](https://github.com/jupyter/jupyter_core/compare/v5.1.0...v5.1.3 )
---
updated-dependencies:
- dependency-name: jupyter-core
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-09 16:32:31 +00:00
dependabot[bot]
ca8e1ee9f3
build(deps): Bump pydantic from 1.10.2 to 1.10.4 in /requirements ( #133 )
...
Bumps [pydantic](https://github.com/pydantic/pydantic ) from 1.10.2 to 1.10.4.
- [Release notes](https://github.com/pydantic/pydantic/releases )
- [Changelog](https://github.com/pydantic/pydantic/blob/v1.10.4/HISTORY.md )
- [Commits](https://github.com/pydantic/pydantic/compare/v1.10.2...v1.10.4 )
---
updated-dependencies:
- dependency-name: pydantic
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 11:22:14 -05:00
Matt Robinson
fee95b643c
feat: add partition_docx
for Word documents ( #131 )
...
* first pass on docx parsing
* linting, linting, linting
* test docx with filename
* added documentation
* more tests; version bump
* typo
* another typo
* another typo!
* it -> its
* save -> saved
* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda
( #129 )
...
* add environment.yml
* instructions on how to install base package and detectron2
* added instructions on paddleocr
* remove covers
* install -> to install
* specified the shell
* updated example snippets
* update environment.yml
* updated the repo reference
* no more ands!
2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets ( #128 )
...
* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
qued
a75499d465
feat: local inference ( #125 )
...
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
0.3.5
2023-01-04 16:19:05 -06:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe
staging brick ( #127 )
...
* add pandas to deps; pip-compile
* staging brick to convert elements to dataframe
* bump version
* add convert_to_dataframe docs
* bump wheel version
* typo fix
* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers ( #124 )
...
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info
( #112 )
...
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
dependabot[bot]
456735735c
build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements ( #120 )
...
Bumps [pillow](https://github.com/python-pillow/Pillow ) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases )
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst )
- [Commits](https://github.com/python-pillow/Pillow/compare/9.3.0...9.4.0 )
---
updated-dependencies:
- dependency-name: pillow
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:52:53 +00:00
dependabot[bot]
f80d05c7b0
build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements ( #122 )
...
Bumps [pytz](https://github.com/stub42/pytz ) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases )
- [Commits](https://github.com/stub42/pytz/compare/release_2022.6...release_2022.7 )
---
updated-dependencies:
- dependency-name: pytz
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:42:49 +00:00
dependabot[bot]
401124ccbf
build(deps): Bump packaging from 21.3 to 22.0 in /requirements ( #119 )
...
Bumps [packaging](https://github.com/pypa/packaging ) from 21.3 to 22.0.
- [Release notes](https://github.com/pypa/packaging/releases )
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pypa/packaging/compare/21.3...22.0 )
---
updated-dependencies:
- dependency-name: packaging
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-02 16:33:32 +00:00
dependabot[bot]
4f8519345c
build(deps): Bump mypy from 0.990 to 0.991 in /requirements ( #123 )
...
Bumps [mypy](https://github.com/python/mypy ) from 0.990 to 0.991.
- [Release notes](https://github.com/python/mypy/releases )
- [Commits](https://github.com/python/mypy/compare/v0.990...v0.991 )
---
updated-dependencies:
- dependency-name: mypy
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 11:24:01 -05:00
dependabot[bot]
6291d12f15
build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements ( #118 )
...
Bumps [lxml](https://github.com/lxml/lxml ) from 4.9.1 to 4.9.2.
- [Release notes](https://github.com/lxml/lxml/releases )
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt )
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.1...lxml-4.9.2 )
---
updated-dependencies:
- dependency-name: lxml
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 12:39:53 -05:00
dependabot[bot]
6ef08d22db
build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements ( #117 )
...
Bumps [torch](https://github.com/pytorch/pytorch ) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases )
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md )
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.0...v1.13.1 )
---
updated-dependencies:
- dependency-name: torch
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:16:13 +00:00
dependabot[bot]
2b8e54f254
build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements ( #115 )
...
Bumps [pip-tools](https://github.com/jazzband/pip-tools ) from 6.10.0 to 6.12.1.
- [Release notes](https://github.com/jazzband/pip-tools/releases )
- [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md )
- [Commits](https://github.com/jazzband/pip-tools/compare/6.10.0...6.12.1 )
---
updated-dependencies:
- dependency-name: pip-tools
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:07:34 +00:00
dependabot[bot]
401af74724
build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements ( #114 )
...
Bumps [transformers](https://github.com/huggingface/transformers ) from 4.23.1 to 4.25.1.
- [Release notes](https://github.com/huggingface/transformers/releases )
- [Commits](https://github.com/huggingface/transformers/compare/v4.23.1...v4.25.1 )
---
updated-dependencies:
- dependency-name: transformers
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 11:58:24 -05:00
Matt Robinson
b14f6ac9bd
feat: extract metadata from .docx
, .xlsx
, and .jpg
( #113 )
...
* add python-docx dependency
* added function for extracting metadata from word documents
* add openpyxl
* added get_jpg_metadata; fixed typing
* bump changelog
* added pillow to dependencies
2022-12-26 09:34:36 -05:00
Mallori Harrell
e0a76effff
feat: Added EmailElement
for email documents ( #103 )
...
* new EmailElement data structure
2022-12-21 16:03:44 -06:00
Matt Robinson
4f6fc29b54
fix: partition_html
should process container divs that include text ( #110 )
...
* check for containers with text
* added tests for containers with text
* changelog and version bump
2022-12-21 21:51:04 +00:00
Mallori Harrell
6f4d9ad06c
chore: add new pattern for dash bullet ( #109 )
...
* add new pattern for dash bullet
2022-12-21 10:23:51 -06:00