sparkbrains
2b88890210
docs: customize sphinx doc theme ( #192 )
...
* feature: adding a feature for customizing color theme of sphinx docs
* fix: adding changelog and comments
* Adding css for changing colors of sidebar
* fix: removing changelog description
2023-02-06 17:30:55 +00:00
Matt Robinson
782b4352ec
build(deps): weekly dependency update; reduce dependabot frequency ( #194 )
...
* deps: pip-compile to update dependencies
* bump version
* linting, linting, linting
* typo
2023-02-06 16:39:29 +00:00
Matt Robinson
014585e872
fix: preserve the order of shapes in partition_pptx
output ( #193 )
...
* order the shapes top to bottom and left to right
* added tests for ordering
* update change log and bump version
* more tests
* don't need enumerate
* n -> on
0.4.6
2023-02-03 22:12:33 +00:00
Matt Robinson
a7ca58e0bc
fix: more english words; split on punctuation ( #191 )
...
* add a bigger list of english words
* update thresholds and add tests
* update docs; bump version
* fix version
* add additional english words back in
* linting, linting, linting
* add slashes
* work -> word
2023-02-02 17:25:47 +00:00
Matt Robinson
0589344ff7
fix: require a minimum prop of alpha characters for titles and narrative text ( #190 )
...
* added alpha ratio check
* added tests for alpha ratio
* bump changelog and update docs
* update changelog/version; update docs
* ofr -> or
2023-02-02 14:59:04 +00:00
Matt Robinson
1230a163fd
feat: set a user controlled max word length for titles ( #189 )
...
* update the docs
* add option for title max word length
* bump version; update changelog
* change max length to 12
* docs updates
* to -> too
2023-02-01 19:32:16 +00:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word ( #188 )
...
* added check for english words
* update docs
* at least one word needs to have multiple characters
* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
d0bf8904fa
docs: example notebooks from community repo ( #187 )
2023-01-31 10:37:32 -05:00
sparkbrains
243bf7ed5e
test: Increase coverage ( #181 )
2023-01-30 22:47:09 -08:00
Matt Robinson
f36e514c6d
build(deps): weekly dependency bump ( #183 )
2023-01-30 11:05:48 -05:00
Matt Robinson
e6cfde5c4a
fix: no UserWarning
when partition_pdf
is called ( #179 )
2023-01-27 12:08:18 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx
tests ( #177 )
...
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
1ce8447ba7
build(deps): bump unstructured inference; compile from setup.py ( #176 )
...
* bump unstructured inference; compile from setup.py
* bump version
* compile the local-inference extra
* linting, linting, linting
0.4.4
2023-01-25 16:32:57 +00:00
Matt Robinson
26a5546152
fix: handle xml filetype detection on amazon linux ( #173 )
...
* fix: handle xml filetype detection on amazon linux
* option for html or xml
* fix typo
* back to dev tag
2023-01-25 11:20:01 -05:00
Matt Robinson
3b6546515d
docs: add links to linkedin and slack ( #175 )
2023-01-24 13:51:10 -08:00
qued
d2909ac688
chore: update all deps ( #172 )
2023-01-23 13:03:02 -06:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx
( #166 )
...
* parition pptx and tests
* add parition_pptx to auto
* update doc types in readme
* add pptx docs
* bump version
* remove extra whitespace
* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
8d3e616846
feat: add ability to parse LayoutElement
lists ( #165 )
...
* added ability to split list items
* changelog and version bump
* retrigger ci
2023-01-20 08:55:11 -05:00
Matt Robinson
c1822911a5
chore: return Element
objects in partition_pdf
and partition_image
( #164 )
...
* helper function to convert to element
* test for element types
* fix for healthcheck url
* version bump
* note on coordinates
* mention FigureCaption
* test_shared -> test_common
* add check boxes for checkbox template
* update changelog
2023-01-19 14:29:28 +00:00
Matt Robinson
59f972d739
build(deps): add requests
as a base dependency ( #162 )
...
* build(deps): add `requests` as a base dependency
* linting, linting, linting
* changelog typo
0.4.3
2023-01-18 16:36:23 +00:00
Matt Robinson
74ce2ae6e5
fix: update detect_filetype
to properly handle older office files ( #161 )
2023-01-18 11:18:20 -05:00
Mallori Harrell
08ccee0acb
chore: Fix parse received data ( #143 )
...
* fix parse_received data
2023-01-17 16:36:44 -06:00
Matt Robinson
749f9c6be8
fix: avoid divide by zero in exceeds_cap_ratio
( #160 )
2023-01-17 15:22:12 -05:00
gokullan
5d9183dc99
chore: graceful exit if sed is an old version ( #157 )
2023-01-17 18:11:14 +00:00
Matt Robinson
9c3c14e94d
fix: resolves UnicodeDecodeError
in partition_email
for emails with attachments ( #158 )
...
* split emails by \n=
* added test for equivalence betweent html and plain text
* changelog and bump version
* add check for content disposition
0.4.2
2023-01-17 11:33:45 -05:00
dependabot[bot]
7ed5f71e30
build(deps): Bump packaging from 22.0 to 23.0 in /requirements ( #156 )
...
Bumps [packaging](https://github.com/pypa/packaging ) from 22.0 to 23.0.
- [Release notes](https://github.com/pypa/packaging/releases )
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pypa/packaging/compare/22.0...23.0 )
---
updated-dependencies:
- dependency-name: packaging
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 11:03:03 -05:00
dependabot[bot]
04c1813c7f
build(deps): Bump filelock from 3.8.2 to 3.9.0 in /requirements ( #152 )
...
Bumps [filelock](https://github.com/tox-dev/py-filelock ) from 3.8.2 to 3.9.0.
- [Release notes](https://github.com/tox-dev/py-filelock/releases )
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst )
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.2...3.9.0 )
---
updated-dependencies:
- dependency-name: filelock
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-17 15:40:26 +00:00
dependabot[bot]
49392a2955
build(deps): Bump requests from 2.28.1 to 2.28.2 in /requirements ( #154 )
...
Bumps [requests](https://github.com/psf/requests ) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/psf/requests/releases )
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md )
- [Commits](https://github.com/psf/requests/compare/v2.28.1...v2.28.2 )
---
updated-dependencies:
- dependency-name: requests
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 10:28:23 -05:00
qued
8abf1f119d
feat: partition image ( #144 )
...
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
419c0867d3
build(deps): bump unstructured_inference
version range ( #151 )
...
* bump unstructured-inference to 0.2.3
* bump version
0.4.1
2023-01-13 22:21:36 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt
files in partition
( #150 )
...
* added partition_text for auto
* rename partition_text tests
* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info
for exploring a directory of files ( #142 )
...
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
0.4.0
2023-01-11 12:40:50 -05:00
qued
7e3af6c609
chore: remove extra requirements.txt ( #140 )
2023-01-10 22:12:10 -06:00
Mallori Harrell
e0feba83f6
feat: Add Image element and find_embedded_image
function ( #130 )
...
* add find_embedded_image
2023-01-09 19:49:19 -06:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci
target ( #138 )
...
* fix install-ci make target
* add note to readme about libmagic
* remove mydoc.docx
* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition
brick with filetype detection ( #132 )
...
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails ( #111 )
...
* partition_text function
2023-01-09 17:08:08 +00:00
dependabot[bot]
7fb8713527
build(deps): Bump black from 22.10.0 to 22.12.0 in /requirements ( #137 )
...
Bumps [black](https://github.com/psf/black ) from 22.10.0 to 22.12.0.
- [Release notes](https://github.com/psf/black/releases )
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md )
- [Commits](https://github.com/psf/black/compare/22.10.0...22.12.0 )
---
updated-dependencies:
- dependency-name: black
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:54:26 +00:00
dependabot[bot]
5129809f00
build(deps): Bump numpy from 1.23.5 to 1.24.1 in /requirements ( #136 )
...
Bumps [numpy](https://github.com/numpy/numpy ) from 1.23.5 to 1.24.1.
- [Release notes](https://github.com/numpy/numpy/releases )
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst )
- [Commits](https://github.com/numpy/numpy/compare/v1.23.5...v1.24.1 )
---
updated-dependencies:
- dependency-name: numpy
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:44:09 +00:00
dependabot[bot]
f06189b292
build(deps-dev): Bump jupyter-core from 5.1.0 to 5.1.3 in /requirements ( #134 )
...
Bumps [jupyter-core](https://github.com/jupyter/jupyter_core ) from 5.1.0 to 5.1.3.
- [Release notes](https://github.com/jupyter/jupyter_core/releases )
- [Changelog](https://github.com/jupyter/jupyter_core/blob/main/CHANGELOG.md )
- [Commits](https://github.com/jupyter/jupyter_core/compare/v5.1.0...v5.1.3 )
---
updated-dependencies:
- dependency-name: jupyter-core
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-09 16:32:31 +00:00
dependabot[bot]
ca8e1ee9f3
build(deps): Bump pydantic from 1.10.2 to 1.10.4 in /requirements ( #133 )
...
Bumps [pydantic](https://github.com/pydantic/pydantic ) from 1.10.2 to 1.10.4.
- [Release notes](https://github.com/pydantic/pydantic/releases )
- [Changelog](https://github.com/pydantic/pydantic/blob/v1.10.4/HISTORY.md )
- [Commits](https://github.com/pydantic/pydantic/compare/v1.10.2...v1.10.4 )
---
updated-dependencies:
- dependency-name: pydantic
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 11:22:14 -05:00
Matt Robinson
fee95b643c
feat: add partition_docx
for Word documents ( #131 )
...
* first pass on docx parsing
* linting, linting, linting
* test docx with filename
* added documentation
* more tests; version bump
* typo
* another typo
* another typo!
* it -> its
* save -> saved
* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda
( #129 )
...
* add environment.yml
* instructions on how to install base package and detectron2
* added instructions on paddleocr
* remove covers
* install -> to install
* specified the shell
* updated example snippets
* update environment.yml
* updated the repo reference
* no more ands!
2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets ( #128 )
...
* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
qued
a75499d465
feat: local inference ( #125 )
...
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
0.3.5
2023-01-04 16:19:05 -06:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe
staging brick ( #127 )
...
* add pandas to deps; pip-compile
* staging brick to convert elements to dataframe
* bump version
* add convert_to_dataframe docs
* bump wheel version
* typo fix
* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers ( #124 )
...
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info
( #112 )
...
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
dependabot[bot]
456735735c
build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements ( #120 )
...
Bumps [pillow](https://github.com/python-pillow/Pillow ) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases )
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst )
- [Commits](https://github.com/python-pillow/Pillow/compare/9.3.0...9.4.0 )
---
updated-dependencies:
- dependency-name: pillow
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:52:53 +00:00
dependabot[bot]
f80d05c7b0
build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements ( #122 )
...
Bumps [pytz](https://github.com/stub42/pytz ) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases )
- [Commits](https://github.com/stub42/pytz/compare/release_2022.6...release_2022.7 )
---
updated-dependencies:
- dependency-name: pytz
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:42:49 +00:00