Matt Robinson
f12240c5e7
feat: add support for .txt
files in partition
( #150 )
...
* added partition_text for auto
* rename partition_text tests
* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info
for exploring a directory of files ( #142 )
...
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
0.4.0
2023-01-11 12:40:50 -05:00
qued
7e3af6c609
chore: remove extra requirements.txt ( #140 )
2023-01-10 22:12:10 -06:00
Mallori Harrell
e0feba83f6
feat: Add Image element and find_embedded_image
function ( #130 )
...
* add find_embedded_image
2023-01-09 19:49:19 -06:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci
target ( #138 )
...
* fix install-ci make target
* add note to readme about libmagic
* remove mydoc.docx
* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition
brick with filetype detection ( #132 )
...
* add python-magic
* first pass on filetype detection
* tests for filetype detection
* more tests for file detection
* added tests for error conditions
* install libmagic dev in github
* libmagic install instructions
* pattern for checking email files
* support reading .eml in rb mode
* add auto partition function
* auto tests for emal
* auto tests for docx
* added tests for html
* add pdf and html tests
* linting, linting, linting
* added docs for auto partitioning
* update readme with generic partition brick
* bumped version
* added test for bad type
* detect .docx files from application/octet-stream
* linting, linting, linting
* identify xlsx from octet stream
* install poppler in ci
* fix mocks; test for unknown type
* install poppler utils
* install in one line
* only poppler-utils
* file extension logic from application/octet-stream
* install local inference for ci
* install detectron2
* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails ( #111 )
...
* partition_text function
2023-01-09 17:08:08 +00:00
dependabot[bot]
7fb8713527
build(deps): Bump black from 22.10.0 to 22.12.0 in /requirements ( #137 )
...
Bumps [black](https://github.com/psf/black ) from 22.10.0 to 22.12.0.
- [Release notes](https://github.com/psf/black/releases )
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md )
- [Commits](https://github.com/psf/black/compare/22.10.0...22.12.0 )
---
updated-dependencies:
- dependency-name: black
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:54:26 +00:00
dependabot[bot]
5129809f00
build(deps): Bump numpy from 1.23.5 to 1.24.1 in /requirements ( #136 )
...
Bumps [numpy](https://github.com/numpy/numpy ) from 1.23.5 to 1.24.1.
- [Release notes](https://github.com/numpy/numpy/releases )
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst )
- [Commits](https://github.com/numpy/numpy/compare/v1.23.5...v1.24.1 )
---
updated-dependencies:
- dependency-name: numpy
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 16:44:09 +00:00
dependabot[bot]
f06189b292
build(deps-dev): Bump jupyter-core from 5.1.0 to 5.1.3 in /requirements ( #134 )
...
Bumps [jupyter-core](https://github.com/jupyter/jupyter_core ) from 5.1.0 to 5.1.3.
- [Release notes](https://github.com/jupyter/jupyter_core/releases )
- [Changelog](https://github.com/jupyter/jupyter_core/blob/main/CHANGELOG.md )
- [Commits](https://github.com/jupyter/jupyter_core/compare/v5.1.0...v5.1.3 )
---
updated-dependencies:
- dependency-name: jupyter-core
dependency-type: direct:development
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-09 16:32:31 +00:00
dependabot[bot]
ca8e1ee9f3
build(deps): Bump pydantic from 1.10.2 to 1.10.4 in /requirements ( #133 )
...
Bumps [pydantic](https://github.com/pydantic/pydantic ) from 1.10.2 to 1.10.4.
- [Release notes](https://github.com/pydantic/pydantic/releases )
- [Changelog](https://github.com/pydantic/pydantic/blob/v1.10.4/HISTORY.md )
- [Commits](https://github.com/pydantic/pydantic/compare/v1.10.2...v1.10.4 )
---
updated-dependencies:
- dependency-name: pydantic
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-09 11:22:14 -05:00
Matt Robinson
fee95b643c
feat: add partition_docx
for Word documents ( #131 )
...
* first pass on docx parsing
* linting, linting, linting
* test docx with filename
* added documentation
* more tests; version bump
* typo
* another typo
* another typo!
* it -> its
* save -> saved
* remove None since it's the default argument
2023-01-05 20:13:39 +00:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda
( #129 )
...
* add environment.yml
* instructions on how to install base package and detectron2
* added instructions on paddleocr
* remove covers
* install -> to install
* specified the shell
* updated example snippets
* update environment.yml
* updated the repo reference
* no more ands!
2023-01-05 16:21:44 +00:00
Sebastian Laverde Alfonso
5a47eb06e9
feat: new bricks for removing and extracting ordered bullets ( #128 )
...
* feat: new cleaning brick for ordered bullets
* test: add test for cleaning ordered bullets
* feat: new brick for extracting ordered bullets
* test: add test for extracting ordered bullets
* docs: update CHANGELOG and bump new dev version
* chore: change extract ordered bullets return type to tuple
* chore: made tidy
* chore: regex to split on pattern instead of built-in
* chore: catch ValueError, made tidy and fix incompatible type
* chore: assertion statements in one line of code
* docs: add documentation for new clean and extract bricks to bricks.rst
* docs: refactor CHANGELOG 0.3.5.dev5 to dev6 with new bullets
* docs: update CHANGELOG 0.3.6-dev0 changes and bump version
Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2023-01-05 17:06:26 +01:00
qued
a75499d465
feat: local inference ( #125 )
...
Splits partition_pdf into two paths, one used for local inference when url is None, another for inference via api when url is a string.
0.3.5
2023-01-04 16:19:05 -06:00
Matt Robinson
17045aed80
feat: add convert_to_dataframe
staging brick ( #127 )
...
* add pandas to deps; pip-compile
* staging brick to convert elements to dataframe
* bump version
* add convert_to_dataframe docs
* bump wheel version
* typo fix
* typo fix 2!
2023-01-04 12:04:59 -05:00
Matt Robinson
445533745c
feat: helper functions to identify and extract phone numbers ( #124 )
...
* added pattern for finding phone numbers
* added cleaning brick for extracting phone numbers
* add docs
* changelog and bump version
* switch to us phone numbers
* bump dev version
2023-01-03 13:31:05 -05:00
Mallori Harrell
509ad4951c
feat: Add extract_attachment_info
( #112 )
...
* Adds function to extract attachments and their metadata from eml files
2023-01-03 11:41:54 -06:00
dependabot[bot]
456735735c
build(deps): Bump pillow from 9.3.0 to 9.4.0 in /requirements ( #120 )
...
Bumps [pillow](https://github.com/python-pillow/Pillow ) from 9.3.0 to 9.4.0.
- [Release notes](https://github.com/python-pillow/Pillow/releases )
- [Changelog](https://github.com/python-pillow/Pillow/blob/main/CHANGES.rst )
- [Commits](https://github.com/python-pillow/Pillow/compare/9.3.0...9.4.0 )
---
updated-dependencies:
- dependency-name: pillow
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:52:53 +00:00
dependabot[bot]
f80d05c7b0
build(deps): Bump pytz from 2022.6 to 2022.7 in /requirements ( #122 )
...
Bumps [pytz](https://github.com/stub42/pytz ) from 2022.6 to 2022.7.
- [Release notes](https://github.com/stub42/pytz/releases )
- [Commits](https://github.com/stub42/pytz/compare/release_2022.6...release_2022.7 )
---
updated-dependencies:
- dependency-name: pytz
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 16:42:49 +00:00
dependabot[bot]
401124ccbf
build(deps): Bump packaging from 21.3 to 22.0 in /requirements ( #119 )
...
Bumps [packaging](https://github.com/pypa/packaging ) from 21.3 to 22.0.
- [Release notes](https://github.com/pypa/packaging/releases )
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pypa/packaging/compare/21.3...22.0 )
---
updated-dependencies:
- dependency-name: packaging
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-02 16:33:32 +00:00
dependabot[bot]
4f8519345c
build(deps): Bump mypy from 0.990 to 0.991 in /requirements ( #123 )
...
Bumps [mypy](https://github.com/python/mypy ) from 0.990 to 0.991.
- [Release notes](https://github.com/python/mypy/releases )
- [Commits](https://github.com/python/mypy/compare/v0.990...v0.991 )
---
updated-dependencies:
- dependency-name: mypy
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 11:24:01 -05:00
dependabot[bot]
6291d12f15
build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements ( #118 )
...
Bumps [lxml](https://github.com/lxml/lxml ) from 4.9.1 to 4.9.2.
- [Release notes](https://github.com/lxml/lxml/releases )
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt )
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.1...lxml-4.9.2 )
---
updated-dependencies:
- dependency-name: lxml
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 12:39:53 -05:00
dependabot[bot]
6ef08d22db
build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements ( #117 )
...
Bumps [torch](https://github.com/pytorch/pytorch ) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases )
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md )
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.0...v1.13.1 )
---
updated-dependencies:
- dependency-name: torch
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:16:13 +00:00
dependabot[bot]
2b8e54f254
build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements ( #115 )
...
Bumps [pip-tools](https://github.com/jazzband/pip-tools ) from 6.10.0 to 6.12.1.
- [Release notes](https://github.com/jazzband/pip-tools/releases )
- [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md )
- [Commits](https://github.com/jazzband/pip-tools/compare/6.10.0...6.12.1 )
---
updated-dependencies:
- dependency-name: pip-tools
dependency-type: direct:development
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:07:34 +00:00
dependabot[bot]
401af74724
build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements ( #114 )
...
Bumps [transformers](https://github.com/huggingface/transformers ) from 4.23.1 to 4.25.1.
- [Release notes](https://github.com/huggingface/transformers/releases )
- [Commits](https://github.com/huggingface/transformers/compare/v4.23.1...v4.25.1 )
---
updated-dependencies:
- dependency-name: transformers
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 11:58:24 -05:00
Matt Robinson
b14f6ac9bd
feat: extract metadata from .docx
, .xlsx
, and .jpg
( #113 )
...
* add python-docx dependency
* added function for extracting metadata from word documents
* add openpyxl
* added get_jpg_metadata; fixed typing
* bump changelog
* added pillow to dependencies
2022-12-26 09:34:36 -05:00
Mallori Harrell
e0a76effff
feat: Added EmailElement
for email documents ( #103 )
...
* new EmailElement data structure
2022-12-21 16:03:44 -06:00
Matt Robinson
4f6fc29b54
fix: partition_html
should process container divs that include text ( #110 )
...
* check for containers with text
* added tests for containers with text
* changelog and version bump
2022-12-21 21:51:04 +00:00
Mallori Harrell
6f4d9ad06c
chore: add new pattern for dash bullet ( #109 )
...
* add new pattern for dash bullet
2022-12-21 10:23:51 -06:00
cragwolfe
962c9dccca
fix: Python 3.7 typing imports ( #108 )
...
Per issue #57 , fix typing imports in Python 3.7.
0.3.4
2022-12-21 10:28:20 -05:00
Yuming Long
de4d0d42b1
doc: bump to new version ( #107 )
0.3.3
2022-12-20 15:02:16 -05:00
Yuming Long
4803281861
chore: logger should not be setting up a BasicConfig ( #106 )
...
* feat: simple logger
* doc: changelog and version
2022-12-20 10:39:02 -05:00
Matt Robinson
407f700b20
build(deps): bump certify
to incorporate security patches ( #105 )
...
* pin certifi in base and huggingface
* pinning for build and docs
2022-12-19 14:47:15 -05:00
Matt Robinson
7a74cdda86
feat: add partition_email
cleaning brick ( #104 )
...
* fix for processing deeply embedded list elements
* fix types in mime encodings cleaner
* first pass on partition_email
* tests for email
* test for mime encodings
* changelog bump
* added note about \n=
* linting, linting, linting
* added email docs
* add partition_email to the readme
* add one more test
2022-12-19 18:02:44 +00:00
Matt Robinson
1d68bb2482
feat: apply
method to apply cleaning bricks to elements ( #102 )
...
* add apply method to apply cleaners to elements
* bump version
* add check for string output
* documentations for the apply method
* change interface to *cleaners
0.3.2
2022-12-15 22:19:02 +00:00
Matt Robinson
b1cce16c16
feat: translate_text
cleaning brick ( #101 )
...
* initial implementation for translate brick
* more input validation
* tests for translate brick
* added docs
* bumped version
* chinese and arabic tests
* re-run pip-compile
* add torch to dependencies
* cleanup doc string
* fix long string
* fix typo in docs
* take out empty string check
* return string if string is empty
* added huggingface into make install
2022-12-15 15:35:15 -05:00
Matt Robinson
1700d4d527
fix: add __init__.py to the partition module ( #100 )
0.3.1
2022-12-14 12:59:34 -05:00
Matt Robinson
151732c74c
release: bump to version 0.3.0 ( #99 )
0.3.0
2022-12-14 11:37:53 -05:00
dependabot[bot]
6b15d706fd
build(deps): Bump huggingface-hub from 0.10.1 to 0.11.1 in /requirements ( #94 )
...
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub ) from 0.10.1 to 0.11.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases )
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.10.1...v0.11.1 )
---
updated-dependencies:
- dependency-name: huggingface-hub
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2022-12-12 11:32:50 -06:00
Matt Robinson
cf041ea637
fix: remove accident ' file ( #98 )
2022-12-12 12:08:22 -05:00
dependabot[bot]
87a77abe45
build(deps): Bump argilla from 1.1.0 to 1.1.1 in /requirements ( #93 )
...
Bumps [argilla](https://github.com/argilla-io/argilla ) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/argilla-io/argilla/releases )
- [Changelog](https://github.com/argilla-io/argilla/blob/develop/release.Dockerfile )
- [Commits](https://github.com/argilla-io/argilla/compare/v1.1.0...v1.1.1 )
---
updated-dependencies:
- dependency-name: argilla
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-12 16:56:51 +00:00
dependabot[bot]
a111c3bef3
build(deps): Bump urllib3 from 1.26.12 to 1.26.13 in /requirements ( #95 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 1.26.12 to 1.26.13.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.12...1.26.13 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-12 16:43:47 +00:00
dependabot[bot]
adf48f2530
build(deps): Bump filelock from 3.8.0 to 3.8.2 in /requirements ( #96 )
...
Bumps [filelock](https://github.com/tox-dev/py-filelock ) from 3.8.0 to 3.8.2.
- [Release notes](https://github.com/tox-dev/py-filelock/releases )
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst )
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.0...3.8.2 )
---
updated-dependencies:
- dependency-name: filelock
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-12 11:35:27 -05:00
Sebastian Laverde Alfonso
efd7d38ce5
docs: update readme with gif ( #90 )
...
relative path might have to be changed once the branch is merged
2022-12-12 16:05:46 +01:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick ( #91 )
...
* update readme
* updated sphinx docs
* bump version; changelog
* clear cache; retrigger ci
* rename test file
* switch default parameters to None
* typo in the changelog
* add in text output
2022-12-12 14:22:10 +00:00
Matt Robinson
6cb3ac81d8
build(deps): Bump certifi to address security scans ( #92 )
2022-12-12 09:03:02 -05:00
qued
0974ef39a5
chore: add type hint to requests.post arg ( #89 )
2022-12-02 18:44:22 -06:00
Matt Robinson
0658744c38
test: mock model api calls; full coverage for partition_pdf ( #88 )
...
* test: mock model api calls; full coverage for partition_pdf
* bump version
2022-11-30 16:34:24 -05:00
Matt Robinson
77cd5cc01f
feat: text2text and token classification for argilla ( #87 )
...
* add support for text2text
* add support for token classification datasets
* bump versions
* updated docs
* remove extra comment
* fix wording in docs
* fix some more wording
2022-11-30 20:07:42 +00:00