459 Commits

Author SHA1 Message Date
dependabot[bot]
401124ccbf
build(deps): Bump packaging from 21.3 to 22.0 in /requirements (#119)
Bumps [packaging](https://github.com/pypa/packaging) from 21.3 to 22.0.
- [Release notes](https://github.com/pypa/packaging/releases)
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst)
- [Commits](https://github.com/pypa/packaging/compare/21.3...22.0)

---
updated-dependencies:
- dependency-name: packaging
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-02 16:33:32 +00:00
dependabot[bot]
4f8519345c
build(deps): Bump mypy from 0.990 to 0.991 in /requirements (#123)
Bumps [mypy](https://github.com/python/mypy) from 0.990 to 0.991.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](https://github.com/python/mypy/compare/v0.990...v0.991)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-02 11:24:01 -05:00
dependabot[bot]
6291d12f15
build(deps): Bump lxml from 4.9.1 to 4.9.2 in /requirements (#118)
Bumps [lxml](https://github.com/lxml/lxml) from 4.9.1 to 4.9.2.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](https://github.com/lxml/lxml/compare/lxml-4.9.1...lxml-4.9.2)

---
updated-dependencies:
- dependency-name: lxml
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 12:39:53 -05:00
dependabot[bot]
6ef08d22db
build(deps): Bump torch from 1.13.0 to 1.13.1 in /requirements (#117)
Bumps [torch](https://github.com/pytorch/pytorch) from 1.13.0 to 1.13.1.
- [Release notes](https://github.com/pytorch/pytorch/releases)
- [Changelog](https://github.com/pytorch/pytorch/blob/master/RELEASE.md)
- [Commits](https://github.com/pytorch/pytorch/compare/v1.13.0...v1.13.1)

---
updated-dependencies:
- dependency-name: torch
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:16:13 +00:00
dependabot[bot]
2b8e54f254
build(deps-dev): Bump pip-tools from 6.10.0 to 6.12.1 in /requirements (#115)
Bumps [pip-tools](https://github.com/jazzband/pip-tools) from 6.10.0 to 6.12.1.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/main/CHANGELOG.md)
- [Commits](https://github.com/jazzband/pip-tools/compare/6.10.0...6.12.1)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-26 17:07:34 +00:00
dependabot[bot]
401af74724
build(deps): Bump transformers from 4.23.1 to 4.25.1 in /requirements (#114)
Bumps [transformers](https://github.com/huggingface/transformers) from 4.23.1 to 4.25.1.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.23.1...v4.25.1)

---
updated-dependencies:
- dependency-name: transformers
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-26 11:58:24 -05:00
Matt Robinson
b14f6ac9bd
feat: extract metadata from .docx, .xlsx, and .jpg (#113)
* add python-docx dependency

* added function for extracting metadata from word documents

* add openpyxl

* added get_jpg_metadata; fixed typing

* bump changelog

* added pillow to dependencies
2022-12-26 09:34:36 -05:00
Mallori Harrell
e0a76effff
feat: Added EmailElement for email documents (#103)
* new EmailElement data structure
2022-12-21 16:03:44 -06:00
Matt Robinson
4f6fc29b54
fix: partition_html should process container divs that include text (#110)
* check for containers with text

* added tests for containers with text

* changelog and version bump
2022-12-21 21:51:04 +00:00
Mallori Harrell
6f4d9ad06c
chore: add new pattern for dash bullet (#109)
* add new pattern for dash bullet
2022-12-21 10:23:51 -06:00
cragwolfe
962c9dccca
fix: Python 3.7 typing imports (#108)
Per issue #57 , fix typing imports in Python 3.7.
0.3.4
2022-12-21 10:28:20 -05:00
Yuming Long
de4d0d42b1
doc: bump to new version (#107) 0.3.3 2022-12-20 15:02:16 -05:00
Yuming Long
4803281861
chore: logger should not be setting up a BasicConfig (#106)
* feat: simple logger

* doc: changelog and version
2022-12-20 10:39:02 -05:00
Matt Robinson
407f700b20
build(deps): bump certify to incorporate security patches (#105)
* pin certifi in base and huggingface

* pinning for build and docs
2022-12-19 14:47:15 -05:00
Matt Robinson
7a74cdda86
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements

* fix types in mime encodings cleaner

* first pass on partition_email

* tests for email

* test for mime encodings

* changelog bump

* added note about \n=

* linting, linting, linting

* added email docs

* add partition_email to the readme

* add one more test
2022-12-19 18:02:44 +00:00
Matt Robinson
1d68bb2482
feat: apply method to apply cleaning bricks to elements (#102)
* add apply method to apply cleaners to elements

* bump version

* add check for string output

* documentations for the apply method

* change interface to *cleaners
0.3.2
2022-12-15 22:19:02 +00:00
Matt Robinson
b1cce16c16
feat: translate_text cleaning brick (#101)
* initial implementation for translate brick

* more input validation

* tests for translate brick

* added docs

* bumped version

* chinese and arabic tests

* re-run pip-compile

* add torch to dependencies

* cleanup doc string

* fix long string

* fix typo in docs

* take out empty string check

* return string if string is empty

* added huggingface into make install
2022-12-15 15:35:15 -05:00
Matt Robinson
1700d4d527
fix: add __init__.py to the partition module (#100) 0.3.1 2022-12-14 12:59:34 -05:00
Matt Robinson
151732c74c
release: bump to version 0.3.0 (#99) 0.3.0 2022-12-14 11:37:53 -05:00
dependabot[bot]
6b15d706fd
build(deps): Bump huggingface-hub from 0.10.1 to 0.11.1 in /requirements (#94)
Bumps [huggingface-hub](https://github.com/huggingface/huggingface_hub) from 0.10.1 to 0.11.1.
- [Release notes](https://github.com/huggingface/huggingface_hub/releases)
- [Commits](https://github.com/huggingface/huggingface_hub/compare/v0.10.1...v0.11.1)

---
updated-dependencies:
- dependency-name: huggingface-hub
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2022-12-12 11:32:50 -06:00
Matt Robinson
cf041ea637
fix: remove accident ' file (#98) 2022-12-12 12:08:22 -05:00
dependabot[bot]
87a77abe45
build(deps): Bump argilla from 1.1.0 to 1.1.1 in /requirements (#93)
Bumps [argilla](https://github.com/argilla-io/argilla) from 1.1.0 to 1.1.1.
- [Release notes](https://github.com/argilla-io/argilla/releases)
- [Changelog](https://github.com/argilla-io/argilla/blob/develop/release.Dockerfile)
- [Commits](https://github.com/argilla-io/argilla/compare/v1.1.0...v1.1.1)

---
updated-dependencies:
- dependency-name: argilla
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-12 16:56:51 +00:00
dependabot[bot]
a111c3bef3
build(deps): Bump urllib3 from 1.26.12 to 1.26.13 in /requirements (#95)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 1.26.12 to 1.26.13.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/1.26.12...1.26.13)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-12-12 16:43:47 +00:00
dependabot[bot]
adf48f2530
build(deps): Bump filelock from 3.8.0 to 3.8.2 in /requirements (#96)
Bumps [filelock](https://github.com/tox-dev/py-filelock) from 3.8.0 to 3.8.2.
- [Release notes](https://github.com/tox-dev/py-filelock/releases)
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst)
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.0...3.8.2)

---
updated-dependencies:
- dependency-name: filelock
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-12 11:35:27 -05:00
Sebastian Laverde Alfonso
efd7d38ce5
docs: update readme with gif (#90)
relative path might have to be changed once the branch is merged
2022-12-12 16:05:46 +01:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00
Matt Robinson
6cb3ac81d8
build(deps): Bump certifi to address security scans (#92) 2022-12-12 09:03:02 -05:00
qued
0974ef39a5
chore: add type hint to requests.post arg (#89) 2022-12-02 18:44:22 -06:00
Matt Robinson
0658744c38
test: mock model api calls; full coverage for partition_pdf (#88)
* test: mock model api calls; full coverage for partition_pdf

* bump version
2022-11-30 16:34:24 -05:00
Matt Robinson
77cd5cc01f
feat: text2text and token classification for argilla (#87)
* add support for text2text

* add support for token classification datasets

* bump versions

* updated docs

* remove extra comment

* fix wording in docs

* fix some more wording
2022-11-30 20:07:42 +00:00
Sebastian Laverde Alfonso
2a56fa741b
fix: partition pdf bad responses (#86)
* fix: bad responses in partition_pdf raise ValueError

* docs: update changelog with fix of partition_pdf responses

* refactor: remove unused import Text

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
2022-11-30 19:27:24 +01:00
Matt Robinson
5c4428413a
build(deps): Bump jupyter-core library (#85) 2022-11-30 10:04:56 -05:00
Sebastian Laverde Alfonso
def873f8b5
Sebastian/refine repo and readme (#82)
* docs: add badges and emojis to README

* docs: add covenant badge and refactor style

* docs: add links to badges

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2022-11-30 10:02:58 +01:00
Matt Robinson
c62f18c0d0
feat: Add html escape quotes to cleaning brick (#84)
* feat: Add html escape quotes to cleaning brick

* bump changelog
2022-11-29 10:58:31 -05:00
Sebastian Laverde Alfonso
8bb4b02053
docs: edit and add all issue templates (#83)
* docs: add title and labels

* docs: create custom.md issue template

* docs: create feature issue template
2022-11-29 15:28:31 +01:00
Sebastian Laverde Alfonso
2ab2f23703
docs: update issue templates (#80)
* docs: update issue templates

Add customs for Bug report and Feature request.

* docs: correct bug report template, no UI no mobile
2022-11-28 17:38:40 +00:00
asymness
2170a2aae2
feat: Implement Argilla staging brick (#81)
* Add argilla to dependencies and run pip-compile

* Implement Argilla staging brick and add unit tests

* Update version and changelog

* Update docs with description and usage for Argilla staging brick

* Remove unused fixtures and fix typo in Argilla tests

* add missing quote in docs

* changelog tweak

* doc tweaks

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-28 14:41:48 +00:00
Sebastian Laverde Alfonso
d6623883dc
docs: create CODE_OF_CONDUCT.md (#79)
This is in order to complete the community standards in Insights.
Add support@unstructured.io as contact for reporting.
2022-11-28 15:10:43 +01:00
cragwolfe
88373a1559
chore: Python 3.8.15 is the most recent 3.8 (#78) 2022-11-23 12:09:13 -05:00
Matt Robinson
b041b0197d
feat: Add entities kwarg to datasaur bricks (#77)
* added entities to datasaur

* add tests for datasaur with entities

* update docs

* fix missing imports

* bump version

* remove accidental file
2022-11-22 19:50:19 +00:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Mallori Harrell
53fcf4e912
chore: Remove PDF parsing code and dependencies (#75)
Remove PDF parsing code and dependencies.
2022-11-21 11:47:29 -06:00
Sebastian Laverde Alfonso
baa15d0098
feat: new partitioning brick that calls the document image analysis API (#68)
* docs: add new feature to the CHANGELOG.md, bump the version, update __version__.py

* feat: new partition to call the document image analysis API

* fix: remove duplicated dependency on partition.py

* fix: linting error due to line-lenght > 100

* test: add test to call partition_pdf brick

* chore: new short example-doc pdf for speed up in test X8

* fix: add missing return statement to _read to pass check

* feat: new partitioning brick to call doc parse API

* docs: version update fix in CHANGELOG

* refactor: no nested ifs

* docs: documentation for new brick partition_pdf

* refactor: made tidy

* docs: minor doc refactor

Co-authored-by: Sebastian Laverde <sebastian@unstructured.io>
0.2.6
2022-11-16 17:48:30 +01:00
Yuming Long
83e7f9d347
docs: add Colab notebook line in README (#66) 2022-11-14 14:45:22 -05:00
qued
9906dd23a1
fix: move _read out of base Document class
Changed where _read sits in the inheritance structure since PDFDocument doesn't really need lazy document processing
2022-11-14 13:34:42 -06:00
dependabot[bot]
2ad2a8fa72
build(deps): Bump pandas from 1.5.0 to 1.5.1 in /requirements (#69)
Bumps [pandas](https://github.com/pandas-dev/pandas) from 1.5.0 to 1.5.1.
- [Release notes](https://github.com/pandas-dev/pandas/releases)
- [Changelog](https://github.com/pandas-dev/pandas/blob/main/RELEASE.md)
- [Commits](https://github.com/pandas-dev/pandas/compare/v1.5.0...v1.5.1)

---
updated-dependencies:
- dependency-name: pandas
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-14 18:07:44 +00:00
dependabot[bot]
8936ab21a7
build(deps): Bump mypy from 0.982 to 0.990 in /requirements (#73)
* build(deps): Bump mypy from 0.982 to 0.990 in /requirements

Bumps [mypy](https://github.com/python/mypy) from 0.982 to 0.990.
- [Release notes](https://github.com/python/mypy/releases)
- [Commits](https://github.com/python/mypy/compare/v0.982...v0.990)

---
updated-dependencies:
- dependency-name: mypy
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* fix typing issues

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-14 17:57:05 +00:00
dependabot[bot]
8b35c00e0b
build(deps-dev): Bump pip-tools from 6.9.0 to 6.10.0 in /requirements (#71)
Bumps [pip-tools](https://github.com/jazzband/pip-tools) from 6.9.0 to 6.10.0.
- [Release notes](https://github.com/jazzband/pip-tools/releases)
- [Changelog](https://github.com/jazzband/pip-tools/blob/master/CHANGELOG.md)
- [Commits](https://github.com/jazzband/pip-tools/compare/6.9.0...6.10.0)

---
updated-dependencies:
- dependency-name: pip-tools
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-14 17:38:34 +00:00
dependabot[bot]
0da8fed32d
build(deps): Bump contourpy from 1.0.5 to 1.0.6 in /requirements (#70)
Bumps [contourpy](https://github.com/contourpy/contourpy) from 1.0.5 to 1.0.6.
- [Release notes](https://github.com/contourpy/contourpy/releases)
- [Changelog](https://github.com/contourpy/contourpy/blob/main/docs/changelog.rst)
- [Commits](https://github.com/contourpy/contourpy/compare/v1.0.5...v1.0.6)

---
updated-dependencies:
- dependency-name: contourpy
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2022-11-14 12:32:41 -05:00
dependabot[bot]
b28ba58a98
build(deps): Bump pycocotools from 2.0.5 to 2.0.6 in /requirements (#72)
Bumps [pycocotools](https://github.com/ppwwyyxx/cocoapi) from 2.0.5 to 2.0.6.
- [Release notes](https://github.com/ppwwyyxx/cocoapi/releases)
- [Commits](https://github.com/ppwwyyxx/cocoapi/commits)

---
updated-dependencies:
- dependency-name: pycocotools
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-11-14 11:02:07 -06:00