Matt Robinson
558ee63e90
feat: ability to skip English language specific checks with env var ( #224 )
...
* add language env var
* update docs
* version and bump change log
2023-02-15 09:15:47 -05:00
Matt Robinson
a68dc35940
chore: default to local inference for partition_pdf
and partition_image
( #222 )
...
* chore: default the url to None for pdf and images
* bump changelog and version
2023-02-14 16:16:33 -05:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector ( #218 )
2023-02-14 12:27:45 -08:00
qued
6d1d50d218
docs: update make targets ( #217 )
2023-02-14 06:08:29 +00:00
qued
5d0743ff8b
docs: add info about os dependencies ( #216 )
2023-02-14 05:31:52 +00:00
natygyoon
a920e55405
fix: remove comments when parsing XML or HTML ( #210 )
...
* Update xml.py
remove comments while parsing
* change logged in CHANGLOG and editted version
* make tidy
* editted version
* new version 0.4.8-dev1
* editted version
* Update CHANGELOG.md
Co-authored-by: cragwolfe <crag@unstructuredai.io>
---------
Co-authored-by: cragwolfe <crag@unstructuredai.io>
0.4.8
2023-02-11 02:52:13 +09:00
Matt Robinson
962de78def
fix: remove response text when the HTML status code is an error ( #213 )
...
* release: version 0.4.7
* remove response text from url error
0.4.7
2023-02-10 11:39:56 -05:00
Matt Robinson
f890972139
docs: add bricks training notebook ( #211 )
...
* added bricks notebook
* more unicode quotes; isd dataframe column fix
* fix remove_punctuation docs
* typo fixes
* put staging bricks in code
2023-02-10 14:39:14 +00:00
Matt Robinson
d0c6d50962
note on local inference
2023-02-09 15:16:14 -05:00
Matt Robinson
7f9aefc549
update partition_pdf section; added partition_image
2023-02-09 15:13:26 -05:00
Matt Robinson
24c90a03dc
docs: switch theme and style refresh ( #209 )
...
* add furo theme
* switch theme to furo
* css for custom sidebar
* remove unnecessary images
* removed unnecessary fonts
* fix logo background
* hide package name
* add favico, tweak colors
* copyright 2023
* update copyright years
* update hover colors
* fix title tab
2023-02-09 10:40:28 -05:00
Matt Robinson
7fb3797165
docs: core concepts training notebook ( #207 )
...
* added to_dict to elements
* first training notebook
* bump changelog, rerun notebook
* remove coordinates and id
* rerun notebook
* has -> have
* partitioning -> partition
* various and sundry typos
* switch to using convert_to_isd
2023-02-09 14:34:34 +00:00
Matt Robinson
47ab808e0f
feat: file info dataframe from filenames and file content ( #204 )
...
* added function for exploring a list of files
* file info from file contents
* added tests for file info from contents
* bump version and add tests
* add dev to version
2023-02-08 20:48:39 +00:00
djacobs7
15b0dffdb0
docs: correct kwarg in bricks.rst ( #206 )
...
Changed whitespace to extra_whitespace in documentation, to match options text.
2023-02-08 18:21:58 +00:00
Matt Robinson
e73cf09977
feat: optional page breaks for .pptx
, .pdf
, .html
and images ( #205 )
...
* page breaks for pptx
* added page breaks for image/pdf
* tests for images with page breaks
* page breaks for html documents
* linting, linting, linting
* changelog and bump version
* update docs
* fix typo
* refactor reusable code to common.py
* add type back in
2023-02-08 15:11:15 +00:00
Sebastian Laverde Alfonso
46b023f454
docs: update colab notebook link ( #203 )
2023-02-07 18:50:03 +01:00
Matt Robinson
ee9f15483f
feat: partition_html
directly from a url ( #202 )
...
* added tests for html from url
* bump version
* added types-requests
* and -> an
2023-02-07 14:09:34 +00:00
sparkbrains
2b88890210
docs: customize sphinx doc theme ( #192 )
...
* feature: adding a feature for customizing color theme of sphinx docs
* fix: adding changelog and comments
* Adding css for changing colors of sidebar
* fix: removing changelog description
2023-02-06 17:30:55 +00:00
Matt Robinson
782b4352ec
build(deps): weekly dependency update; reduce dependabot frequency ( #194 )
...
* deps: pip-compile to update dependencies
* bump version
* linting, linting, linting
* typo
2023-02-06 16:39:29 +00:00
Matt Robinson
014585e872
fix: preserve the order of shapes in partition_pptx
output ( #193 )
...
* order the shapes top to bottom and left to right
* added tests for ordering
* update change log and bump version
* more tests
* don't need enumerate
* n -> on
0.4.6
2023-02-03 22:12:33 +00:00
Matt Robinson
a7ca58e0bc
fix: more english words; split on punctuation ( #191 )
...
* add a bigger list of english words
* update thresholds and add tests
* update docs; bump version
* fix version
* add additional english words back in
* linting, linting, linting
* add slashes
* work -> word
2023-02-02 17:25:47 +00:00
Matt Robinson
0589344ff7
fix: require a minimum prop of alpha characters for titles and narrative text ( #190 )
...
* added alpha ratio check
* added tests for alpha ratio
* bump changelog and update docs
* update changelog/version; update docs
* ofr -> or
2023-02-02 14:59:04 +00:00
Matt Robinson
1230a163fd
feat: set a user controlled max word length for titles ( #189 )
...
* update the docs
* add option for title max word length
* bump version; update changelog
* change max length to 12
* docs updates
* to -> too
2023-02-01 19:32:16 +00:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word ( #188 )
...
* added check for english words
* update docs
* at least one word needs to have multiple characters
* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
d0bf8904fa
docs: example notebooks from community repo ( #187 )
2023-01-31 10:37:32 -05:00
sparkbrains
243bf7ed5e
test: Increase coverage ( #181 )
2023-01-30 22:47:09 -08:00
Matt Robinson
f36e514c6d
build(deps): weekly dependency bump ( #183 )
2023-01-30 11:05:48 -05:00
Matt Robinson
e6cfde5c4a
fix: no UserWarning
when partition_pdf
is called ( #179 )
2023-01-27 12:08:18 -05:00
Matt Robinson
339c133326
fix: cleanup from live .docx
tests ( #177 )
...
* add env var for cap threshold; raise default threshold
* update docs and tests
* added check for ending in a comma
* update docs
* no caps check for all upper text
* capture Text in html and text
* check category in Text equality check
* lower case all caps before checking for verbs
* added check for us city/state/zip
* added address type
* add address to html
* add address to text
* fix for text tests; escape for large text segments
* refactor regex for readability
* update comment
* additional test for text with linebreaks
* update docs
* update changelog
* update elements docs
* remove old comment
* case -> cast
* type fix
2023-01-26 15:52:25 +00:00
Matt Robinson
1ce8447ba7
build(deps): bump unstructured inference; compile from setup.py ( #176 )
...
* bump unstructured inference; compile from setup.py
* bump version
* compile the local-inference extra
* linting, linting, linting
0.4.4
2023-01-25 16:32:57 +00:00
Matt Robinson
26a5546152
fix: handle xml filetype detection on amazon linux ( #173 )
...
* fix: handle xml filetype detection on amazon linux
* option for html or xml
* fix typo
* back to dev tag
2023-01-25 11:20:01 -05:00
Matt Robinson
3b6546515d
docs: add links to linkedin and slack ( #175 )
2023-01-24 13:51:10 -08:00
qued
d2909ac688
chore: update all deps ( #172 )
2023-01-23 13:03:02 -06:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx
( #166 )
...
* parition pptx and tests
* add parition_pptx to auto
* update doc types in readme
* add pptx docs
* bump version
* remove extra whitespace
* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
8d3e616846
feat: add ability to parse LayoutElement
lists ( #165 )
...
* added ability to split list items
* changelog and version bump
* retrigger ci
2023-01-20 08:55:11 -05:00
Matt Robinson
c1822911a5
chore: return Element
objects in partition_pdf
and partition_image
( #164 )
...
* helper function to convert to element
* test for element types
* fix for healthcheck url
* version bump
* note on coordinates
* mention FigureCaption
* test_shared -> test_common
* add check boxes for checkbox template
* update changelog
2023-01-19 14:29:28 +00:00
Matt Robinson
59f972d739
build(deps): add requests
as a base dependency ( #162 )
...
* build(deps): add `requests` as a base dependency
* linting, linting, linting
* changelog typo
0.4.3
2023-01-18 16:36:23 +00:00
Matt Robinson
74ce2ae6e5
fix: update detect_filetype
to properly handle older office files ( #161 )
2023-01-18 11:18:20 -05:00
Mallori Harrell
08ccee0acb
chore: Fix parse received data ( #143 )
...
* fix parse_received data
2023-01-17 16:36:44 -06:00
Matt Robinson
749f9c6be8
fix: avoid divide by zero in exceeds_cap_ratio
( #160 )
2023-01-17 15:22:12 -05:00
gokullan
5d9183dc99
chore: graceful exit if sed is an old version ( #157 )
2023-01-17 18:11:14 +00:00
Matt Robinson
9c3c14e94d
fix: resolves UnicodeDecodeError
in partition_email
for emails with attachments ( #158 )
...
* split emails by \n=
* added test for equivalence betweent html and plain text
* changelog and bump version
* add check for content disposition
0.4.2
2023-01-17 11:33:45 -05:00
dependabot[bot]
7ed5f71e30
build(deps): Bump packaging from 22.0 to 23.0 in /requirements ( #156 )
...
Bumps [packaging](https://github.com/pypa/packaging ) from 22.0 to 23.0.
- [Release notes](https://github.com/pypa/packaging/releases )
- [Changelog](https://github.com/pypa/packaging/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pypa/packaging/compare/22.0...23.0 )
---
updated-dependencies:
- dependency-name: packaging
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 11:03:03 -05:00
dependabot[bot]
04c1813c7f
build(deps): Bump filelock from 3.8.2 to 3.9.0 in /requirements ( #152 )
...
Bumps [filelock](https://github.com/tox-dev/py-filelock ) from 3.8.2 to 3.9.0.
- [Release notes](https://github.com/tox-dev/py-filelock/releases )
- [Changelog](https://github.com/tox-dev/py-filelock/blob/main/docs/changelog.rst )
- [Commits](https://github.com/tox-dev/py-filelock/compare/3.8.2...3.9.0 )
---
updated-dependencies:
- dependency-name: filelock
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-01-17 15:40:26 +00:00
dependabot[bot]
49392a2955
build(deps): Bump requests from 2.28.1 to 2.28.2 in /requirements ( #154 )
...
Bumps [requests](https://github.com/psf/requests ) from 2.28.1 to 2.28.2.
- [Release notes](https://github.com/psf/requests/releases )
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md )
- [Commits](https://github.com/psf/requests/compare/v2.28.1...v2.28.2 )
---
updated-dependencies:
- dependency-name: requests
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-01-17 10:28:23 -05:00
qued
8abf1f119d
feat: partition image ( #144 )
...
Adds partition_image to partition image file types, which is integrated into the partition brick. This relies on the 0.2.2 version of unstructured-inference.
2023-01-13 22:24:13 -06:00
Matt Robinson
419c0867d3
build(deps): bump unstructured_inference
version range ( #151 )
...
* bump unstructured-inference to 0.2.3
* bump version
0.4.1
2023-01-13 22:21:36 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt
files in partition
( #150 )
...
* added partition_text for auto
* rename partition_text tests
* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info
for exploring a directory of files ( #142 )
...
* added python-pptx to requirements
* added filetype detection for powerpoint
* add more filetypes to detect
* more tests
* added tests for filetype
* reorder document types
* tests for get_directory_file_info
* added docs for get_directory_file_info
* bump version
* Word -> Office
* added test for filetype
* add group by filetype example
0.4.0
2023-01-11 12:40:50 -05:00
qued
7e3af6c609
chore: remove extra requirements.txt ( #140 )
2023-01-10 22:12:10 -06:00