249 Commits

Author SHA1 Message Date
Tom Aarsen
e61ce2cc00
Skip posix_path test on Windows (#283) 2023-02-25 08:31:34 +00:00
qued
a79b365ab4
feat: add ubuntu setup script (#279) 2023-02-24 20:05:26 -06:00
Tom Aarsen
9062d25d0d
Resolve numerous typos (#280)
* Resolve numerous typos

* Resolve typo in mime type
2023-02-24 17:48:23 -08:00
grungyfeline998
956f04d770
feat: detect filetype with extension if libmagic is unavailable (#268)
* included the previous PR changes and verified black

* resolved the issues mentioned

* make tidy and add tests

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-02-24 15:23:29 +00:00
cragwolfe
e419ba1d33
doc: Announce the competition! (#274) 2023-02-23 16:52:34 -08:00
Matt Robinson
0d229f0a5e
fix: preserve all elements when serialized; feat: helper functions for serialization (#273)
* added type to text element map

* add element_id and coordinates

* added test for serialization

* added serialization for check boxes

* add dict_to_elements and covert_to_dict aliases

* helpers for serializing and deserializing elements

* bump version; changelog

* add Text to tests

* aliases for isd functions

* remove test elements json

* changelog updates

* make indent a kwarg

* update expected structured output

* docs update

* use new function in ingest code

* pop coordinates due to floating point differences

* pop coordinates
0.4.15
2023-02-23 21:58:59 +00:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
0.4.14
2023-02-23 17:19:13 +00:00
cragwolfe
83f04545df
fix: Adds missing __init__.py (#259) 0.4.13 2023-02-22 21:31:34 -08:00
cragwolfe
80c0fab215
build: new release (#249)
Cut a release that has the unstructured-ingest command line included in the unstructured package.

Bonus tweak to the Ingest checklist.
0.4.12
2023-02-23 03:44:05 +00:00
Viktor Zhemchuzhnikov
60abac2c4b
feat: add allow custom parsers in partition_html (#251)
This will allow partition_html to use a custom XMLParser or HTMLParser.
It can be useful if one needs to specify additional arguments to these parsers (not only built-in remove_comments=True).
---------

Co-authored-by: Viktor Zhemchuzhnikov <v.zhemchuzhnikov@xsolla.com>
2023-02-23 01:57:42 +00:00
cragwolfe
1b8bf318b8
refactor: move processing logic to IngestDoc (#248)
Moves the logic to partition a raw document to the IngestDoc level to
allow for easier overrides for subclasses of IngestDoc.
2023-02-22 01:02:05 +00:00
cragwolfe
69acb083bd
refactor: break up logic from one line to 2 (#247)
Separate elements out into separate variable to allow for conditional logic based on the instance type of the doc (or other properties).
2023-02-21 17:44:58 -06:00
cragwolfe
87fd0d01dc
feat: Ingest refactors, doc updates (#243)
- Creates ABC's for ingest connectors
- Updates the s3_connector classes to inherit from ABC's
- Moves s3 test script to it's own file to establish pattern for additional connectors
- Rewrites the Ingest.md doc, including instructions how how to add a connector
- Updates the example s3 ingest script to use the new location for main.py

Note that there were no logic changes, this is essentially a refactoring PR.

Test instructions:

Run ./test_unstructured_ingest/test-ingest.sh and ./examples/ingest/s3-small-batch/ingest.sh.
2023-02-21 10:15:33 -08:00
Matt Robinson
314924137f
docs: add quotes to local-inference install instructions (#245) 2023-02-21 09:58:26 -06:00
noahdemoes
f205e6f3ae
build: add Python 3.9 and Python 3.10 to the CI test job (#235)
* add python 3.9 3.10

* run on branch

* run on branch

* run on branch

* run on branch

* revert

* update all jobs

* update all jobs

* update all jobs
2023-02-20 14:08:46 -08:00
Matt Robinson
7472e1bb21
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
2023-02-17 22:13:28 +00:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
0.4.11
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Matt Robinson
9bbd4a1d56
docs: file exploration training notebook (#221) 2023-02-16 20:33:02 +00:00
Matt Robinson
f5ff140d7c
fix: ElementMetadata serializes when the filename is a Path object (#233) 0.4.10 2023-02-16 17:20:51 +00:00
cragwolfe
3c1b089071
feat: Ingest CLI flags and test fixture updates (#227)
* Many command line options added. The sample ingest project is now an easy to use CLI (no code editing
   necessary), capable of processing large numbers of files from S3 in a re-entrant manner. See Ingest.md.
* Fixes issue where text fixtures had been truncated
  * Adds a check to make sure this doesn't happen again
* Moves fixture outputs for the existing connector one subdir lower, 
  to make room for future connector outputs.
2023-02-16 16:45:50 +00:00
Matt Robinson
74e6b84b41
feat: add metadata tracking to document elements (#225)
* add metadata field to elements

* metadata tracking for pdf/image

* metadata for html

* update expected outputs

* metadata for the rest of the document types

* take out file metadata for now

* add url to tables

* added metadata to test_auto

* bump version

* added coordinates to __init__

* fix coordinates in tests
0.4.9
2023-02-15 18:26:20 +00:00
Ethan Steininger
b8dce6109b
doc: update README with local-inference instructions
doc: update README with local-inference instructions
2023-02-15 14:49:40 +00:00
Matt Robinson
558ee63e90
feat: ability to skip English language specific checks with env var (#224)
* add language env var

* update docs

* version and bump change log
2023-02-15 09:15:47 -05:00
Matt Robinson
a68dc35940
chore: default to local inference for partition_pdf and partition_image (#222)
* chore: default the url to None for pdf and images

* bump changelog and version
2023-02-14 16:16:33 -05:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector (#218) 2023-02-14 12:27:45 -08:00
qued
6d1d50d218
docs: update make targets (#217) 2023-02-14 06:08:29 +00:00
qued
5d0743ff8b
docs: add info about os dependencies (#216) 2023-02-14 05:31:52 +00:00
natygyoon
a920e55405
fix: remove comments when parsing XML or HTML (#210)
* Update xml.py

remove comments while parsing

* change logged in CHANGLOG and editted version

* make tidy

* editted version

* new version 0.4.8-dev1

* editted version

* Update CHANGELOG.md

Co-authored-by: cragwolfe <crag@unstructuredai.io>

---------

Co-authored-by: cragwolfe <crag@unstructuredai.io>
0.4.8
2023-02-11 02:52:13 +09:00
Matt Robinson
962de78def
fix: remove response text when the HTML status code is an error (#213)
* release: version 0.4.7

* remove response text from url error
0.4.7
2023-02-10 11:39:56 -05:00
Matt Robinson
f890972139
docs: add bricks training notebook (#211)
* added bricks notebook

* more unicode quotes; isd dataframe column fix

* fix remove_punctuation docs

* typo fixes

* put staging bricks in code
2023-02-10 14:39:14 +00:00
Matt Robinson
d0c6d50962 note on local inference 2023-02-09 15:16:14 -05:00
Matt Robinson
7f9aefc549 update partition_pdf section; added partition_image 2023-02-09 15:13:26 -05:00
Matt Robinson
24c90a03dc
docs: switch theme and style refresh (#209)
* add furo theme

* switch theme to furo

* css for custom sidebar

* remove unnecessary images

* removed unnecessary fonts

* fix logo background

* hide package name

* add favico, tweak colors

* copyright 2023

* update copyright years

* update hover colors

* fix title tab
2023-02-09 10:40:28 -05:00
Matt Robinson
7fb3797165
docs: core concepts training notebook (#207)
* added to_dict to elements

* first training notebook

* bump changelog, rerun notebook

* remove coordinates and id

* rerun notebook

* has -> have

* partitioning -> partition

* various and sundry typos

* switch to using convert_to_isd
2023-02-09 14:34:34 +00:00
Matt Robinson
47ab808e0f
feat: file info dataframe from filenames and file content (#204)
* added function for exploring a list of files

* file info from file contents

* added tests for file info from contents

* bump version and add tests

* add dev to version
2023-02-08 20:48:39 +00:00
djacobs7
15b0dffdb0
docs: correct kwarg in bricks.rst (#206)
Changed whitespace to extra_whitespace in documentation, to match options text.
2023-02-08 18:21:58 +00:00
Matt Robinson
e73cf09977
feat: optional page breaks for .pptx, .pdf, .html and images (#205)
* page breaks for pptx

* added page breaks for image/pdf

* tests for images with page breaks

* page breaks for html documents

* linting, linting, linting

* changelog and bump version

* update docs

* fix typo

* refactor reusable code to common.py

* add type back in
2023-02-08 15:11:15 +00:00
Sebastian Laverde Alfonso
46b023f454
docs: update colab notebook link (#203) 2023-02-07 18:50:03 +01:00
Matt Robinson
ee9f15483f
feat: partition_html directly from a url (#202)
* added tests for html from url

* bump version

* added types-requests

* and -> an
2023-02-07 14:09:34 +00:00
sparkbrains
2b88890210
docs: customize sphinx doc theme (#192)
* feature: adding a feature for customizing color theme of sphinx docs

* fix: adding changelog and comments

* Adding css for changing colors of sidebar

* fix: removing changelog description
2023-02-06 17:30:55 +00:00
Matt Robinson
782b4352ec
build(deps): weekly dependency update; reduce dependabot frequency (#194)
* deps: pip-compile to update dependencies

* bump version

* linting, linting, linting

* typo
2023-02-06 16:39:29 +00:00
Matt Robinson
014585e872
fix: preserve the order of shapes in partition_pptx output (#193)
* order the shapes top to bottom and left to right

* added tests for ordering

* update change log and bump version

* more tests

* don't need enumerate

* n -> on
0.4.6
2023-02-03 22:12:33 +00:00
Matt Robinson
a7ca58e0bc
fix: more english words; split on punctuation (#191)
* add a bigger list of english words

* update thresholds and add tests

* update docs; bump version

* fix version

* add additional english words back in

* linting, linting, linting

* add slashes

* work -> word
2023-02-02 17:25:47 +00:00
Matt Robinson
0589344ff7
fix: require a minimum prop of alpha characters for titles and narrative text (#190)
* added alpha ratio check

* added tests for alpha ratio

* bump changelog and update docs

* update changelog/version; update docs

* ofr -> or
2023-02-02 14:59:04 +00:00
Matt Robinson
1230a163fd
feat: set a user controlled max word length for titles (#189)
* update the docs

* add option for title max word length

* bump version; update changelog

* change max length to 12

* docs updates

* to -> too
2023-02-01 19:32:16 +00:00
Matt Robinson
2d08fcbf83
fix: titles and narrative text need at least one english word (#188)
* added check for english words

* update docs

* at least one word needs to have multiple characters

* bump change log
2023-02-01 09:10:48 -05:00
Matt Robinson
d0bf8904fa
docs: example notebooks from community repo (#187) 2023-01-31 10:37:32 -05:00
sparkbrains
243bf7ed5e
test: Increase coverage (#181) 2023-01-30 22:47:09 -08:00
Matt Robinson
f36e514c6d
build(deps): weekly dependency bump (#183) 2023-01-30 11:05:48 -05:00