33 Commits

Author SHA1 Message Date
cragwolfe
7b9475ef26
chore: rm competition announcement from the README (#361) 2023-03-13 09:34:26 -07:00
Alvaro Bartolome
2979e17aa4
feat: add .pre-commit-config.yaml to let users enable pre-commit hooks (#320)
Per the README, provides an optional `pre-commit` configuration
file to ensure code matches the formatting and linting standards used in `unstructured`.
2023-03-05 20:23:39 +00:00
Matt Robinson
1cd1bd8eba
docs: more detailed bricks writeup; reoganize docs (#304)
* add print statement in readme

* elements before bricks

* new preamble to bricks section

* add preamble to bricks section

* add preamble to cleaning section

* descriptions of each documentation page

* non-brick helper functions to the bottom

* fix codeblock

* includes some optional kwargs

* code blocks

* typo fix
2023-02-27 23:11:49 +00:00
Tom Aarsen
9062d25d0d
Resolve numerous typos (#280)
* Resolve numerous typos

* Resolve typo in mime type
2023-02-24 17:48:23 -08:00
cragwolfe
e419ba1d33
doc: Announce the competition! (#274) 2023-02-23 16:52:34 -08:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
Matt Robinson
314924137f
docs: add quotes to local-inference install instructions (#245) 2023-02-21 09:58:26 -06:00
Matt Robinson
7472e1bb21
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
2023-02-17 22:13:28 +00:00
Matt Robinson
601f250edc
feat: add partition_ppt for older power point docs (#238)
* added partition_ppt function and tests

* add ppt support to auto

* version bump

* update docs

* doc fixes

* update changelog

* `.docx` -> `.pptx`

* its -> their

* remove whitespace
2023-02-17 16:57:08 +00:00
Matt Robinson
6036af33e7
feat: add partition_doc for .doc files (#236)
* first pass on doc partitioning

* add libreoffice to deps

* update docs and readme

* add .doc to auto

* changelog bump

* value error with missing doc

* doc updates
2023-02-17 09:30:23 -05:00
Ethan Steininger
b8dce6109b
doc: update README with local-inference instructions
doc: update README with local-inference instructions
2023-02-15 14:49:40 +00:00
cragwolfe
ab542ca3c6
feat: Sample ingest project with S3 connector (#218) 2023-02-14 12:27:45 -08:00
qued
6d1d50d218
docs: update make targets (#217) 2023-02-14 06:08:29 +00:00
qued
5d0743ff8b
docs: add info about os dependencies (#216) 2023-02-14 05:31:52 +00:00
Sebastian Laverde Alfonso
46b023f454
docs: update colab notebook link (#203) 2023-02-07 18:50:03 +01:00
Matt Robinson
3b6546515d
docs: add links to linkedin and slack (#175) 2023-01-24 13:51:10 -08:00
Matt Robinson
8b6c5fac9d
feat: basic PowerPoint parsing in partition_pptx (#166)
* parition pptx and tests

* add parition_pptx to auto

* update doc types in readme

* add pptx docs

* bump version

* remove extra whitespace

* partition -> partitioning
2023-01-23 17:03:09 +00:00
Matt Robinson
f12240c5e7
feat: add support for .txt files in partition (#150)
* added partition_text for auto

* rename partition_text tests

* bump version and update docs
2023-01-13 16:39:53 -05:00
Matt Robinson
7b3b594ee5
fix: correct make install-ci target (#138)
* fix install-ci make target

* add note to readme about libmagic

* remove mydoc.docx

* remove local-inference
2023-01-09 17:03:09 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Mallori Harrell
d7a00046a9
feat: Add new functionality to parse text and header of emails (#111)
* partition_text function
2023-01-09 17:08:08 +00:00
Matt Robinson
7a74cdda86
feat: add partition_email cleaning brick (#104)
* fix for processing deeply embedded list elements

* fix types in mime encodings cleaner

* first pass on partition_email

* tests for email

* test for mime encodings

* changelog bump

* added note about \n=

* linting, linting, linting

* added email docs

* add partition_email to the readme

* add one more test
2022-12-19 18:02:44 +00:00
Sebastian Laverde Alfonso
efd7d38ce5
docs: update readme with gif (#90)
relative path might have to be changed once the branch is merged
2022-12-12 16:05:46 +01:00
Matt Robinson
3c19c7cd8a
feat: Add partition_html brick (#91)
* update readme

* updated sphinx docs

* bump version; changelog

* clear cache; retrigger ci

* rename test file

* switch default parameters to None

* typo in the changelog

* add in text output
2022-12-12 14:22:10 +00:00
Sebastian Laverde Alfonso
def873f8b5
Sebastian/refine repo and readme (#82)
* docs: add badges and emojis to README

* docs: add covenant badge and refactor style

* docs: add links to badges

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2022-11-30 10:02:58 +01:00
cragwolfe
88373a1559
chore: Python 3.8.15 is the most recent 3.8 (#78) 2022-11-23 12:09:13 -05:00
Matt Robinson
08e091c5a9
chore: Reorganize partition bricks under partition directory (#76)
* move partition_pdf to partition folder

* move partition.py

* refactor partioning bricks into partition diretory

* import to nlp for backward compatibility

* update docs

* update version and bump changelog

* fix typo in changelog

* update readme reference
2022-11-21 22:27:23 +00:00
Mallori Harrell
53fcf4e912
chore: Remove PDF parsing code and dependencies (#75)
Remove PDF parsing code and dependencies.
2022-11-21 11:47:29 -06:00
Yuming Long
83e7f9d347
docs: add Colab notebook line in README (#66) 2022-11-14 14:45:22 -05:00
Matt Robinson
64f2d3aa49
docs: Add developer quick start section (#61) 2022-11-09 09:52:29 -06:00
Matt Robinson
4aa3d51b03
fix: Change the image src to an absolute link (#60) 2022-11-08 22:05:30 +00:00
Matt Robinson
e290f085af
docs: Link to security policy in the README 2022-09-27 10:32:55 -04:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00