17 Commits

Author SHA1 Message Date
Jack Retterer
a35ff890e0
Update docs jack (#1157)
Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
2023-08-21 10:27:32 -07:00
Matt Robinson
331c7faf38
build(deps): split up dependencies by document type (#986)
* split dependencies by document type

* make pip-compile with new requirements

* add extra requirements to setup.py

* add in all docs; re pip-compile

* extra for all docs

* add pandas to xlsx

* dependency requires for tsv and csv

* handling for doc, docx and odt

* dependency check for pypandoc

* required dependencies for pandoc files

* xml and html

* markdown

* msg

* add in pdf

* add in pptx

* add in excel

* add lxml as base req

* extra all docs for local inference

* local inference installs all

* pin pillow version

* fixes for plain text tests

* fixes for doc

* update make commands

* changelog and version

* add xlrd

* update pip-compile

* pin numpy for python 3.8 support

* more constraints

* contraint on scipy

* update install docs

* constrain ipython

* add outlook to pip-compile

* more ipython constraints

* add extras to dockerfile

* pin office365 client

* few doc tweaks

* types as strings

* last pip-compile

* re pip-comple

* make tidy

* make tidy
2023-08-01 11:31:13 -04:00
Jack Retterer
708714dab5
docs: fixed typo in Installation guide (#945) 2023-07-21 13:33:44 +00:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
Nicolas
c62bee48ad
Update installing.rst (#590) 2023-05-16 02:08:01 +00:00
Matt Robinson
fae5f8fdde
feat: add partition_odt for open office docs (#548)
* added filetype detection for odt

* add function for partition odt documents

* add odt files to auto

* changelog and version

* docs and readme

* update installation docs

* skip tests if not supported or in docker

* import pytest

* fix docs typos
2023-05-04 19:28:08 +00:00
Shukri
8d4308af43
doc: typo (#495)
XML/HTML Depenedencies -> XML/HTML Dependencies
2023-04-17 20:26:50 -07:00
qued
4211dda360
build: sync detectron version (#440)
* Update detectron2 version in Dockerfile
* Update detectron2 version in docs
2023-04-03 18:47:43 -05:00
Matt Robinson
e43cb0e6e0
feat: add partition_epub function (#364)
* add pypandoc dependency

* added epub partitioner and file conversion

* test for partition_epub

* tests for file conversion

* add epub to filetype detection

* added epub to auto partition

* update bricks docs

* updated installing docs

* changelot and version

* add pandoc to dependencies

* add pandoc to debian dependencies

* linting, linting, linting

* typo fix

* typo fix

* file conversion type hints

* more type hints

---------

Co-authored-by: qued <64741807+qued@users.noreply.github.com>
2023-03-14 15:52:21 +00:00
Matt Robinson
354eff1e2b
build(deps): automatically download nltk models when required (#246)
* code for downloading nltk packages

* don't run nltk make command in ci

* test for model downloads

* remove nltk install from docs

* update changelog and bump version
2023-02-23 17:19:13 +00:00
Matt Robinson
314924137f
docs: add quotes to local-inference install instructions (#245) 2023-02-21 09:58:26 -06:00
Matt Robinson
7472e1bb21
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme

* added quick start to docs

* parenthetical on extra deps

* typo

* fix typo

* fixed mixed tabs/spaces
2023-02-17 22:13:28 +00:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Matt Robinson
33b983fbf0
docs: instructions on how to install on Windows + conda (#129)
* add environment.yml

* instructions on how to install base package and detectron2

* added instructions on paddleocr

* remove covers

* install -> to install

* specified the shell

* updated example snippets

* update environment.yml

* updated the repo reference

* no more ands!
2023-01-05 16:21:44 +00:00
Mallori Harrell
53fcf4e912
chore: Remove PDF parsing code and dependencies (#75)
Remove PDF parsing code and dependencies.
2022-11-21 11:47:29 -06:00
Matt Robinson
fb16847946
feat: Staging brick for attention window chunking (#34)
* add huggingface dependencies and re pip-compile

* first pass on chunk by attention window

* test for chunking function

* completed tests for chunk_by_attention_window

* change default buffer size to 2

* wrapper function for staging

* added docs for transformers

* fix wording and typos

* updated change log and bumped the version

* added docs on huggingface dependencies

* fix typo

* re pip-compile
2022-10-13 11:18:27 -04:00
Matt Robinson
5f40c78f25 Initial Release 2022-09-26 14:55:20 -07:00