91 Commits

Author SHA1 Message Date
John
670687bb67
update .pre-commit-config to match linting used by CI (#1906)
Closes #1905 
.pre-commit-config.yaml does not match pyproject.toml, which causes
unnecessary/undesirable formatting changes. These changes are not
required by CI, so they should not have to be made.

**To Reproduce**
Install pre-commit configuration as described
[here](https://github.com/Unstructured-IO/unstructured#installation-instructions-for-local-development).
Make a commit and something like the following will be logged:
```
check for added large files..............................................Passed
check toml...........................................(no files to check)Skipped
check yaml...........................................(no files to check)Skipped
check json...........................................(no files to check)Skipped
check xml............................................(no files to check)Skipped
fix end of files.........................................................Passed
trim trailing whitespace.................................................Passed
mixed line ending........................................................Passed
black....................................................................Passed
ruff.....................................................................Failed
- hook id: ruff
- files were modified by this hook
```

---------

Co-authored-by: Yao You <theyaoyou@gmail.com>
2023-10-27 13:24:55 -05:00
cragwolfe
8aceda97dd
test: print slowest unittests (#1911)
Show which tests are slowing things down when running `make test`:

E.g., from the CI run in this PR:

```
2023-10-27T05:51:05.6256039Z 105.12s setup    test_unstructured/partition/pdf_image/test_pdf.py::test_chipper_has_hierarchy
2023-10-27T05:51:05.6257784Z 93.47s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[entire_page]
2023-10-27T05:51:05.6259866Z 93.09s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6261818Z 31.70s call     test_unstructured/partition/epub/test_epub.py::test_add_chunking_strategy_on_partition_epub_non_default
2023-10-27T05:51:05.6263774Z 17.22s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-filename]
2023-10-27T05:51:05.6265658Z 17.13s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-spool]
2023-10-27T05:51:05.6273195Z 16.95s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image_hi_res
2023-10-27T05:51:05.6275118Z 16.77s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf[hi_res-expected1-pdf-rb]
2023-10-27T05:51:05.6276759Z 14.64s call     test_unstructured/partition/test_text.py::test_partition_text_detects_more_than_3_languages
2023-10-27T05:51:05.6278381Z 13.86s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_with_multipage_tiff
2023-10-27T05:51:05.6280137Z 13.51s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_filename[False-None]
2023-10-27T05:51:05.6281995Z 13.41s call     test_unstructured/partition/test_html_partition.py::test_add_chunking_strategy_on_partition_html
2023-10-27T05:51:05.6283640Z 12.80s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_copy_protection
2023-10-27T05:51:05.6285305Z 12.46s call     test_unstructured/partition/pdf_image/test_image.py::test_add_chunking_strategy_on_partition_image
2023-10-27T05:51:05.6287250Z 12.39s call     test_unstructured/partition/pdf_image/test_image.py::test_partition_image_hi_res_ocr_mode_with_table_extraction[individual_blocks]
2023-10-27T05:51:05.6289347Z 12.14s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6291329Z 12.12s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_custom_metadata_date
2023-10-27T05:51:05.6293388Z 12.12s call     test_unstructured/partition/test_auto.py::test_auto_partition_pdf_from_file[True-application/pdf]
2023-10-27T05:51:05.6294869Z 12.08s call     test_unstructured/partition/test_auto.py::test_auto_with_page_breaks
2023-10-27T05:51:05.6296396Z 12.02s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_with_hi_res_strategy_metadata_date
2023-10-27T05:51:05.6298278Z 11.99s call     test_unstructured/partition/pdf_image/test_pdf.py::test_partition_pdf_from_file_with_hi_res_strategy_metadata_date
```
2023-10-27 11:40:55 -05:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Roman Isecke
63861f537e
Add check for duplicate click options (#1775)
### Description
Given that many of the options associated with the `Click` based cli
ingest commands are added dynamically from a number of configs, a check
was incorporated to make sure there were no duplicate entries to prevent
new configs from overwriting already added options.

### Issues that were found and fixes:
* duplicate api-key option set on Notion command conflicts with api key
used for unstructured api. Added notion prefix.
* retry logic configs had duplicates in biomed. Removed since this is
not handled by the pipeline.
2023-10-20 14:00:19 +00:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
Roman Isecke
b265d8874b
refactoring linting (#1739)
### Description
Currently linting only takes place over the base unstructured directory
but we support python files throughout the repo. It makes sense for all
those files to also abide by the same linting rules so the entire repo
was set to be inspected when the linters are run. Along with that
autoflake was added as a linter which has a lot of added benefits such
as removing unused imports for you that would currently break flake and
require manual intervention.

The only real relevant changes in this PR are in the `Makefile`,
`setup.cfg`, and `requirements/test.in`. The rest is the result of
running the linters.
2023-10-17 12:45:12 +00:00
ryannikolaidis
d9a0bd741a
fix: build test failures (#1748)
* Fix missing HF_TOKEN when running containerized test for the build
process
* Fix pytest args when running specific test

## Testing
Example run of the HF_TOKEN assgned for the containerized test in the
build process:
https://github.com/Unstructured-IO/unstructured/actions/runs/6504556437/job/17666669155

Example run of the pytest args working for the arm test (ran in a new
workflow for testing on push):
https://github.com/Unstructured-IO/unstructured/actions/runs/6504213010
2023-10-13 01:08:27 -07:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
Trevor Bossert
6acd06987b
Remove extra index url from docs (#1711)
It’s no longer required to specify the extra index url as we utilize a
different method of gathering install anonymous analytics.
2023-10-11 19:34:49 +00:00
Trevor Bossert
ce206f1f85
add extra-index-url for scarf anonymous tracking (#1668)
This adds extra-index-url to our docs to allow for anonymous install
analytics to help us understand and improve our product.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:16:38 +00:00
Benjamin Torres
e0201e9a11
feat/add sources from unstructured inference (#1538)
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true

In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image

yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')

sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
2023-10-05 20:26:47 +00:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
Yuming Long
f962a1e57d
fix: fix ingest paddle hanging issue (#1441)
## Summary

Ingest tests are having paddle OOM issue which cause the tests to hang
forever. The fix here is to remove paddle from ci and set both OCR env
`TABLE_OCR` and `ENTIRE_PAGE_OCR` to `tesseract`. (will have follow up
PR to investigate why this is failing)

## Test
please check ingest tests in CI
2023-09-19 17:20:23 +00:00
Yao You
b534b2a6cd
Chore: bump inference package version to 0.5.28 and new release (#1355)
This bump removes the preprocessing before table structure extraction
and improves the OCR results for tables.

---------

Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
2023-09-15 18:26:15 -07:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00
Yao You
a5ca628f22
[CORE-1741] use forked pytesseract to reduce calls to tesseract (#1298)
This PR resolves
[CORE-1741](https://unstructured-ai.atlassian.net/browse/CORE-1741) by
using a new function `pytesseract.run_and_get_multiple_output`, see
forked repo for more details:
https://github.com/Unstructured-IO/unstructured.pytesseract/releases/tag/0.3.11-dev1

This reduces the call to `tesseract` by half per page of PDF/image
during partition, roughly reducing the runtime by 48%.

The new function is in forked `unstructured.pytesseract`. A PR has been
made to the upstream repo and once that is merged we should switch to
the up stream version. For now we add a new dependency:
`unstructured.pytesseract`.

## testing

Existing unit tests should serve as tests to the new function. 

To demonstrate the changes in performance:
- checkout main
- run `./scripts/performance/profile.sh` and select `ocr_only` strategy,
using the 10th document (16 page layout paper in pdf format)
- examine the speedscope profile or time profile in flamegraph -> should
see two dominant time spenders are `pytesseract.image_to_text` and
`pytesseract.image_to_boxes`, with both about the same total time (see
attached first image)
- checkout this branch
- run the same `profile.sh` with the same options
- examine the profile again and this time should notice 1) total runtime
is reduced by more than 40%; 2) only
`unstructured_pytesseract.run_and_get_multiple_output` is the top time
spender and its total time is about the same as either the
`pytesseract.image_to_text` or `pytesseract.image_to_boxes` time (see
second image below)

![Screenshot 2023-09-06 at 9 45 10
AM](https://github.com/Unstructured-IO/unstructured/assets/647930/fed6118b-a0dc-493d-bef8-85d73027c968)

![Screenshot 2023-09-06 at 9 46 37
AM](https://github.com/Unstructured-IO/unstructured/assets/647930/dd1d6369-cfba-43d4-b1c6-87a8a98b2e16)

[CORE-1741]:
https://unstructured-ai.atlassian.net/browse/CORE-1741?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
2023-09-14 23:27:18 +00:00
Yao You
12d7628b10
update constraints to pin weaviate during ci (#1408)
This PR ensures the version for `weaviate` is consistent in CI testing.
Latest (3.24.1) is not compatible with our test needs and last version
that run successfully in CI is 3.23.2.
2023-09-13 23:19:20 +00:00
Roman Isecke
59e850bbd9
Roman/downstream connector cli subcommand (#1302)
### Description
Update all other connectors to use the new downstream architecture that
was recently introduced for the s3 connector.

Closes #1313 and #1311
2023-09-11 11:40:56 -04:00
Ahmet Melek
09cc4bfa5f
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
  - either all issues in all projects in a Jira Cloud Organization
  - or 
    - issues in user specified projects, boards
    - user specified issues
- processes this kind of data: 
  - text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
  - notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263

To test the changes, make the necessary setups and run the relevant
ingest test scripts.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 10:10:48 +00:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00
qued
fc9d251e4e
build(deps): Remove pillow pin (#1274)
Removed pin for `PIL` as `detectron2` repo has been updated, and so has
`unstructured-inference`.
2023-09-01 19:47:50 +00:00
Roman Isecke
ed7f991ab9
Add s3 writer (#1223)
### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.

To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
2023-08-31 22:19:53 +00:00
Yao You
b504a48e06
dev: add py-spy profiling (#1251)
This PR adds a new developer tool for profiling performance: `py-spy`.
Additionally it adds a new make command to start a docker with your
local `unstructured` repo mounted for quick testing code in a Rocky
Linux environment (see usage below for intent).

### py-spy

It is a sampling profiler https://github.com/benfred/py-spy and in
practice usually provides more readily usable information than commonly
used `cProfiler`. It also supports output to `speedscope` format,
[which](https://github.com/jlfwong/speedscope#usage) provides a rich
view of the profiling result.

### usage

The new tool is added to the existing `profile.sh` script and is readily
discoverable in the interactive interface. When select to view the new
speedscope format profile it would show up in your local browser if you
followed the readme to install speedscope locally via `npm install -g
speedscope`.

On macOS the profiling tool needs superuser privilege. If you are not
comfortable with that feel free to run the profiling inside a Linux
container if your local dev env is macOS.
2023-08-31 19:26:29 +00:00
Trevor Bossert
e4535d29ca
Set user for container to same as api image. (#1239)
This is security best practice, a user can override this with their own
Dockerfile if required.
2023-08-30 01:01:44 +00:00
qued
4a5a3022a3
fix: remove duplicate target in makefile (#1235)
Removed a duplicate make target in the `Makefile`.
2023-08-29 06:49:18 +00:00
ryannikolaidis
835378aba6
ci: fix documentation build flow (#1181) 2023-08-24 00:24:03 -05:00
Austin Walker
e7d189fcc8
chore: Bump inference and set default ocr_mode to entire_page (#1172)
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
2023-08-22 16:05:02 -07:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00
qued
cb923b96a2
build(deps): dependency cleanup (#1102)
Cleans up some pins that were prone to conflicts. All pins belong in constraints.in.
2023-08-15 05:15:44 +00:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Matt Robinson
331c7faf38
build(deps): split up dependencies by document type (#986)
* split dependencies by document type

* make pip-compile with new requirements

* add extra requirements to setup.py

* add in all docs; re pip-compile

* extra for all docs

* add pandas to xlsx

* dependency requires for tsv and csv

* handling for doc, docx and odt

* dependency check for pypandoc

* required dependencies for pandoc files

* xml and html

* markdown

* msg

* add in pdf

* add in pptx

* add in excel

* add lxml as base req

* extra all docs for local inference

* local inference installs all

* pin pillow version

* fixes for plain text tests

* fixes for doc

* update make commands

* changelog and version

* add xlrd

* update pip-compile

* pin numpy for python 3.8 support

* more constraints

* contraint on scipy

* update install docs

* constrain ipython

* add outlook to pip-compile

* more ipython constraints

* add extras to dockerfile

* pin office365 client

* few doc tweaks

* types as strings

* last pip-compile

* re pip-comple

* make tidy

* make tidy
2023-08-01 11:31:13 -04:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00
shreyanid
c3e92057f2
Update pip in makefile (#981)
* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements
2023-07-27 21:38:51 +00:00
Yuming Long
df1ba39905
Chore: add uns api repo unittests (#954)
* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-07-26 20:55:35 +00:00
David Potter
f7e46af22f
feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
Yuming Long
067eb5701f
Fix: docker build with missing dependency (#931)
* pip -compile

* test trigger

* Revert "test trigger"

This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c.

* version conflict and pip compile
2023-07-14 22:20:11 +00:00
rvztz
ce20c3f2bc
feat: add OneDrive connector (#834) 2023-07-13 20:57:54 +00:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00
ryannikolaidis
60fe231f08
fix: use api key where needed in tests (#843)
* passes api key for unstructured-api to unit and ingest tests as needed.
* adds check for env var CI to otherwise skip tests that require an api key
2023-06-29 17:31:01 +00:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
cragwolfe
68f04159bc
chore: rm old detectron2 install from makefile (#767)
* chore: remove vestigal Makefile target and tensorboard
2023-06-16 10:05:36 -07:00
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose

* added staging brick and tests for weaviate

* initial notebook and requirements file

* add commentary to weaviate notebook

* weaviate readme

* update docs

* version and change log

* install weaviate client

* install weaviate; skip for docker

* linting, linting, linting

* install weaviate client with deps

* comments on weaviate client

* fix module not found error for docker container

* skipped wrong test in docker

* fix typos

* add in local-inference
2023-06-01 20:48:54 +00:00
qued
d3600dd5da
build(deps): update inference version (#662)
Updated to the the latest version of unstructured-inference. detectron2 now gets implemented with onnxruntime, yay!

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 13:50:15 -05:00
kravetsmic
795a9a0b4c
feat: add jupyter make commands (#651)
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-05-31 14:01:23 +00:00