61 Commits

Author SHA1 Message Date
Christine Straub
b30d6a601e
Fix/1209 tweak xycut ordering output (#1630)
Closes GH Issue #1209.

### Summary
- add swapped `xycut` sorting
- update `xycut` sorting evaluation script

PDFs:
-
[sbaa031.073.pdf](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7234218/pdf/sbaa031.073.pdf)
-
[multi-column-2p.pdf](https://github.com/Unstructured-IO/unstructured/files/12796147/multi-column-2p.pdf)
-
[11723901.pdf](https://github.com/Unstructured-IO/unstructured-inference/files/12360085/11723901.pdf)
### Testing
```
elements = partition_pdf("sbaa031.073.pdf", strategy="hi_res")
print("\n\n".join([str(el) for el in elements]))
```
### Evaluation
```
PYTHONPATH=. python examples/custom-layout-order/evaluate_xy_cut_sorting.py sbaa031.073.pdf hi_res xycut_only
```
2023-10-05 07:41:38 +00:00
Roman Isecke
9d81971fcb
update ingest python doc (#1446)
### Description
Updating the python version of the example docs to show how to run the
same code that the CLI runs, but using python. Rather than copying the
same command that would be run via the terminal and using the subprocess
library to run it, this updates it to use the supported code exposed in
the inference directory.

For now only the wikipedia one has been updated to get some opinions on
this before updating all other connector docs.

Would close out
https://github.com/Unstructured-IO/unstructured/issues/1445
2023-10-03 10:01:41 -04:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Trevor Bossert
3e04110bab
Chore: Pin unstructured-inference in extra-pdf-image (#1474)
This is so users are able to upgrade it when unstructured library is
updated.
2023-09-22 09:41:53 -07:00
Ahmet Melek
9e88929a8c
feat: document embeddings (#1368)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372

This module:

- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.

To test the changes, run `examples/embed/example.py`
2023-09-20 19:55:30 +00:00
Ryan Nikolaidis
8c1d03e5cf update slack invite 2023-09-20 00:02:03 -07:00
Roman Isecke
333558494e
roman/delta lake dest connector (#1385)
### Description
Add delta table downstream destination connector

Closes https://github.com/Unstructured-IO/unstructured/issues/1415
2023-09-15 22:13:39 +00:00
Roman Isecke
59e850bbd9
Roman/downstream connector cli subcommand (#1302)
### Description
Update all other connectors to use the new downstream architecture that
was recently introduced for the s3 connector.

Closes #1313 and #1311
2023-09-11 11:40:56 -04:00
Ahmet Melek
09cc4bfa5f
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
  - either all issues in all projects in a Jira Cloud Organization
  - or 
    - issues in user specified projects, boards
    - user specified issues
- processes this kind of data: 
  - text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
  - notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263

To test the changes, make the necessary setups and run the relevant
ingest test scripts.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 10:10:48 +00:00
cragwolfe
a475b447e8
doc: add colab link for xycut sorting (#1288) 2023-09-03 20:19:40 +00:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00
Roman Isecke
ed7f991ab9
Add s3 writer (#1223)
### Description
Convert s3 cli code to also support writing to s3. Writers are added as
optional subcommands to the parent command with their own arguments.
Custom `click.Group` introduced to add some custom formatting and text
in help messages.

To limit the scope of this PR, most existing files were not touched but
instead new files were added for the new flow. This allowed _only_ the
s3 connector to be updated without breaking any other ones.
2023-08-31 22:19:53 +00:00
Christine Straub
483b09b3c9
Feat/1136 elements ordering for pdf (#1161)
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach

### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
2023-08-24 17:46:19 -07:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Sebastian Laverde Alfonso
fe5048a834
feat: chipper local inference notebook (#1116)
Download chipper model for local use and demonstrate how to partition a .pdf document
through the unstructured and unstructured_inference libraries.
2023-08-15 20:43:23 -07:00
Mark Risher
612f9da6e8
Update news-of-the-day.ipynb - typo (#1113)
Fixed typo
2023-08-14 16:48:49 +00:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Ronny H
b31c62fa84
replace Weaviate nearText with BM25 query algorithm (#1078) 2023-08-10 22:15:27 +00:00
rvztz
dee9b405cd
feat: Sharepoint connector (#918) 2023-08-10 09:37:58 -07:00
Roman Isecke
50389f15a8
roman/Add notion connector (#1033)
* Add notion connector and supporting code

* minor fixes

* Add notion deps to extras

* Use the same return type for both helper methods

* Don't ignore types that aren't recognized when mapping json

* Add support for recursively getting docs

* Add recursive search for databases

* fix logging

* fix linting

* remove debugging code
2023-08-08 22:01:25 -04:00
Matt Robinson
ac7efa19e7
docs: news of the day (chroma + langchain) (#1054)
* news of the day notebook

* readme and requirements

* change to single mode instead of elements
2023-08-08 11:36:04 -04:00
Sebastian Laverde Alfonso
084ead173a
chore: custom layout order example notebook (#1024)
* chore: CFR double column sample

Federal Regulations document for example notebook in `examples/custom-layout-order`

* chore: custom-layout-order example dir

* feat: helper methods to plot and reorder layouts

Helper methods: `plot_image_with_bounding_boxes_coloured` and `reorder_elements_in_double_columns`

* chore: delete __init__.py

---------

Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>
2023-08-02 18:29:04 -06:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00
Roman Isecke
28214a6cc3
Roman/ingest refactor (#978)
* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-31 13:20:10 -04:00
David Potter
f7e46af22f
feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00
Ahmet Melek
b7674fb97e
feat: confluence connector (cloud) (#906)
* Add confluence connector and an example script

* add test script, add dependency installations

* add authentication secret variables for ci tests and actions

* add dependency installation commands for workflows

* add dependency installation commands for workflows

* Update ingest test fixtures (#907)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values

* change workflow name to avoid confusion

* change workflow name to avoid confusion

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update

* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions

* Update ingest test fixtures (#911)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* revert back the test python version matrix

* recompile dependencies

* modifications for shellcheck

* update changelog and version

* changelog and version

* remove comments

* Update ingest test fixtures (#915)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* add the option to state the number of spaces to be fetched

* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users

* add help message

* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test

* change test names

* rename connector arg

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* change arg name for connector

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* add comment to example

* change arg names

* add new tests to ingest test

* shellcheck remove redundant statement

* Update ingest test fixtures (#932)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* Update ingest test fixtures (#936)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* linting

* change file extensions to parse as html

* Update ingest test fixtures (#943)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* remove old fixtures

* update version to 0.8.2-dev3

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

* change file to trigger CI

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-07-18 19:29:41 +01:00
rvztz
ce20c3f2bc
feat: add OneDrive connector (#834) 2023-07-13 20:57:54 +00:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00
David Potter
bec733cdf8
feat: add Dropbox connector (#844) 2023-06-30 17:08:27 -07:00
David Potter
3b472cb7df
feat: add google cloud storage connector (#746) 2023-06-21 15:14:50 -07:00
Sebastian Laverde Alfonso
508ce48d54
Feat: notebook for Elasticsearch integration (#681)
* feat: nb elasticsearch unstructured sentiment

* chore: refactor readme for elasticsearch nb

* fix: update es-credentials.ini

* chore: update es-credentials.ini

* fix: type in nb load-into-es.ipynb

exist --> exists

* fix: typo 2 in nb load-into-es.ipynb

obtaing --> obtain
2023-06-05 19:05:08 +00:00
Matt Robinson
c35fff2972
feat: Add stage_for_weaviate and schema creation function (#672)
* add weaviate docker compose

* added staging brick and tests for weaviate

* initial notebook and requirements file

* add commentary to weaviate notebook

* weaviate readme

* update docs

* version and change log

* install weaviate client

* install weaviate; skip for docker

* linting, linting, linting

* install weaviate client with deps

* comments on weaviate client

* fix module not found error for docker container

* skipped wrong test in docker

* fix typos

* add in local-inference
2023-06-01 20:48:54 +00:00
Yuming Long
ab5f92dd79
Fix(ingest): Deprecate --s3-url in favor of --remote-url (#616)
* deprecation s3-url

* changelopg and versioin

* download dir not now
2023-05-19 12:11:40 -04:00
Mallori Harrell
34d563c1fc
feat: Create spacy notebook example (#593)
* add new notebook for spacy
2023-05-17 15:42:15 -05:00
Trevor Bossert
830d67f653
Feat: Discord connector (#515)
* Initial commit of discord connector

based off of initial work by @tnachen with modifications

https://github.com/tnachen/unstructured/tree/tnachen/discord_connector

* Add test file

change format of imports

* working version of the connector

More work to be done to tidy it up and add any additional options

* add to test fixtures update

* fix spacing

* tests working, switching to bot testing channel

* add additional channel

add reprocess to tests

* add try clause to allow for exit on error

Update changelog and bump version

* add updated expected output filtes

* add logic to check if —discord-period is an integer

Add more to option description

* fix lint error

* Update discord reqs

* PR feedback

* add newline

* another newline

---------

Co-authored-by: Justin Bossert <packerbacker21@hotmail.com>
2023-05-16 11:46:30 -07:00
Matt Robinson
e052c2a9b2
docs: example of how to use unstructured with pgvector (#571)
* pgvector requirements

* first pass on pgvector notebook and sql alchemy file

* created code for loading vectors into db

* added query for embedding distance

* updates to pgvector notebook

* update function with time decay

* update pgvector notebook to use example code

* remove old create table script

* add readme for pgvector

* update example to use get_date()
2023-05-12 13:54:38 -04:00
Matt Robinson
19beb24e03
docs: unstructured -> MySQL example (#557)
* added requirements for mysql

* first bit of mysql notebook

* update requirements file

* wrap with mysql example

* update readme with install instructions
2023-05-09 13:26:49 +00:00
pravin-unstructured
4020da56ad
Went through this demo notebook with Matt. Decision was made to add it to our collection of examples for use later. (#484) 2023-04-17 11:53:25 -04:00
Trevor Bossert
cff7f4fd5a
Slack connector (#462)
This connector takes a slack channel id, token and other options to
pull conversation history for a channel and store it as a text file that
is then processed by unstructured into expected output.
2023-04-16 19:34:43 +00:00
natygyoon
7f6e094c1f
feat: add local file system connector for unstructured-ingest (#399)
* added local connector to unstructured-ingest
2023-03-29 15:53:23 -07:00
cragwolfe
ce9fc26009
feat: add ability to pass headers in partition_html (#397)
Also adds pytest-mock requirement, those fixtures are nice to have!

Implements issue/feature #396 .
2023-03-23 20:14:57 -07:00
Habeeb Shopeju
2ca843782c
Connector for Biomedical Literature (#345)
The implementation involves the introduction of SimpleBiomedConfig, BiomedIngestDoc and BiomedConnector which ingests documents from the PDF Download.
2023-03-11 01:09:54 +00:00
Alvaro Bartolome
5291a96616
Add AzureBlobStorageConnector (#353)
* Add `AzureBlobStorageConnector` based on its `fsspec` implementation inheriting
from `FsspecConnector`
* Start deprecation life cycle for `unstructured-ingest --s3-url` option, to be deprecated in
  favor of `--remote-url`.

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
2023-03-10 15:43:40 -08:00
Tom Aarsen
1580c1bf8e
feat: Add GitLab ingest connector (#349)
Add GitLab data connector for ingest.

Involves more general Git functionality that is shared between the GitHub and GitLab data connectors.

Prevent code duplication for functionality between GitHub and GitLab ingest connectors.

Renamed github-access-token, github-branch and github-file-glob to git-access-token, git-branch and git-file-glob, respectively.

These work for GitHub and GitLab.
2023-03-08 00:15:21 -08:00
Habeeb Shopeju
4117f57e14
Connector for Google Drive (#294)
Implements issue #244
2023-03-07 06:01:02 +00:00
Ikko Eltociear Ashimine
213077e2ab
docs: update sec-sentiment-analysis.ipynb (#342)
Huggingface -> Hugging Face
2023-03-06 15:16:14 +00:00
Tom Aarsen
54a6db1c2c
feat: Add Wikipedia ingest connector (#299)
The connector can process a Wikipedia page
and output the HTML,
the plain text contents,
and the summary.
No API key required
Also add test case verifying that 3 files are indeed created (one for HTML, one for text, one for the summary).
2023-02-28 08:25:11 +00:00
Tom Aarsen
ded60afda9
feat: Add GitHub data connector; add Markdown partitioner (#284) 2023-02-27 14:36:44 -08:00
Matt Robinson
9b0dbc7026
build(deps): bump dependencies; resolve security issues in example dependencies (#300)
* bump cryptography version

* re pip-compile for latest versions

* update argilla example requirements

* dependency updates

* bump versions

* pin unstructured-inference due to multithreading issue

* linting, linting, linting

* dependency on one line
2023-02-27 12:45:28 -05:00