103 Commits

Author SHA1 Message Date
cragwolfe
918a3d0deb
fix: allow users to install package with python3.13 or higher (#3893)
Although, python3.13 is not officially supported or tested in CI just
yet.
2025-01-30 14:52:24 +00:00
Roman Isecke
9049e4e2be
feat/remove ingest code, use new dep for tests (#3595)
### Description
Alternative to https://github.com/Unstructured-IO/unstructured/pull/3572
but maintaining all ingest tests, running them by pulling in the latest
version of unstructured-ingest.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-10-15 10:01:34 -05:00
Roman Isecke
ebf16055d8
feat/add deprecation warning to all embed code (#3614)
### Description
Related PR to move the code over:
https://github.com/Unstructured-IO/unstructured-ingest/pull/92

Also removed the console script that exposes ingest.
2024-09-10 23:48:39 +00:00
Matt Robinson
6ba8135bf9
fix: check ole storage content to differentiate filetypes (#3581)
### Summary

Updates the file detection logic for OLE files to check the storage
content of the file to more reliable differentiate between DOC, PPT, XLS
and MSG files. This corrects a bug that caused file type detection to be
incorrect in cases where the `filetype` library guessed and incorrect
MIME type, such as `'application/vnd.ms-excel'` for a `.msg` file.

As part of this work, the `"msg"` extra was removed because the
`python-oxmsg` package is now a base dependency.

### Testing

Using a test `.msg` file that returns `'application/vnd.ms-excel'` from
`filetype.guess_mime`.

```python
from unstructured.file_utils.filetype import detect_filetype

filename = "test-file.msg"
detect_filetype(filename=filename) # result should be FileType.MSG
```
2024-08-30 15:12:46 -04:00
David Potter
ddba928344
Potter/mixedbread embedder (#3513)
Thanks to @huangrpablo and @juliuslipp we now have a mixedbread.ai
embedder!
2024-08-27 14:52:13 +00:00
David Potter
59ec64235b
chore: rename astra to astradb (#3458)
DataStax wanted all references to be astradb instead of astra. As per
@erichare

We'll also have to do the same in unstructured-ingest :)
2024-08-05 20:41:02 +00:00
Roman Isecke
f1a28600d9
feat/singlestore dest connector (#3320)
### Description
Adds [SingleStore](https://www.singlestore.com/) database destination
connector with associated ingest test.
2024-07-03 15:15:39 +00:00
David Potter
8610bd3ab9
feat: Kafka source and destination connector (#3176)
Thanks to @tullytim we have a new Kafka source and destination
connector. It also works with hosted Kafka via Confluent.

Documentation will be added to the Docs repo.
2024-06-22 23:26:23 +00:00
Matt Robinson
23e570fc8a
docs: cleanup readme; add python 3.12 (#3120)
### Summary

Updates documentation references in the README to point to
https://docs.unstructured.io and cleans up a few sections of the README.
Specifically:

- Removes an old API announcement
- Removes the section mentioning Chipper as a beta feature. Chipper is
only available through the SaaS API.

Also adds a Python 3.12 tag to `setup.py` since we now support Python
3.12.
2024-05-30 16:22:54 +00:00
Matt Robinson
6b400b46fe
feat: add VoyageAI embeddings (#3069) (#3099)
Original PR was #3069. Merged in to a feature branch to fix dependency
and linting issues. Application code changes from the original PR were
already reviewed and approved.

------------
Original PR description:
Adding VoyageAI embeddings 
Voyage AI’s embedding models and rerankers are state-of-the-art in
retrieval accuracy.

---------

Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com>
Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>
2024-05-24 21:48:35 +00:00
Na'aman Hirschfeld
8802535e95
chore: add py.typed (#3043)
This PR adds `py.typed`, which will fix issues of the following type:

![Uploading Screenshot 2024-05-17 at 12.13.33.png…]()

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-05-20 08:52:04 -04:00
Matt Robinson
d7608014c0
improve: add Python 3.12 support (#3033) (#3047)
### Summary

Closes #2959. Updates the dependency and CI to add support for Python
3.12.

The MongoDB ingest tests were disabled due to jobs like [this
one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333)
failing due to issues with the `bson` package. `bson` is a dependency
for the AstraDB connector, but `pymongo` does not work when `bson` is
installed from `pip`. This issue is documented by MongoDB
[here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun
off #3049 to resolve this. Issue seems unrelated to Python 3.12, though
unsure why this didn't surface previously.

Disables the `argilla` tests because `argilla` does not yet support
Python 3.12. We can add the `argilla` tests back in once the PR
references below is merged. You can still use the `stage_for_argilla`
function if you're on `python<3.12` and you install `argilla` yourself.
- https://github.com/argilla-io/argilla/pull/4837

---------

Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
2024-05-19 23:03:15 +00:00
cragwolfe
1621a70755
fix: Brings back missing word list files (#2857)
Fixes https://github.com/Unstructured-IO/unstructured/issues/2855
2024-04-04 23:38:15 -07:00
Roman Isecke
d6f2841ff4
feat: update dependencies and remove constraint on pydantic (#2841)
### Description
* The `consistent-deps.sh` was fixed to take into account the ingest
dependencies, causing some errors to show up. New constriants were added
to make that script pass.
* Update all requirements without constraint on pydantic, allowing the
latest version to be pulled in.
* `pikepdf` is causing a conflict but there's a fix on their `main`
branch, just need for the next release to be published. Opened up a
question here to see if we can get that out any sooner: [Do releases
happen on a
schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now
added `lxml<5` to the constraints.

A couple optimizations: 
* `constraints.in` renamed to `constraints.txt` since the whole point is
all dependencies are already pinned and the file never gets compiled
* `constraints.txt` moved to a `requirements/deps` directory as this
never gets compiled by `pip-compile`
* Other dependency files updated to reference the new location of
`base.in` and `constraints.txt`
* make file updated since it was originally written to avoid the
`base.in` and `constraints.in` file
2024-04-04 19:58:23 +00:00
Ahmet Melek
d46792214a
feat: add vertexai embeddings (#2693)
This PR:
- Adds VertexAI embeddings as an embedding provider

Testing
- Tested with pinecone destination connector on
[this](https://github.com/Unstructured-IO/unstructured/actions/runs/8429035114/job/23082700074?pr=2693)
job run.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-03-28 21:15:36 +00:00
David Potter
9177aa20a8
feature CORE-3985: add Clarifai destination connector (#2633)
Thanks to @mogith-pn from Clarifai we have a new destination connector!

This PR intends to add Clarifai as a ingest destination connector.

Access via CLI and programmatic.
Documentation and Examples.
Integration test script.
2024-03-21 16:36:21 +00:00
Roman Isecke
c02cfb89d3
bug/unstructured-ingest produces ModuleNotFoundError: No module named 'unstructured.txtgest (#2661)
Quick fix for issue
https://github.com/Unstructured-IO/unstructured/issues/2658
2024-03-18 19:08:29 +00:00
Roman Isecke
9c1c41f493
BUGFIX: fix dependencies in setup.py (#2605)
### Description
Currently the requirements associated with an extra in the `setup.py` is
being dynamically generated using the `load_requirements()` method in
the same file. This is being passed in all the `.in` files which then
get read line by line to generate the requirements associated with an
extra. Unless the `.in` file itself has a version pin, this will never
respect the `.txt` files being generated by `pip-compile`. This fix
updates all the inputs to `load_requirements()` to use the `.txt` files
themselves.
2024-03-06 18:59:08 +00:00
David Potter
e8ec09c8b9
feat: astra dest connector (#2571)
Thanks to Eric Hare @erichare at DataStax we have a new destination
connector.

This Pull Request implements an integration with [Astra
DB](https://datastax.com) which allows for the Astra DB Vector Database
to be compatible with Unstructured's set of integrations.

To create your Astra account and authenticate with your
`ASTRA_DB_APPLICATION_TOKEN`, and `ASTRA_DB_API_ENDPOINT`, follow these
steps:

1. Create an account at https://astra.datastax.com
2. Login and create a new database
3. From the database page, in the right hand panel, you will find your
API Endpoint
4. Beneath that, you can create a Token to be used

Some notes about Astra DB:

- Astra DB is a Vector Database which allows for high-performance
database transactions, and enables modern GenAI apps [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html)
- It supports similarity search via a number of methods [See
here](https://docs.datastax.com/en/astra/astra-db-vector/get-started/concepts.html#metrics)
- It also supports non-vector tables / collections
2024-02-23 20:50:50 +00:00
David Potter
0f0b58dfe7
bug: remove vectara requirements (#2491)
I accidentally added Vectara to setup and makefile. But there are no
dependencies for Vectara

This removes Vectara from those files.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 20:41:53 +00:00
David Potter
c100ce28a7
feat: add Vectara destination connector (#2357)
Thanks to Ofer at Vectara, we now have a Vectara destination connector.

- There are no dependencies since it is all REST calls to API
-

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-02-01 14:38:34 +00:00
qued
007fc45739
chore: new black changes (#2473)
Update `black` and apply changes to affected files. I separated this PR
so we can have a look at the changes and decide whether we want to:
1. Go forward with the new formatting
2. Change the black config to make the old formatting valid
3. Get rid of black entirely and just use `ruff`
4. Do something I haven't thought of
2024-01-30 17:12:35 +00:00
ryannikolaidis
44cb2ce645
fix: path to databricks-volumes requirements (#2443)
setup.py is currently pointing to the wrong location for the
databricks-volumes extra requirements. This PR updates to point to the
correct location.

## Testing
Tested by installing from local source with `pip install .`
2024-01-23 03:28:36 +00:00
Roman Isecke
a8de52e94f
feat: databricks volumes dest added (#2391)
### Description
This adds in a destination connector to write content to the Databricks
Unity Catalog Volumes service. Currently there is an internal account
that can be used for testing manually but there is not dedicated account
to use for testing so this is not being added to the automated ingest
tests that get run in the CI.

To test locally: 
```shell
#!/usr/bin/env bash

path="testpath/$(uuidgen)"

PYTHONPATH=. python ./unstructured/ingest/main.py local \
    --num-processes 4 \
    --output-dir azure-test \
    --strategy fast \
    --verbose \
    --input-path example-docs/fake-memo.pdf \
    --recursive \
    databricks-volumes \
    --catalog "utic-dev-tech-fixtures" \
    --volume "small-pdf-set" \
    --volume-path "$path" \
    --username "$DATABRICKS_USERNAME" \
    --password "$DATABRICKS_PASSWORD" \
    --host "$DATABRICKS_HOST"
```
2024-01-23 01:25:51 +00:00
David Potter
4a34765fdf
chore: add postgres extra (#2422)
postgres missing in setup.py

Co-authored-by: potter-potter <david.potter@gmail.com>
2024-01-18 16:32:53 +00:00
ryannikolaidis
d25e6081d8
chore: add opensearch extra (#2419) 2024-01-18 05:21:37 +00:00
Roman Isecke
b37b4689bc
drop python3.8 (#2372)
### Description
Remove all uses of python3.8

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-01-09 23:37:30 +00:00
ryannikolaidis
dd1443ab6f
feat: add Qdrant ingest destination connector (#2338)
This PR intends to add [Qdrant](https://qdrant.tech/) as a supported
ingestion destination.

- Implements CLI and programmatic usage.
- Documentation update
- Integration test script

---
Clone of #2315 to run with CI secrets

---------

Co-authored-by: Anush008 <anushshetty90@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2024-01-02 22:08:20 +00:00
David Potter
4b8352e0f5
feat: add chroma destination connector (#2240)
Adds Chroma (also known as ChromaDB) as a vector destination.

Currently Chroma is an in-memory single-process oriented library with
plans of a hosted and/or more production ready solution
-https://docs.trychroma.com/deployment

Though they now claim to support multiple Clients hitting the database
at once, I found that it was inconsistent. Sometimes multiprocessing
worked (maybe 1 out of 3 times) But the other times I would get
different errors. So I kept it single process.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-19 16:58:23 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
rvztz
ce905dd098
feat: Weaviate destination connector (#1963)
Closes #1781.
- Adds a Weaviate destination connector
- The connector receives a host for the weaviate instance and a weaviate
class name.
- Defines a weaviate schema for json elements.
- Defines the pre-processing to conform unstructured's schema to the
proposed weaviate schema.
2023-12-01 22:27:41 +00:00
Ahmet Melek
ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Roman Isecke
b8af2f18bb
add mongo db destination connector (#2068)
### Description
This adds the basic implementation of pushing the generated json output
of partition to mongodb. None of this code provisions the mondo db
instance so things like adding a search index around the embedding
content must be done by the user. Any sort of schema validation would
also have to take place via user-specific configuration on the database.
This update makes no assumptions about the configuration of the database
itself.
2023-11-16 22:40:22 +00:00
Ahmet Melek
ca78dc737a
feat: extend ingest options to support multiple embedding modules, add deterministic ingest test for embeddings (#1918)
Closes #1782 

This PR:
- Extends ingest pipeline so that it is possible to select an embedding
provider from a range of providers
- Modifies the ingest embedding test to be a diff test, since the
embedding vectors are reproducible after supporting multiple providers

Additional info on the chosen provider for the test:
- Found `langchain.embeddings.HuggingFaceEmbeddings` to be deterministic
even when there's no seed set
- Took 6.84s to pass a unit test with the provider (without cache,
including model download)
- `langchain.embeddings.HuggingFaceEmbeddings` runs in local, making it
zero cost

For all these reasons, testing embedding modules with the Huggingface
model seems to be making sense

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-11-06 12:26:12 +00:00
Roman Isecke
4802332de0
Roman/optimize ingest ci (#1799)
### Description
Currently the CI caches the CI dependencies but uses the hash of all
files in `requirements/`. This isn't completely accurate since the
ingest dependencies are installed in a later step and don't affect the
cached environment. As part of this PR:
* ingest dependencies were isolated into their own folder in
`requirements/ingest/`
* A new cache setup was introduced in the CI to restore the base cache
-> install ingest dependencies -> cache it with a new id
* new make target created to install all ingest dependencies via `pip
install -r ...`
* updates to Dockerfile to use `find ...` to install all dependencies,
avoiding the need to update this when new deps are added.
* update to pip-compile script to run over all `*.in` files in
`requirements/`
2023-10-24 14:54:00 +00:00
Mallori Harrell
00635744ed
feat: Adds local embedding model (#1619)
This PR adds a local embedding model option as an alternative to using
our OpenAI embedding brick. This brick uses LangChain's
HuggingFacEmbeddings.
2023-10-19 11:51:36 -05:00
Jack Retterer
b8f24ba67e
Added AWS Bedrock embeddings (#1738)
Summary: Added support for AWS Bedrock embeddings. Leverages
"amazon.titan-tg1-large" for the embedding model.

Test

- find your aws secret access key and key id; make sure the account has
access to bedrock's tian embed model
- follow the instructions in
d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)

---------

Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
2023-10-18 19:36:51 -05:00
cragwolfe
5b994f37ae
build(release): actually make the release 0.10.18 (#1576)
Workaround a publication issue to pypi. No code changes, 0.10.18 is the
new 0.10.17.
2023-09-28 23:09:18 -07:00
Trevor Bossert
792232dcc5
Chore: move scarf to setup.py (#1569)
This also follows what I have seen as the recommend way to define a file
package like this.

Also bumps minor versions from pip compile

Testing:
`pip install -e .`
Everything should build as normal

`❯ pip install -e .
Obtaining file:///Users/trevor/dev/unstructured
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Collecting scarf@ https://packages.unstructured.io/scarf.tgz (from
unstructured==0.10.17.dev16)
  Using cached https://packages.unstructured.io/scarf.tgz (1.1 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done`

When new release goes out, I will test just plain pip install to verify
that functionality still works
2023-09-28 16:18:14 -07:00
Roman Isecke
5c7b4f586b
Roman/azure cognitive embeddings (#1524)
### Description
This PR is two-fold:  

**Embeddings:**
* Embeddings incorporated into the sharepoint source connector, which
will now call out to OpenAI and create embeddings if the flag is passed
in and the api key provided.

**Writing vector content (embeddings) to Azure cognitive search index:**
* The schema for the index expected to exist in Azure has been updated
to include the vector field type and a test script has been added to
test the new content being produced from the Sharepoint connector to
push the embedding content.

Some important notes about other changes in here:
* The embedding code had to be updated to patch the `to_dict` method on
elements to add `embeddings` to the dict output if that was added. While
the code originally added the embedding content, when `to_dict` was
called to save the content as json, this was lost.
2023-09-26 23:24:21 +00:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Trevor Bossert
09a0958f90
Feat: CORE-1269 - Install paddlepaddle wheel dependent on arch, supporting aarch64 (#1350)
Testing instructions

on Apple silicon

```
make docker-build
docker run -it unstructured:dev bash
python3
```
Then run the test in this PR
https://unstructured-ai.atlassian.net/browse/CORE-1269

You should get output like shown in ticket

Run the same process on your local machine (not inside docker) with same
test to verify the non aarch64 paddlepaddle got installed correctly

---------

Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
2023-09-15 17:05:48 -07:00
Ahmet Melek
09cc4bfa5f
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
  - either all issues in all projects in a Jira Cloud Organization
  - or 
    - issues in user specified projects, boards
    - user specified issues
- processes this kind of data: 
  - text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
  - notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263

To test the changes, make the necessary setups and run the relevant
ingest test scripts.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 10:10:48 +00:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Matt Robinson
ad595d32f6
enhancement: tell users to install missing extras (#1167)
### Summary

Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.

### Testing

First `pip uninstall ebooklib`. Then run

```python
from unstructured.partition.auto import partition

partition(filename="example-docs/winter-sports.epub")
```

The error should look like

```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
2023-08-22 03:00:21 +00:00
Roman Isecke
db8af4f5de
Roman/notion tests (#1072)
### Description
* Add ingest test for Notion docs
* Update default cache dir for connectors to include connector name.
Makes debugging the cached content easier.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-21 15:16:50 -04:00
qued
cb923b96a2
build(deps): dependency cleanup (#1102)
Cleans up some pins that were prone to conflicts. All pins belong in constraints.in.
2023-08-15 05:15:44 +00:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00