172 Commits

Author SHA1 Message Date
Amanda Cameron
d0c84d605c
chore: updating table docs with file extensions (#1702)
gh issue: https://github.com/Unstructured-IO/unstructured/issues/1691

Adding filetype extensions from this
[list](f98d5e65ca/unstructured/file_utils/filetype.py (L154-L200))
where applicable.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Crag Wolfe <crag@unstructuredai.io>
2023-10-14 14:14:52 -07:00
Ahmet Melek
94836cfad4
feat: add file-based access permissions for SharePoint ingest (#1628)
This PR:

- defines rbac_data as a SourceMetadata field,
- manages connections to an external api for obtaining rbac data with
ConnectorRBAC class,
- serializes rbac data and saves it to the disk,
- matches the rbac_data in the disk to each IngestDoc, using a common
field,
- forwards rbac data to Elements, via the partition() function

To test the changes, run `examples/ingest/sharepoint/ingest.sh` with the
relevant rbac & connector credentials

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-10-13 00:38:08 +00:00
Dev Khant
f09b87da23
Doc : replace link upstream connectors with source connectors (#1683)
Fixes #1502

Here I have replaced `stream_connectors.html` with
`source_connectors.html`.
2023-10-09 21:37:51 -07:00
Amanda Cameron
f98d5e65ca
chore: adding max_characters to other element type chunking (#1673)
This PR adds the `max_characters` (hard max) param to non-table element
chunking. Additionally updates the `num_characters` metadata to
`max_characters` to make it clearer which param we're referencing.

To test:

```
from unstructured.partition.html import partition_html

filename = "example-docs/example-10k-1p.html"
chunk_elements = partition_html(
        filename,
        chunking_strategy="by_title",
        combine_text_under_n_chars=0,
        new_after_n_chars=50,
        max_characters=100,
    )

for chunk in chunk_elements:
     print(len(chunk.text))

# previously we were only respecting the "soft max" (default of 500) for elements other than tables
# now we should see that all the elements have text fields under 100 chars.
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-09 19:42:36 +00:00
Jack Retterer
7e310ecac2
Update Getting Started Guide in Documentation (#1667)
- Fixed typo that stated "infer_table_structured" instead of
"infer_table_structure"

Co-authored-by: cragwolfe <crag@unstructured.io>
2023-10-07 01:12:52 +00:00
Ronny H
8564d920ac
Update Metadata and Installation Documentation (#1646)
* Updated Metadata page: add common and additional metadata fields by
document types and connectors
* Updated specific installation extra by document types and connectors
* Added embedding brick page in Sphinx TOC
* Fixed Sphinx warnings in new pages
2023-10-05 01:25:41 +00:00
Manirevuri
13453d6358
Fix: Documentation for Unstructured API's (#1624)
Fixed "files=file_data" param for all python files

---------

Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-03 20:42:32 +00:00
Roman Isecke
9d81971fcb
update ingest python doc (#1446)
### Description
Updating the python version of the example docs to show how to run the
same code that the CLI runs, but using python. Rather than copying the
same command that would be run via the terminal and using the subprocess
library to run it, this updates it to use the supported code exposed in
the inference directory.

For now only the wikipedia one has been updated to get some opinions on
this before updating all other connector docs.

Would close out
https://github.com/Unstructured-IO/unstructured/issues/1445
2023-10-03 10:01:41 -04:00
Roman Isecke
5c7b4f586b
Roman/azure cognitive embeddings (#1524)
### Description
This PR is two-fold:  

**Embeddings:**
* Embeddings incorporated into the sharepoint source connector, which
will now call out to OpenAI and create embeddings if the flag is passed
in and the api key provided.

**Writing vector content (embeddings) to Azure cognitive search index:**
* The schema for the index expected to exist in Azure has been updated
to include the vector field type and a test script has been added to
test the new content being produced from the Sharepoint connector to
push the embedding content.

Some important notes about other changes in here:
* The embedding code had to be updated to patch the `to_dict` method on
elements to add `embeddings` to the dict output if that was added. While
the code originally added the embedding content, when `to_dict` was
called to save the content as json, this was lost.
2023-09-26 23:24:21 +00:00
Ronny H
868cac5bd5
Fixed Sphinx warning errors (#1438)
Fixed issue #1437 - resolved the Warning errors when building sphinx
with `make html`.

test:
1. `cd docs` folder and `rm -rf build`
2. `pip install -r requirements.txt`
3. run `make html`
2023-09-26 04:20:16 +00:00
Trevor Bossert
2a24c81852
Update docker download url to use scarf gateway (#1523)
This updates the docker image download url to pass through the scarf
gateway, this allows anonymous tracking of downloads

Related to:
https://github.com/Unstructured-IO/unstructured#chart_with_upwards_trend-analytics

Testing:
docker pull
downloads.unstructured.io/unstructured-io/unstructured:latest

Result:
Image should download
2023-09-25 14:52:39 -07:00
Roman Isecke
bd49cfbab7
feat: adds Azure Cognitive Search (full text) destination connector (#1459)
### Description
New [Azure Cognitive
Search](https://azure.microsoft.com/en-us/products/ai-services/cognitive-search)
destination connector added. Writes each json element from the created
json files via partition and writes that content to an index.

**Bonus bug fix:** Due to a recent change where the default version of
python used in the repo was bumped to `3.10` from `3.8`, this means
running `pip-compile` now runs it against that version rather than the
lowest we support which is still `3.8`. This breaks the setup for those
lower versions because some of the versions pulled in by `pip-compile`
exist for `3.10` but not `3.8`. `pip-compile` was updates to run as a
script that checks the version of python being used first, which helps
guarantee that all dependencies meet the minimum python version
requirement.

Closes out https://github.com/Unstructured-IO/unstructured/issues/1466
2023-09-25 10:27:42 -04:00
Ahmet Melek
9e88929a8c
feat: document embeddings (#1368)
Closes https://github.com/Unstructured-IO/unstructured/issues/1319,
closes https://github.com/Unstructured-IO/unstructured/issues/1372

This module:

- implements EmbeddingEncoder classes which track embedding related data
- implements embed_documents method which receives a list of Elements,
obtains embeddings for the text within Elements, updates the Elements
with an attribute named embeddings , and returns the updated Elements
- the module uses langchain to obtain the embeddings
-----
- The PR additionally fixes a JSON de-serialization issue on the
metadata fields.

To test the changes, run `examples/embed/example.py`
2023-09-20 19:55:30 +00:00
Ryan Nikolaidis
8c1d03e5cf update slack invite 2023-09-20 00:02:03 -07:00
Steve Canny
b54994ae95
rfctr: docx partitioning (#1422)
Reviewers: I recommend reviewing commit-by-commit or just looking at the
final version of `partition/docx.py` as View File.

This refactor solves a few problems but mostly lays the groundwork to
allow us to refine further aspects such as page-break detection,
list-item detection, and moving python-docx internals upstream to that
library so our work doesn't depend on that domain-knowledge.
2023-09-19 15:32:46 -07:00
John
6187dc0976
update links in integrations.rst (#1418)
A number of the links in integrations.rst don't seem to lead to the
intended section in the unstructured documentation.

For example:
```See the `stage_for_weaviate <https://unstructured-io.github.io/unstructured/bricks.html#stage-for-weaviate>`_ docs for details```

It seems this link should direct to here instead: https://unstructured-io.github.io/unstructured/bricks/staging.html#stage-for-weaviate
2023-09-15 16:50:55 -07:00
Roman Isecke
333558494e
roman/delta lake dest connector (#1385)
### Description
Add delta table downstream destination connector

Closes https://github.com/Unstructured-IO/unstructured/issues/1415
2023-09-15 22:13:39 +00:00
Ronny H
f1364594ad
Docs models (#1412)
This PR adds documentation of models supported by the `Unstructured`
tool. The changes reflect the tool's capabilities, usage examples, and
the process for integrating custom models.

Sections:
- Detailed the basic usage of the `Unstructured` partition with the
model name.
- Provided a list of available models in the `Unstructured` partition.
- Added instructions on using non-default models via three distinct
methods.
- Explained leveraging models from the LayoutParser's model zoo with
`UnstructuredDetectronModel`.
- Guided users in integrating their custom object detection models using
the `UnstructuredObjectDetectionModel` class.

Tested the docs build with:
> cd docs
> pip install -r requirements.txt
> make html
2023-09-13 23:37:31 -07:00
Amanda Cameron
7fd81dc7df
Table processing test for RTF (#1388)
This PR does two things:
1. Adds test case (and alters sample doc) for rtf and epub files with
table
2. Adds `xls/x` file extension to `skip_infer_table_types` default list

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
2023-09-12 18:27:05 -07:00
Roman Isecke
59e850bbd9
Roman/downstream connector cli subcommand (#1302)
### Description
Update all other connectors to use the new downstream architecture that
was recently introduced for the s3 connector.

Closes #1313 and #1311
2023-09-11 11:40:56 -04:00
Ronny H
edc45013dc
Add strategy documentation (#1353) 2023-09-09 18:54:01 -07:00
Ahmet Melek
09cc4bfa5f
feat: jira connector (cloud) (#1238)
This connector:
- takes a Jira Cloud URL, user email and api token; to authenticate into
Jira Cloud
- ingests:
  - either all issues in all projects in a Jira Cloud Organization
  - or 
    - issues in user specified projects, boards
    - user specified issues
- processes this kind of data: 
  - text fields such as issue summary, description, and comments
- dropdown fields such as issue type, status, priority, assignee,
reporter, labels, and components
- other data such as issue id, issue key, project id, information on
subtasks
  - notes down attachment URLs, however does not process attachments
- stores each downloaded issue in a txt file, in a predefined template
form (consisting of the data above)
- then processes each downloaded issue document into elements using
unstructured library
- related to: https://github.com/Unstructured-IO/unstructured/issues/263

To test the changes, make the necessary setups and run the relevant
ingest test scripts.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-09-06 10:10:48 +00:00
Jack Retterer
95b6295307
Jack/update documentation (#1190)
Updated:
- Added back support document types for partitioning
- Added more tabs for python code in the API page
- Added a RAG section in Key Concepts
- Added a Common Use case section in overview
2023-09-04 16:15:50 +00:00
David Potter
b710bafa89
feat: add salesforce connector (#1168) 2023-09-02 08:50:31 -07:00
Matt Robinson
c49df62967
feat: partition_xml infers element type on each leaf node (#1249)
### Summary

Closes #1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.

Also adds the option to pass `text` as an input to `partition_xml`.

### Testing

Create a `parrots.xml` file that looks like:

```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.

Conures are feathery and like to dance.</description></parrot></xml>
```

Run:

```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict

elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```

One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.

```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure A conure is a very friendly bird.',
  'type': 'NarrativeText'},
 {'element_id': '859ecb332da6961acd2fb6a0185d1549',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```

One the feature branch, the output is the following, and the tags are
correctly separated.

```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'Conure',
  'type': 'Title'},
 {'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
  'metadata': {'file_directory': '/home/matt/tmp',
               'filename': 'parrots.xml',
               'filetype': 'application/xml',
               'last_modified': '2023-08-30T14:21:38'},
  'text': 'A conure is a very friendly bird.\n'
          '\n'
          'Conures are feathery and like to dance.',
  'type': 'NarrativeText'}]

```
2023-08-30 17:07:10 -04:00
Matt Robinson
f6a745a74f
feat: chunk elements based on titles (#1222)
### Summary

An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:

- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.

### Testing

```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)

for chunk in chunks:
    print(chunk)
    print("\n\n" + "-"*80)
    input()
```
2023-08-29 16:04:57 +00:00
omahs
64b4287308
fix: typos (#1215)
fix: typos
2023-08-28 12:05:48 +00:00
Matt Robinson
07f76275f1
feat: detect PGP encrypted content in partition_email and partition_msg (#1205)
### Summary

Closes #1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.

### Testing

```python
from unstructured.partition_email import partition_email

filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```

```python
from unstructured.partition_msg import partition_msg

filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
2023-08-25 17:09:25 -07:00
Matt Robinson
cdae53cc29
chore: deprecation warning for file_filename (#1191)
### Summary

Closes #1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.

### Testing

```python
from unstructured.partition.auto import partition

filename = "example-docs/winter-sports.epub"

# Should not emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should emit a warning
with open(filename, "rb") as f:
    elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename

# Should raise an error
with open(filename, "rb") as f:
    elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
2023-08-24 07:02:47 +00:00
Jack Retterer
05e311651a
doc: add delta tables connector reference (#1177)
Added delta tables to connectors page for users to discover
2023-08-22 12:50:27 -07:00
ryannikolaidis
ac2313a3fa
doc: fix get-api-key link (#1175) 2023-08-22 19:31:07 +00:00
ryannikolaidis
ab7fafcb41
doc: add pdf extra note (#1165) 2023-08-22 18:20:26 +00:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00
Jack Retterer
f639d04695
Fixed some typos (#1162)
The Wikipedia data connector was labeled as Airtable.
2023-08-21 18:03:15 -07:00
Jack Retterer
a35ff890e0
Update docs jack (#1157)
Documentation Overhaul

- Added documentation hierarchy
- Added options for Bash vs Python for API & Upstream Connectors
- Added Introduction section (Overview, Key Concepts, Getting Started)
- Redid connectors section
- Installation is now broken up (needs further work)
2023-08-21 10:27:32 -07:00
Francisco Kurucz
d2a41f462d
doc: fix typo on partition_md function in bricks documentation (#1147) 2023-08-17 20:54:11 -07:00
Noah Greer
fa0a5afb71
docs: correct spelling of partition in docs (#1104)
Fixes a typo in several places where the word `partition` is misspelled
as `partiton`
2023-08-12 14:57:27 -07:00
John
f63a66dbef
Capture section and chapter in the metadata for epubs under epub_section (#1005)
Capture section and chapter in the metadata for epubs under epub_section.
Closes Github issue #459
2023-08-12 21:02:06 +00:00
Ronny H
0d5b5a0e79
Revamp README & Bricks documentation (#1103)
Reorganize README.md
2023-08-12 19:58:51 +00:00
Ahmet Melek
627f78c16f
feat: airtable connector (#1012)
* add the first version of airtable connector

* change imports as inline to fail gracefully in case of lacking dependency

* parse tables as csv rather than plain text

* add relevant logic to be able to use --airtable-list-of-paths

* add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings

* fix ingest test names

* add scripts for the large table test

* remove large table test from diff test

* make base and table ids explicit

* add and remove comments

* use -ne instead of !=

* update code based on the recent ingest refactor, update changelog and version

* shellcheck fix

* update comments

* update check-num-rows-and-columns-output error message

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* update help comments

* update help comments

* update help comments

* update workflows to set auth tokens and to run make install

* add comments on create_scale_test_components

* separate component ids from the test script, add comments to document test component creation

* add LARGE_BASE test, implement LARGE_BASE component creation, replace component id

* shellcheck fixes

* shellcheck fixes

* update docs

* update comment

* bump version

* add wrongly deleted file

* sort columns before saving to process

* Update ingest test fixtures (#1098)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-11 12:02:51 -07:00
Matt Robinson
fa5a3dbd81
feat: unique_element_ids kwarg for UUID elements (#1085)
* added kwarg for unique elements

* test for unique ids

* update docs

* changelog and version
2023-08-11 11:02:37 +00:00
Yuming Long
112347aa0d
doc: update API doc to sync with new parameter in prod API (#1049)
* doc doc

* changelog and version

* sample docs -> example docs

* nit on compute cost doc

* pass empty dict not none

* note note

* cutting release
2023-08-09 11:09:37 -04:00
kravetsmic
25ca5744cf
feat: optionally ignore header and footer tags in partition html (#1013)
---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 21:56:33 +00:00
kravetsmic
bef93aef6e
fix: email addresses shouldn't be flagged as titles (#957)
* feat: add func for checking on EmailAddress type

* feat: add EmailAddress type

* feat: add check for email type

* feat: add test for cheking EmailAdress type

* feat: update existing example files with email

* feat: add new exampe fileds with email in the text

* fix: apply linter

* feat: update changelog file

* feat: add test for is_email_address function

* don't push

* fix: clean up code

* apply linter

* fix: clean up

* fix: remove file chaanges

* fix: remove not used  files for email address test

* fix: remove not necessary tests

* clean up

* fix: apply linter

* fix: update CHANGELOG

* fix: change version

* fix: fix  msg test

* fix: apply linter for tests

* fix: remove spaces

* fix: apply linter with longer line

* feat: update documentation

* fix: remove duplicates

* Update getting_started.rst

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 11:28:36 -04:00
Hynek Kydlíček
47b20119c3
fix: extract emojis with partition_xlsx (#1009)
* 🐛 fixxed emoji xlsx bug

* update version and changelog

* check if beautifulsoup exists

* update docs

* fix html parser call

* fix failing attachment test

*   added emoji test, added requirment fixed dependency

* 🐛 dependency

* 🐛 correct depeendency

* linting, linting, linting

* check for bs4

* skip auto xls filename test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2023-08-04 10:14:08 -04:00
kravetsmic
73eeae852e
feat: add filter element types as post processing function (#1014)
* don't push

* enhancement: improve json detection by detect_filetype (#971)

* update regex pattern

* improve json regex pattern checks and add test file

* update file name

* update tests and formatting

* update changelog and version

* refactor: simplifies JSON detection and add tests (#975)

* refactor json detection

* version and changelog

* fix mock in test

* feat: adds Outlook connector (#939)

* bonus: fixes issue with email partitioning where From field was being assigned the To field value.

* Roman/expose dpi param (#966)

* Bump inference version

* Pass through the dpi param if available

* Update CHANGELOG

* Check dpi param passed in via unit test

* Bump inference version

* Fix unit test around file info to work on mac as well

* chore: cleanup changelog for 0.8.2 (#976)

* Update `partition_via_api` to not post a strategy value if not user specified (#967)

* remove default strategy

* working on test

* fixed test, coordinates param needed to be included

* nits

* update changelog

* lint

* update requirements

* build(release): cut 0.8.4 release (#979)

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Chore: add uns api repo unittests (#954)

* stage

* git clone

* ci ignore markdown file

* make install

* use env instead

* remove md

* add script

* wrong env value

* add note

* maybe don't rm

* no cd../

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

* fix: handling for empty tables in word docs and powerpoints (#982)

* fix table index error

* changelog and version

* fix: only download nltk packages if necessary (#985)

* fix: only download nltk if necessary

* changelog and version

* Chore: Pass table support  param to partition image (#973)

* add param and test in image table extraction

* version and changelog

* need to publish this one for api repo

* add new param skip_infer_table_types

* use warning

* clean up with mapping

* add test for tsv

* fix test fail

* weird change from merge

* doc nit

* don't use mapping

* correct conflict

* Update pip in makefile (#981)

* update pip in makefile

* merge and update requirements

* update version

* update outlook requirements

* chore: remove debug printing (#988)

* fix: correct nltk download arg order (#991)

* fix: correct download order to nltk args

* add smoke test for tokenizers

* Chore: put back function `split_by_paragraph` (#992)

* put back function

* not really fixes

* don't push

* fix: clean up code

* fix: clean up

* fix: clean up

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Roman/ingest refactor (#978)

* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>

* feat: adds Box connector (#996)

* chore: rename Element's "date" field to "last_modified" (#997)

Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.

* don't push

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* feat: add document date for remaining file types (#930) (#969)

* feat: add document date for remaining file types (#930)

* feat: add functions for getting modification date

* feat: add date field to metadata from csv file

* feat: add tests for csv patition

* feat: add date field to metadata from html file

* feat: add tests for html partition

* fix: return file name onlyif possible

* feat: add csv tests

* fix: renaming

* feat: add filed metadata_date  as date of last mod

* feat: add tests for partition_docx

* feat: add filed metadata_date  to .doc file

* feat: add tests for partition_doc

* feat: add metadata_date  to .epub file

* feat: add tests for partition_epub

* fix: fix test mocking

* feat: add metadata_date for image partition

* feat: add test for image partition

* feat: add coorrdinate system argument

* feat: add date to element metadata

* feat: add metadata_date for JSON partition

* feat: add test for JSON partition

* fix: rename variable

* feat: add metadata_date for md partition

* feat: add test for md partition

* feat: update doc string

* feat: add metadata_date for .odt partition

* feat: update .odt string

* feat: add metadata_date for .org partition

* feat: add tests for .org partition

* feat: add metadata_date for .pdf partition

* feat: add tests for .pdf partition

* feat: add metadata_date for .pptx partition

* feat: add metadata_date for .ppt partition

* feat: add tests for .ppt partition

* feat: add tests for .pptx partition

* feat: add metadata_date for .rst partition

* feat: add tests for .rst partition

* fix: get modification date after file checking

* feat: add tests for .rtf partition

* feat: add tests for .rtf partition

* feat: add metadata_date for .txt partition

* fix: rename argument

* feat: add tests for .txt partition

* feat: update doc string rst patrition function

* feat: add metadata_date for .tsv partition

* feat: add tests for .tsv partition

* feat: add metadata_date for .xlsx partition

* feat: add tests for .xlsx partition

* fix: clean up

* feat: add tests for .xml partition

* feat: add tests for .xml partition

* fix: use `or ` instead of `if`

* fix: fix epub tests

* fix: remove not used code

* fix: add try block for getting file name

* fix: applying linter changes

* fix: fix test_partition_file

* feat: add metadata_date for email

* feat: add test for email partition

* feat: add metadata_date for msg

* feat: add tests for msg partition

* feat: update CHANGELOG file

* fix: update partitions doc string

* don't push

* fix: clean up code

* linting, linting, linting

* remove unnecessary example doc

* update version and changelog

* ingest-test-fixtures-update

* set metadata date in test

---------

Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>

* ingest-test-fixtures-update

* Update ingest test fixtures (#970)

Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* Revert "Update ingest test fixtures (#970)"

This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.

* remove date from metadata in outputs

* update docstring ordering

* remove print

* remove print

* remove print

* linting, linting, linting

* fix version and test

* fix changelog

* fix changelog

* update version

---------

Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>

* fix: removie prints

* remove unused file

* fix: apply linter

* feat: add post processing filter_element_types

* feat: add tests for filter_element_types

* feat: update changelog

* feat: add doc string for filter_element_types

* fix: change the version

* feat: update documentation

* bump dev version number

* cleanup changelog

* linting, linting, linting

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-08-03 10:50:35 -04:00
Yuming Long
5af717f16f
doc: update API doc to sync with new parameter in prod API (#1020)
* sync readme
2023-08-01 19:47:32 -04:00
Matt Robinson
331c7faf38
build(deps): split up dependencies by document type (#986)
* split dependencies by document type

* make pip-compile with new requirements

* add extra requirements to setup.py

* add in all docs; re pip-compile

* extra for all docs

* add pandas to xlsx

* dependency requires for tsv and csv

* handling for doc, docx and odt

* dependency check for pypandoc

* required dependencies for pandoc files

* xml and html

* markdown

* msg

* add in pdf

* add in pptx

* add in excel

* add lxml as base req

* extra all docs for local inference

* local inference installs all

* pin pillow version

* fixes for plain text tests

* fixes for doc

* update make commands

* changelog and version

* add xlrd

* update pip-compile

* pin numpy for python 3.8 support

* more constraints

* contraint on scipy

* update install docs

* constrain ipython

* add outlook to pip-compile

* more ipython constraints

* add extras to dockerfile

* pin office365 client

* few doc tweaks

* types as strings

* last pip-compile

* re pip-comple

* make tidy

* make tidy
2023-08-01 11:31:13 -04:00
David Potter
1542607892
feat: adds Box connector (#996) 2023-08-01 01:10:10 +00:00
Roman Isecke
28214a6cc3
Roman/ingest refactor (#978)
* Pull out s3 code as subcommand

* Pull out dropbox code as subcommand

* Pull out azure code as subcommand

* Pull out fsspec code as subcommand

* Pull out github code as subcommand

* Pull out gitlab code as subcommand

* Pull out reddit code as subcommand

* Pull out slack code as subcommand

* Pull out discord code as subcommand

* Pull out wikipedia code as subcommand

* Pull out gdrive code as subcommand

* Pull out biomed code as subcommand

* rename parameters

* Pull out onedrive code as subcommand

* Pull out outlook code as subcommand

* Pull out local code as subcommand

* Pull out elasticsearch code as subcommand

* Pull out confluence code as subcommand

* Drop previous main file

* update changelog

* Add back in mp.Pool

* Fix mypy issues with click

* Make sure all tests run with verbose flag

* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data

* Pull out some more shared options

* Support running code via python as well as cli

* update ingest readme and move it to the ingest folder

* update usage in connector docs

* move local command arg in test

* Seperate out cli code from logic running unstructured

* Make some cli fields required rather than optional

* rename process -> processor

* Improve logger to avoid duplicate handlers

---------

Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
2023-07-31 13:20:10 -04:00