- revert the layout parser fast pdf file to original with just two pages
- add a new file that has one empty page and one page says "this page is
intentionally left blank" for tests
This PR resolves#1247 by using the matching elements and bbox for
coordinate computation.
This PR also updates the example doc
`example-docs/layout-parser-paper-fast.pdf` so that it includes a true
blank page and a page with text "this page is intentionally left blank".
This change helps us testing:
- differences between fast and hi_res
- code handling empty pages in between pages with contents (which
triggers the bug found in #1247 )
Lastly, this PR updates the names of the variables inside
`_partition_pdf_or_image_with_ocr` so that matching inputs all starts
with `_` like `_elements`, `_text`, and `_bboxes` to improve
readability.
This change also improves partition performance for multi-page pdfs as
it reduces the amount of iterations inside
`add_pytesseract_bbox_to_elements`. Testing locally on m2 mac + Rocky
docker shows it reduces partition time for DA-619p.pdf file from around
1min to around 23s.
### Summary
Closes#1229. Updates `partition_xml` so that the element type is
inferred on each leaf node when `xml_keep_tags=False` instead of
delegating splitting and partitioning to `partition_xml`. If
`xml_keep_tags=True`, the file is treated like a text file still and
partitioning is still delegated to `partition_text`.
Also adds the option to pass `text` as an input to `partition_xml`.
### Testing
Create a `parrots.xml` file that looks like:
```xml
<xml><parrot><name>Conure</name><description>A conure is a very friendly bird.
Conures are feathery and like to dance.</description></parrot></xml>
```
Run:
```python
from unstructured.partition.xml import partition_xml
from unstructured.staging.base import convert_to_dict
elements = partition_xml(filename="parrots.xml")
convert_to_dict(elements)
```
One `main`, the output is the following. Notice how the `<name>` tag
incorrectly gets merged into `<description>` in the first element.
```python
[{'element_id': '7ae4074435df8dfcefcf24a4e6c52026',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure A conure is a very friendly bird.',
'type': 'NarrativeText'},
{'element_id': '859ecb332da6961acd2fb6a0185d1549',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
One the feature branch, the output is the following, and the tags are
correctly separated.
```python
[{'element_id': '5512218914e4eeacf71a9cd42c373710',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'Conure',
'type': 'Title'},
{'element_id': '113bf8d250c2b1a77c9c2caa4b812f85',
'metadata': {'file_directory': '/home/matt/tmp',
'filename': 'parrots.xml',
'filetype': 'application/xml',
'last_modified': '2023-08-30T14:21:38'},
'text': 'A conure is a very friendly bird.\n'
'\n'
'Conures are feathery and like to dance.',
'type': 'NarrativeText'}]
```
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
### Summary
An initial pass on smart chunking for RAG applications. Breaks a
document into sections based on the presence of `Title` elements. Also
starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is `1500`. The chunking function does not split individual
elements, so it's possible for a section to exceed that threshold if an
individual element if over `new_after_n_chars` characters, which could
occur with a long `NarrativeText` element.
- Section under `combine_under_n_chars` characters are combined. The
default is `500`.
### Testing
```python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
```
### Summary
Closes#1018. Enables `partition_email` and `partition_msg` to detect if
an email has PGP encrypted content. Based on the specification in [RFC
2015](https://www.ietf.org/rfc/rfc2015.txt). The test emails are based
on the example email in the spec. If PGP detected content is detected, a
warning is emitted and an empty set of lists is returned.
### Testing
```python
from unstructured.partition_email import partition_email
filename = "example-docs/eml/fake-encrypted.eml"
partition_email(filename=filename)
```
```python
from unstructured.partition_msg import partition_msg
filename = "example-docs/fake-encrypted.msg"
partition_msgl(filename=filename)
```
### Summary
Closes#1184. Updates `partition_html` to respect the ordering of
`<pre>` tags in HTML documents.
### Testing
The elements in the following example should be in the correct order.
```python
from unstructured.partition.html import partition_html
html_text = """
<pre>The Big Brown Bear</pre>
<div>The big brown bear is growling.</div>
<pre>The big brown bear is sleeping.</pre>
<div>The Big Blue Bear</div>
"""
elements = partition_html(text=html_text)
print("\n\n".join([str(el) for el in elements]))
```
### Summary
Address
[#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for
`hi_res` and `fast` strategies. The `ocr_only` strategy does not include
coordinates.
- add functionality to switch sort mode between the current `basic`
sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies
- add the script to evaluate the `xy-cut` sorting approach
- add jupyter notebook to provide evaluation and visualization for the
`xy-cut` sorting approach
### Evaluation
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy>
```
Here, the file should be under the project root directory. For example,
```
export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast
```
Add test case test_partition_image_with_multipage_tiff that reads multipage TIFF file and
- confirms that the function reads all the pages in the TIFF.
- page number is added to the metadata
This PR is branched from and developed on top of 6d6be99 commit.
### Summary
Closes#1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/winter-sports.epub"
# Should not emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should raise an error
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
- fixes#1079 where partitioning is happening twice in the case of
`strategy="ocr_only"`
- only calls `extractable_elements` if we can predetermine that
`ocr_only` is not a possible strategy even if it was the intended
strategy.
- Adds additional assertion test that `_partition_pdf_or_image_with_ocr`
is not called when falling back to `fast` from `ocr_only`
* pip-compile in order to bump unstructured-inference
* Set the default `ocr_mode` back to `enitre_page` now that [this
error](https://github.com/Unstructured-IO/unstructured-inference/pull/183)
is addressed
* Explicitly add `sphinx-tabs` to `build.in`. This file provides
`docs/requirements.txt`.
* Remove a pinned `pydantic` version
* Fix a makefile command to `pip-compile` a missing ingest file.
### Summary
Updates `partition` to let users know to installs the appropriate extras
if they're missing. Prior to this PR, users would get an exception
stating `partition_pdf` (or whichever function that requires extras)
does not exist.
### Testing
First `pip uninstall ebooklib`. Then run
```python
from unstructured.partition.auto import partition
partition(filename="example-docs/winter-sports.epub")
```
The error should look like
```python
ImportError: partition_epub is not available. Install the epub dependencies with pip install "unstructured[epub]"
```
**Summary**
Closes#747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
### Summary
Closes#1027
The msg test in question was no longer failing after removing the
quick-fix and comment explaining the issue. However, the test was not
functioning as intended. Test was refactored to appropriately test
`metadata_last_modified` of attachments.
`partition_msg` was then updated to pass `metadata_last_modified` to
`attachment_partitioner`.
The same was done for email partitioning.
### Testing
```
from unstructured.partition.text import partition_text
from unstructured.partition.msg import partition_msg
from unstructured.partition.email import partition_email
filename="example-docs/fake-email-attachment.msg"
elements = partition_msg(filename=filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
# previously, these were different values because last_modified wasn't being updated in attachments
elements[1].metadata.last_modified
elements[-1].text
elements[-1].metadata.last_modified
email_filename="example-docs/eml/fake-email-attachment.eml"
email_elements = partition_email(filename=email_filename, attachment_partitioner=partition_text, process_attachments=True, metadata_last_modified="0000-00-00")
email_elements[1].metadata.last_modified
email_elements[-1].text
email_elements[-1].metadata.last_modified
```
Set to individual_blocks for now to work around [this
bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179).
I verified by printing the current ocr_mode in inference. The
`entire_page` default is overridden.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: awalker4 <awalker4@users.noreply.github.com>
The reason this test is failing is the API is returning "fast" results
when "hi_res" is requested, which is being tracked in this ticket:
https://github.com/Unstructured-IO/unstructured-api/issues/188 .
This failure was only showing up on the `main` branch, per the commented
out `pytest` skips.
Handle Content-Disposition: inline and attachment without filename
* Add new email test example and test with Content-Disposition: inline.
* Move attachment_info above for loop so it is always defined
* Check if item is inline as well as attachment as these both lack an = character to split on
* Create filename if filename is not specified and write file.
* Update list_attachments with new filename
Fix attachments with = in filename
* Limit split to first match of = to prevent creating a list of more than two parts
* Add example email with attachment name and test for issue
* feat: add functionality to check if a string contains any emoji characters
* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji
* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use
* chore: update changelog & version
* chore: update changelog & version
* chore: update dependencies
* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`
* chore: update changelog & version
* feat: add functionality to switch html text parser based on whether the html text contains emoji
* chore: update changelog & version
* fix lint errors
* test: revert the `EXPECTED_XLS_TEXT_LEN` value back
* feat: always use `soupparser_fromstring` to parse `html text`
* fix lint error
* feat: add functionality to check if a string contains any emoji characters
* feat: add functionality to switch `html` text parser based on whether the `html` text contains emoji
* chore: add `beautifulsoup4` and `emoji` packages to `requirements/base.in` for general use
* chore: update changelog & version
* chore: update changelog & version
* chore: update dependencies
* test: update `EXPECTED_XLS_TEXT_LEN` for `test_auto_partition_xls_from_filename`
* chore: update changelog & version
* add auto_paragraph_grouper. add line break pattern.
* combine group_broken_paragraph and blank_line_grouper function
* fix make check errors
* fix make check errors
* fix make check errors
* fix make check errors
* run make tidy to fix errors
* tidy core.py and text.py
* fix blank-line breaker to extends the result and replace new line with space
* fix function name typo
* call group_broken_paragraphs for blank_line_grouper
* edit function name from one_line_grouper to new_line_grouper for consistency
* edit threshold from 0.5 to 0.1
* edit threshold from 0.5 to 0.1
* Revert "call group_broken_paragraphs for blank_line_grouper"
This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.
* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1
* edit test_text assertion. remove all BULLETS_PATTERN.
* Update ingest test fixtures (#1052)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* edit test case in test_xml_partition
* update assertion on test_auto
---------
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph
* chore: add docstring
* chore: fix lint errors
* feat: ignore spaces when extracting emphasized texts from a paragraph
* feat: add functionality to track emphasized text (`bold/italic` formatting) from table
* test: add test case for grabbing emphasized texts from element metadata
* chore: fix lint errors
* chore: update changelog & version
* Update ingest test fixtures (#1047)
* feat: add func for checking on EmailAddress type
* feat: add EmailAddress type
* feat: add check for email type
* feat: add test for cheking EmailAdress type
* feat: update existing example files with email
* feat: add new exampe fileds with email in the text
* fix: apply linter
* feat: update changelog file
* feat: add test for is_email_address function
* don't push
* fix: clean up code
* apply linter
* fix: clean up
* fix: remove file chaanges
* fix: remove not used files for email address test
* fix: remove not necessary tests
* clean up
* fix: apply linter
* fix: update CHANGELOG
* fix: change version
* fix: fix msg test
* fix: apply linter for tests
* fix: remove spaces
* fix: apply linter with longer line
* feat: update documentation
* fix: remove duplicates
* Update getting_started.rst
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* don't push
* enhancement: improve json detection by detect_filetype (#971)
* update regex pattern
* improve json regex pattern checks and add test file
* update file name
* update tests and formatting
* update changelog and version
* refactor: simplifies JSON detection and add tests (#975)
* refactor json detection
* version and changelog
* fix mock in test
* feat: adds Outlook connector (#939)
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
* Roman/expose dpi param (#966)
* Bump inference version
* Pass through the dpi param if available
* Update CHANGELOG
* Check dpi param passed in via unit test
* Bump inference version
* Fix unit test around file info to work on mac as well
* chore: cleanup changelog for 0.8.2 (#976)
* Update `partition_via_api` to not post a strategy value if not user specified (#967)
* remove default strategy
* working on test
* fixed test, coordinates param needed to be included
* nits
* update changelog
* lint
* update requirements
* build(release): cut 0.8.4 release (#979)
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Chore: add uns api repo unittests (#954)
* stage
* git clone
* ci ignore markdown file
* make install
* use env instead
* remove md
* add script
* wrong env value
* add note
* maybe don't rm
* no cd../
---------
Co-authored-by: cragwolfe <crag@unstructured.io>
* fix: handling for empty tables in word docs and powerpoints (#982)
* fix table index error
* changelog and version
* fix: only download nltk packages if necessary (#985)
* fix: only download nltk if necessary
* changelog and version
* Chore: Pass table support param to partition image (#973)
* add param and test in image table extraction
* version and changelog
* need to publish this one for api repo
* add new param skip_infer_table_types
* use warning
* clean up with mapping
* add test for tsv
* fix test fail
* weird change from merge
* doc nit
* don't use mapping
* correct conflict
* Update pip in makefile (#981)
* update pip in makefile
* merge and update requirements
* update version
* update outlook requirements
* chore: remove debug printing (#988)
* fix: correct nltk download arg order (#991)
* fix: correct download order to nltk args
* add smoke test for tokenizers
* Chore: put back function `split_by_paragraph` (#992)
* put back function
* not really fixes
* don't push
* fix: clean up code
* fix: clean up
* fix: clean up
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Roman/ingest refactor (#978)
* Pull out s3 code as subcommand
* Pull out dropbox code as subcommand
* Pull out azure code as subcommand
* Pull out fsspec code as subcommand
* Pull out github code as subcommand
* Pull out gitlab code as subcommand
* Pull out reddit code as subcommand
* Pull out slack code as subcommand
* Pull out discord code as subcommand
* Pull out wikipedia code as subcommand
* Pull out gdrive code as subcommand
* Pull out biomed code as subcommand
* rename parameters
* Pull out onedrive code as subcommand
* Pull out outlook code as subcommand
* Pull out local code as subcommand
* Pull out elasticsearch code as subcommand
* Pull out confluence code as subcommand
* Drop previous main file
* update changelog
* Add back in mp.Pool
* Fix mypy issues with click
* Make sure all tests run with verbose flag
* refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data
* Pull out some more shared options
* Support running code via python as well as cli
* update ingest readme and move it to the ingest folder
* update usage in connector docs
* move local command arg in test
* Seperate out cli code from logic running unstructured
* Make some cli fields required rather than optional
* rename process -> processor
* Improve logger to avoid duplicate handlers
---------
Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* feat: adds Box connector (#996)
* chore: rename Element's "date" field to "last_modified" (#997)
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
* don't push
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* feat: add document date for remaining file types (#930) (#969)
* feat: add document date for remaining file types (#930)
* feat: add functions for getting modification date
* feat: add date field to metadata from csv file
* feat: add tests for csv patition
* feat: add date field to metadata from html file
* feat: add tests for html partition
* fix: return file name onlyif possible
* feat: add csv tests
* fix: renaming
* feat: add filed metadata_date as date of last mod
* feat: add tests for partition_docx
* feat: add filed metadata_date to .doc file
* feat: add tests for partition_doc
* feat: add metadata_date to .epub file
* feat: add tests for partition_epub
* fix: fix test mocking
* feat: add metadata_date for image partition
* feat: add test for image partition
* feat: add coorrdinate system argument
* feat: add date to element metadata
* feat: add metadata_date for JSON partition
* feat: add test for JSON partition
* fix: rename variable
* feat: add metadata_date for md partition
* feat: add test for md partition
* feat: update doc string
* feat: add metadata_date for .odt partition
* feat: update .odt string
* feat: add metadata_date for .org partition
* feat: add tests for .org partition
* feat: add metadata_date for .pdf partition
* feat: add tests for .pdf partition
* feat: add metadata_date for .pptx partition
* feat: add metadata_date for .ppt partition
* feat: add tests for .ppt partition
* feat: add tests for .pptx partition
* feat: add metadata_date for .rst partition
* feat: add tests for .rst partition
* fix: get modification date after file checking
* feat: add tests for .rtf partition
* feat: add tests for .rtf partition
* feat: add metadata_date for .txt partition
* fix: rename argument
* feat: add tests for .txt partition
* feat: update doc string rst patrition function
* feat: add metadata_date for .tsv partition
* feat: add tests for .tsv partition
* feat: add metadata_date for .xlsx partition
* feat: add tests for .xlsx partition
* fix: clean up
* feat: add tests for .xml partition
* feat: add tests for .xml partition
* fix: use `or ` instead of `if`
* fix: fix epub tests
* fix: remove not used code
* fix: add try block for getting file name
* fix: applying linter changes
* fix: fix test_partition_file
* feat: add metadata_date for email
* feat: add test for email partition
* feat: add metadata_date for msg
* feat: add tests for msg partition
* feat: update CHANGELOG file
* fix: update partitions doc string
* don't push
* fix: clean up code
* linting, linting, linting
* remove unnecessary example doc
* update version and changelog
* ingest-test-fixtures-update
* set metadata date in test
---------
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#970)
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* Revert "Update ingest test fixtures (#970)"
This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2.
* remove date from metadata in outputs
* update docstring ordering
* remove print
* remove print
* remove print
* linting, linting, linting
* fix version and test
* fix changelog
* fix changelog
* update version
---------
Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
* fix: removie prints
* remove unused file
* fix: apply linter
* feat: add post processing filter_element_types
* feat: add tests for filter_element_types
* feat: update changelog
* feat: add doc string for filter_element_types
* fix: change the version
* feat: update documentation
* bump dev version number
* cleanup changelog
* linting, linting, linting
---------
Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: David Potter <potterdavidm@gmail.com>
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* split dependencies by document type
* make pip-compile with new requirements
* add extra requirements to setup.py
* add in all docs; re pip-compile
* extra for all docs
* add pandas to xlsx
* dependency requires for tsv and csv
* handling for doc, docx and odt
* dependency check for pypandoc
* required dependencies for pandoc files
* xml and html
* markdown
* msg
* add in pdf
* add in pptx
* add in excel
* add lxml as base req
* extra all docs for local inference
* local inference installs all
* pin pillow version
* fixes for plain text tests
* fixes for doc
* update make commands
* changelog and version
* add xlrd
* update pip-compile
* pin numpy for python 3.8 support
* more constraints
* contraint on scipy
* update install docs
* constrain ipython
* add outlook to pip-compile
* more ipython constraints
* add extras to dockerfile
* pin office365 client
* few doc tweaks
* types as strings
* last pip-compile
* re pip-comple
* make tidy
* make tidy
* add param and test in image table extraction
* version and changelog
* need to publish this one for api repo
* add new param skip_infer_table_types
* use warning
* clean up with mapping
* add test for tsv
* fix test fail
* weird change from merge
* doc nit
* don't use mapping
* correct conflict