**Summary**
Remove `unstructured.partition.html.convert_and_partition_html()`. Move
file-type conversion (to HTML) responsibility to each brokering
partitioner that uses that strategy and let them call `partition_html()`
for themselves with the result.
**Additional Context**
Rationale:
- `partition_html()` does not want or need to know which partitioners
might broker partitioning to it.
- Different brokering partitioners have their own methods to convert
their format to HTML and quirks that may be involved for their format.
Avoid coupling them so they can evolve independently.
- The core of the conversion work is already encapsulated in
`unstructured.partition.common.convert_file_to_html_text_using_pandoc()`.
- `convert_and_partition_html()` represents an additional brokering
layer with the entailed complexities of an additional site for default
parameter values to be (mis-)applied and/or dropped and is an additional
location for new parameters to be added.
**Summary**
Remedy the persistent type errors when importing `unstructured`. Give
the partitioner type annotations a general scrubbing while we're at it.
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.
---------
Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
**Summary**
This is the final step in adding pluggable chunking-strategies. It
introduces the `chunk()` function to replace calls to strategy-specific
chunkers in the `@add_chunking_strategy` decorator. The `chunk()`
function then uses a mapping of chunking-strategy names (e.g.
"by_title", "basic") to chunking functions (chunkers) to dispatch the
chunking call. This allows other chunkers to be added at runtime rather
than requiring a code change, which is what "pluggable" chunkers is.
**Additional Information**
- Move the `@add_chunking_strategy` to the new `chunking.dispatch`
module since it coheres strongly with that operation, but publish it
from `chunking(.__init__)` (as it was before) so users don't couple to
the way we organize the chunking sub-package. Also remove the third
level of nesting as it's unrequired in this case.
- Add unit tests for the `@add_chunking_strategy` decorator which was
previously uncovered by any direct test.
The new "basic" chunking strategy and overlap options need to be
available from the ingest CLI. An ingest test of those features is also
welcome, both to verify the ingest feature and to defend against
regressions in the chunking code.
Add a local ingest test exercising both the "basic" chunking strategy
and intra-chunk overlap. Since there is no new source connector
involved, use the local ingest source and destination. Update
documentation to suit, filling in some details that hadn't made it into
the docs yet.
The chunking subpackage `unstructured.chunking` currently contains only
the `title` module and the `@add_chunking_strategy()` decorator is
located in that module even though it has no special relationship to the
`by_title` chunking strategy.
Move it to the `__init__.py` module such that it is exported from
`unstructured.chunking`. Adjust all references, pretty much one per
partitioner, to import it from there.
This prepares the way for further separation of the chunking package
into modules, including a new `character` module for the `by_character`
chunking strategy.
* **Removed `ebooklib` as a dependency** `ebooklib` is licensed under
AGPL3, which is incompatible with the Apache 2.0 license. Thus it is
being removed.
### Summary
Closes#1534 and #1535
Detects document language using `langdetect` package.
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
This PR adds support for `source` property from
`unstructured_inference`, allowing the user to be able to see the origin
of the data under `detection_origin`field environment variable
UNSTRUCTURED_INCLUDE_DEBUG_METADATA=true
In order to try this feature you can use this code:
```
from unstructured.partition.pdf import partition_pdf_or_image
yolox_elements = partition_pdf_or_image(filename='example-docs/loremipsum-flat.pdf', strategy='hi_res', model_name='yolox')
sources = [e.detection_origin for e in yolox_elements]
print(sources)
```
And will print 'yolox' as source for all the elements
This PR adds an arg to the html partition flow called `source_format` if
anything other than "html" we will return non-HTML elements to conform
with the file type we received.
addresses: https://github.com/Unstructured-IO/unstructured/issues/726
### Summary
Partial solution to #1185.
Related to #1222.
Creates decorator from `chunk_by_title` cleaning brick.
Breaks a document into sections based on the presence of Title elements.
Also starts a new section under the following conditions:
- If metadata changes, indicating a change in section or page or a
switch to processing attachments. If `multipage_sections=True`, sections
can span pages. `multipage_sections` defaults to True.
- If the length of the section exceeds `new_after_n_chars` characters.
The default is 1500. The **chunking function does not split individual
elements**, so it's possible for a section to exceed that threshold if
an individual element if over `new_after_n_chars characters`, which
could occur with a long NarrativeText element.
Combines sections under these conditions
- Sections under `combine_under_n_chars` characters are combined. The
default is 500.
### Testing
from unstructured.partition.html import partition_html
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
chunks = partition_html(url=url, chunking_strategy="by_title")
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()
### Summary
Closes#1007. Adds a deprecation warning for the `file_filename` kwarg
to `partition`, `partition_via_api`, and `partition_multiple_via_api`.
Also catches a warning in `ebooklib` that we do not want to emit in
`unstructured`.
### Testing
```python
from unstructured.partition.auto import partition
filename = "example-docs/winter-sports.epub"
# Should not emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should emit a warning
with open(filename, "rb") as f:
elements = partition(file=f, file_filename="test.epub")
# Should be test.epub
elements[0].metadata.filename
# Should raise an error
with open(filename, "rb") as f:
elements = partition(file=f, metadata_filename="test.epub", file_filename="test.epub")
```
* fix conflicts
* add tests and clean metadata_filename in partitions
* fix test_email and remove comments
* make tidy/check
* update changelog and version
* fix tests
* make tidy again
* add include_metadata kwarg and tests to parsers
add exclude_metadata to docx
add test for doc to exclude metadata
add include_metadata kwarg to email
add include_metadata kwarg to epub
add include_metadata kwarg to json
add exclude_metadata tests to md
add include_metadata kwarg and tests for msg parse
add include_metadata kwarg and tests for odt parse
add include_metadata kwarg and tests for org parse
add include_metadata kwarg and tests for ppt and pptx parse
add include_metadata kwarg and tests for rst parse
add include_metadata kwarg and tests for rtf parse
add include_metadata tests for text parse
add include_metadata tests for tsv parse
add include_metadata tests for xlsx parse
add include_metadata tests for xml parse
* WIP add include_metadata to partition_pdf
* add include_metadata tests to partition_pdf
* make tidy/check
* update changelog and version
* change test asserts and move docstring logic to process_metadata
* make tidy
* fix tests asserts
* linting, linting, linting
* sync versions
* skip api call test not on main
---------
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
* first pass on regex metadata
* fix typing for regex metadata
* add dataclass back in
* add decorators
* fix tests
* update docs
* add tests for regex metadata
* add process metadata to tsv
* changelog and version
* docs typos
* consolidate to using a single kwarg
* fix test
Adds filetype to metadata. I've created a decorator that adds metadata to a list of elements. This replaces some existing boilerplate, but also adds a nice layered approach to determining the filetype. Since in some cases several partition_ functions handle a file in various formats, the partition function that first touches a file will be the last one to alter its metadata, resulting in the correct filetype metadata.
Tests are added to make sure:
* When partition is used, any content type or auto file type detection will override file-specific partition function metadata
* Both auto and file-specific partitioning gives the desired filetype metadata
Won't work with image files currently... the plumbing is there to use the image format inferred by PIL, but we need to pull in the fix from this PR to unstructured-inference .
* refactor epub; add rtf
* added test for rtf files
* filetype detection for rtf files
* add rtf to auto
* update docs for group_broken_paragraphs
* add rtf to docs
* update file list in readme
* update stage_for_transformers docs
* changelog and version bump
* skip rtf if in docker
* skip test if rtf not supported
* docs tweaks