mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-09 18:15:55 +00:00

Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039
This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f
)
---------
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
52 lines
2.1 KiB
ReStructuredText
52 lines
2.1 KiB
ReStructuredText
########
|
|
Chunking
|
|
########
|
|
|
|
Chunking functions in ``unstructured`` use metadata and document elements
|
|
detected with ``partition`` functions to split a document into subsections
|
|
for uses cases such as Retrieval Augmented Generation (RAG).
|
|
|
|
|
|
``chunk_by_title``
|
|
------------------
|
|
|
|
The ``chunk_by_title`` function combines elements into sections by looking
|
|
for the presence of titles. When a title is detected, a new section is created.
|
|
Tables and non-text elements (such as page breaks or images) are always their
|
|
own section.
|
|
|
|
New sections are also created if changes in metadata occure. Examples of when
|
|
this occurs include when the section of the document or the page number changes
|
|
or when an element comes from an attachment instead of from the main document.
|
|
If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
|
|
that span between pages. This kwarg is ``True`` by default.
|
|
|
|
``chunk_by_title`` will start a new section if the length of a section exceed
|
|
``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
|
|
not split elements, it is possible for a section to exceed that lenght, for
|
|
example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
|
|
|
|
Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
|
|
do not exceed the specified threshold, which defaults to ``500``. This will combine
|
|
a series of ``Title`` elements that occur one after another, which sometimes
|
|
happens in lists that are not detected as ``ListItem`` elements. Set
|
|
``combine_text_under_n_chars=0`` to turn off this behavior.
|
|
|
|
The following shows an example of how to use ``chunk_by_title``. You will
|
|
see the document chunked into sections instead of elements.
|
|
|
|
|
|
.. code:: python
|
|
|
|
from unstructured.partition.html import partition_html
|
|
from unstructured.chunking.title import chunk_by_title
|
|
|
|
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
|
|
elements = partition_html(url=url)
|
|
chunks = chunk_by_title(elements)
|
|
|
|
for chunk in chunks:
|
|
print(chunk)
|
|
print("\n\n" + "-"*80)
|
|
input()
|