unstructured/docs/source/core/chunking.rst
Ahmet Melek ed08773de7
feat: add pinecone destination connector (#1774)
Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039 

This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node  in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f)

---------

Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
2023-11-29 22:37:32 +00:00

52 lines
2.1 KiB
ReStructuredText

########
Chunking
########
Chunking functions in ``unstructured`` use metadata and document elements
detected with ``partition`` functions to split a document into subsections
for uses cases such as Retrieval Augmented Generation (RAG).
``chunk_by_title``
------------------
The ``chunk_by_title`` function combines elements into sections by looking
for the presence of titles. When a title is detected, a new section is created.
Tables and non-text elements (such as page breaks or images) are always their
own section.
New sections are also created if changes in metadata occure. Examples of when
this occurs include when the section of the document or the page number changes
or when an element comes from an attachment instead of from the main document.
If you set ``multipage_sections=True``, ``chunk_by_title`` will allow for sections
that span between pages. This kwarg is ``True`` by default.
``chunk_by_title`` will start a new section if the length of a section exceed
``new_after_n_chars``. The default value is ``1500``. ``chunk_by_title`` does
not split elements, it is possible for a section to exceed that lenght, for
example if a ``NarrativeText`` elements exceeds ``1500`` characters on its on.
Similarly, sections under ``combine_text_under_n_chars`` will be combined if they
do not exceed the specified threshold, which defaults to ``500``. This will combine
a series of ``Title`` elements that occur one after another, which sometimes
happens in lists that are not detected as ``ListItem`` elements. Set
``combine_text_under_n_chars=0`` to turn off this behavior.
The following shows an example of how to use ``chunk_by_title``. You will
see the document chunked into sections instead of elements.
.. code:: python
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title
url = "https://understandingwar.org/backgrounder/russian-offensive-campaign-assessment-august-27-2023-0"
elements = partition_html(url=url)
chunks = chunk_by_title(elements)
for chunk in chunks:
print(chunk)
print("\n\n" + "-"*80)
input()