mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-24 09:26:08 +00:00

Closes https://github.com/Unstructured-IO/unstructured/issues/1414
Closes #2039
This PR:
- Uses Pinecone python cli to implement a destination connector for
Pinecone and provides the ingest readme requirements
[(here)](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest#the-checklist)
for the connector
- Updates documentation for the s3 destination connector
- Alphabetically sorts setup.py contents
- Updates logs for the chunking node in ingest pipeline
- Adds a baseline session handle implementation for destination
connectors, to be able to parallelize their operations
- For the
[bug](https://github.com/Unstructured-IO/unstructured/issues/1892)
related to persisting element data to ingest embedding nodes; this PR
tests the
[solution](https://github.com/Unstructured-IO/unstructured/pull/1893)
with its ingest test
- Solves a bug on ingest chunking params with [bugfix on chunking params
and implementing related
test](69e1949a6f
)
---------
Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com>
21 lines
2.1 KiB
ReStructuredText
21 lines
2.1 KiB
ReStructuredText
Chunking Configuration
|
|
=========================
|
|
|
|
A common chunking configuration is a critical element in the data processing pipeline, particularly
|
|
when creating embeddings and populating vector databases with the results. This configuration defines
|
|
the parameters governing the segmentation of text into meaningful chunks, whether at the document,
|
|
paragraph, or sentence level. It plays a pivotal role in determining the size and structure of these chunks,
|
|
ensuring that they align with the specific requirements of downstream tasks, such as embedding generation and
|
|
vector database population. By carefully configuring chunking parameters, users can optimize the granularity of
|
|
data segments, ultimately contributing to more cohesive and contextually rich results. This is crucial for tasks
|
|
like natural language processing and text analysis, as well as for the efficient storage and retrieval of embeddings
|
|
in vector databases, enhancing the quality and relevance of the results.
|
|
|
|
Configs
|
|
---------------------
|
|
* ``chunk_elements (default False)``: Boolean flag whether to run chunking as part of the ingest process.
|
|
* ``multipage_sections (default True)``: If True, sections can span multiple pages.
|
|
* ``combine_text_under_n_chars (default 500)``: Combines elements (for example a series of titles) until a section reaches a length of n characters. Defaults to `max_characters` which combines chunks whenever space allows. Specifying 0 for this argument suppresses combining of small chunks. Note this value is "capped" at the `new_after_n_chars` value since a value higher than that would not change this parameter's effect.
|
|
* ``new_after_n_chars (default 1500)``: Cuts off new sections once they reach a length of n characters (soft max). Defaults to `max_characters` when not specified, which effectively disables any soft window. Specifying 0 for this argument causes each element to appear in a chunk by itself (although an element with text longer than `max_characters` will be still be split into two or more chunks).
|
|
* ``max_characters (default 1500)``: Chunks elements text and text_as_html (if present) into chunks of length n characters (hard max)
|