unstructured/docs/source/introduction/key_concepts.rst

Key Concepts
============

Natural Language Processing (NLP) encompasses a broad spectrum of tasks and methodologies. This section introduces some fundamental concepts crucial for most NLP projects that involve Unstructured's products.

Data Ingestion
--------------

Unstructured's ``upstream connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our upstream connectors, you can find details `here <https://unstructured-io.github.io/unstructured/ingest/source_connectors.html>`__.

Data Preprocessing
^^^^^^^^^^^^^^^^^^

Before the core analysis, raw data often requires significant preprocessing:

- **Partitioning**: Segregating data into smaller, manageable segments or partitions.

- **Cleaning**: Removing anomalies, filling missing values, and eliminating any irrelevant or erroneous information.

Preprocessing ensures data integrity and can significantly influence the outcomes of subsequent tasks.

Chunking Text for Vector Databases
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Vector databases often require data to be in smaller, consistent chunks for efficient storage and retrieval. Chunking involves dividing lengthy texts into smaller segments or chunks, ensuring that each piece retains enough context to be meaningful.

Embeddings
^^^^^^^^^^

Embeddings convert textual data into fixed-size vectors, preserving semantic context. These vector representations can then be used for a myriad of tasks, including similarity searches, clustering, and classification. Different embeddings might prioritize different aspects of the text, from semantic meaning to sentence structure.

Vector Databases
^^^^^^^^^^^^^^^^

Vector databases store embeddings in a manner optimized for high-speed similarity searches. Given a query embedding, these databases can quickly retrieve the most similar vectors, facilitating tasks like recommendation, anomaly detection, and clustering.

These foundational concepts provide the groundwork for more advanced NLP methodologies and pipelines. Proper understanding and implementation can vastly improve the outcomes of NLP projects.

Tokens
^^^^^^

Tokenization decomposes texts into smaller units, called tokens. A token might represent a word, part of a word, or even a single character. This process helps in analyzing and processing the text, making it digestible for models and algorithms.

Large Language Models (LLMs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

LLMs, like GPT, are trained on vast amounts of data and have the capacity to comprehend and generate human-like text. They have achieved state-of-the-art results across a multitude of NLP tasks and can be fine-tuned to cater to specific domains or requirements.

Retrieval Augmented Generation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Large Language Models (LLMs) like OpenAI's ChatGPT and Anthropic's Claude have revolutionized the AI landscape with their prowess. However, they inherently suffer from significant drawbacks. One major issue is their static nature, which means they're "frozen in time".
For instance, ChatGPT's knowledge is limited up to September 2021, leaving it blind to any developments or information post that period. Despite this, LLMs might often respond to newer queries with unwarranted confidence, a phenomenon known as "hallucination".
Such errors can be highly detrimental, especially when these models serve critical real-world applications.

Retrieval Augmented Generation (RAG) is a groundbreaking technique designed to counteract the limitations of foundational LLMs. By pairing an LLM with a RAG pipeline, we can enable users to access the underlying data sources that the model uses. This transparent approach not
only ensures that an LLM's claims can be verified for accuracy but also builds a trust factor among users.

Moreover, RAG offers a cost-effective solution. Instead of bearing the extensive computational and financial burdens of training custom models or finetuning existing ones, RAG can, in many situations, serve as a sufficient alternative. This reduction in resource consumption
is particularly beneficial for organizations that lack the means to develop and deploy foundational models from scratch.

A RAG workflow can be broken down into the following steps:

1. **Data ingestion**: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our `data connectors <https://unstructured-io.github.io/unstructured/source_connectors.html>`__.

2. **Data preprocessing and cleaning**: Once you've identified and collected your data sources a good practice is to remove any unnecessary artifacts within the dataset. At Unstructured we have a variety of different tools to remove unneccesary elements. Found `here <https://unstructured-io.github.io/unstructured/functions.html>`_

3. **Chunking**: The next step is to break your text down into digestable pieces for your LLM to be able to consume. LangChain, Llama Index and Haystack offer chunking funcionalities.

4. **Embedding**: After chunking, you will need to convert the text into a numerical representation (vector embedding) that a LLM can understand. OpenAI, Cohere, and Hugging Face all offer embedding models.

5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are lots of options to choose from for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate and more).

6. **User Prompt**: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.

7. **LLM Generation**: Once you've retrieved your relevant chunks you pass the prompt + the context to the LLM for the LLM to generate a more accurate response.

For a full guide on how to implement RAG check out this `blog post <https://medium.com/unstructured-io/effortless-document-extraction-a-guide-to-using-unstructured-api-and-data-connectors-6c2659eda4af>`__
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00			`Key Concepts`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`============`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00			`Natural Language Processing (NLP) encompasses a broad spectrum of tasks and methodologies. This section introduces some fundamental concepts crucial for most NLP projects that involve Unstructured's products.`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Data Ingestion`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`--------------`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
build: release commit for 0.11.5 (#2285) Also fix broken link in docs. 2023-12-16 18:25:55 -08:00			Unstructured's ``upstream connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our upstream connectors, you can find details `here <https://unstructured-io.github.io/unstructured/ingest/source_connectors.html>`__.
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Data Preprocessing`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^^^^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Before the core analysis, raw data often requires significant preprocessing:`

			`- Partitioning: Segregating data into smaller, manageable segments or partitions.`
feat: add Qdrant ingest destination connector (#2338) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> 2024-01-02 14:08:20 -08:00
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00			`- Cleaning: Removing anomalies, filling missing values, and eliminating any irrelevant or erroneous information.`

			`Preprocessing ensures data integrity and can significantly influence the outcomes of subsequent tasks.`

			`Chunking Text for Vector Databases`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Vector databases often require data to be in smaller, consistent chunks for efficient storage and retrieval. Chunking involves dividing lengthy texts into smaller segments or chunks, ensuring that each piece retains enough context to be meaningful.`

			`Embeddings`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Embeddings convert textual data into fixed-size vectors, preserving semantic context. These vector representations can then be used for a myriad of tasks, including similarity searches, clustering, and classification. Different embeddings might prioritize different aspects of the text, from semantic meaning to sentence structure.`

			`Vector Databases`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Vector databases store embeddings in a manner optimized for high-speed similarity searches. Given a query embedding, these databases can quickly retrieve the most similar vectors, facilitating tasks like recommendation, anomaly detection, and clustering.`

			`These foundational concepts provide the groundwork for more advanced NLP methodologies and pipelines. Proper understanding and implementation can vastly improve the outcomes of NLP projects.`

			`Tokens`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`Tokenization decomposes texts into smaller units, called tokens. A token might represent a word, part of a word, or even a single character. This process helps in analyzing and processing the text, making it digestible for models and algorithms.`

			`Large Language Models (LLMs)`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
Update docs jack (#1157) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work) 2023-08-21 10:27:32 -07:00
			`LLMs, like GPT, are trained on vast amounts of data and have the capacity to comprehend and generate human-like text. They have achieved state-of-the-art results across a multitude of NLP tasks and can be fine-tuned to cater to specific domains or requirements.`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00
			`Retrieval Augmented Generation`
Fixed Sphinx warning errors (#1438) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html` 2023-09-25 21:20:16 -07:00			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00
feat: add Qdrant ingest destination connector (#2338) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> 2024-01-02 14:08:20 -08:00			`Large Language Models (LLMs) like OpenAI's ChatGPT and Anthropic's Claude have revolutionized the AI landscape with their prowess. However, they inherently suffer from significant drawbacks. One major issue is their static nature, which means they're "frozen in time".`
			`For instance, ChatGPT's knowledge is limited up to September 2021, leaving it blind to any developments or information post that period. Despite this, LLMs might often respond to newer queries with unwarranted confidence, a phenomenon known as "hallucination".`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00			`Such errors can be highly detrimental, especially when these models serve critical real-world applications.`

feat: add Qdrant ingest destination connector (#2338) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> 2024-01-02 14:08:20 -08:00			`Retrieval Augmented Generation (RAG) is a groundbreaking technique designed to counteract the limitations of foundational LLMs. By pairing an LLM with a RAG pipeline, we can enable users to access the underlying data sources that the model uses. This transparent approach not`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00			`only ensures that an LLM's claims can be verified for accuracy but also builds a trust factor among users.`

feat: add Qdrant ingest destination connector (#2338) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> 2024-01-02 14:08:20 -08:00			`Moreover, RAG offers a cost-effective solution. Instead of bearing the extensive computational and financial burdens of training custom models or finetuning existing ones, RAG can, in many situations, serve as a sufficient alternative. This reduction in resource consumption`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00			`is particularly beneficial for organizations that lack the means to develop and deploy foundational models from scratch.`

			`A RAG workflow can be broken down into the following steps:`

Doc : replace link `upstream connectors` with `source connectors` (#1683) Fixes #1502 Here I have replaced `stream_connectors.html` with `source_connectors.html`. 2023-10-10 10:07:51 +05:30			1. Data ingestion: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our `data connectors <https://unstructured-io.github.io/unstructured/source_connectors.html>`__.
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00
docs: no more bricks (#1967) ### Summary We no longer use the "bricks" terminology for partioning functions, etc in the library. This PR updates various references to bricks within the repo and the docs. This is just an initial pass to swap the terminology out, it'll likely be helpful to reorganize the docs a bit as well. --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> 2023-11-02 10:43:26 -04:00			2. Data preprocessing and cleaning: Once you've identified and collected your data sources a good practice is to remove any unnecessary artifacts within the dataset. At Unstructured we have a variety of different tools to remove unneccesary elements. Found `here <https://unstructured-io.github.io/unstructured/functions.html>`_
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00
			`3. Chunking: The next step is to break your text down into digestable pieces for your LLM to be able to consume. LangChain, Llama Index and Haystack offer chunking funcionalities.`

			`4. Embedding: After chunking, you will need to convert the text into a numerical representation (vector embedding) that a LLM can understand. OpenAI, Cohere, and Hugging Face all offer embedding models.`

feat: add Qdrant ingest destination connector (#2338) This PR intends to add [Qdrant](https://qdrant.tech/) as a supported ingestion destination. - Implements CLI and programmatic usage. - Documentation update - Integration test script --- Clone of #2315 to run with CI secrets --------- Co-authored-by: Anush008 <anushshetty90@gmail.com> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> 2024-01-02 14:08:20 -08:00			`5. Vector Database: The next step is to choose a location for storing your chunked embeddings. There are lots of options to choose from for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate and more).`
Jack/update documentation (#1190) Updated: - Added back support document types for partitioning - Added more tabs for python code in the API page - Added a RAG section in Key Concepts - Added a Common Use case section in overview 2023-09-04 09:15:50 -07:00
			`6. User Prompt: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.`

			`7. LLM Generation: Once you've retrieved your relevant chunks you pass the prompt + the context to the LLM for the LLM to generate a more accurate response.`

build: release commit for 0.11.5 (#2285) Also fix broken link in docs. 2023-12-16 18:25:55 -08:00			For a full guide on how to implement RAG check out this `blog post <https://medium.com/unstructured-io/effortless-document-extraction-a-guide-to-using-unstructured-api-and-data-connectors-6c2659eda4af>`__