mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-25 14:14:30 +00:00
Fixed broken links and improved readability in key concepts page (#2533)
To test: > cd docs && make html
This commit is contained in:
parent
f048695a55
commit
ad561b7939
@ -1,17 +1,17 @@
|
||||
## 0.12.5-dev5
|
||||
## 0.12.5-dev6
|
||||
|
||||
### Enhancements
|
||||
|
||||
### Features
|
||||
|
||||
* **Add OctoAI embedder** Adds support for embeddings via OctoAI.
|
||||
|
||||
### Fixes
|
||||
|
||||
* **Add OctoAI embedder** Adds support for embeddings via OctoAI.
|
||||
* **Fix `check_connection` in opensearch, databricks, postgres, azure connectors **
|
||||
* **Fix don't treat plain text files with double quotes as JSON ** If a file can be deserialized as JSON but it deserializes as a string, treat it as plain text even though it's valid JSON.
|
||||
* **Fix `check_connection` in opensearch, databricks, postgres, azure connectors **
|
||||
* **Fix cluster of bugs in `partition_xlsx()` that dropped content.** Algorithm for detecting "subtables" within a worksheet dropped table elements for certain patterns of populated cells such as when a trailing single-cell row appeared in a contiguous block of populated cells.
|
||||
* **Improved documentation**. Fixed broken links and improved readability on `Key Concepts` page.
|
||||
* **Rename `OpenAiEmbeddingConfig` to `OpenAIEmbeddingConfig`.
|
||||
|
||||
## 0.12.4
|
||||
|
||||
@ -1,12 +1,12 @@
|
||||
Key Concepts
|
||||
============
|
||||
|
||||
Natural Language Processing (NLP) encompasses a broad spectrum of tasks and methodologies. This section introduces some fundamental concepts crucial for most NLP projects that involve Unstructured's products.
|
||||
Natural Language Processing (NLP) encompasses various tasks and methodologies. This section introduces fundamental concepts crucial for most NLP projects involving Unstructured products.
|
||||
|
||||
Data Ingestion
|
||||
**************
|
||||
|
||||
Unstructured's ``Source Connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our upstream connectors, you can find details `here <https://unstructured-io.github.io/unstructured/ingest/source_connectors.html>`__.
|
||||
Unstructured ``Source Connectors`` make data ingestion easy. They ensure that your data is accessible, up to date, and usable for any downstream task. If you'd like to read more on our source connectors, you can find details `here <https://unstructured-io.github.io/unstructured/ingest/source_connectors.html>`__.
|
||||
|
||||
Data Preprocessing
|
||||
******************
|
||||
@ -15,7 +15,7 @@ Before the core analysis, raw data often requires significant preprocessing:
|
||||
|
||||
- **Partitioning**: Segregating data into smaller, manageable segments or partitions.
|
||||
|
||||
- **Cleaning**: Removing anomalies, filling missing values, and eliminating any irrelevant or erroneous information.
|
||||
- **Cleaning**: Removing anomalies, filling in missing values, and eliminating irrelevant or erroneous information.
|
||||
|
||||
Preprocessing ensures data integrity and can significantly influence the outcomes of subsequent tasks.
|
||||
|
||||
@ -27,7 +27,7 @@ Vector databases often require data to be in smaller, consistent chunks for effi
|
||||
Embeddings
|
||||
**********
|
||||
|
||||
Embeddings convert textual data into fixed-size vectors, preserving semantic context. These vector representations can then be used for a myriad of tasks, including similarity searches, clustering, and classification. Different embeddings might prioritize different aspects of the text, from semantic meaning to sentence structure.
|
||||
Embeddings convert textual data into fixed-size vectors, preserving semantic context. These vector representations can be used for many tasks, including similarity searches, clustering, and classification. Different embeddings prioritize different aspects of the text, from semantic meaning to sentence structure.
|
||||
|
||||
Vector Databases
|
||||
****************
|
||||
@ -39,40 +39,37 @@ These foundational concepts provide the groundwork for more advanced NLP methodo
|
||||
Tokens
|
||||
******
|
||||
|
||||
Tokenization decomposes texts into smaller units, called tokens. A token might represent a word, part of a word, or even a single character. This process helps in analyzing and processing the text, making it digestible for models and algorithms.
|
||||
Tokenization decomposes texts into smaller units called tokens. A token might represent a word, part of a word, or even a single character. This process helps analyze and process the text, making it digestible for models and algorithms.
|
||||
|
||||
Large Language Models (LLMs)
|
||||
****************************
|
||||
|
||||
LLMs, like GPT, are trained on vast amounts of data and have the capacity to comprehend and generate human-like text. They have achieved state-of-the-art results across a multitude of NLP tasks and can be fine-tuned to cater to specific domains or requirements.
|
||||
LLMs, like GPT, are trained on vast amounts of data and can comprehend and generate human-like text. They have achieved state-of-the-art results across many NLP tasks and can be fine-tuned to cater to specific domains or requirements.
|
||||
|
||||
Retrieval Augmented Generation (RAG)
|
||||
************************************
|
||||
|
||||
Large Language Models (LLMs) like OpenAI's ChatGPT and Anthropic's Claude have revolutionized the AI landscape with their prowess. However, they inherently suffer from significant drawbacks. One major issue is their static nature, which means they're "frozen in time".
|
||||
For instance, ChatGPT's knowledge is limited up to September 2021, leaving it blind to any developments or information post that period. Despite this, LLMs might often respond to newer queries with unwarranted confidence, a phenomenon known as "hallucination".
|
||||
Such errors can be highly detrimental, especially when these models serve critical real-world applications.
|
||||
Large Language Models (LLMs) like OpenAI's ChatGPT and Anthropic's Claude have revolutionized the AI landscape with their prowess. However, they inherently suffer from significant drawbacks. One major issue is their static nature, which means they're "frozen in time." Despite this, LLMs might often respond to newer queries with unwarranted confidence, a phenomenon known as "hallucination."
|
||||
Such errors can be highly detrimental, mainly when these models serve critical real-world applications.
|
||||
|
||||
Retrieval Augmented Generation (RAG) is a groundbreaking technique designed to counteract the limitations of foundational LLMs. By pairing an LLM with a RAG pipeline, we can enable users to access the underlying data sources that the model uses. This transparent approach not
|
||||
only ensures that an LLM's claims can be verified for accuracy but also builds a trust factor among users.
|
||||
Retrieval Augmented Generation (RAG) is a groundbreaking technique designed to counteract the limitations of foundational LLMs. By pairing an LLM with an RAG pipeline, we can enable users to access the underlying data sources that the model uses. This transparent approach ensures that an LLM's claims can be verified for accuracy and builds a trust factor among users.
|
||||
|
||||
Moreover, RAG offers a cost-effective solution. Instead of bearing the extensive computational and financial burdens of training custom models or finetuning existing ones, RAG can, in many situations, serve as a sufficient alternative. This reduction in resource consumption
|
||||
is particularly beneficial for organizations that lack the means to develop and deploy foundational models from scratch.
|
||||
Moreover, RAG offers a cost-effective solution. Instead of bearing the extensive computational and financial burdens of training custom models or fine-tuning existing ones, RAG can, in many situations, serve as a sufficient alternative. This reduction in resource consumption is particularly beneficial for organizations that need more means to develop and deploy foundational models from scratch.
|
||||
|
||||
A RAG workflow can be broken down into the following steps:
|
||||
|
||||
1. **Data ingestion**: The first step is acquiring data from your relevant sources. At Unstructured we make this super easy with our `data connectors <https://unstructured-io.github.io/unstructured/source_connectors.html>`__.
|
||||
1. **Data ingestion**: The first step is acquiring data from your relevant sources. We make this easy with our `source connectors <https://unstructured-io.github.io/unstructured/ingest/source_connectors.html>`__.
|
||||
|
||||
2. **Data preprocessing and cleaning**: Once you've identified and collected your data sources a good practice is to remove any unnecessary artifacts within the dataset. At Unstructured we have a variety of different tools to remove unneccesary elements. Found `here <https://unstructured-io.github.io/unstructured/functions.html>`_
|
||||
2. **Data preprocessing and cleaning**: Once you've identified and collected your data sources, removing any unnecessary artifacts within the dataset is a good practice. At Unstructured, we have various tools for data processing in our `core functionalities <https://unstructured-io.github.io/unstructured/core.html>`__.
|
||||
|
||||
3. **Chunking**: The next step is to break your text down into digestable pieces for your LLM to be able to consume. LangChain, Llama Index and Haystack offer chunking funcionalities.
|
||||
3. **Chunking**: The next step is to break your text into digestible pieces for your LLM to consume. We provide the basic and context-aware chunking strategies. Please refer to the documentation `here <https://unstructured-io.github.io/unstructured/core/chunking.html>`__.
|
||||
|
||||
4. **Embedding**: After chunking, you will need to convert the text into a numerical representation (vector embedding) that a LLM can understand. OpenAI, Cohere, and Hugging Face all offer embedding models.
|
||||
4. **Embedding**: After chunking, you must convert the text into a numerical representation (vector embedding) that an LLM can understand. To use the various embedding models using Unstructured tools, please refer to `this page <https://unstructured-io.github.io/unstructured/core/embedding.html>`__.
|
||||
|
||||
5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are lots of options to choose from for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate and more).
|
||||
5. **Vector Database**: The next step is to choose a location for storing your chunked embeddings. There are many options for your vector database (ChromaDB, Milvus, Pinecone, Qdrant, Weaviate, and more). For complete list of Unstructured ``Destination Connectors``, please visit `this page <https://unstructured-io.github.io/unstructured/ingest/destination_connectors.html>`__.
|
||||
|
||||
6. **User Prompt**: Take the user prompt and grab the most relevant chunks of information in the vector database via similarity search.
|
||||
|
||||
7. **LLM Generation**: Once you've retrieved your relevant chunks you pass the prompt + the context to the LLM for the LLM to generate a more accurate response.
|
||||
7. **LLM Generation**: Once you've retrieved your relevant chunks, you pass the prompt + the context to the LLM for the LLM to generate a more accurate response.
|
||||
|
||||
For a full guide on how to implement RAG check out this `blog post <https://medium.com/unstructured-io/effortless-document-extraction-a-guide-to-using-unstructured-api-and-data-connectors-6c2659eda4af>`__
|
||||
For a complete guide on how to implement RAG, check out this `blog post <https://medium.com/unstructured-io/effortless-document-extraction-a-guide-to-using-unstructured-api-and-data-connectors-6c2659eda4af>`__
|
||||
|
||||
@ -1 +1 @@
|
||||
__version__ = "0.12.5-dev5" # pragma: no cover
|
||||
__version__ = "0.12.5-dev6" # pragma: no cover
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user