mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-27 15:13:35 +00:00
parent
ba70828f4a
commit
64b4287308
10
README.md
10
README.md
@ -33,7 +33,7 @@
|
||||
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
|
||||
</h2>
|
||||
|
||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/bricks.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and is efficient in transforming unstructured data into structured outputs.
|
||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/bricks.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
|
||||
|
||||
<h3 align="center">
|
||||
<p>API Announcement!</p>
|
||||
@ -205,14 +205,14 @@ Weining Li 5
|
||||
Abstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural
|
||||
networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation.
|
||||
However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy
|
||||
reuse of im- portant innovations by a wide audience. Though there have been on-going efforts to improve reusability and
|
||||
reuse of important innovations by a wide audience. Though there have been ongoing efforts to improve reusability and
|
||||
simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none
|
||||
of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA
|
||||
is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper
|
||||
introduces LayoutParser , an open-source library for streamlining the usage of DL in DIA research and applica- tions.
|
||||
introduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications.
|
||||
The core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models
|
||||
for layout de- tection, character recognition, and many other document processing tasks. To promote extensibility,
|
||||
LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digiti- zation
|
||||
for layout detection, character recognition, and many other document processing tasks. To promote extensibility,
|
||||
LayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization
|
||||
pipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in
|
||||
real-word use cases. The library is publicly available at https://layout-parser.github.io
|
||||
|
||||
|
||||
@ -2,7 +2,7 @@ Unstructured Core Library
|
||||
=========================
|
||||
|
||||
The ``unstructured`` library is designed to help preprocess structure unstructured text documents
|
||||
for use in downstream machine learning tasks. Examples of documents that can be processes
|
||||
for use in downstream machine learning tasks. Examples of documents that can be processed
|
||||
using the ``unstructured`` library include PDFs, XML and HTML documents.
|
||||
|
||||
Library Documentation
|
||||
@ -18,7 +18,7 @@ Library Documentation
|
||||
Learn more about partitioning, cleaning, and staging bricks, including advanced usage patterns.
|
||||
|
||||
:doc:`upstream_connectors`
|
||||
Connect to your favortite data storage platforms for an efortless batch processing of your files.
|
||||
Connect to your favorite data storage platforms for an effortless batch processing of your files.
|
||||
|
||||
:doc:`metadata`
|
||||
Learn more about how metadata is tracked in the ``unstructured`` library.
|
||||
|
||||
@ -60,13 +60,13 @@ Our integration with `LangChain <https://github.com/hwchase17/langchain>`_ makes
|
||||
loader.load()
|
||||
|
||||
Checkout the `LangChain docs <https://python.langchain.com/en/latest/modules/indexes/document_loaders.html>`_ for more
|
||||
examples about how to use Unstructured data loders.
|
||||
examples about how to use Unstructured data loaders.
|
||||
|
||||
|
||||
``Integration with LlamaIndex``
|
||||
--------------------------------
|
||||
|
||||
To use ``Unstructured.io File Loader`` you will need to have `LlamaIndex <https://github.com/jerryjliu/llama_index>`_ 🦙 (GPT Index) installed in your environment. Just ``pip install llama-index`` and then pass in a ``Path`` to a local file. Optionally, you may specify split_documents if you want each element generated by ``unstructured`` to be placed in a separate document. Here is a simple example on how to use it:
|
||||
To use ``Unstructured.io File Loader`` you will need to have `LlamaIndex <https://github.com/jerryjliu/llama_index>`_ 🦙 (GPT Index) installed in your environment. Just ``pip install llama-index`` and then pass in a ``Path`` to a local file. Optionally, you may specify split_documents if you want each element generated by ``unstructured`` to be placed in a separate document. Here is a simple example of how to use it:
|
||||
|
||||
.. code:: python
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user