mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-11-02 11:03:38 +00:00
docs: cleanup readme; add python 3.12 (#3120)
### Summary Updates documentation references in the README to point to https://docs.unstructured.io and cleans up a few sections of the README. Specifically: - Removes an old API announcement - Removes the section mentioning Chipper as a beta feature. Chipper is only available through the SaaS API. Also adds a Python 3.12 tag to `setup.py` since we now support Python 3.12.
This commit is contained in:
parent
293901e144
commit
23e570fc8a
50
README.md
50
README.md
@ -37,21 +37,7 @@
|
||||
<p>Open-Source Pre-Processing Tools for Unstructured Data</p>
|
||||
</h2>
|
||||
|
||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/core.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
|
||||
|
||||
<h3 align="center">
|
||||
<p>API Announcement!</p>
|
||||
</h3>
|
||||
|
||||
We are thrilled to announce our newly launched [Unstructured API](https://unstructured-io.github.io/unstructured/api.html), providing the Unstructured capabilities from `unstructured` as an API. Check out the [`unstructured-api` GitHub repository](https://github.com/Unstructured-IO/unstructured-api) to start making API calls. You’ll also find instructions about how to host your own API version.
|
||||
|
||||
While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https://unstructured.io/api-key) and start using it today! Check out the [`unstructured-api` README](https://github.com/Unstructured-IO/unstructured-api#--) to start making API calls.</p>
|
||||
|
||||
#### :rocket: Beta Feature: Chipper Model
|
||||
|
||||
We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res_model_name=chipper` parameter. Please refer to the documentation [here](https://unstructured-io.github.io/unstructured/api.html#beta-version-hi-res-strategy-with-chipper-model).
|
||||
|
||||
As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https://short.unstructured.io/pzw05l7).
|
||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://docs.unstructured.io/open-source/core-functionality/partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
|
||||
|
||||
## :eight_pointed_black_star: Quick Start
|
||||
|
||||
@ -182,29 +168,23 @@ This starts a docker container with your local repo mounted to `/mnt/local_unstr
|
||||
## :clap: Quick Tour
|
||||
|
||||
### Documentation
|
||||
This README overviews how to install, use and develop the library. For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .
|
||||
For more comprehensive documentation, visit https://docs.unstructured.io . You can also learn
|
||||
more about our other products on the documentation page, including our SaaS API.
|
||||
|
||||
### Concepts Guide
|
||||
Here are a few pages from the [Open Source documentation page](https://docs.unstructured.io/open-source/introduction/overview)
|
||||
that are helpful for new users to review:
|
||||
|
||||
The `unstructured` library includes core functionality for partitioning, chunking, cleaning, and
|
||||
staging raw documents for NLP tasks.
|
||||
You can see a complete list of available functions and how to use them from the [Core Functionality documentation](https://unstructured-io.github.io/unstructured/core.html).
|
||||
- [Quick Start](https://docs.unstructured.io/open-source/introduction/quick-start)
|
||||
- [Using the `unstructured` open source package](https://docs.unstructured.io/open-source/core-functionality/overview)
|
||||
- [Connectors](https://docs.unstructured.io/open-source/ingest/overview)
|
||||
- [Concepts](https://docs.unstructured.io/open-source/concepts/document-elements)
|
||||
- [Integrations](https://docs.unstructured.io/open-source/integrations)
|
||||
|
||||
In general, these functions fall into several categories:
|
||||
- *Partitioning* functions break raw documents into standard, structured elements.
|
||||
- *Cleaning* functions remove unwanted text from documents, such as boilerplate and sentence fragments.
|
||||
- *Staging* functions format data for downstream tasks, such as ML inference and data labeling.
|
||||
- *Chunking* functions split documents into smaller sections for use in RAG apps and similarity
|
||||
search.
|
||||
- *Embedding* encoder classes provide an interfaces for easily converting preprocessed text to
|
||||
vectors.
|
||||
|
||||
The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/ingest/index.html)
|
||||
|
||||
### PDF Document Parsing Example
|
||||
The following examples show how to get started with the `unstructured` library. You can parse over a dozen document types with one line of code! Use this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the example below.
|
||||
|
||||
The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional parameters via `pip install unstructured[local-inference]`. Ensure you first install `libmagic` using the instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection) `partition` will always apply the default arguments. If you need advanced features, use a document-specific partitioning function.
|
||||
The following examples show how to get started with the `unstructured` library. The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional dependencies per doc type.
|
||||
For example, to install docx dependencies you need to run `pip install "unstructured[docx]"`.
|
||||
See our [installation guide](https://docs.unstructured.io/open-source/installation/full-installation) for more details.
|
||||
|
||||
```python
|
||||
from unstructured.partition.auto import partition
|
||||
@ -245,7 +225,7 @@ Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
|
||||
including document image classification [11,
|
||||
```
|
||||
|
||||
See the [partitioning](https://unstructured-io.github.io/unstructured/core.html#partitioning)
|
||||
See the [partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning)
|
||||
section in our documentation for a full list of options and instructions on how to use
|
||||
file-specific partitioning functions.
|
||||
|
||||
@ -263,7 +243,7 @@ Encountered a bug? Please create a new [GitHub issue](https://github.com/Unstruc
|
||||
| Section | Description |
|
||||
|-|-|
|
||||
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
|
||||
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
|
||||
| [Documentation](https://docs.unstructured.io/) | Full API documentation |
|
||||
| [Batch Processing](unstructured/ingest/README.md) | Ingesting batches of documents through Unstructured |
|
||||
|
||||
## :chart_with_upwards_trend: Analytics
|
||||
|
||||
1
setup.py
1
setup.py
@ -96,6 +96,7 @@ setup(
|
||||
"Programming Language :: Python :: 3.9",
|
||||
"Programming Language :: Python :: 3.10",
|
||||
"Programming Language :: Python :: 3.11",
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Topic :: Scientific/Engineering :: Artificial Intelligence",
|
||||
],
|
||||
author="Unstructured Technologies",
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user