mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-11-03 19:43:24 +00:00 
			
		
		
		
	docs: cleanup readme; add python 3.12 (#3120)
### Summary Updates documentation references in the README to point to https://docs.unstructured.io and cleans up a few sections of the README. Specifically: - Removes an old API announcement - Removes the section mentioning Chipper as a beta feature. Chipper is only available through the SaaS API. Also adds a Python 3.12 tag to `setup.py` since we now support Python 3.12.
This commit is contained in:
		
							parent
							
								
									293901e144
								
							
						
					
					
						commit
						23e570fc8a
					
				
							
								
								
									
										50
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										50
									
								
								README.md
									
									
									
									
									
								
							@ -37,21 +37,7 @@
 | 
			
		||||
  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>
 | 
			
		||||
</h2>
 | 
			
		||||
 | 
			
		||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://unstructured-io.github.io/unstructured/core.html#partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
 | 
			
		||||
 | 
			
		||||
<h3 align="center">
 | 
			
		||||
  <p>API Announcement!</p>
 | 
			
		||||
</h3>
 | 
			
		||||
 | 
			
		||||
We are thrilled to announce our newly launched [Unstructured API](https://unstructured-io.github.io/unstructured/api.html), providing the Unstructured capabilities from `unstructured` as an API. Check out the [`unstructured-api` GitHub repository](https://github.com/Unstructured-IO/unstructured-api) to start making API calls. You’ll also find instructions about how to host your own API version.
 | 
			
		||||
 | 
			
		||||
While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https://unstructured.io/api-key) and start using it today! Check out the [`unstructured-api` README](https://github.com/Unstructured-IO/unstructured-api#--) to start making API calls.</p>
 | 
			
		||||
 | 
			
		||||
#### :rocket: Beta Feature: Chipper Model
 | 
			
		||||
 | 
			
		||||
We are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res_model_name=chipper` parameter. Please refer to the documentation [here](https://unstructured-io.github.io/unstructured/api.html#beta-version-hi-res-strategy-with-chipper-model).
 | 
			
		||||
 | 
			
		||||
As the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https://short.unstructured.io/pzw05l7).
 | 
			
		||||
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https://docs.unstructured.io/open-source/core-functionality/partitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
 | 
			
		||||
 | 
			
		||||
## :eight_pointed_black_star: Quick Start
 | 
			
		||||
 | 
			
		||||
@ -182,29 +168,23 @@ This starts a docker container with your local repo mounted to `/mnt/local_unstr
 | 
			
		||||
## :clap: Quick Tour
 | 
			
		||||
 | 
			
		||||
### Documentation
 | 
			
		||||
This README overviews how to install, use and develop the library. For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .
 | 
			
		||||
For more comprehensive documentation, visit https://docs.unstructured.io . You can also learn
 | 
			
		||||
more about our other products on the documentation page, including our SaaS API.
 | 
			
		||||
 | 
			
		||||
### Concepts Guide
 | 
			
		||||
Here are a few pages from the [Open Source documentation page](https://docs.unstructured.io/open-source/introduction/overview)
 | 
			
		||||
that are helpful for new users to review:
 | 
			
		||||
 | 
			
		||||
The `unstructured` library includes core functionality for partitioning, chunking, cleaning, and
 | 
			
		||||
staging raw documents for NLP tasks.
 | 
			
		||||
You can see a complete list of available functions and how to use them from the [Core Functionality documentation](https://unstructured-io.github.io/unstructured/core.html).
 | 
			
		||||
- [Quick Start](https://docs.unstructured.io/open-source/introduction/quick-start)
 | 
			
		||||
- [Using the `unstructured` open source package](https://docs.unstructured.io/open-source/core-functionality/overview)
 | 
			
		||||
- [Connectors](https://docs.unstructured.io/open-source/ingest/overview)
 | 
			
		||||
- [Concepts](https://docs.unstructured.io/open-source/concepts/document-elements)
 | 
			
		||||
- [Integrations](https://docs.unstructured.io/open-source/integrations)
 | 
			
		||||
 | 
			
		||||
In general, these functions fall into several categories:
 | 
			
		||||
- *Partitioning* functions break raw documents into standard, structured elements.
 | 
			
		||||
- *Cleaning* functions remove unwanted text from documents, such as boilerplate and sentence fragments.
 | 
			
		||||
- *Staging* functions format data for downstream tasks, such as ML inference and data labeling.
 | 
			
		||||
- *Chunking* functions split documents into smaller sections for use in RAG apps and similarity
 | 
			
		||||
  search.
 | 
			
		||||
- *Embedding* encoder classes provide an interfaces for easily converting preprocessed text to
 | 
			
		||||
  vectors.
 | 
			
		||||
 | 
			
		||||
The **Connectors** 🔗 in `unstructured` serve as vital links between the pre-processing pipeline and various data storage platforms. They allow for the batch processing of documents across various sources, including cloud services, repositories, and local directories. Each connector is tailored to a specific platform, such as Azure, Google Drive, or Github, and comes with unique commands and dependencies. To see the list of Connectors available in `unstructured` library, please check out the [Connectors GitHub folder](https://github.com/Unstructured-IO/unstructured/tree/main/unstructured/ingest/connector) and [documentation](https://unstructured-io.github.io/unstructured/ingest/index.html)
 | 
			
		||||
 | 
			
		||||
### PDF Document Parsing Example
 | 
			
		||||
The following examples show how to get started with the `unstructured` library. You can parse over a dozen document types with one line of code! Use this [Colab notebook](https://colab.research.google.com/drive/1U8VCjY2-x8c6y5TYMbSFtQGlQVFHCVIW) to run the example below.
 | 
			
		||||
 | 
			
		||||
The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional parameters via `pip install unstructured[local-inference]`. Ensure you first install `libmagic` using the instructions outlined [here](https://unstructured-io.github.io/unstructured/installing.html#filetype-detection) `partition` will always apply the default arguments. If you need advanced features, use a document-specific partitioning function.
 | 
			
		||||
The following examples show how to get started with the `unstructured` library. The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional dependencies per doc type.
 | 
			
		||||
For example, to install docx dependencies you need to run `pip install "unstructured[docx]"`.
 | 
			
		||||
See our  [installation guide](https://docs.unstructured.io/open-source/installation/full-installation) for more details.
 | 
			
		||||
 | 
			
		||||
```python
 | 
			
		||||
from unstructured.partition.auto import partition
 | 
			
		||||
@ -245,7 +225,7 @@ Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of
 | 
			
		||||
including document image classification [11,
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
See the [partitioning](https://unstructured-io.github.io/unstructured/core.html#partitioning)
 | 
			
		||||
See the [partitioning](https://docs.unstructured.io/open-source/core-functionality/partitioning)
 | 
			
		||||
section in our documentation for a full list of options and instructions on how to use
 | 
			
		||||
file-specific partitioning functions.
 | 
			
		||||
 | 
			
		||||
@ -263,7 +243,7 @@ Encountered a bug? Please create a new [GitHub issue](https://github.com/Unstruc
 | 
			
		||||
| Section | Description |
 | 
			
		||||
|-|-|
 | 
			
		||||
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
 | 
			
		||||
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
 | 
			
		||||
| [Documentation](https://docs.unstructured.io/) | Full API documentation |
 | 
			
		||||
| [Batch Processing](unstructured/ingest/README.md) | Ingesting batches of documents through Unstructured |
 | 
			
		||||
 | 
			
		||||
## :chart_with_upwards_trend: Analytics
 | 
			
		||||
 | 
			
		||||
							
								
								
									
										1
									
								
								setup.py
									
									
									
									
									
								
							
							
						
						
									
										1
									
								
								setup.py
									
									
									
									
									
								
							@ -96,6 +96,7 @@ setup(
 | 
			
		||||
        "Programming Language :: Python :: 3.9",
 | 
			
		||||
        "Programming Language :: Python :: 3.10",
 | 
			
		||||
        "Programming Language :: Python :: 3.11",
 | 
			
		||||
        "Programming Language :: Python :: 3.12",
 | 
			
		||||
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
 | 
			
		||||
    ],
 | 
			
		||||
    author="Unstructured Technologies",
 | 
			
		||||
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user