## Loading Data into Weaviate with `unstructured`

This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`.

In [1]:
import json

import tqdm
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate
import weaviate
from weaviate.util import generate_uuid5

The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function.

In [2]:
filename = "../../example-docs/layout-parser-paper-fast.pdf"
elements = partition_pdf(filename=filename, strategy="fast")

Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable.

In [3]:
unstructured_class_name = "UnstructuredDocument"

In [4]:
unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)
schema = {"classes": [unstructured_class]} 

In [5]:
client = weaviate.Client("http://localhost:8080")

In [6]:
client.schema.delete_all()
client.schema.create(schema)

Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate.

In [7]:
data_objects = stage_for_weaviate(elements)

In [8]:
with client.batch(batch_size=10) as batch:
 for data_object in tqdm.tqdm(data_objects):
 batch.add_data_object(
 data_object,
 unstructured_class_name,
 uuid=generate_uuid5(data_object),
 )

100%|██████████| 28/28 [00:00<00:00, 69.56it/s]


Now that the documents are in Weaviate, we're able to run queries against Weaviate!

In [9]:
response = (
 client.query
 .get("UnstructuredDocument", ["text", "_additional {score}"])
 .with_bm25(
 query="document understanding"
 )
 .with_limit(2)
 .do()
)

print(json.dumps(response, indent=4))

{
 "data": {
 "Get": {
 "UnstructuredDocument": [
 {
 "_additional": {
 "score": "0.23643185"
 },
 "text": "Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classi\ufb01cation [11,"
 },
 {
 "_additional": {
 "score": "0.22914983"
 },
 "text": "LayoutParser: A Uni\ufb01ed Toolkit for Deep Learning Based Document Image Analysis"
 }
 ]
 }
 }
}
