mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-07 17:12:48 +00:00

Update Unstructured example for Weaviate, now using latest python v4 client. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
361 lines
12 KiB
Plaintext
361 lines
12 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "a3ce962e",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Loading Data into Weaviate with `unstructured`\n",
|
|
"\n",
|
|
"This notebook shows a basic workflow for uploading document elements into Weaviate using the `unstructured` library. To get started with this notebook, first install the dependencies with `pip install -r requirements.txt` and start the Weaviate docker container with `docker-compose up`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "5d9ffc17",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:54:56.713106Z",
|
|
"start_time": "2023-08-09T22:54:55.721284Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import json\n",
|
|
"\n",
|
|
"import tqdm\n",
|
|
"from unstructured.partition.pdf import partition_pdf\n",
|
|
"from unstructured.staging.weaviate import create_unstructured_weaviate_class, stage_for_weaviate\n",
|
|
"import weaviate\n",
|
|
"from weaviate.util import generate_uuid5"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "673715e9",
|
|
"metadata": {},
|
|
"source": [
|
|
"The first step is to partition the document using the `unstructured` library. In the following example, we partition a PDF with `partition_pdf`. You can also partition over a dozen document types with the `partition` function."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f9fc0cf9",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:54:58.584857Z",
|
|
"start_time": "2023-08-09T22:54:58.300351Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"filename = \"../../example-docs/layout-parser-paper-fast.pdf\"\n",
|
|
"elements = partition_pdf(filename=filename, strategy=\"fast\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "3ae76364",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we'll create a schema for our Weaviate database using the `create_unstructured_weaviate_class` helper function from the `unstructured` library. The helper function generates a schema that includes all of the elements in the `ElementMetadata` object from `unstructured`. This includes information such as the filename and the page number of the document element. After specifying the schema, we create a connection to the database with the Weaviate client library and create the schema. You can change the name of the class by updating the `unstructured_class_name` variable."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "91057cb1",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:54:59.298547Z",
|
|
"start_time": "2023-08-09T22:54:59.296005Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"unstructured_class_name = \"UnstructuredDocument\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "78e804bb",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:54:59.727082Z",
|
|
"start_time": "2023-08-09T22:54:59.722593Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# not used, we are creating the schema from the provided data\n",
|
|
"# unstructured_class = create_unstructured_weaviate_class(unstructured_class_name)\n",
|
|
"# schema = {\"classes\": [unstructured_class]}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "3e317a2d",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:55:01.606118Z",
|
|
"start_time": "2023-08-09T22:55:00.684623Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Connecting to Weaviate\n",
|
|
"# https://weaviate.io/developers/weaviate/starter-guides/connect\n",
|
|
"client = weaviate.connect_to_local()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "0c508784",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:55:17.579418Z",
|
|
"start_time": "2023-08-09T22:55:17.039304Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"client.collections.delete(unstructured_class_name)\n",
|
|
"collection = client.collections.create(\n",
|
|
" name=unstructured_class_name\n",
|
|
")\n",
|
|
"# we can get our collection at any time:\n",
|
|
"collection = client.collections.get(unstructured_class_name)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "024ae133",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next, we stage the elements for Weaviate using the `stage_for_weaviate` function and batch upload the results to Weaviate. `stage_for_weaviate` outputs a dictionary that conforms to the schema we created earlier. Once that data is stage, we can use the Weaviate client library to batch upload the results to Weaviate."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "a7018bb1",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:55:21.595936Z",
|
|
"start_time": "2023-08-09T22:55:21.591105Z"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"data_objects = stage_for_weaviate(elements)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "1a077829",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'file_directory': '../../example-docs',\n",
|
|
" 'filename': 'layout-parser-paper-fast.pdf',\n",
|
|
" 'languages': ['eng'],\n",
|
|
" 'last_modified': '2024-06-04T17:26:18',\n",
|
|
" 'page_number': 1,\n",
|
|
" 'filetype': 'application/pdf',\n",
|
|
" 'text': '1 2 0 2',\n",
|
|
" 'category': 'UncategorizedText'}"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# this one of our objects\n",
|
|
"data_objects[0]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "af712d8e",
|
|
"metadata": {
|
|
"ExecuteTime": {
|
|
"end_time": "2023-08-09T22:55:23.590915Z",
|
|
"start_time": "2023-08-09T22:55:23.036903Z"
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"100%|██████████| 25/25 [00:00<00:00, 26620.36it/s]\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"FAILED: []\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"with collection.batch.dynamic() as batch:\n",
|
|
" for data_object in tqdm.tqdm(data_objects):\n",
|
|
" batch.add_object(\n",
|
|
" properties=data_object\n",
|
|
" )\n",
|
|
" failed_objs_a = client.batch.failed_objects # check if we have failed objects\n",
|
|
" print(\"FAILED: \", failed_objs_a)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "dac10bf5",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that the documents are in Weaviate, we're able to run queries against Weaviate!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "25d5bebc",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Object(uuid=_WeaviateUUIDInt('117e4b2d-1222-4d2e-9a40-2e761ecdafe8'), metadata=MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None), properties={'text': '2', 'languages': ['eng'], 'page_number': 2.0, 'category': 'UncategorizedText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'filename': 'layout-parser-paper-fast.pdf', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'file_directory': '../../example-docs'}, references=None, vector={}, collection='UnstructuredDocument')\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# lets just get a single object\n",
|
|
"object = collection.query.fetch_objects(limit=1).objects[0]\n",
|
|
"print(object)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "6477a112",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We leveraged Weaviate AUTO SCHEMA to generate our collection\n",
|
|
"# you can get the collection schema dict like this\n",
|
|
"# collection.config.get().to_dict()\n",
|
|
"# we can use this same dict to create the collection\n",
|
|
"# new_collection = client.collections.create_from_dict(collection.config.get().to_dict())"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "82f67c21",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0.36298108100891113 {'text': 'Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11,', 'languages': ['eng'], 'page_number': 1.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}\n",
|
|
"0.3443584442138672 {'text': 'LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis', 'languages': ['eng'], 'page_number': 1.0, 'category': 'Title', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': None, 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"results = collection.query.bm25(\n",
|
|
" query=\"document understanding\",\n",
|
|
" limit=2,\n",
|
|
" return_metadata=weaviate.classes.query.MetadataQuery(score=True)\n",
|
|
")\n",
|
|
"for object in results.objects:\n",
|
|
" print(object.metadata.score, object.properties)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "905e02ca",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'text': 'Deep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks including document image classification [11,', 'languages': ['eng'], 'page_number': 1.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}\n",
|
|
"{'text': 'Z. Shen et al.', 'languages': ['eng'], 'page_number': 2.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}\n",
|
|
"{'text': 'The library implements simple and intuitive Python APIs without sacrificing generalizability and versatility, and can be easily installed via pip. Its convenient functions for handling document image data can be seamlessly integrated with existing DIA pipelines. With detailed documentations and carefully curated tutorials, we hope this tool will benefit a variety of end-users, and will lead to advances in applications in both industry and academic research.', 'languages': ['eng'], 'page_number': 2.0, 'category': 'NarrativeText', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'filename': 'layout-parser-paper-fast.pdf', 'parent_id': UUID('47f9bb4b-20e0-5b9f-1ac6-bbb60cd9c2f9'), 'file_directory': '../../example-docs'}\n",
|
|
"{'text': 'Introduction', 'languages': ['eng'], 'page_number': 1.0, 'category': 'Title', 'filetype': 'application/pdf', 'last_modified': '2024-06-04T17:26:18', 'parent_id': None, 'filename': 'layout-parser-paper-fast.pdf', 'file_directory': '../../example-docs'}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# We can also perform similarity search\n",
|
|
"results = collection.query.near_text(\n",
|
|
" query=\"document understanding\",\n",
|
|
" limit=4\n",
|
|
")\n",
|
|
"for object in results.objects:\n",
|
|
" print(object.properties)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ec2993a3fa4c1bed",
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"jupyter": {
|
|
"outputs_hidden": false
|
|
}
|
|
},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "20bca8f9",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"client.close()"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.12.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|