unstructured/docs/source/examples/databricks.rst
Ronny H d80abf0714
Reorganized the Examples section in Documentation & add Databricks example (#1855)
To test:
> cd docs && make html

Change logs:
* Examples are reorganized to have its own page
* Removed two old examples, ie. "file-utils" & "sentiment analysis".
* Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" &
"Multi-Files Processing with S3 Connector and API"
* Reorganized and added detailed API documentation: (i) usage, (ii)
SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and
validation errors
2023-11-30 01:24:43 +00:00

114 lines
2.8 KiB
ReStructuredText

Databricks Delta Table and Connector
====================================
.. contents::
:class: this-will-duplicate-information-and-it-is-still-useful-here
:depth: 2
Objectives
----------
1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
2. Process and store this data in a Databricks Delta Table.
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
Prerequisites
-------------
- Unstructured Python SDK
- Databricks account and workspace
- AWS S3 for Delta Table storage
Extracting PDF Using Unstructured Python SDK
--------------------------------------------
1. Install Unstructured Python SDK
.. code-block:: bash
pip install unstructuredio-sdk
2. Code Example
.. code-block:: python
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
s = UnstructuredClient(
security=shared.Security(
api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
),
)
req = shared.PartitionParameters(
# Note that this currently only supports a single file
files=shared.PartitionParametersFiles(
content=file.read(),
files=filename,
),
# Other partition params
strategy="hi_res",
pdf_infer_table_structure=True,
chunking_strategy="by_title",
)
Processing and Storing into Databricks Delta Table
--------------------------------------------------
3. Initialize PySpark
.. code-block:: python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
4. Convert JSON output into Dataframe
.. code-block:: python
import pyspark
dataframe = spark.createDataFrame(res.elements)
5. Store DataFrame as Delta Table
.. code-block:: python
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
Extracting Delta Table Using Unstructured Connector
---------------------------------------------------
6. Install Unstructured Connector Dependency
.. code-block:: bash
pip install "unstructured[delta-table]"
7. Command Line Execution
.. code-block:: bash
unstructured-ingest \
delta-table \
--table-uri <<REPLACE WITH S3 URI>> \
--output-dir delta-table-example \
--storage_options "AWS_REGION=us-east-2, \
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
--verbose
Conclusion
------------
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.