mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-03 23:20:35 +00:00
114 lines
2.8 KiB
ReStructuredText
114 lines
2.8 KiB
ReStructuredText
![]() |
Databricks Delta Table and Connector
|
||
|
====================================
|
||
|
|
||
|
.. contents::
|
||
|
:class: this-will-duplicate-information-and-it-is-still-useful-here
|
||
|
:depth: 2
|
||
|
|
||
|
Objectives
|
||
|
----------
|
||
|
|
||
|
1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
|
||
|
2. Process and store this data in a Databricks Delta Table.
|
||
|
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
|
||
|
|
||
|
Prerequisites
|
||
|
-------------
|
||
|
|
||
|
- Unstructured Python SDK
|
||
|
- Databricks account and workspace
|
||
|
- AWS S3 for Delta Table storage
|
||
|
|
||
|
|
||
|
Extracting PDF Using Unstructured Python SDK
|
||
|
--------------------------------------------
|
||
|
|
||
|
1. Install Unstructured Python SDK
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
pip install unstructuredio-sdk
|
||
|
|
||
|
2. Code Example
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
from unstructured_client import UnstructuredClient
|
||
|
from unstructured_client.models import shared
|
||
|
from unstructured_client.models.errors import SDKError
|
||
|
|
||
|
s = UnstructuredClient(
|
||
|
security=shared.Security(
|
||
|
api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
|
||
|
),
|
||
|
)
|
||
|
|
||
|
req = shared.PartitionParameters(
|
||
|
# Note that this currently only supports a single file
|
||
|
files=shared.PartitionParametersFiles(
|
||
|
content=file.read(),
|
||
|
files=filename,
|
||
|
),
|
||
|
# Other partition params
|
||
|
strategy="hi_res",
|
||
|
pdf_infer_table_structure=True,
|
||
|
chunking_strategy="by_title",
|
||
|
)
|
||
|
|
||
|
Processing and Storing into Databricks Delta Table
|
||
|
--------------------------------------------------
|
||
|
|
||
|
3. Initialize PySpark
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
from pyspark.sql import SparkSession
|
||
|
|
||
|
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
|
||
|
|
||
|
4. Convert JSON output into Dataframe
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
import pyspark
|
||
|
|
||
|
dataframe = spark.createDataFrame(res.elements)
|
||
|
|
||
|
5. Store DataFrame as Delta Table
|
||
|
|
||
|
.. code-block:: python
|
||
|
|
||
|
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
|
||
|
|
||
|
|
||
|
Extracting Delta Table Using Unstructured Connector
|
||
|
---------------------------------------------------
|
||
|
|
||
|
6. Install Unstructured Connector Dependency
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
pip install "unstructured[delta-table]"
|
||
|
|
||
|
7. Command Line Execution
|
||
|
|
||
|
.. code-block:: bash
|
||
|
|
||
|
unstructured-ingest \
|
||
|
delta-table \
|
||
|
--table-uri <<REPLACE WITH S3 URI>> \
|
||
|
--output-dir delta-table-example \
|
||
|
--storage_options "AWS_REGION=us-east-2, \
|
||
|
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
|
||
|
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
|
||
|
--verbose
|
||
|
|
||
|
|
||
|
Conclusion
|
||
|
------------
|
||
|
|
||
|
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
|
||
|
|
||
|
|
||
|
|