unstructured/docs/source/examples/databricks.rst

Databricks Delta Table and Connector
====================================

.. contents::
   :class: this-will-duplicate-information-and-it-is-still-useful-here
   :depth: 2

Objectives
----------

1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
2. Process and store this data in a Databricks Delta Table.
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.

Prerequisites
-------------

- Unstructured Python SDK
- Databricks account and workspace
- AWS S3 for Delta Table storage


Extracting PDF Using Unstructured Python SDK
--------------------------------------------

1. Install Unstructured Python SDK

.. code-block:: bash

   pip install unstructuredio-sdk

2. Code Example

.. code-block:: python

   from unstructured_client import UnstructuredClient
   from unstructured_client.models import shared
   from unstructured_client.models.errors import SDKError

   s = UnstructuredClient(
       security=shared.Security(
           api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
       ),
   )

   req = shared.PartitionParameters(
       # Note that this currently only supports a single file
       files=shared.PartitionParametersFiles(
           content=file.read(),
           files=filename,
       ),
       # Other partition params
       strategy="hi_res",
       pdf_infer_table_structure=True,
       chunking_strategy="by_title",
   )

Processing and Storing into Databricks Delta Table
--------------------------------------------------

3. Initialize PySpark

.. code-block:: python

   from pyspark.sql import SparkSession

   spark = SparkSession.builder.appName('sparkdf').getOrCreate()

4. Convert JSON output into Dataframe

.. code-block:: python

   import pyspark

   dataframe = spark.createDataFrame(res.elements)

5. Store DataFrame as Delta Table

.. code-block:: python

   dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")


Extracting Delta Table Using Unstructured Connector
---------------------------------------------------

6. Install Unstructured Connector Dependency

.. code-block:: bash

   pip install "unstructured[delta-table]"

7. Command Line Execution

.. code-block:: bash

   unstructured-ingest \
       delta-table \
       --table-uri <<REPLACE WITH S3 URI>> \
       --output-dir delta-table-example \
       --storage_options "AWS_REGION=us-east-2, \
                          AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
                          AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
       --verbose


Conclusion
------------

This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
Reorganized the Examples section in Documentation & add Databricks example (#1855) To test: > cd docs && make html Change logs: * Examples are reorganized to have its own page * Removed two old examples, ie. "file-utils" & "sentiment analysis". * Added two examples: "RAG with Unstructured, LangChain, and ChromaDB" & "Multi-Files Processing with S3 Connector and API" * Reorganized and added detailed API documentation: (i) usage, (ii) SDKs, (iii) Azure Marketplace, (iv) AWS Marketplace, (v) parameters and validation errors 2023-11-29 17:24:43 -08:00			`Databricks Delta Table and Connector`
			`====================================`

			`.. contents::`
			`:class: this-will-duplicate-information-and-it-is-still-useful-here`
			`:depth: 2`

			`Objectives`
			`----------`

			`1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.`
			`2. Process and store this data in a Databricks Delta Table.`
			`3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.`

			`Prerequisites`
			`-------------`

			`- Unstructured Python SDK`
			`- Databricks account and workspace`
			`- AWS S3 for Delta Table storage`


			`Extracting PDF Using Unstructured Python SDK`
			`--------------------------------------------`

			`1. Install Unstructured Python SDK`

			`.. code-block:: bash`

			`pip install unstructuredio-sdk`

			`2. Code Example`

			`.. code-block:: python`

			`from unstructured_client import UnstructuredClient`
			`from unstructured_client.models import shared`
			`from unstructured_client.models.errors import SDKError`

			`s = UnstructuredClient(`
			`security=shared.Security(`
			`api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key`
			`),`
			`)`

			`req = shared.PartitionParameters(`
			`# Note that this currently only supports a single file`
			`files=shared.PartitionParametersFiles(`
			`content=file.read(),`
			`files=filename,`
			`),`
			`# Other partition params`
			`strategy="hi_res",`
			`pdf_infer_table_structure=True,`
			`chunking_strategy="by_title",`
			`)`

			`Processing and Storing into Databricks Delta Table`
			`--------------------------------------------------`

			`3. Initialize PySpark`

			`.. code-block:: python`

			`from pyspark.sql import SparkSession`

			`spark = SparkSession.builder.appName('sparkdf').getOrCreate()`

			`4. Convert JSON output into Dataframe`

			`.. code-block:: python`

			`import pyspark`

			`dataframe = spark.createDataFrame(res.elements)`

			`5. Store DataFrame as Delta Table`

			`.. code-block:: python`

			`dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")`


			`Extracting Delta Table Using Unstructured Connector`
			`---------------------------------------------------`

			`6. Install Unstructured Connector Dependency`

			`.. code-block:: bash`

			`pip install "unstructured[delta-table]"`

			`7. Command Line Execution`

			`.. code-block:: bash`

			`unstructured-ingest \`
			`delta-table \`
			`--table-uri <<REPLACE WITH S3 URI>> \`
			`--output-dir delta-table-example \`
			`--storage_options "AWS_REGION=us-east-2, \`
			`AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \`
			`AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \`
			`--verbose`


			`Conclusion`
			`------------`

			`This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.`