unstructured/docs/source/examples/databricks.rst
Filip Knefel bdfd975115
chore: change table extraction defaults (#2588)
Change default values for table extraction - works in pair with
[this](https://github.com/Unstructured-IO/unstructured-api/pull/370)
`unstructured-api` PR

We want to move away from `pdf_infer_table_structure` parameter, in this
PR:
- We change how it's treated wrt `skip_infer_table_types` parameter.
Whether to extract tables from pdf now follows from the rule:
`pdf_infer_table_structure && "pdf" not in skip_infer_table_types`
- We set it to `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]` by default
- We remove it from the examples in documentation
- We describe it as deprecated in favor of `skip_infer_table_types` in
documentation

More detailed description of how we want parameters to interact
- if `pdf_infer_table_structure` is False tables will never extracted
from pdf
- if `pdf_infer_table_structure` is True tables will be extracted from
pdf unless it's skipped via `skip_infer_table_types`
- on default `pdf_infer_table_structure=True` and
`skip_infer_table_types=[]`

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-22 10:08:49 +00:00

109 lines
2.6 KiB
ReStructuredText

Delta Table Source Connector
============================
Objectives
----------
1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
2. Process and store this data in a Databricks Delta Table.
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
Prerequisites
-------------
- Unstructured Python SDK
- Databricks account and workspace
- AWS S3 for Delta Table storage
Extracting PDF Using Unstructured Python SDK
--------------------------------------------
1. Install Unstructured Python SDK
.. code-block:: bash
pip install unstructuredio-sdk
2. Code Example
.. code-block:: python
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
s = UnstructuredClient(
security=shared.Security(
api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
),
)
req = shared.PartitionParameters(
# Note that this currently only supports a single file
files=shared.PartitionParametersFiles(
content=file.read(),
files=filename,
),
# Other partition params
strategy="hi_res",
chunking_strategy="by_title",
)
Processing and Storing into Databricks Delta Table
--------------------------------------------------
3. Initialize PySpark
.. code-block:: python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
4. Convert JSON output into Dataframe
.. code-block:: python
import pyspark
dataframe = spark.createDataFrame(res.elements)
5. Store DataFrame as Delta Table
.. code-block:: python
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
Extracting Delta Table Using Unstructured Connector
---------------------------------------------------
6. Install Unstructured Connector Dependency
.. code-block:: bash
pip install "unstructured[delta-table]"
7. Command Line Execution
.. code-block:: bash
unstructured-ingest \
delta-table \
--table-uri <<REPLACE WITH S3 URI>> \
--output-dir delta-table-example \
--storage_options "AWS_REGION=us-east-2, \
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
--verbose
Conclusion
----------
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.