mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-04 15:42:16 +00:00

Change default values for table extraction - works in pair with [this](https://github.com/Unstructured-IO/unstructured-api/pull/370) `unstructured-api` PR We want to move away from `pdf_infer_table_structure` parameter, in this PR: - We change how it's treated wrt `skip_infer_table_types` parameter. Whether to extract tables from pdf now follows from the rule: `pdf_infer_table_structure && "pdf" not in skip_infer_table_types` - We set it to `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` by default - We remove it from the examples in documentation - We describe it as deprecated in favor of `skip_infer_table_types` in documentation More detailed description of how we want parameters to interact - if `pdf_infer_table_structure` is False tables will never extracted from pdf - if `pdf_infer_table_structure` is True tables will be extracted from pdf unless it's skipped via `skip_infer_table_types` - on default `pdf_infer_table_structure=True` and `skip_infer_table_types=[]` --------- Co-authored-by: Filip Knefel <filip@unstructured.io> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ds-filipknefel <ds-filipknefel@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
109 lines
2.6 KiB
ReStructuredText
109 lines
2.6 KiB
ReStructuredText
Delta Table Source Connector
|
|
============================
|
|
|
|
Objectives
|
|
----------
|
|
|
|
1. Extract text and metadata from a PDF file using the Unstructured.io Python SDK.
|
|
2. Process and store this data in a Databricks Delta Table.
|
|
3. Retrieve data from the Delta Table using the Unstructured.io Delta Table Connector.
|
|
|
|
Prerequisites
|
|
-------------
|
|
|
|
- Unstructured Python SDK
|
|
- Databricks account and workspace
|
|
- AWS S3 for Delta Table storage
|
|
|
|
|
|
Extracting PDF Using Unstructured Python SDK
|
|
--------------------------------------------
|
|
|
|
1. Install Unstructured Python SDK
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install unstructuredio-sdk
|
|
|
|
2. Code Example
|
|
|
|
.. code-block:: python
|
|
|
|
from unstructured_client import UnstructuredClient
|
|
from unstructured_client.models import shared
|
|
from unstructured_client.models.errors import SDKError
|
|
|
|
s = UnstructuredClient(
|
|
security=shared.Security(
|
|
api_key_auth=UNSTRUCTURED_API_KEY, # replace with your own API key
|
|
),
|
|
)
|
|
|
|
req = shared.PartitionParameters(
|
|
# Note that this currently only supports a single file
|
|
files=shared.PartitionParametersFiles(
|
|
content=file.read(),
|
|
files=filename,
|
|
),
|
|
# Other partition params
|
|
strategy="hi_res",
|
|
chunking_strategy="by_title",
|
|
)
|
|
|
|
Processing and Storing into Databricks Delta Table
|
|
--------------------------------------------------
|
|
|
|
3. Initialize PySpark
|
|
|
|
.. code-block:: python
|
|
|
|
from pyspark.sql import SparkSession
|
|
|
|
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
|
|
|
|
4. Convert JSON output into Dataframe
|
|
|
|
.. code-block:: python
|
|
|
|
import pyspark
|
|
|
|
dataframe = spark.createDataFrame(res.elements)
|
|
|
|
5. Store DataFrame as Delta Table
|
|
|
|
.. code-block:: python
|
|
|
|
dataframe.write.mode("overwrite").format("delta").saveAsTable("delta_table")
|
|
|
|
|
|
Extracting Delta Table Using Unstructured Connector
|
|
---------------------------------------------------
|
|
|
|
6. Install Unstructured Connector Dependency
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install "unstructured[delta-table]"
|
|
|
|
7. Command Line Execution
|
|
|
|
.. code-block:: bash
|
|
|
|
unstructured-ingest \
|
|
delta-table \
|
|
--table-uri <<REPLACE WITH S3 URI>> \
|
|
--output-dir delta-table-example \
|
|
--storage_options "AWS_REGION=us-east-2, \
|
|
AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID, \
|
|
AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY" \
|
|
--verbose
|
|
|
|
|
|
Conclusion
|
|
----------
|
|
|
|
This documentation covers the essential steps for converting unstructured PDF data into structured data and storing it in a Databricks Delta Table. It also outlines how to extract this data for further use.
|
|
|
|
|
|
|