fix: occasional SIGABRT with deltalake writer on Linux (#1567)

- resolves an issue where occasionally deltalake writer results in
SIGABRT event though the writer finished writing table properly on linux
- this is first observed in ingest test
- Putting the writer into a process mitigates this problem by forcing
python to finish the deltalake rust backend to finish its tasks

## test

To test this it is best to setup an instance on a Linux system since the
problem has only been observed on Linux so far. Run

```bash
PYTHONPATH=. ./unstructured/ingest/main.py delta-table --num-processes 2 --metadata-exclude coordinates,filename,file_directory,metadata.data_source.date_processed,metadata.last_modified,metadata.date_created,metadata.detection_class_prob,metadata.parent_id,metadata.category_depth --table-uri ../tables/delta/ --preserve-downloads --verbose delta-table --write-column json_data --mode overwrite --table-uri file:///tmp/delta
```

Without this fix occasionally we'd encounter `SIGABTR`.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
This commit is contained in:
Yao You 2023-09-28 21:41:18 -05:00 committed by GitHub
parent 4e84e32ed0
commit cd8c6a2e09
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 16 additions and 5 deletions

View File

@ -30,6 +30,7 @@ allowing the document to be loaded. Fix: Change parent class for Formula to Text
* **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings. * **Fixes Sphinx errors.** Fixes errors when running Sphinx `make html` and installs library to suppress warnings.
* **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields. * **Fixes a metadata backwards compatibility error** Problem: When calling `partition_via_api`, the hosted api may return an element schema that's newer than the current `unstructured`. In this case, metadata fields were added which did not exist in the local `ElementMetadata` dataclass, and `__init__()` threw an error. Fix: remove nonexistent fields before instantiating in `ElementMetadata.from_json()`. Importance: Crucial to avoid breaking changes when adding fields.
* **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel. * **Fixes issue with Discord connector when a channel returns `None`** Problem: Getting the `jump_url` from a nonexistent Discord `channel` fails. Fix: property `jump_url` is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
* **Fixes occasionally SIGABTR when writing table with `deltalake` on Linux** Problem: occasionally on Linux ingest can throw a `SIGABTR` when writing `deltalake` table even though the table was written correctly. Fix: put the writing function into a `Process` to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using `deltalake`
## 0.10.16 ## 0.10.16

View File

@ -56,7 +56,7 @@ trap print_last_run EXIT
for script in "${scripts[@]}"; do for script in "${scripts[@]}"; do
CURRENT_SCRIPT=$script CURRENT_SCRIPT=$script
if [[ "$CURRENT_SCRIPT" == "test-ingest-notion.sh" ]] || [[ "$CURRENT_SCRIPT" == "test-ingest-delta-table.sh" ]]; then if [[ "$CURRENT_SCRIPT" == "test-ingest-notion.sh" ]]; then
echo "--------- RUNNING SCRIPT $script --- IGNORING FAILURES" echo "--------- RUNNING SCRIPT $script --- IGNORING FAILURES"
set +e set +e
echo "Running ./test_unstructured_ingest/$script" echo "Running ./test_unstructured_ingest/$script"

View File

@ -3,6 +3,7 @@ import os
import typing as t import typing as t
from dataclasses import dataclass from dataclasses import dataclass
from datetime import datetime as dt from datetime import datetime as dt
from multiprocessing import Process
from pathlib import Path from pathlib import Path
import pandas as pd import pandas as pd
@ -182,8 +183,17 @@ class DeltaTableDestinationConnector(BaseDestinationConnector):
f"writing {len(json_list)} rows to destination " f"writing {len(json_list)} rows to destination "
f"table at {self.connector_config.table_uri}", f"table at {self.connector_config.table_uri}",
) )
write_deltalake( # NOTE: deltalake writer on Linux sometimes can finish but still trigger a SIGABRT and cause
table_or_uri=self.connector_config.table_uri, # ingest to fail, even though all tasks are completed normally. Putting the writer into a
data=pd.DataFrame(data={self.write_config.write_column: json_list}), # process mitigates this issue by ensuring python interpreter waits properly for deltalake's
mode=self.write_config.mode, # rust backend to finish
writer = Process(
target=write_deltalake,
kwargs={
"table_or_uri": self.connector_config.table_uri,
"data": pd.DataFrame(data={self.write_config.write_column: json_list}),
"mode": self.write_config.mode,
},
) )
writer.start()
writer.join()