1533 Commits

Author SHA1 Message Date
Matt Robinson
adbf29d144
docs: remove examples directory in favor of docs page (#3391)
### Summary

Removes the examples directory in the repo in favor of the example code
page in our documentation, which is more actively maintained.

- https://docs.unstructured.io/examplecode/notebooks
2024-07-15 14:35:51 +00:00
Matt Robinson
2baa7905bc
build(deps): version bumps for 2024-07-15 (#3397)
### Summary

Version bumps for 2024-07-15.
2024-07-15 13:57:48 +00:00
Christine Straub
3e1a30d338
build(deps): bump unstructured.paddleocr 2.8.0.1 (#3388)
### Summary
- Bump unstructured.paddleocr to `2.8.0.1` which removed `lmdb`
dependency due to license issue.

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
2024-07-14 03:43:44 +00:00
Christine Straub
69cddf5f89
ci: disable sharepoint ingest test (#3393)
Disable sharepoint ingest test to unblock development. We need to
re-enable this test when the sharepoint credentials are updated.
2024-07-14 01:58:08 +00:00
Matt Robinson
ee2b247297
build: check dependency licenses in CI (#3349)
### Summary

Adds a CI check to ensure that packages added as dependencies are
appropriately licensed. All of the `.txt` files in the `requirements`
directory are checked with the exception of:

- `constraints.txt`, since those are not installed and are instead
conditions on the other dependency files
- `dev.txt`, since those are for local development and not shipped as
part of the `unstructured` package
- `extra-pdf-image.txt` - the `extra-pdf-image.in` since checking
`extra-pdf-image.txt` pulls in NVIDIA GPU related packages with an
`Other/Proprietary` license type, and there's not a good way to exclude
those without adding `Other/Proprietary` to the allowed licenses list.

### Testing

The new `check-licenses` job should pass in CI.
2024-07-11 22:36:01 +00:00
Steve Canny
3d6e30a1f7
rfctr(auto): improve expression in tests (#3384)
**Summary**
In preparation for further work on auto-partitioning, improve the
expression in the test-suite.
2024-07-11 19:57:28 +00:00
Steve Canny
c27e0d0062
rfctr(html): replace html parser (#3218)
**Summary**
Replace legacy HTML parser with recursive version that captures all
content and provides flexibility to add new metadata. It's also
substantially faster although that's just a happy side-effect.

**Additional Context**
The prior HTML parsing algorithm that makes up the core of HTML
partitioning was buggy and very difficult to reason about because it did
not conform to the inherently recursive structure of HTML. The new
version retains `lxml` as the performant and reliable base library but
uses `lxml`'s custom element classes to efficiently classify HTML
elements by their behaviors (block-item and inline (phrasing) primarily)
and give those elements the desired partitioning behaviors.

This solves a host of existing problems with content being skipped and
elements (paragraphs) being divided improperly, but also provides a
clear domain model for reasoning about its behavior and reliably
adjusting it to suit our existing and future purposes.

The parser's operation is recursive, closely modeling the recursive
structure of HTML itself. It's behaviors are based on the HTML Standard
and reliably produce proper and explainable results even for novel
cases.

Fixes #2325 
Fixes #2562
Fixes #2675
Fixes #3168
Fixes #3227
Fixes #3228 
Fixes #3230 
Fixes #3237 
Fixes #3245 
Fixes #3247 
Fixes #3255
Fixes #3309 

### BEHAVIOR DIFFERENCES

#### `emphasized_text_tags` encoding is changed:
- `<strong>` is encoded as `"b"` rather than `"strong"`.
- `<em>` is encoded as `"i"` rather than `"em"`.
- `<span>` is no longer recorded in `emphasized_text_tags` (because
without the CSS we can't tell whether it's used for emphasis or if so
what kind).
- nested emphasis (e.g. bold+italic) is encoded as multiple characters
("bi").
- `emphasized_text_contents` is broken on emphasis-change boundaries,
like:
  ```html
   `<p>foo <b>bar <i>baz</i> bada</b> bing</p>`
  ```
  produces:
  ```json
  {
    "emphasized_text_contents": ["bar", "baz", "bada"],
    "emphasized_text_tags": ["b", "bi", "b"]
  }
  ```
   whereas previously it would have produced:
  ```json
  {
    "emphasized_text_contents": ["bar baz bada", "baz"],
    "emphasized_text_tags": ["b", "i"]
  }
  ```

#### `<pre>` text is preserved as it appears in the html
Except that a leading newline is removed if present (has to be in
position 0 of text). Also, a trailing newline is stripped but only if it
appears in the very last position ([-1]) of the `<pre>` text. Old parser
stripped all leading and trailing whitespace.

Result is that:
```html
<pre>
foo
bar
baz
</pre>
```
parses to `"foo\nbar\nbaz"` which is the same result produced for:
```html
<pre>foo
bar
baz</pre>
```
This equivalence is the same behavior exhibited by a browser, which is
why we did the extra work to make it this way.

#### Whitespace normalization
Leading and trailing whitespace are removed from element text, just as
it is removed in the browser. Runs of whitespace within the element text
are reduced to a single space character (like in the browser). Note this
means that `\t`, `\n`, and `&nbsp;` are replaced with a regular space
character. All text derived from elements is whitespace normalized
except the text within a `<pre>` tag. Any leading or trailing newline is
trimmed from `<pre>` element text; all other whitespace is preserved
just as it appeared in the HTML source.

#### `link_start_indexes` metadata is no longer captured. Rationale:
- It was frequently wrong, often `-1`.
- It was deprecated but then added back in a community PR.
- Maintaining it across any possible downstream transformations (e.g.
chunking) would be expensive and almost certainly lead to wrong values
as distant code evolves.
- It is complex to compute and recompute when whitespace is normalized,
adding substantial complexity to the code and reducing readability and
maintainability

#### `<br/>` element is replaced with a single newline (`"\n"`)
but that is usually replaced with a space in `Element.text` when it is
normalized. The newline is preserved within a `<pre>` element.
  - Related: _No paragraph-break on `<br/><br/>`_

#### Empty `h1..h6` elements are dropped.
HTML heading elements (`<h1..h6>`) are "skipped" (do not generate a
`Title` element) when they contain no text or contain only whitespace.

---------

Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-07-11 00:14:28 +00:00
Roman Isecke
76cccb3a5e
feat/persist metadata for fsspec connectors (#3371)
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-07-10 22:08:05 +00:00
David Potter
6c78677ebb
feat: add Astra source connector (#3304)
Thanks to @erichare we now have an AstraDB source connector.

updating constant names to be more aligned with AstraDB
2024-07-10 20:29:22 +00:00
Steve Canny
0c562d8050
rfctr(auto): fix auto-partition test xfails and skips (#3367)
**Summary**
Improve expression in auto-partition tests and fix xfails and skips. Add
issues for the two hard-fails where xfail needed to stay.
2024-07-10 05:29:07 +00:00
ryannikolaidis
543057317f
rfctr [P6M-398]: salesforce connector v2 <- Ingest test fixtures update (#3377)
This pull request includes updated ingest test fixtures.
Please review and merge if appropriate.

Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>
2024-07-09 22:30:50 +00:00
Christine Straub
512583ed91
build(deps): bump unstructured.paddleocr 2.8.0 (#3374)
### Summary
Bump unstructured.paddleocr to `2.8.0`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
2024-07-09 22:19:08 +00:00
David Potter
db1e6993a8
rfctr [P6M-398]: salesforce connector v2 (#3344)
Updates salesforce source connector to v2.
2024-07-09 16:46:58 +00:00
Marianna
176875bf26
OD metrics for CI (#3269)
OD metrics for CI

---------

Co-authored-by: Paweł Kmiecik <pawel.kmiecik@deepsense.ai>
Co-authored-by: Michał Martyniak <64484917+micmarty-deepsense@users.noreply.github.com>
2024-07-09 12:41:20 +00:00
Ahmet Melek
3f96a5ae5c
rfctr: Implement Azure Cognitive Search V2 Destination Connector (#3311)
This PR 
- adds the V2 version of Azure Cognitive Search connector
- extends the ingest test to check for chunking and embedding capability
2024-07-09 12:19:15 +00:00
Roman Isecke
b556d6d575
rfctr: Implement Sharepoint V2 Source Connector (#3314)
### Description
Migrate over the sharepoint connector to v2 and in the process refactor
the majority of the connector. It now pulls in much more content from
the SDK on index time, including permissions data is the parameters are
passed in. HTML content generated from the SitePage is isolated to the
html content in the `CanvasContent1` and `LayoutWebpartsContent`
returned by the SDK.

Some TODOs were left in there for future iterations. Currently only
document and site page content is being pulled in from sharepoint, but
sharepoint has more types of content than just that, such as lists. Note
left in there to support other sharepoint types.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: vangheem <vangheem@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
2024-07-09 09:52:59 +00:00
Steve Canny
00e1d5c05b
rfctr(html): refine HTML parser (#3351)
**Note**
This refines the new HTML parser but _does not install it_. This is why
no changes to ingest test expectations or other unit-tests are required
here. Installing the new parser will happen in the next PR #3218.

**Summary**
The initial version of the parser (purposely) raised on a block element
nested inside a phrasing element. While such nesting is not valid
according to the HTML Standard, it is accepted by the browser and does
happen in the wild.

The refinements here handle this situation similarly to how the browser
does, breaking phrasing at the block element boundaries and starting it
up again after the block element.

Unfortunately this adds complexity to the parser, but it makes the
parser robust against pretty much any HTML we're likely to encounter and
partitions it consistent with how it would be rendered in the browser.
2024-07-09 01:10:03 +00:00
Matt Robinson
7b25dfc337
fix(CVE-2024-39705): remove nltk download (#3361)
### Summary

Addresses
[CVE-2024-39705](https://nvd.nist.gov/vuln/detail/CVE-2024-39705), which
highlights the risk of remote code execution when running
`nltk.download` . Removes `nltk.download` in favor of a `.tgz` file with
the appropriate NLTK data files and checking the SHA256 hash to validate
the download. An error now raises if `nltk.download` is invoked.

The logic for determining the NLTK download directory is borrowed from
`nltk`, so users can still set `NLTK_DATA` as they did previously.

### Testing

1. Create a directory called `~/tmp/nltk_test`. Set
`NLTK_DATA=${HOME}/tmp/nltk_test`.
2. From a python interactive session, run:
```python
from unstructured.nlp.tokenize import download_nltk_packages

download_nltk_packages()
```
3. Run `ls /tmp/nltk_test/nltk_data`. You should see the downloaded
data.

---------

Co-authored-by: Steve Canny <stcanny@gmail.com>
0.14.10
2024-07-08 22:55:36 +00:00
Steve Canny
d48fa3b163
rfctr(auto): improve typing and organize auto tests (#3355)
**Summary**
In preparation for further work on auto-partitioning (`partition()`),
improve typing and organize `test_auto.py` by introducing categories.
2024-07-08 21:25:17 +00:00
Pluto
609a08a95f
remove unused _with_spans metric (#3342)
The table metrics considering spans is not used and it messes with the
output thus I have cleaned the code from it. Though, I have left
table_as_cells in the source code - it still may be useful for the users
2024-07-08 16:59:53 +00:00
Pluto
caea73c8e3
Tables detection f1 (#3341)
This pull request add table detection metrics.

One case that was considered by me:

Case: Two tables are predicted and matched with one table in ground
truth
Question: Is this matching correct in both cases or just for on table

There are two subcases:
- table was predicted by OD as two sub tables (so half in two, there are
two non overlapping subtables) -> in my opinion both are correct
- it is false positive from tables matching script in
get_table_level_alignment -> 1 good, 1 wrong

As we don't have bounding boxes I followed the notebook calculation
script and assumed pessimistic, second subcase version
2024-07-08 13:29:52 +00:00
Matt Robinson
3fc2342f6d
build(deps): version bumps for 2024-07-08 (#3359)
## Summary

Version bumps for 2024-07-08.
2024-07-08 12:38:04 +00:00
Nathan Van Gheem
1ce01c3254
rfctr: Implement SQL V2 Dest Connector (#3323) 2024-07-05 16:05:45 +00:00
Nathan Van Gheem
6e4d9ccd5b
refactor: implement databricks volumes v2 dest connector (#3334) 2024-07-03 19:01:16 +00:00
Christine Straub
493bfccddd
fix: exception handling for OCRAgent.get_agent() (#3335)
The purpose of this PR is to help investigate
https://github.com/Unstructured-IO/unstructured/issues/3202.
2024-07-03 17:58:04 +00:00
Roman Isecke
d86d15c3ab
feat/custom ingest stager (#3340)
### Description
Allow used to pass in a reference to a custom defined stager via the
CLI. Checks are run on the instance passed in to be a subclass of the
UploadStager interface.
2024-07-03 16:59:51 +00:00
Roman Isecke
f1a28600d9
feat/singlestore dest connector (#3320)
### Description
Adds [SingleStore](https://www.singlestore.com/) database destination
connector with associated ingest test.
2024-07-03 15:15:39 +00:00
John
0046f58a4f
revert unstructured-client pin and make pip-compile (#3298)
Change unstructured-client pin to setting minimum version instead of max
version and `make pip-compile`.

Integration tests that were dependent on the old version of the client
are removed. These tests should be replicated in/moved to the SDK
repo(s).
2024-07-02 16:42:03 +00:00
Nathan Van Gheem
da29242dbd
rfctr: implement mongodb v2 destination connector (#3313)
This PR provides support for V2 mongodb destination connector.
2024-07-02 16:40:51 +00:00
Roman Isecke
c28deffbc4
bugfix/isolate ingest v2 dependencies (#3327)
### Description
This PR handles two things:
* Exposing all the connectors via the connector registries by simply
importing the connector module. This should be safe assuming all
connector specific dependencies themselves are imported in the methods
where they are used and wrapped in `@requires_dependencies` decorator
* Remove any import that pulls from the v2 ingest.cli package
2024-07-02 14:07:54 +00:00
Pluto
5d89b41b1a
Fix not counting false negatives and false positives in table metrics (#3300)
This pull request fixes counting tables metric for three cases:
- False Negatives: when table exist in ground truth but any of the
predicted tables doesn't match the table, the table should count as 0
and the file should not be completely skipped (before it was np.NaN).
- False Positives: When there is a predicted table that didn't match any
ground truth table it should be counted as 0, right now it is skipped in
processing (matched_indices==-1)
- The file should be completely skipped only if there is no tables in
ground truth and in prediction

In short we can say that previous metric calculation didn't consider OD
mistakes
2024-07-02 10:07:24 +00:00
Ahmet Melek
72f28d7a11
feat: add v2 pinecone destination connector (#3286)
This PR adds a V2 version of the Pinecone destination connector
2024-07-01 23:22:06 +00:00
David Potter
a18b21c06e
rfctr [P6M-397]: opensearch source connector v2 (#3302)
Updates opensearch source connector to v2. Leverages elasticsearch v2
heavily.

Expected tests renamed because thats how Elasticsearch names them.
2024-07-01 20:35:26 +00:00
Matt Robinson
db8617872b
build: image and dependency updates; fix tesseract files locations (#3310)
### Summary

Updates to the latest version of the `wolfi-base` image. Changes
include:
- Version bumps to address CVEs
- `libreoffice` is now included in the `arm64`. `.doc` files are now
supported for `arm64`. `.ppt` do not work with the `libreoffice` package
currently available on `wolfi-os`. We have follow on work to look into
that.
- Updates the location of the `tesseract` `tessdata` files on the
`arm64` build. Closes #3290.
- Closes #3319 and addes `psutil` to the base dependencies.

### Testing

- `test_dockerfile` should continue to pass with the updates.
2024-07-01 19:39:32 +00:00
David Potter
9eb4c96b94
fix: update slack test to point to new channel (#3328)
When we switched community Slack from Paid to Free we lost the CI test
bot. Also if messages delete after 90 days then our expected test data
will disappear.

- created a new bot in our paid company slack
(test_unstructured_ingest_bot)
- added a new private channel (test-ingest)
- invited the bot to the channel
- adjusted the end datetime of the test to cover the first few messages
in the channel

Still to do:
- update the CI secrets with the new bot token
- update the LastPass with the new bot token (I don't have write
access.. :(.
2024-07-01 18:11:21 +00:00
Matt Robinson
116200559b
docs: add link to serverless api in readme (#3322)
### Summary

Adds links to the serverless api. README updates look like the
following:

<img width="904" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/1635179/fcb2b0c5-0dff-4612-8f18-62836ca6de8b">
2024-07-01 07:39:12 -04:00
David Potter
15f80c4ad6
rfct [P6M]-392: OpenSearch V2 Destination Connector (#3293)
Migrates OpenSearch destination connector to V2. Relies a lot on the
Elasticsearch connector where possible. (this is expected)
2024-06-28 20:51:23 +00:00
Matt Robinson
4a71bbb44c
release: version 0.14.9 (#3312)
### Summary

Release for `0.14.9`.
0.14.9
2024-06-27 20:51:24 +00:00
Roman Isecke
137b149be8
Bugfix/ingest pipeline check (#3303)
### Description
Using a `isinstance` on the destination registry mapping breaks when
inheritance is used for the associated uploader types. This adds a
connector type field to all uploaders so that the entry can be
deterministically fetched when running check for associated stager in
pipeline.
2024-06-27 16:35:37 +00:00
Steve Canny
087adb218f
feat(docx): differentiate no-file from not-ZIP (#3306)
**Summary**
The `python-docx` error `docx.opc.exceptions.PackageNotFoundError`
arises both when no file exists at the given path and when the file
exists but is not a ZIP archive (and so is not a DOCX file).

This ambiguity is unwelcome when diagnosing the error as the two
possible conditions generally indicate a different course of action to
resolve the error.

Add detailed validation to `DocxPartitionerOptions` to distinguish these
two and provide more precise exception messages.

**Additional Context**
- `python-pptx` shares the same OPC-Package (file) loading code used by
`python-docx`, so the same ambiguity will be present in `python-pptx`.
- It would be preferable for this distinguished exception behavior to be
upstream in `python-docx` and `python-pptx`. If we're willing to take
the version bump it might be worth considering doing that instead.
2024-06-27 00:18:56 +00:00
Roman Isecke
54ec311c55
feat/migrate onedrive src (#3295)
### Description
Migrate the onedrive source connector to v2, adding in more rich content
pulled from the response of the SDK to add further metadata to the
FIleData produced by the indexer.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-26 23:59:51 +00:00
Matt Robinson
6939bff49e
build(deps): bump langchain-community version (#3305)
### Summary

Bumps to the latest `langchain-community` version to resolve
[CVE-2024-2965](https://nvd.nist.gov/vuln/detail/CVE-2024-2965).

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2024-06-26 22:42:32 +00:00
Roman Isecke
5179b739fa
feat/more conservative ingest logging (#3301)
### Description
Isolate all log statements that happen per record and make them debug
level to avoid bloating the console output.
2024-06-26 16:54:08 +00:00
Pawel Kmiecik
575957b2d2
feat: enhance analysis options with od model dump and better vis (#3234)
This PR adds new capabilities for drawing bboxes for each layout
(extracted, inferred, ocr and final) + OD model output dump as a json
file for better analysis.

---------

Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: Michal Martyniak <michal.martyniak@deepsense.ai>
2024-06-26 13:14:55 +00:00
Steve Canny
f2fee0c32f
fix(auto): partition() passes strategy to DOC,ODT (#3278)
**Summary**
Remedy gap where `strategy` argument passed to `partition()` was not
forwarded to `partition_doc()` or `partition_odt()` and so was not
making its way to `partition_docx()`.
2024-06-26 00:29:47 +00:00
qued
0665e94b96
build: move numpy pin to packaging (#3296)
Moved numpy pin to `base.in` where it will be picked up by packaging.

Side note:
`constraints.txt` (formerly `constraints.in`) is a really useful
pattern: you put a constraint there, add that file as a `-c` requirement
in other files, and the constraint will be applied when pip-compiling
*only when needed* because the library is required by something else.
Neat! However, unfortunately, in my searches I've never found a similar
pattern for packaging, so any pins we want to propagate to user installs
need to be explicitly placed in the `.in` files.

So what is `constraints.txt` really doing for us? Well in the past I
think there have been instances where something is temporarily broken in
an upstream dependency but we expect it to be patched soon, but in the
meantime we want things to work in our CI builds and development
installs, so it's not worth pinning everywhere it's used. Having said
that, I'm coming to the conclusion that `constraints.txt` causes more
harm than good in the confusion it causes WRT packaging -- maybe we
should remove that pattern at some point.
2024-06-25 21:08:25 +00:00
Yao You
c32aeaac44
fix: wait to run soffice until there is no other soffice process running (#3287)
## Summary

This PR addresses an issue where the code could attempt to run `soffice`
in multiple processes and closes #3284
The fix is to add a wait mechanism when there is another `soffice`
process running in already.

## Diagnosis of issue

- `soffice` can only have one process running when using the command
`soffice` as is.
- on main branch the function `partition.common.convert_office_doc`
simply spawns a subprocess to run `soffice` command to convert a `doc`
or `ppt` file into `docx` or `pptx` format.
- if there are multiple partition calls to process `doc` or `ppt` files
and they all want to spawn `soffice` subprocesses only one will succeed
while other processes will simply fail and return 1 from the subprocess
- in downstream this will lead to errors like `PackageNotFoundError:
Package not found at '/tmp/tmpac6lcu4w/document.docx'`

## solution

While there are
[ways](https://www.reddit.com/r/libreoffice/comments/agk3os/how_to_open_more_than_one_calc_instance_under/)
to circumvent the limit of `soffice` by setting a tmp file as user
installation env, these kind of solutions rely on the internals of
`soffice` and adds maintenance cost to track its changes.

This PR solves this problem by adding a wait mechanism: 
- we first spawning a subprocess to run `soffice` 
- if the `stdout` is empty and we still have wait time budget left the
function first checks if there is another `soffice` running
  * If yes then the function waits for 0.01s before checking again; 
* if no the functions spawns a subprocess to run `soffice` and return to
beginning of this step
* we need to return the the beginning to check if `stdout` is empty
because we could have another collision right after `soffice` becomes
available.

## test

This PR adds two unit tests.
Additionally this can be tested by running partition of `.doc` files
locally with multiprocessing.
2024-06-25 18:49:27 +00:00
Roman Isecke
a7a53f6fcb
feat/migrate astra db (#3294)
### Description
Move astradb destination connector over to the new v2 ingest framework
2024-06-25 18:00:47 +00:00
Roman Isecke
3f581e6b7d
feat/migrate gdrive source connector (#3239)
### Description
Migrate the google drive source connector over to the new v2 ingest
framework and include a variety of improvements as part of the refactor:
* The ID is no longer limited to a drive id but can also be the id of a
subfolder within a drive or a file directly and each case is handled
appropriately
* More metadata is pulled in from google drive to enrich the partitioned
elements downstream and now the modified date is being set to not
reprocess if the ingest pipeline already has the file cached
* timing information is set on the file created when downloaded based on
the last modified data retrieved from google drive

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2024-06-25 12:55:28 +00:00
Roman Isecke
e0f4374386
Roman/bugfix conflicting event loop ingest (#3264)
### Description
In use cases where an external system (such as code being run in a
jupyter notebook) already has a running event loop, run the async code
in a dedicated thread pool to not conflict with the existing event loop.

This also has a variety of fixes that were found when putting together a
demo leveraging the elasticsearch destination connector
2024-06-24 18:47:37 +00:00