9 Commits

Author SHA1 Message Date
cragwolfe
bd8a74d686
chore: shell scripts default indent of 2 instead of 4 (#2287)
Given the tendency for shell scripts to easily enter into a few levels
of indentation and long line lengths, update the default to 2 spaces.
2023-12-19 07:48:21 +00:00
Roman Isecke
76efcf4dd7
chore: add shfmt (#2246)
### Description
Given all the shell files that now exist in the repo, would be nice to
have linting/formatting around them (in addition to the existing
shellcheck which doesn't do anything to format the shell code). This PR
introduces `shfmt` to both check for changes and apply formatting when
the associated make targets are called.
2023-12-12 01:04:15 +00:00
David Potter
cde11d1eb0
feat: Add sftp source connector (#2163)
Adds source connector for SFTP which uses fsspec and paramiko via
fsspec. Paramiko is the standard sftp package for python used in pysftp
etc...

```
--username foo \
--password bar \
--remote-url sftp://localhost:47474/upload/
```

Will only download a specifically requested file if it has an extension.
(i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will
treat any other remote_url as a folder path. This is intentional.

---------

Co-authored-by: potter-potter <david.potter@gmail.com>
2023-12-07 19:33:19 +00:00
Klaijan
2d450c48e7
fix: skipped file not found error (#2188)
Create the file outside the if-clause.
2023-11-29 18:31:59 +00:00
Roman Isecke
b951d73a9b
feat: add logging to ingest CLI for tests being skipped at the end (#2174)
### Description
Often times there are tests being skipped either due to missing env vars
or explicitly defined in the base script but these get lost in the logs.
This PR updates the scripts to leverage a custom error code if being
skipped due to missing env vars and this custom error code is being
caught by the base script and logs all files being skipped to a file. At
the end of the script, this file gets logged in the CI output.
2023-11-29 13:41:19 +00:00
rvztz
50b1431c9e
rvztz/hubspot ingest connector (#1760)
Closes #1843 

Ingest connector for HubSpot. Supports:
- Calls: Logs from calls related to contacts, companies and tickets
- Communications: Logs from SMS/Whatsapp related to contacts, companies
and tickets
- Notes: Notes related to CRM notes
- Products: CRM products
- Emails: Logs from emails sent to CRM objects.
- Tasks: CRM tasks

From each record, `body/`description`information is grabbed. When a
title property is available, this is registered at the beggining of the
output file. The CLI receives three params:
- `api-token`: [Private
app](https://developers.hubspot.com/docs/api/private-apps) token.
- `object-types: One of the noted supported objects in the form of a
comma separated list: `calls,products,tasks`
- `custom-properties`: Custom properties to grab information from. Must
be in the form
`<object_type>:<custom_property_id>,<object_type>:<custom_property_id>`

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rvztz <rvztz@users.noreply.github.com>
2023-11-28 23:07:57 +00:00
Klaijan
433c3889dc
ci: reorganize eval output folders and add azure to matrix test (#2093)
**Summary**
The CI workflow for evaluation previously saved the metric outputs to
the `metrics/` folder. Currently structured in subfolders e.g.
`metrics/text-extraction` `metrics/element-type` for the folder clean up
purpose.

Additionally, Azure connector is also added to
`full_python_matrix_tests` in this PR.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>
2023-11-21 20:04:30 +00:00
Roman Isecke
6e67c48fd8
feat: update all ingest tests to use huggingface for embeddings (#2071)
### Description
Update any use of OpenAI for generating embeddings in the ingest tests
to use Huggingface

**Bonus Changes:**
* Remove duplicate delta table test
* Delete delta table destination directory at the beginning of the test
to make sure it doesn't exist and prevent the test from breaking.
2023-11-21 18:43:19 +00:00
Roman Isecke
24a419ece0
separate ingest tests (#1951)
### Description
This splits the source ingest tests from the destination ingest tests
since they share a different pattern:
* src tests pull data from a source and compare the partitioned content
to the expected results
* destingation tests leverage the local connector to produce results to
push to a destination and leverages overhead to create temporary
locations at those destinations to write to and delete when done.

Only the src tests create partitioned content that needs to be checked
so the update ingest test CI job only needs to run these.
2023-11-01 19:23:44 +00:00