7 Commits

Author SHA1 Message Date
Roman Isecke
2bb463d006
feat: support both single and batch ingest docs (#2105)
### Description
There are some source ingest connectors that would be more efficient to
read the content in batches rather than use an entire process per
document. For example, reading from ElasticSearch. Given an index with
possible hundreds of documents, reading each one individually is not as
optimal as reading in batches. To try and maintain as much of the ingest
doc paradigm already being supported, a new class `BaseIngestDocBatch`
was added to handle reading in batches. It produces a list of
`BaseSingleIngestDoc` which is what all current implementations were
renamed to. This list is generated after it runs its `get_files` method.
Past the source node, all other steps in the pipeline should not be
affected, this is just an optimization for the read step.

**Additional Changes:**
* Removed use of jq and instead converted this into a fields filter on
the content to let the database handle the filtering and limit the
amount of data being pulled in.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-11-27 19:25:30 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
shreyanid
32bfebccf7
feat: introduce language detection function for text partitioning function (#1453)
### Summary
Uses `langdetect` to detect all languages present in the input document.

### Details
- Converts all language codes (whether user inputted or detected using
`langdetect`) to a standard ISO 639-3 code.
- Adds `languages` field to the metadata
- Will revisit how to nonstandardly represent simplified vs traditional
Chinese scripts internally (separate PR).
- Update ingest test results to add `languages` field to documents. Some
other side effects are changes in order of some elements and changes in
element categorization

### Test
You can test the detect_languages function individually by importing the
function and inputting a text sample and optionally a language:
```
text = "My lubimy mleko i chleb."
doc_langs = detect_languages(text)
print(doc_langs)
```
-> ['ces', 'pol', 'slk']

---------

Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: shreyanid <shreyanid@users.noreply.github.com>
Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2023-09-26 18:09:27 +00:00
rvztz
9a3e24fcbb
Adds data source properties to elasticsearch, wikipedia and google-drive (#1282) 2023-09-19 20:25:38 +00:00
Klaijan
ad386af8b5
Klaijan/auto paragraph grouper (#994)
* add auto_paragraph_grouper. add line break pattern.

* combine group_broken_paragraph and blank_line_grouper function

* fix make check errors

* fix make check errors

* fix make check errors

* fix make check errors

* run make tidy to fix errors

* tidy core.py and text.py

* fix blank-line breaker to extends the result and replace new line with space

* fix function name typo

* call group_broken_paragraphs for blank_line_grouper

* edit function name from one_line_grouper to new_line_grouper for consistency

* edit threshold from 0.5 to 0.1

* edit threshold from 0.5 to 0.1

* Revert "call group_broken_paragraphs for blank_line_grouper"

This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9.

* revert to commit 8fb93b7 and change threshold from 0.5 to 0.1

* edit test_text assertion. remove all BULLETS_PATTERN.

* Update ingest test fixtures (#1052)

Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>

* edit test case in test_xml_partition

* update assertion on test_auto

---------

Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local>
Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com>
Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
2023-08-07 18:37:18 -04:00
Ahmet Melek
5ea216cf07
feat: elasticsearch connector (#817) 2023-07-01 17:45:28 +00:00