John
9500d04791
detect document language across all partitioners ( #1627 )
...
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package.
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
rvztz
424852ab39
feat: adds data source properties to Sharepoint and Outlook ( #1278 )
2023-09-20 09:13:35 +00:00
Christine Straub
0e887cc36b
Feat/1060 update metadata fields ( #1099 )
...
Closes Github Issue #1060 .
* update the metadata field links
* update the metadata field emphasized_texts
2023-08-16 04:33:06 +00:00
Matt Robinson
f4ddf53590
feat: track emphasized text in partition_html
( #1034 )
...
* Feat/965 track emphasized text html (#1021 )
* feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML
* feat: add `include_tail_text` parameter to `_construct_text`
* test: add test case for `_get_emphasized_texts_from_tag`
* test: add `emphasized_texts` to metadata
* chore: update changelog & version
* fix tests
* fix lint errors
* chore: update changelog
* chore: small comment updates
* feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag
* chore: update changelog
* Update ingest test fixtures (#1026 )
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
* ingest-test-fixtures-update
* Update ingest test fixtures (#1035 )
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
---------
Co-authored-by: Christine Straub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>
2023-08-03 16:24:25 +00:00
cragwolfe
13d3559fa4
chore: rename Element's "date" field to "last_modified" ( #997 )
...
Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.
2023-08-01 02:55:43 +00:00
David Potter
f7e46af22f
feat: adds Outlook connector ( #939 )
...
* bonus: fixes issue with email partitioning where From field was being assigned the To field value.
2023-07-26 04:09:26 +00:00