unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-18 21:10:01 +00:00

Author	SHA1	Message	Date
Michał Martyniak	2d1923ac7e	Better element IDs - deterministic and document-unique hashes (#2673 ) Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842 Main changes compared to part one: * hash computation includes element's sequence number on page, page number, document filename and its text * there are more test for deterministic behavior of IDs returned by partitioning functions + their uniqueness (guaranteed at the document level, and high probability across multiple documents) This PR addresses the following issue: https://github.com/Unstructured-IO/unstructured/issues/2461	2024-04-24 00:05:20 -07:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
Steve Canny	b8a8de33f4	fix(ingest): canonicalize ingest JSON (#2080 ) Canonicalize JSON produced for ingest tests such that incidental changes is _form_ of the JSON objects (keys moving around) that does not change the _content_ of that JSON object does not trigger an ingest-test failure.	2023-11-15 00:52:58 -08:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
ryannikolaidis	955efac935	fix: SharePoint connector fails if any document has an unsupported filetype (#1493 )	2023-09-22 18:47:28 +00:00
rvztz	2f52df180f	Adds data source properties to onedrive, reddit and slack (#1281 )	2023-09-20 04:26:36 +00:00
rvztz	ce20c3f2bc	feat: add OneDrive connector (#834 )	2023-07-13 20:57:54 +00:00

7 Commits