9 Commits

Author SHA1 Message Date
Steve Canny
208c7edc52
rfctr(csv): minify HTML and table text is cct (#3733)
**Summary**
Eliminate historical "idiosyncracies" of `table.metadata.text_as_html`
HTML introduced by `partition_csv()`. Produce minified `.text_as_html`
consistent with that formed by chunking.

**Additional Context**
- CSV `.metadata.text_as_html` is minified (no extra whitespace or
thead, tbody, tfoot elements).
- `table.text` is clean-concatenated-text (CCT) of table.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
2024-10-19 06:49:09 +00:00
Michał Martyniak
2d1923ac7e
Better element IDs - deterministic and document-unique hashes (#2673)
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842

Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)

This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
2024-04-24 00:05:20 -07:00
Roman Isecke
cc05e948ff
chore: sensitive info connector audit (#2227)
### Description
All other connectors that were not included in
https://github.com/Unstructured-IO/unstructured/pull/2194 are now
updated to follow the new pattern and mark any variables as sensitive
where it makes sense.
Core changes:
* All connectors now support an `AccessConfig` to mark data that's
needed for auth (i.e. username, password) and those that are sensitive
are designated appropriately using the new enhanced field.
* All cli configs on the cli definition now inherit from the base config
in the connector file to reuse the variables set on that dataclass
* The base writer class was updated to better generalize the new
approach given better use of dataclasses
* The base cli classes were refactored to also take into account the
need for a connector and write config when creating the respective
runner/writer classes.
* Any mismatch between the cli field name and the dataclass field name
were updated on the dataclass side to not impact the user but maintain
consistency
* Add custom redaction logic for mongodb URIs since the password is
expected to be a part of it. Now this:
`"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"`
->
`"mongodb+srv://ingest-test-user:***REDACTED***@ingest-test.hgaig.mongodb.net/"`
in the logs
* Bundle all fsspec based files into their own packages. 
* Refactor custom `_decode_dataclass` used for enhanced json mixin by
using a monkey-patch approach. The original approach was breaking on
optional nested dataclasses when serializing since the other methods in
`dataclasses_json_core` weren't using the new method. By monkey-patching
the original method with a new one, all other methods in that library
would use the new one.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-12-11 17:37:49 +00:00
Steve Canny
b8a8de33f4
fix(ingest): canonicalize ingest JSON (#2080)
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
2023-11-15 00:52:58 -08:00
Roman Isecke
135aa65906
update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814)
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-10-25 22:04:27 +00:00
Léa
89fa88f076
fix: stop csv and tsv dropping the first line of the file (#1530)
The current code assumes the first line of csv and tsv files are a
header line. Most csv and tsv files don't have a header line, and even
for those that do, dropping this line may not be the desired behavior.

Here is a snippet of code that demonstrates the current behavior and the
proposed fix

```
import pandas as pd
from lxml.html.soupparser import fromstring as soupparser_fromstring

c1 = """
    Stanley Cups,,
    Team,Location,Stanley Cups
    Blues,STL,1
    Flyers,PHI,2
    Maple Leafs,TOR,13
    """

f = "./test.csv"
with open(f, 'w') as ff:
    ff.write(c1)
  
print("Suggested Improvement Keep First Line") 
table = pd.read_csv(f, header=None)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)

print("\n\nOriginal Looses First Line") 
table = pd.read_csv(f)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-16 17:59:35 -05:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
rvztz
2e01c49d90
feat: adds data source properties to delta table connector. (#1464) 2023-09-27 17:46:01 -07:00
Roman Isecke
106ee965a6
Roman/delta table connector (#1132)
### Description
Add delta table connector and test against a delta table generated via
delta.io and uploaded to s3. Shows an example of how to use the
connection options to leverage s3.

I was able to get this to work with s3 if I pass in the access and
secret keys as storage options. Even though the s3 bucket being used is
public, would not work without those.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
2023-08-22 10:19:46 -04:00