unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-10-03 04:14:15 +00:00

Author	SHA1	Message	Date
Amanda Cameron	a501d1d18f	Adding table extraction to partition_html (#1324 ) Adding table extraction to HTML partitioning. This PR utilizes 'table' HTML elements to extract and parse HTML tables and return them in partitioning. ``` # checkout this branch, go into ipython shell In [1]: from unstructured.partition.html import partition_html In [2]: path_to_html = "{html sample file with table}" In [3]: elements = partition_html(path_to_html) ``` you should see the table in the elements list!	2023-09-11 11:14:11 -07:00
cragwolfe	d0749d181f	fix: avoid PDF sorting error on negative coords (#1361 ) The default sorting algorithm for PDF's, "xycut," would cause an error when partitioning a document if Y coordinate points were negative. This change checks for that condition (or more broadly, any negative coordinates) and falls back to the "basic" sort if that is the case. This PR does not address the underlying issue of "bad points" which still should be investigated. However, the sorting code should be less brittle to unexpected bounding boxes in the first case. Resolves: https://github.com/Unstructured-IO/unstructured/issues/1296	2023-09-10 19:29:49 -07:00
cragwolfe	87bfe7a1fe	build(deps): PDF images, unstructured-inference==0.5.23 (#1341 ) Bumps unstructured-inference==05.23 to pull in @christinestraub's fix: https://github.com/Unstructured-IO/unstructured-inference/pull/198 , so embedded Images in PDF's are now included in partition results ("hi_res"). From the perspective of elements with clean text, this is not a big win as a lot of the images have OCR garbage. However, it is important to preserve image elements for other downstream use cases, so overall this is a step forward.	2023-09-08 05:29:53 +00:00
Matt Robinson	22974f61ce	fix: separate elements by `<br>` tag in `partition_html` (#1314 ) ### Summary Closes #1230. Updates `partition_html` to split on `<br>` tags that appear within text elements. ### Testing The following is code previously produced one giant element on `main`. ```python from unstructured.partition.html import partition_html filename = "example-docs/ideas-page.html" elements = partition_html(filename=filename) len(elements) # Should be 4 print("\n\n".join([str(el) for el in elements)]) ``` The output should be: ```python January 2023 (Someone fed my essays into GPT to make something that could answer questions based on them, then asked it where good ideas come from. The answer was ok, but not what I would have said. This is what I would have said.) The way to get new ideas is to notice anomalies: what seems strange, or missing, or broken? You can see anomalies in everyday life (much of standup comedy is based on this), but the best place to look for them is at the frontiers of knowledge. Knowledge grows fractally. From a distance its edges look smooth, but when you learn enough to get close to one, you'll notice it's full of gaps. These gaps will seem obvious; it will seem inexplicable that no one has tried x or wondered about y. In the best case, exploring such gaps yields whole new fractal buds. ```	2023-09-07 13:16:31 +00:00
Ahmet Melek	09cc4bfa5f	feat: jira connector (cloud) (#1238 ) This connector: - takes a Jira Cloud URL, user email and api token; to authenticate into Jira Cloud - ingests: - either all issues in all projects in a Jira Cloud Organization - or - issues in user specified projects, boards - user specified issues - processes this kind of data: - text fields such as issue summary, description, and comments - dropdown fields such as issue type, status, priority, assignee, reporter, labels, and components - other data such as issue id, issue key, project id, information on subtasks - notes down attachment URLs, however does not process attachments - stores each downloaded issue in a txt file, in a predefined template form (consisting of the data above) - then processes each downloaded issue document into elements using unstructured library - related to: https://github.com/Unstructured-IO/unstructured/issues/263 To test the changes, make the necessary setups and run the relevant ingest test scripts. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-09-06 10:10:48 +00:00
David Potter	b710bafa89	feat: add salesforce connector (#1168 )	2023-09-02 08:50:31 -07:00
cragwolfe	65344117b1	enhancement: entire page OCR output included with hi_res (#1263 ) Bumps unstructured-inference==0.5.19 to bring in @christinestraub's enhancement https://github.com/Unstructured-IO/unstructured-inference/pull/186 . This is a massive improvement where previously omitted text was not included in `hi_res` output if the layout model had not put a bounding box around it. In addition, the xycut sorting algorithm generally does a good job of ordering the merged OCR-text-not-in-layout-model bboxes with layout-model bboxes into "natural reading order." More details in https://github.com/Unstructured-IO/unstructured-inference/pull/186#issuecomment-1700438645 . Bonus: changelog fix.	2023-09-01 04:27:48 +00:00
ryannikolaidis	076b1e38f4	feat: serialize ingest docs as json (#1178 )	2023-08-31 01:48:41 +00:00
Ahmet Melek	b22e18f7d8	uncomment confluence diff ingest test (#1217 ) Uncomment confluence-diff ingest test to: - see if the test has consistent results - keep testing the confluence connector --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-28 18:05:57 -07:00
cragwolfe	4c13d12dc3	fix: prevent spammy ListItem's from images and PDF's (#1210 ) The issue was that for blocks detected in an image such as: ![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a) , where the full image is: https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png , many ListItem's would be extracted that were not adding much value to the output (assuming the block was determined to be of type List from the layout model). This particular file is also used in ingest tests, and you can see the prior output here: https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280 Test Instructions: 1. run the following snippet: ``` import json import os from datetime import datetime from unstructured.__version__ import __version__ from unstructured.partition.auto import partition from unstructured.staging.base import elements_to_json filename = "/opt/home/tmp/IRS-form-1987.png" output_dir = "/opt/home/tmp/json" base_name_with_ext = os.path.basename(filename) output_filename_part = os.path.join(output_dir, base_name_with_ext) print(f"unstructured version: {__version__}") #for strategy in ("hi_res", "fast", "auto"): for strategy in ("hi_res",): d1 = datetime.now() elements = partition(filename=filename, strategy=strategy) elems_as_dicts = json.loads(elements_to_json(elements, indent=2)) # strip out metadata for the sake of more readable results for element_dict in elems_as_dicts: del element_dict["metadata"] json_filename=f"{output_filename_part}-{strategy}.json" with open(json_filename, "w") as jsonf: jsonf.write(json.dumps(elems_as_dicts, indent=2)) d2 = datetime.now() print(f"num elements for {strategy}: {len(elements)}") print(f"time elapsed {strategy}: {(d2-d1).total_seconds()}") ``` updating the `filename` and `output_dir` paths for your particular local environment. 2. Open the json file that was writen to your `output_dir`, named IRS-form-1987.png-hi_res.json Witness the new element: ``` { "type": "ListItem", "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f", "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87 -61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil e Form 3115 for these changes." }, ```	2023-08-26 21:01:07 -07:00
Matt Robinson	c578b85699	fix: respect `<pre>` tag order in `partition_html` (#1197 ) ### Summary Closes #1184. Updates `partition_html` to respect the ordering of `<pre>` tags in HTML documents. ### Testing The elements in the following example should be in the correct order. ```python from unstructured.partition.html import partition_html html_text = """ <pre>The Big Brown Bear</pre> <div>The big brown bear is growling.</div> <pre>The big brown bear is sleeping.</pre> <div>The Big Blue Bear</div> """ elements = partition_html(text=html_text) print("\n\n".join([str(el) for el in elements])) ```	2023-08-25 04:14:48 +00:00
Christine Straub	483b09b3c9	Feat/1136 elements ordering for pdf (#1161 ) ### Summary Address [#1136](https://github.com/Unstructured-IO/unstructured/issues/1136) for `hi_res` and `fast` strategies. The `ocr_only` strategy does not include coordinates. - add functionality to switch sort mode between the current `basic` sorting and the new `xy-cut` sorting for `hi_res` and `fast` strategies - add the script to evaluate the `xy-cut` sorting approach - add jupyter notebook to provide evaluation and visualization for the `xy-cut` sorting approach ### Evaluation ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py <file_path> <strategy> ``` Here, the file should be under the project root directory. For example, ``` export PYTHONPATH=.:$PYTHONPATH && python examples/custom-layout-order/evaluate_xy_cut_sorting.py example-docs/multi-column-2p.pdf fast ```	2023-08-24 17:46:19 -07:00
Trevor Bossert	f267cef329	feat: Adds in threaded replies (#1188 ) - Puts threaded replies into the same text field as parent message, allowing for a full thread to be under a single element_id - Output is now XML instead of TXT to allow for easier parsing of new format. https://github.com/Unstructured-IO/unstructured/issues/1186	2023-08-24 12:12:29 -07:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.	2023-08-22 16:05:02 -07:00
Roman Isecke	106ee965a6	Roman/delta table connector (#1132 ) ### Description Add delta table connector and test against a delta table generated via delta.io and uploaded to s3. Shows an example of how to use the connection options to leverage s3. I was able to get this to work with s3 if I pass in the access and secret keys as storage options. Even though the s3 bucket being used is public, would not work without those. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-22 10:19:46 -04:00
Roman Isecke	db8af4f5de	Roman/notion tests (#1072 ) ### Description * Add ingest test for Notion docs * Update default cache dir for connectors to include connector name. Makes debugging the cached content easier. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-08-21 15:16:50 -04:00
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com>	2023-08-18 20:59:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res.	2023-08-17 06:25:00 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True.	2023-08-17 04:16:23 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts	2023-08-16 04:33:06 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
Christine Straub	80266460fd	fix: GH issue 1057 etree parser error (csv) (#1112 ) Addresses #1057 for CSV. Related to PR #1077. * update partition_csv to always use soupparser_fromstring to parse html text	2023-08-14 17:48:57 +00:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
Ahmet Melek	64a1930c46	chore[ingest]: fix confluence ingest diff tests (#1082 ) * trigger CI * trigger CI * trigger CI * do not ingest personal spaces in the diff test * fix argument * Update ingest test fixtures (#1083) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-10 17:45:17 +00:00
rvztz	dee9b405cd	feat: Sharepoint connector (#918 )	2023-08-10 09:37:58 -07:00
Yuming Long	b4fe40e484	Chore[ingest]: adding parameter --partition-pdf-infer-table-structure (#1056 ) * add param * expected test * add option (to do doc nit) * test with api for now * typo * test with api key * use local only * encoding -> partition-encoding * changelog and version * Update ingest test fixtures (#1055) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> * ignore coordinates * no witespace lol * Update ingest test fixtures (#1061) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-08-08 18:11:06 -04:00
Klaijan	ad386af8b5	Klaijan/auto paragraph grouper (#994 ) * add auto_paragraph_grouper. add line break pattern. * combine group_broken_paragraph and blank_line_grouper function * fix make check errors * fix make check errors * fix make check errors * fix make check errors * run make tidy to fix errors * tidy core.py and text.py * fix blank-line breaker to extends the result and replace new line with space * fix function name typo * call group_broken_paragraphs for blank_line_grouper * edit function name from one_line_grouper to new_line_grouper for consistency * edit threshold from 0.5 to 0.1 * edit threshold from 0.5 to 0.1 * Revert "call group_broken_paragraphs for blank_line_grouper" This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9. * revert to commit 8fb93b7 and change threshold from 0.5 to 0.1 * edit test_text assertion. remove all BULLETS_PATTERN. * Update ingest test fixtures (#1052) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * edit test case in test_xml_partition * update assertion on test_auto --------- Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local> Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com> Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-07 18:37:18 -04:00
ryannikolaidis	cd1df5e8e6	fix: remove default encoding for ingest (#1036 )	2023-08-05 16:57:45 +00:00
Christine Straub	b76d2ee745	feat: track emphasized text msword (#1048 ) * feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph * chore: add docstring * chore: fix lint errors * feat: ignore spaces when extracting emphasized texts from a paragraph * feat: add functionality to track emphasized text (`bold/italic` formatting) from table * test: add test case for grabbing emphasized texts from element metadata * chore: fix lint errors * chore: update changelog & version * Update ingest test fixtures (#1047)	2023-08-04 17:04:12 -04:00
Matt Robinson	f4ddf53590	feat: track emphasized text in `partition_html` (#1034 ) * Feat/965 track emphasized text html (#1021) * feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML * feat: add `include_tail_text` parameter to `_construct_text` * test: add test case for `_get_emphasized_texts_from_tag` * test: add `emphasized_texts` to metadata * chore: update changelog & version * fix tests * fix lint errors * chore: update changelog * chore: small comment updates * feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag * chore: update changelog * Update ingest test fixtures (#1026) Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * ingest-test-fixtures-update * Update ingest test fixtures (#1035) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-08-03 16:24:25 +00:00
ryannikolaidis	70365ea42d	chore: add Dropbox secrets to CI environments (#1029 )	2023-08-03 02:18:29 +00:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
David Potter	1542607892	feat: adds Box connector (#996 )	2023-08-01 01:10:10 +00:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
Matt Robinson	6e852cbe70	feat: track links from anchor tags in `partition_html` (#959 ) * track tags in html * pass through links as metadata * add test for grabbing links * one more link * changelog and version * update docs * fix tests * update empty link assertion * ingest-test-fixtures-update * Update ingest test fixtures (#961)	2023-07-24 18:28:56 +00:00
Jason Scheirer	196efa09b1	chore: Add encoding param to ingest (#955 ) * Add encoding param to ingest	2023-07-24 10:06:13 -07:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
rvztz	ce20c3f2bc	feat: add OneDrive connector (#834 )	2023-07-13 20:57:54 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
David Potter	bec733cdf8	feat: add Dropbox connector (#844 )	2023-06-30 17:08:27 -07:00
qued	350bb1dad5	enhancement: clean pdf elements (bump unstructured-inference) (#790 ) More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4) Make large model available (from unstructured-inference bump to 0.5.3) Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2) --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructured.io>	2023-06-29 18:35:06 -07:00
ryannikolaidis	62e20442df	chore: refactor ingest tests (#814 ) - Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth - Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory - Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository - Construct paths as reusable variables declared at top of scripts - Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance) - OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind - Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true - Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)	2023-06-29 23:13:41 +00:00
ryannikolaidis	8ea5f6939e	fix: parameterized ingest test overwriting (#838 ) * sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh * fixes incorrect expected results * update expected results to properly parse Korean text * bonus: installs language pack for Korean in CI and ingest fixture workflows	2023-06-29 18:37:09 +00:00
Roman Isecke	9882c2b83f	Avoid setting metadata in constructor signature for elements (#837 ) Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification). Bonus refactor for PageBreak to have text values of "". --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-29 03:14:05 +00:00
Matt Robinson	44411ecc59	enhancement: `max_partition` kwarg for limiting element size (#818 ) * add max partition size logic * work splitting logic into split_by_paragraph * pass through max_partition to other functions * added test for splitting long document * add type hint * add documentation * version and changelog * ingest-test-fixtures-update * Update ingest test fixtures (#819) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * retrigger ci * ingest-test-fixtures-update * ingest-test-fixtures-update * Update ingest test fixtures (#821) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * update default for partition_xml * update version for release * update msg doc string --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 15:26:01 -04:00
Matt Robinson	38457777fa	fix: ignore escaped commas in CSV checks (#832 ) * fix file content checking bug * skip counting commas in quotes for csv detection * add test for comma count * change file content grab to -1 * version and changelog * add csv to extension check * add file to tests * ingest-test-fixtures-update * Update ingest test fixtures (#833) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * fix typo * fix changelog wording --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 17:22:23 +00:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00

1 2

84 Commits