unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-24 17:41:15 +00:00

Author	SHA1	Message	Date
Austin Walker	dd243b4fd9	chore: pass ocr_mode in partition_pdf_or_image (#1154 ) Set to individual_blocks for now to work around [this bug](https://github.com/Unstructured-IO/unstructured-inference/issues/179). I verified by printing the current ocr_mode in inference. The `entire_page` default is overridden. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: awalker4 <awalker4@users.noreply.github.com>	2023-08-18 20:59:08 +00:00
ryannikolaidis	668d0f1b01	feat: per-process ingest connections (#1058 ) * adds per process connections for Google Drive connector	2023-08-17 17:34:08 +00:00
cragwolfe	dd0f582585	build(deps): bump unstructured-inference==0.5.13 (#1141 ) Bump to unstructured-inference==0.5.13, which includes: Fix extracted image elements being included in layout merge, addresses the issue where an entire-page image in a PDF was not passed to the layout model when using hi_res.	2023-08-17 06:25:00 +00:00
John	9f7bd6127b	enhancement: Add `include_header` kwarg for xlsx, default True(#1125 ) Closes Github issue #1121 Adds include_header kwarg to partition_xlsx and change default behavior to True.	2023-08-17 04:16:23 +00:00
Christine Straub	0a23139720	enhancement: implement full-page OCR(#1133 ) *implements full-page OCR as supported in unstructured-inference=0.5.11.	2023-08-16 19:16:35 +00:00
Christine Straub	0e887cc36b	Feat/1060 update metadata fields (#1099 ) Closes Github Issue #1060. * update the metadata field links * update the metadata field emphasized_texts	2023-08-16 04:33:06 +00:00
John	6e5d27c6c3	fix pdf partition of list items being detected as titles in OCR only mode (#1119 ) Closes Github issue #1010 adds group_bullet_paragraph func to handle grouping of bullet items that are split across multiple lines	2023-08-15 09:35:54 -07:00
Christine Straub	80266460fd	fix: GH issue 1057 etree parser error (csv) (#1112 ) Addresses #1057 for CSV. Related to PR #1077. * update partition_csv to always use soupparser_fromstring to parse html text	2023-08-14 17:48:57 +00:00
Ahmet Melek	627f78c16f	feat: airtable connector (#1012 ) * add the first version of airtable connector * change imports as inline to fail gracefully in case of lacking dependency * parse tables as csv rather than plain text * add relevant logic to be able to use --airtable-list-of-paths * add script for creation of reseources for testing, add test script (large) for testing with a large number of tables to validate scroll functionality, update test script (diff) based on the new settings * fix ingest test names * add scripts for the large table test * remove large table test from diff test * make base and table ids explicit * add and remove comments * use -ne instead of != * update code based on the recent ingest refactor, update changelog and version * shellcheck fix * update comments * update check-num-rows-and-columns-output error message Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * update help comments * update help comments * update help comments * update workflows to set auth tokens and to run make install * add comments on create_scale_test_components * separate component ids from the test script, add comments to document test component creation * add LARGE_BASE test, implement LARGE_BASE component creation, replace component id * shellcheck fixes * shellcheck fixes * update docs * update comment * bump version * add wrongly deleted file * sort columns before saving to process * Update ingest test fixtures (#1098) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-11 12:02:51 -07:00
Ahmet Melek	64a1930c46	chore[ingest]: fix confluence ingest diff tests (#1082 ) * trigger CI * trigger CI * trigger CI * do not ingest personal spaces in the diff test * fix argument * Update ingest test fixtures (#1083) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-10 17:45:17 +00:00
rvztz	dee9b405cd	feat: Sharepoint connector (#918 )	2023-08-10 09:37:58 -07:00
Yuming Long	b4fe40e484	Chore[ingest]: adding parameter --partition-pdf-infer-table-structure (#1056 ) * add param * expected test * add option (to do doc nit) * test with api for now * typo * test with api key * use local only * encoding -> partition-encoding * changelog and version * Update ingest test fixtures (#1055) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> * ignore coordinates * no witespace lol * Update ingest test fixtures (#1061) Co-authored-by: yuming-long <yuming-long@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>	2023-08-08 18:11:06 -04:00
Klaijan	ad386af8b5	Klaijan/auto paragraph grouper (#994 ) * add auto_paragraph_grouper. add line break pattern. * combine group_broken_paragraph and blank_line_grouper function * fix make check errors * fix make check errors * fix make check errors * fix make check errors * run make tidy to fix errors * tidy core.py and text.py * fix blank-line breaker to extends the result and replace new line with space * fix function name typo * call group_broken_paragraphs for blank_line_grouper * edit function name from one_line_grouper to new_line_grouper for consistency * edit threshold from 0.5 to 0.1 * edit threshold from 0.5 to 0.1 * Revert "call group_broken_paragraphs for blank_line_grouper" This reverts commit 8fb93b7aa7c4d7e0320ac1e09c77da44c9b6c7d9. * revert to commit 8fb93b7 and change threshold from 0.5 to 0.1 * edit test_text assertion. remove all BULLETS_PATTERN. * Update ingest test fixtures (#1052) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * edit test case in test_xml_partition * update assertion on test_auto --------- Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MacBook-Pro.local> Co-authored-by: Klaijan Sinteppadon <klaijan@klaijans-mbp.mynetworksettings.com> Co-authored-by: Klaijan Sinteppadon <klaijan@Klaijans-MBP.fios-router.home> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-08-07 18:37:18 -04:00
ryannikolaidis	fac2da6117	fix: ingest test check num files in smoke test (#1051 )	2023-08-06 22:58:50 +00:00
ryannikolaidis	cd1df5e8e6	fix: remove default encoding for ingest (#1036 )	2023-08-05 16:57:45 +00:00
Christine Straub	b76d2ee745	feat: track emphasized text msword (#1048 ) * feat: add functionality to track emphasized text (`bold/italic` formatting) from paragraph * chore: add docstring * chore: fix lint errors * feat: ignore spaces when extracting emphasized texts from a paragraph * feat: add functionality to track emphasized text (`bold/italic` formatting) from table * test: add test case for grabbing emphasized texts from element metadata * chore: fix lint errors * chore: update changelog & version * Update ingest test fixtures (#1047)	2023-08-04 17:04:12 -04:00
Matt Robinson	f4ddf53590	feat: track emphasized text in `partition_html` (#1034 ) * Feat/965 track emphasized text html (#1021) * feat: add functionality to track emphasized text (<strong>, <em>, <span>, <b>, <i> tags) in HTML * feat: add `include_tail_text` parameter to `_construct_text` * test: add test case for `_get_emphasized_texts_from_tag` * test: add `emphasized_texts` to metadata * chore: update changelog & version * fix tests * fix lint errors * chore: update changelog * chore: small comment updates * feat: update `XMLDocument._read_xml` to create `<p>` tag element for the text enclosed in the `<pre>` tag * chore: update changelog * Update ingest test fixtures (#1026) Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: Matt Robinson <mrobinson@unstructured.io> * ingest-test-fixtures-update * Update ingest test fixtures (#1035) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-08-03 16:24:25 +00:00
ryannikolaidis	70365ea42d	chore: add Dropbox secrets to CI environments (#1029 )	2023-08-03 02:18:29 +00:00
ryannikolaidis	719e15e7fe	chore: skip ingest test on missing Slack token (#1028 )	2023-08-02 21:16:37 +00:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
David Potter	1542607892	feat: adds Box connector (#996 )	2023-08-01 01:10:10 +00:00
Roman Isecke	28214a6cc3	Roman/ingest refactor (#978 ) * Pull out s3 code as subcommand * Pull out dropbox code as subcommand * Pull out azure code as subcommand * Pull out fsspec code as subcommand * Pull out github code as subcommand * Pull out gitlab code as subcommand * Pull out reddit code as subcommand * Pull out slack code as subcommand * Pull out discord code as subcommand * Pull out wikipedia code as subcommand * Pull out gdrive code as subcommand * Pull out biomed code as subcommand * rename parameters * Pull out onedrive code as subcommand * Pull out outlook code as subcommand * Pull out local code as subcommand * Pull out elasticsearch code as subcommand * Pull out confluence code as subcommand * Drop previous main file * update changelog * Add back in mp.Pool * Fix mypy issues with click * Make sure all tests run with verbose flag * refactor approach to dynamically add common options to each subcommand, scrub logging of options for sensitive data * Pull out some more shared options * Support running code via python as well as cli * update ingest readme and move it to the ingest folder * update usage in connector docs * move local command arg in test * Seperate out cli code from logic running unstructured * Make some cli fields required rather than optional * rename process -> processor * Improve logger to avoid duplicate handlers --------- Co-authored-by: Ryan Nikolaidis <1208590+ryannikolaidis@users.noreply.github.com>	2023-07-31 13:20:10 -04:00
Matt Robinson	d9aed66b65	feat: add document date for remaining file types (#930 ) (#969 ) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-07-26 15:10:14 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
Matt Robinson	6e852cbe70	feat: track links from anchor tags in `partition_html` (#959 ) * track tags in html * pass through links as metadata * add test for grabbing links * one more link * changelog and version * update docs * fix tests * update empty link assertion * ingest-test-fixtures-update * Update ingest test fixtures (#961)	2023-07-24 18:28:56 +00:00
Jason Scheirer	196efa09b1	chore: Add encoding param to ingest (#955 ) * Add encoding param to ingest	2023-07-24 10:06:13 -07:00
Ahmet Melek	b7674fb97e	feat: confluence connector (cloud) (#906 ) * Add confluence connector and an example script * add test script, add dependency installations * add authentication secret variables for ci tests and actions * add dependency installation commands for workflows * add dependency installation commands for workflows * Update ingest test fixtures (#907) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add add ingest test fixtures update workflow for python 3.10, update example script with dummy values * change workflow name to avoid confusion * change workflow name to avoid confusion * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update * only leave 3.8 in ingest test matrix to test consistent partitioning among python versions * Update ingest test fixtures (#911) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * revert back the test python version matrix * recompile dependencies * modifications for shellcheck * update changelog and version * changelog and version * remove comments * Update ingest test fixtures (#915) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * add the option to state the number of spaces to be fetched * add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users * add help message * add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test * change test names * rename connector arg Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * change arg name for connector Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> * add comment to example * change arg names * add new tests to ingest test * shellcheck remove redundant statement * Update ingest test fixtures (#932) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * Update ingest test fixtures (#936) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * linting * change file extensions to parse as html * Update ingest test fixtures (#943) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * remove old fixtures * update version to 0.8.2-dev3 * change file to trigger CI * change file to trigger CI * change file to trigger CI * change file to trigger CI --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-18 19:29:41 +01:00
rvztz	ce20c3f2bc	feat: add OneDrive connector (#834 )	2023-07-13 20:57:54 +00:00
qued	79f734d3f9	fix: better extractable check (#900 ) auto strategy was choosing the fast strategy in cases where the pdf contents were just a flat image, resulting in no output. This PR changes the behavior of auto so that elements that can be extracted by fast are extracted, a cursory examination of the elements is made to see if there are elements with text present, and if so then these elements are used as the output. Otherwise fallback strategies come into play.	2023-07-07 23:41:37 -05:00
Ahmet Melek	4b827f0793	fix: local connector output filename when a single file is being processed (#879 ) * fix string processing error for _output_filename * Add docstring and type hint, update CHANGELOG, update version * update test fixture * simple code change commit to retrigger ci checks * update test fixture - after brew install tesseract-lang * Update ingest test fixtures (#882) Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com> * correct CHANGELOG * correct CHANGELOG --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>	2023-07-05 14:37:40 -07:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
Ahmet Melek	5ea216cf07	feat: elasticsearch connector (#817 )	2023-07-01 17:45:28 +00:00
David Potter	bec733cdf8	feat: add Dropbox connector (#844 )	2023-06-30 17:08:27 -07:00
qued	350bb1dad5	enhancement: clean pdf elements (bump unstructured-inference) (#790 ) More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4) Make large model available (from unstructured-inference bump to 0.5.3) Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2) --------- Co-authored-by: Roman Isecke <roman@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructured.io>	2023-06-29 18:35:06 -07:00
ryannikolaidis	62e20442df	chore: refactor ingest tests (#814 ) - Adds reusable validation scripts (check-x.sh) to minimize repeated (or near-repeated) code and create one source of truth - Restructures the location of download and output folders such that they are nested in the test_unstructured_ingest directory - Adds gitignore for output folders / files to avoid them accidentally getting checked into the repository - Construct paths as reusable variables declared at top of scripts - Sort order of flag for ingest calls, across all tests (this makes it easier to parse at a glance) - OVERWRITE_FIXTURES removes all old fixtures for path to guarantee no stale results are left behind - Bonus: don't check/exit on expected number of expected outputs when OVERWRITE_FIXTURES is true - Bonus: exclude file_directory from Slack and Discord test scripts (match convention in all others)	2023-06-29 23:13:41 +00:00
ryannikolaidis	8ea5f6939e	fix: parameterized ingest test overwriting (#838 ) * sets OVERWRITE_FIXTURES to default to false in test-ingest-local-single-file.sh * fixes incorrect expected results * update expected results to properly parse Korean text * bonus: installs language pack for Korean in CI and ingest fixture workflows	2023-06-29 18:37:09 +00:00
ryannikolaidis	60fe231f08	fix: use api key where needed in tests (#843 ) * passes api key for unstructured-api to unit and ingest tests as needed. * adds check for env var CI to otherwise skip tests that require an api key	2023-06-29 17:31:01 +00:00
Roman Isecke	9882c2b83f	Avoid setting metadata in constructor signature for elements (#837 ) Avoid setting metadata in constructor signature for elements because that can lead to unexpected object reuse (and modification). Bonus refactor for PageBreak to have text values of "". --------- Co-authored-by: Alan Bertl <alan@unstructured.io> Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-29 03:14:05 +00:00
Matt Robinson	44411ecc59	enhancement: `max_partition` kwarg for limiting element size (#818 ) * add max partition size logic * work splitting logic into split_by_paragraph * pass through max_partition to other functions * added test for splitting long document * add type hint * add documentation * version and changelog * ingest-test-fixtures-update * Update ingest test fixtures (#819) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * retrigger ci * ingest-test-fixtures-update * ingest-test-fixtures-update * Update ingest test fixtures (#821) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * update default for partition_xml * update version for release * update msg doc string --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 15:26:01 -04:00
Matt Robinson	38457777fa	fix: ignore escaped commas in CSV checks (#832 ) * fix file content checking bug * skip counting commas in quotes for csv detection * add test for comma count * change file content grab to -1 * version and changelog * add csv to extension check * add file to tests * ingest-test-fixtures-update * Update ingest test fixtures (#833) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * fix typo * fix changelog wording --------- Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-28 17:22:23 +00:00
kravetsmic	58e988e110	feature(html partition): parse pre tag (#642 ) * feature(html partition): parse pre tag * chore: update CHANGELOG.md * style: black format xml.py * Added tests dor html with pre tag * remove skip test, update parse pre tag * fix style * chore: spell check * chore: update changelog & version * chore: update ingest test fixtures * chore: add exception handling if `element.text` is `None` in `_read_xml` * test: add more sanity testing on the `.text` content of the element(s) * refactor: move the conditional logic for <pre> outside of the `try/except` block --------- Co-authored-by: cragwolfe <crag@unstructured.io> Co-authored-by: christinestraub <christinemstraub@gmail.com>	2023-06-27 18:52:39 +00:00
ryannikolaidis	a5c7e5b41e	chore: DRY ingest connectors (#769 )	2023-06-26 20:12:05 +00:00
Emily Chen	a8a19ceba0	chore: Add --ocr-languages parameter to unstructured ingest (#793 )	2023-06-23 12:38:33 -07:00
Matt Robinson	8683e2695c	fix: enable `partition_pdf` to recursively grab text with fast strategy (#796 ) * initial pass on text in figures * refactor text extraction * update tests * fix title test * add test for docs that require recursive text grab * version and changelog * ingest-test-fixtures-update * there are 8 pdf files now	2023-06-22 11:19:54 -04:00
David Potter	3b472cb7df	feat: add google cloud storage connector (#746 )	2023-06-21 15:14:50 -07:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
ryannikolaidis	4faa27ffe7	test: add google drive ingest test (#764 )	2023-06-16 16:28:24 +00:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
kravetsmic	7fd7d7afae	feat(biomed connector): added additional params (#468 ) (#623 ) Unstructured-ingest biomed connector: Adds max retries, max request time with backoff and decay. --------- Co-authored-by: Crag Wolfe <crag@unstructuredai.io>	2023-06-15 01:57:45 -07:00

1 2 3 4 5

201 Commits