unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-19 07:02:38 +00:00

Author	SHA1	Message	Date
Matt Robinson	c9976760c5	fix: revert back to old requirements file for sphinx docs (#3077 ) ### Summary As seen in [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/9182534479/job/25251583102), the build job for sphinx docs is failing, and has been failing for quite some time. This PR reverts the requirements file back to a [previous good commit](`91b892c79d`) for that job, and also moves the `build.in` file so the requirements file doesn't get update on `make pip-compile.` This is fine since those requirements don't get installed as part of the package, and we're deprecated the `sphinx` docs in favor of https://docs.unstructured.io anyway. ### Testing Build was [successful](https://github.com/Unstructured-IO/unstructured/actions/runs/9198605026/job/25301670934?pr=3077) on the feature branch. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-05-23 03:32:06 +00:00
Christine Straub	18428f24ab	chore: bump unstructured-inference 0.7.33 (#3074 ) Summary: - bump unstructured-inference to `0.7.33` - cut a release for `0.14.2` - add some dependencies that previously came through from the layoutparser extras.	2024-05-22 22:35:00 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Matt Robinson	d7608014c0	improve: add Python 3.12 support (#3033 ) (#3047 ) ### Summary Closes #2959. Updates the dependency and CI to add support for Python 3.12. The MongoDB ingest tests were disabled due to jobs like [this one](https://github.com/Unstructured-IO/unstructured/actions/runs/9133383127/job/25116767333) failing due to issues with the `bson` package. `bson` is a dependency for the AstraDB connector, but `pymongo` does not work when `bson` is installed from `pip`. This issue is documented by MongoDB [here](https://pymongo.readthedocs.io/en/stable/installation.html). Spun off #3049 to resolve this. Issue seems unrelated to Python 3.12, though unsure why this didn't surface previously. Disables the `argilla` tests because `argilla` does not yet support Python 3.12. We can add the `argilla` tests back in once the PR references below is merged. You can still use the `stage_for_argilla` function if you're on `python<3.12` and you install `argilla` yourself. - https://github.com/argilla-io/argilla/pull/4837 --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>	2024-05-19 23:03:15 +00:00
Matt Robinson	f4b01a4aad	build(deps): bump versions for security hygiene (#3008 ) ### Summary Version bumps to keep on top of security scans.	2024-05-13 15:30:09 +00:00
Christine Straub	b64a48440d	chore: bump unstructured-inference 0.7.31 (#2981 )	2024-05-08 16:26:58 +00:00
Steve Canny	eff84afe24	chore: update python-docx version dependency (#2952 ) Summary `unstructured` will use table features added in the most recent version of `python-docx`. Also update the `lxml` version constraint because `lxml>4.9.2` will not install on Apple Silicon (https://github.com/Unstructured-IO/unstructured/issues/1707). `python-docx` requires `lxml` although other file formats require it as well.	2024-05-01 21:36:31 +00:00
cragwolfe	9e46ed016c	fix: reqs arm64 friendly again. release 0.13.4 (#2935 ) Cut a release. Run pip-compile on mac to avoid `nvidia-*` requirements creeping into `requirements/extra-pdf-image.txt`. This should fix arm64 image builds that have been breaking on main.	2024-04-26 08:15:13 +00:00
Roman Isecke	9ad2993fe3	bug: fix pip-compile (#2885 ) ### Description Currently wasn't compiling `base.in` first, which is required because others use the generated `.txt` file as a constraint.	2024-04-19 21:39:25 +00:00
Roman Isecke	d6f2841ff4	feat: update dependencies and remove constraint on pydantic (#2841 ) ### Description * The `consistent-deps.sh` was fixed to take into account the ingest dependencies, causing some errors to show up. New constriants were added to make that script pass. * Update all requirements without constraint on pydantic, allowing the latest version to be pulled in. * `pikepdf` is causing a conflict but there's a fix on their `main` branch, just need for the next release to be published. Opened up a question here to see if we can get that out any sooner: [Do releases happen on a schedule?](https://github.com/pikepdf/pikepdf/discussions/574). For now added `lxml<5` to the constraints. A couple optimizations: * `constraints.in` renamed to `constraints.txt` since the whole point is all dependencies are already pinned and the file never gets compiled * `constraints.txt` moved to a `requirements/deps` directory as this never gets compiled by `pip-compile` * Other dependency files updated to reference the new location of `base.in` and `constraints.txt` * make file updated since it was originally written to avoid the `base.in` and `constraints.in` file	2024-04-04 19:58:23 +00:00
John	3783b44d0b	fix documentation html links example (#2608 ) Closes #2577 Testing: ``` from unstructured.partition.html import partition_html cnn_lite_url = "https://lite.cnn.com/" elements = partition_html(url=cnn_lite_url) links = [] for element in elements: if element.metadata.link_urls: relative_link = element.metadata.link_urls[0][1:] if relative_link.startswith("2024"): links.append(f"{cnn_lite_url}{relative_link}") print(links) ``` --------- Co-authored-by: ron-unstructured <ronny@unstructured.io> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2024-03-04 18:33:42 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
John	db67805ec6	feat: add support for partitioning .heic files (#2454 ) .heic files are an image filetype we have not supported. #### Testing ``` from unstructured.partition.image import partition_image png_filename = "example-docs/DA-1p.png" heic_filename = "example-docs/DA-1p.heic" png_elements = partition_image(png_filename, strategy="hi_res") heic_elements = partition_image(heic_filename, strategy="hi_res") for i in range(len(heic_elements)): print(heic_elements[i].text == png_elements[i].text) ``` --------- Co-authored-by: christinestraub <christinemstraub@gmail.com>	2024-01-30 04:49:00 +00:00
Ronny H	149f894d0a	Fixed sphinx-build error by pinning alabaster=-0.7.13 (#2436 )	2024-01-20 14:36:48 -08:00
Matt Robinson	2d3a7f1c48	fix: fix table index error by bumping `unstructured-inference` (#2430 ) ### Summary Closes #2417. Bumps `unstructured-inference` to pull in the fix implemented in https://github.com/Unstructured-IO/unstructured-inference/pull/317	2024-01-19 22:42:32 +00:00
John	1f0826ab0a	pin unstructured-client (#2392 ) Replacement for #2311 since python 3.8 was dropped as a supported version. Unstructured-client added `api_key_auth` as a param to `UnstructuredClient` in [version 0.9.0](`8c93115c92`). This pins the version of `unstructured-client` so users do not receive `TypeError: UnstructuredClient.__init__() got an unexpected keyword argument 'api_key_auth'`	2024-01-15 17:26:38 +00:00
Roman Isecke	b37b4689bc	drop python3.8 (#2372 ) ### Description Remove all uses of python3.8 --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-01-09 23:37:30 +00:00
Christine Straub	ed76b11b1a	Refactor: support image extraction (#2201 ) ### Summary This PR is the second part of the "image extraction" refactor to move it from unstructured-inference repo to unstructured repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/299. This PR adds logic to support extracting images. ### Testing `git clone -b refactor/remove_image_extraction_code --single-branch https://github.com/Unstructured-IO/unstructured-inference.git && cd unstructured-inference && pip install -e . && cd ../` ``` elements = partition_pdf( filename="example-docs/embedded-images.pdf", strategy="hi_res", extract_images_in_pdf=True, ) print("\n\n".join([str(el) for el in elements])) ```	2023-12-05 18:22:29 +00:00
Christine Straub	69d0ee1aea	Refactor: support merging `extracted` layout with `inferred` layout (#2158 ) ### Summary This PR is the second part of `pdfminer` refactor to move it from `unstructured-inference` repo to `unstructured` repo, the first part is done in https://github.com/Unstructured-IO/unstructured-inference/pull/294. This PR adds logic to merge the extracted layout with the inferred layout. The updated workflow for the `hi_res` strategy: * pass the document (as data/filename) to the `inference` repo to get `inferred_layout` (DocumentLayout) * pass the `inferred_layout` returned from the `inference` repo and the document (as data/filename) to the `pdfminer_processing` module, which first opens the document (create temp file/dir as needed), and splits the document by pages * if is_image is `True`, return the passed inferred_layout(DocumentLayout) * if is_image is `False`: * get extracted_layout (TextRegions) from the passed document(data/filename) by pdfminer * merge `extracted_layout` (TextRegions) with the passed `inferred_layout` (DocumentLayout) * return the `inferred_layout `(DocumentLayout) with updated elements (all merged LayoutElements) as merged_layout (DocumentLayout) * pass merged_layout and the document (as data/filename) to the `OCR` module, which first opens the document (create temp file/dir as needed), and splits the document by pages (convert PDF pages to image pages for PDF file) ### Note This PR also fixes issue #2164 by using functionality similar to the one implemented in the `fast` strategy workflow when extracting elements by `pdfminer`. ### TODO * image extraction refactor to move it from `unstructured-inference` repo to `unstructured` repo * improving natural reading order by applying the current default `xycut` sorting to the elements extracted by `pdfminer`	2023-12-01 20:56:31 +00:00
John	e5bdf7fb43	chore: unstructured python client (#2195 ) ### Summary Closes #2033 Updates `partition_via_api` to use `UnstructuredClient` for api calls instead of `requests`. Updates associated tests. Note: This PR does not update `partition_multiple_via_api` as documentation in `unstructured-python-client` indicates it does not support multiple files. A new issue should be opened to add that functionality to `unstructured-python-client`. --------- Co-authored-by: Klaijan <klaijan@unstructured.io> Co-authored-by: Roman Isecke <136338424+rbiseck3@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-01 18:49:59 +00:00
Yuming Long	92dae8cd1a	Chore: Repair invalid PDF structure for PDFminer when PSSyntaxError (#2137 ) ### Summary Add a procedure to repair PDF when the PDF structure is invalid for `PDFminer` to process. This PR handles two cases of `PSSyntaxError Invalid dictionary construct: ...`: * PDFminer open entire document and create pages generator on `PDFPage.get_pages(fp)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655715023/?alert_rule_id=14681339&alert_type=issue&notification_uuid=d8db4cf4-686f-4504-8a22-74a79a8e966f&project=4505909127086080&referrer=slack) * PDFminer's interpreter process a single page on `interpreter.process_page(page)`: [sentry log example](https://unstructuredio.sentry.io/issues/4655898781/?referrer=slack&notification_uuid=0d929d48-f490-4db8-8dad-5d431c8460bc&alert_rule_id=14681339&alert_type=issue) Additional tech details: * Add new dependency `pikepdf` in `requirements/extra-pdf-image.in`, which is used for repairing PDF. * Add new denpendenct `pypdf` in `requirements/extra-pdf-image.in`, which is used to find the error page from entire document by reading the PDF file again (can't find a way to split pdf in PDFminer). * Refactor the `is null` check for `get_uris_from_annots`, since the root cause is that `get_uris` passed a None `annots` to `get_uris_from_annots`, so the Null check should happen in `get_uris`. * Add more type protection in `get_uris_from_annots` when using any `PDFObjRef.resolve()` as `dict` (it could still be a `PDFObjRef`). This should fix : * https://github.com/Unstructured-IO/unstructured/issues/1922 where `annotation_dict` is a `PDFObjRef` * https://github.com/Unstructured-IO/unstructured/issues/1921 where `rect` is a `PDFObjRef` ### Test Added three test files (both are larger than 500 KB) for unittests to test: * Repair entire doc * Repair one page * Reprocess failure after repairing one page (just return the elements before error page in this case). * Also seems like splitting the document into smaller pages could fix this problem, but not sure why. For example, I saw error from reprocess in the whole [cancer.pdf](https://github.com/Unstructured-IO/unstructured/files/13461616/cancer.pdf) doc, but no error when i split the pdf by error page.... * tested if i can repair the entire doc again in this case, saw other error which means repairing is not helping imo * PDFminer can process the whole doc after pikepdf only repaired the entire doc in the first place, but we can't repair by pages in this way --------- Co-authored-by: cragwolfe <crag@unstructured.io>	2023-11-29 19:00:15 +00:00
Yuming Long	ccda93b0d1	chore: bump inference to `0.7.15` release unst `0.11.0` (#2110 ) ^^	2023-11-20 18:20:03 +00:00
Yuming Long	ad14321016	Chore: don't pass empty language code to tesseract CLI (#1996 ) Summary: Close: https://github.com/Unstructured-IO/unstructured/issues/1920 * stop passing in empty string from `languages` to tesseract, which will result in passing empty string to language config `-l` for the tesseract CLI * also stop passing in duplicate language code from `languages` to tesseract OCR * if we failed to convert any iso languages from the `languages` parameter, proceed OCR with `eng` as default ### Test * First confirm the tesseract error `Estimating resolution as X` before this: * on the `unstructured-api` repo with main branch, run `make run-web-app` * curl to test error from empty string, or just any wrong input like `-F 'languages="eng,de"'`: ``` curl -X 'POST' 'http://0.0.0.0:8000/general/v0/general' \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -F 'files=@sample-docs/layout-parser-paper-with-table.jpg' \ -F 'languages=""' \ -F 'strategy=hi_res' \ -F 'pdf_infer_table_structure=True' \ \| jq -C . \| less -R ``` * after this change: * in your unstructured API env, cd to unstructured repo and install it locally with `pip install -e .` * check out to this branch * run `make run-web-app` again in api repo * the curl command return output and see warning in log --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com>	2023-11-06 19:30:12 -06:00
Jack Retterer	b8f24ba67e	Added AWS Bedrock embeddings (#1738 ) Summary: Added support for AWS Bedrock embeddings. Leverages "amazon.titan-tg1-large" for the embedding model. Test - find your aws secret access key and key id; make sure the account has access to bedrock's tian embed model - follow the instructions in `d5e797cd44/docs/source/bricks/embedding.rst (bedrockembeddingencoder)` --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com> Co-authored-by: Yao You <yao@unstructured.io> Co-authored-by: Yao You <theyaoyou@gmail.com> Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>	2023-10-18 19:36:51 -05:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
Klaijan	0a65fc2134	feat: xlsx subtable extraction (#1585 ) Executive Summary Unstructured is now able to capture subtables, along with other text element types within the `.xlsx` sheet. Technical Details - The function now reads the excel without header as default - Leverages the connected components search to find subtables within the sheet. This search is based on dfs search - It also handle the overlapping table or text cases - Row with only single cell of data is considered not a table, and therefore passed on the determine the element type as text - In connected elements, it is possible to have table title, header, or footer. We run the count for the first non-single empty rows from top and bottom to determine those text Result This table now reads as: <img width="747" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2"> ``` [ { "type": "Title", "element_id": "3315afd97f7f2ebcd450e7c939878429", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "Financial performance" }, { "type": "Table", "element_id": "17f5d512705be6f8812e5dbb801ba727", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "3315afd97f7f2ebcd450e7c939878429", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Quarterly revenue</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>1</td>\n </tr>\n <tr>\n <td>Group financial performance</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>2</td>\n </tr>\n <tr>\n <td>Segmental results</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>3</td>\n </tr>\n <tr>\n <td>Segmental analysis</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>4</td>\n </tr>\n <tr>\n <td>Cash flow</td>\n <td>FY 22</td>\n <td>FY 23</td>\n <td></td>\n <td>5</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n" }, { "type": "Title", "element_id": "8a9db7161a02b427f8fda883656036e1", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "Operational metrics" }, { "type": "Table", "element_id": "d5d16f7bf9c7950cd45fae06e12e5847", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "8a9db7161a02b427f8fda883656036e1", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Mobile customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>6</td>\n </tr>\n <tr>\n <td>Fixed broadband customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>7</td>\n </tr>\n <tr>\n <td>Marketable homes passed</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>8</td>\n </tr>\n <tr>\n <td>TV customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>9</td>\n </tr>\n <tr>\n <td>Converged customers</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>10</td>\n </tr>\n <tr>\n <td>Mobile churn</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>11</td>\n </tr>\n <tr>\n <td>Mobile data usage</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>12</td>\n </tr>\n <tr>\n <td>Mobile ARPU</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>13</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n" }, { "type": "Title", "element_id": "f97e9da0e3b879f0a9df979ae260a5f7", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "Other" }, { "type": "Table", "element_id": "080e1a745a2a3f2df22b6a08d33d59bb", "metadata": { "filename": "vodafone.xlsx", "file_directory": "example-docs", "last_modified": "2023-10-03T17:51:34", "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7", "languages": [ "spa", "ita" ], "page_number": 1, "page_name": "Index", "text_as_html": "<table border=\"1\" class=\"dataframe\">\n <tbody>\n <tr>\n <td>Topic</td>\n <td>Period</td>\n <td></td>\n <td></td>\n <td>Page</td>\n </tr>\n <tr>\n <td>Average foreign exchange rates</td>\n <td>Nine quarters to 30 June 2023</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n <tr>\n <td>Guidance rates</td>\n <td>FY 23/24</td>\n <td></td>\n <td></td>\n <td>14</td>\n </tr>\n </tbody>\n</table>" }, "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n" } ] ```	2023-10-04 13:30:23 -04:00
Roman Isecke	5c7b4f586b	Roman/azure cognitive embeddings (#1524 ) ### Description This PR is two-fold: Embeddings: * Embeddings incorporated into the sharepoint source connector, which will now call out to OpenAI and create embeddings if the flag is passed in and the api key provided. Writing vector content (embeddings) to Azure cognitive search index: * The schema for the index expected to exist in Azure has been updated to include the vector field type and a test script has been added to test the new content being produced from the Sharepoint connector to push the embedding content. Some important notes about other changes in here: * The embedding code had to be updated to patch the `to_dict` method on elements to add `embeddings` to the dict output if that was added. While the code originally added the embedding content, when `to_dict` was called to save the content as json, this was lost.	2023-09-26 23:24:21 +00:00
Ronny H	868cac5bd5	Fixed Sphinx warning errors (#1438 ) Fixed issue #1437 - resolved the Warning errors when building sphinx with `make html`. test: 1. `cd docs` folder and `rm -rf build` 2. `pip install -r requirements.txt` 3. run `make html`	2023-09-26 04:20:16 +00:00
Christine Straub	2d951722df	Feat/1332 save embedded images in pdf (#1371 ) Addresses [#1332](https://github.com/Unstructured-IO/unstructured/issues/1332) with `unstructured-inference` PR [#208](https://github.com/Unstructured-IO/unstructured-inference/pull/208). ### Summary - Add `image_path` to element metadata - Pass parameters related to extracting images in PDF - Preserve image elements ignored due to garbage text if `el.metadata.image_path` is `True` ### Testing from unstructured.partition.pdf import partition_pdf f_path = "example-docs/embedded-images.pdf" # default image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, ) # specific image output directory elements = partition_pdf( f_path, strategy=strategy, extract_images_in_pdf=True, image_output_dir_path=<directory path>, )	2023-09-22 09:16:03 +00:00
pravin-unstructured	8641fe39dc	Add Model Probabilities to Hi-Res strategy MetaData for Images + PDFs. (#1323 ) If a layout model is used from unstructured-inference, you get back class probabilities in the element metadata from partition. extra-pdf-image-in in requirements already has the newest version of unstructured-inference in there without a pinned version. Is there any place else that the unstructured-inference version needs to be updated to the required release version, 0.5.22?	2023-09-07 22:56:43 -04:00
cragwolfe	c72014ffaf	build(release): bump to unstructured-inference==0.5.21 (#1293 )	2023-09-03 19:09:18 -07:00
ryannikolaidis	076b1e38f4	feat: serialize ingest docs as json (#1178 )	2023-08-31 01:48:41 +00:00
ryannikolaidis	835378aba6	ci: fix documentation build flow (#1181 )	2023-08-24 00:24:03 -05:00
Austin Walker	e7d189fcc8	chore: Bump inference and set default ocr_mode to entire_page (#1172 ) * pip-compile in order to bump unstructured-inference * Set the default `ocr_mode` back to `enitre_page` now that [this error](https://github.com/Unstructured-IO/unstructured-inference/pull/183) is addressed * Explicitly add `sphinx-tabs` to `build.in`. This file provides `docs/requirements.txt`. * Remove a pinned `pydantic` version * Fix a makefile command to `pip-compile` a missing ingest file.	2023-08-22 16:05:02 -07:00
Jack Retterer	a35ff890e0	Update docs jack (#1157 ) Documentation Overhaul - Added documentation hierarchy - Added options for Bash vs Python for API & Upstream Connectors - Added Introduction section (Overview, Key Concepts, Getting Started) - Redid connectors section - Installation is now broken up (needs further work)	2023-08-21 10:27:32 -07:00
cragwolfe	b4b8ac4d8a	chore: run make pip-compile on mac (#1107 ) so cuda deps removed.	2023-08-13 20:42:12 +00:00
John	f63a66dbef	Capture section and chapter in the metadata for epubs under `epub_section` (#1005 ) Capture section and chapter in the metadata for epubs under epub_section. Closes Github issue #459	2023-08-12 21:02:06 +00:00
cragwolfe	6779918406	build(release): bump unstructured-inference (#1074 ) * build(release): bump unstructured-inference Related to downstream issue: Unstructured-IO/unstructured-api#182 And upstream PR: Unstructured-IO/unstructured-inference#165 --------- Co-authored-by: Shreya Nidadavolu <shreyanid9@gmail.com>	2023-08-10 20:57:46 +00:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
shreyanid	c3e92057f2	Update pip in makefile (#981 ) * update pip in makefile * merge and update requirements * update version * update outlook requirements	2023-07-27 21:38:51 +00:00
cragwolfe	1e2d531bb9	build(release): cut 0.8.4 release (#979 )	2023-07-26 18:01:31 +00:00
shreyanid	71a24b2887	Update `partition_via_api` to not post a strategy value if not user specified (#967 ) * remove default strategy * working on test * fixed test, coordinates param needed to be included * nits * update changelog * lint * update requirements	2023-07-26 09:56:39 -07:00
Yuming Long	067eb5701f	Fix: docker build with missing dependency (#931 ) * pip -compile * test trigger * Revert "test trigger" This reverts commit 69d4c8cd9f285f6ef4bf445f5fb27b5c62e1391c. * version conflict and pip compile	2023-07-14 22:20:11 +00:00
Matt Robinson	b3936893b8	build: add python 3.11 to CI (#908 ) * remove argilla; bump reqs * enable py 3.11 * add 3.11 to setup.py * make pip-compile * ignore cli mypy errors * install argilla * fix constraints * install argilla * changelog and version * skip argilla in docker * dont import argilla in docker * skip all of argilla if in container * only import argilla if outside docker * more docker skips * remove weird pypi settings	2023-07-10 18:52:25 +00:00
David Potter	3b472cb7df	feat: add google cloud storage connector (#746 )	2023-06-21 15:14:50 -07:00
John	fc53277826	fix: Enable MIME type detection if libmagic is not available (#714 ) * fix: Add filetype check if libmagic unavailable * make tidy * make check * fix: change mime_type error to warning * Update changelog and __version__ * fix: Add filetype to requirements	2023-06-09 17:06:21 -04:00
ryannikolaidis	2094b976cf	feat: adds data_source metadata to ElementMetadata (#690 )	2023-06-07 21:22:18 -07:00
qued	01f76888e0	build(deps): add tabulate dependency (#673 ) tabulate is used by functions that extract tables from Microsoft documents, but there is nothing explicitly requiring the library. This was not caught by tests, because for some reason, tabulate is in base.txt. This PR adds the dependency to base.in (which also puts it in setup.py), and recompiles the dependencies.	2023-06-01 16:56:24 -05:00
qued	c82bad1061	build(deps): avoid version conflicts (#636 ) Addresses #631. * Uses constraints to keep dependency versions more consistent. * Moves all dependencies to .in files which are then ingested by setup.py. * Adds script to check consistency of all extras. * Adds consistency check to CI. I should note that while it shouldn't be possible to cause a conflict between base.txt and any of the extras (because base.txt constrains all the extras) it is possible to get a conflict between two of the extras files. There are ways of trying to avoid that (like constraining each file by all the files that have already been processed before it in the order given in the make pip-compile target) but the ones I could think of seemed a little overwrought, and come with problems of their own. If a conflict arises, it should be flagged by CI or locally with make check-deps. When/if that happens, you can resolve the conflict by adding appropriate global constraints in requirements/constraints.txt. Also note that if fileA.in is constrained by fileB.txt, then fileB.in should be compiled before fileA.in in the make pip-compile target. Otherwise fileA.in will be compiled with the old version of fileB.txt which can cause conflicts or keep dependencies from being updated properly.	2023-05-24 22:29:35 +00:00
ryannikolaidis	2fc4d37454	chore: pin inference version, bump deps, and update openssl (#551 )	2023-05-08 17:02:55 -07:00

1 2

69 Commits