unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-13 08:57:34 +00:00

Author	SHA1	Message	Date
Tracy Shen	d82a34519e	[Merge request] bug fix on table structure metric (#3089 ) Summary This fix is to provide better logic oon matched_idx of calculating table structure metric to provide more accurate calculation on the acc Additional Context - this fix has passed CI run in Draft PR #3025 initially - therefore, this time we would like to merge into main branch - this commit has merged the latest change from main after the Draft PR	2024-06-10 15:14:32 +00:00
Duda Nogueira	657a949a00	chore: Weaviate pyv4 example (#3151 ) Update Unstructured example for Weaviate, now using latest python v4 client. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-06-10 10:08:46 -04:00
Steve Canny	a66661a7bf	rfctr(html): drop now dead XMLDocument and Document (#3165 ) Summary `HTMLDocument` is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (`partition_html()` + 7 brokering partitioners like EPUB, MD, and RST). For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in turn subclassed `Document`, both of which are no longer relevant and unnecessarily complicate reasoning about `HTMLDocument` behavior. Remove that inheritance and dependency and drop both `XMLDocument` and `Document` modules which become dead code after no longer being used by `HTMLDocument`.	2024-06-08 07:36:18 +00:00
Matt Robinson	b4876f1b18	build: 0.14.5 release (#3164 ) ### Summary Update changelog and version for `0.14.5` release. 0.14.5	2024-06-07 17:20:30 +00:00
Roman Isecke	0fe0f15f30	feat: migrate weaviate connector to new framework (#3160 ) ### Description Add weaviate output connector to those supported in the new v2 ingest framework. Some fixes were needed to the upoad stager step as this was the first connector moved over that leverages this part of the pipeline.	2024-06-06 23:18:55 +00:00
Steve Canny	a883fc9df2	rfctr(html): improve SNR in HTMLDocument (#3162 ) Summary Remove dead code and organize helpers of HTMLDocument in preparation for improvements and bug-fixes to follow	2024-06-06 21:21:33 +00:00
Steve Canny	8378ddaa3b	rfctr(html): organize and improve HTMLDocument tests (#3161 ) Summary In preparation for further work on HTMLDocument, organize the organic growth in `documents/tests_html.py` and improving typing and expression. Reviewers: Commits are groomed and review is probably eased by going commit-by-commit	2024-06-06 18:16:02 +00:00
Steve Canny	f1cab248ce	rfctr(msg): remove temporary new_msg.py (#3157 ) Summary Remove temporary `new_msg.py` module. Additional Context The rewrite of `partition_msg()` was placed in a separate file `new_msg.py` to avoid a messy diff for code-review. This PR makes that `new_msg.py` the new `msg.py`. No code changes were made in the process.	2024-06-06 08:31:56 +00:00
Steve Canny	ddbe90f6bb	rfctr(html): clean html tests in prep for PRs to follow (#3156 ) Summary Clean `tests_unstructured/partition/test_html.py` in preparation for broader refactor of HTML partitioner to follow. That refactor will address a cluster of bugs. Temporarily remove blank lines in tests so reordering tests in following commit is easier to follow. Those will go back in after that.	2024-06-05 23:11:58 +00:00
Steve Canny	e4158deaff	fix(msg): use python-oxmsg for MSG email parsing (#3142 ) Summary `partition_msg()` previously used the `msg_parser` library for parsing Outlook MSG email files (.msg files). The `msg_parser` library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments. Use the new and permissively licenced `python-oxmsg` library instead. Additional Context For reviewability purposes, this PR temporarily places the new `partition_msg()` implementation in `new_msg.py` and references that implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py` in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written `partition_msg()` implementation. Fixes #2481 Fixes #3006	2024-06-05 21:12:27 +00:00
Roman Isecke	b777864296	feat: Migrate over fsspec connectors (#3066 ) ### Description Move over all fsspec connectors to the new framework Variety of bug fixes found and fixed in this PR as well: * custom json mixin being used for the enhanced dataclass would break if typing was quoted. That was fixed. A check was also added to the enhanced dataclass to prevent `InitVar` from being used in the root dataclass since this breaks serialization. * hashing for partitioner was using the filename of the raw file being partitioned rather than the file name of the file data generated from indexing. This means that mutliple files could result in the same partition hash when recursive flag is passed in. This was updated to use the file data file name instead. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-05 19:12:06 +00:00
Matt Robinson	0e16bf4bf0	enhancement: apply tar filters when using python 3.12 or above (#3124 ) ### Summary Applies tar filters when using Python 3.12 or above. This was added to the [Python `tarfile` library in 3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters) and guards against malicious content being extracted from `.tar.gz` files. ### Testing Added smoke test. If this passes for all Python versions, we're good.	2024-06-05 18:28:59 +00:00
Yao You	fdb27378cb	chore: use python3 consistently in makefile (#3152 ) This PR changes two `python` commands in `Makefile` to use `python3` to be consistent with other make commands. This makes it more explicit on which python to use when the makefile is used outside of a controlled virtualenv where only one python exists.	2024-06-05 00:05:57 +00:00
Matt Robinson	5203390a4a	build(deps): weekly pip version bump (#3147 ) ### Summary Weekly PR to bump dependency versions.	2024-06-04 20:47:04 +00:00
Christine Straub	1dede5029d	fix: parsing pdf error - new_cells as str has no "copy" (#3130 ) Closes #3119. ### Testing Parsing the provided PDF should be successful. [testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf) ``` filename = "testing_brochure_2.pdf" with open(filename, "rb") as pdf_content: elements = partition_pdf( file=pdf_content, infer_table_structure=True, extract_image_block_types=["Image", "Table"], chunking_strategy="by_title", max_characters=1000, new_after_n_chars=3000, combine_text_under_n_chars=1000, ) print("\n\n".join([str(el) for el in elements])) ``` 0.14.4	2024-06-03 18:49:38 +00:00
Matt Robinson	1b43102762	fix: remote root handlers when they exist (#3128 ) ### Summary In some environments, such as Google Colab, loggers have a root handling that did not mask sensitive values. As a result, secrets such as API keys appeared in the logs. The PR removes root handlers when they exist to ensure sensitive values are handler properly. ### Testing Run the following in a Colab notebook. You should see two log outputs, one with the API key masked and one with it exposed. ``` !pip install unstructured ``` ```python import logging import json from unstructured.ingest.interfaces import ( ChunkingConfig, EmbeddingConfig, PartitionConfig, ProcessorConfig, ReadConfig, ) partition_config = PartitionConfig( partition_by_api=True, api_key="super secret", ) from unstructured.ingest.logger import ingest_log_streaming_init ingest_log_streaming_init(logging.INFO) logger = logging.getLogger("unstructured.ingest") logger.setLevel(logging.INFO) logger.info( f"Running partition node to extract content from json files. " f"Config: {partition_config.to_json()}, " ) ``` Now replace the first cell with the following and rerun the Python code. Only the masked logging output should remain. ``` !git clone https://github.com/Unstructured-IO/unstructured.git && cd unstructured && git checkout fix/rm-log-dupes && pip install -e . ```	2024-05-31 22:07:38 +00:00
Matt Robinson	54c1e4e57f	ci: remove jira issue workflow (#3129 ) ### Summary Removes the workflow for creating Jira tickets.	2024-05-31 22:00:40 +00:00
Matt Robinson	6005abce79	feat: configure googlevisionapi (#3126 ) ### Summary Includes changes from #3117. Merged into a feature branch to run the full test suite. Original PR description: The Google Vision API allows for [configuration of the API endpoint](https://cloud.google.com/vision/docs/ocr#regionalization), to select if the data should be sent to the US or the EU. This PR adds an environment variable (`GOOGLEVISION_API_ENDPOINT`) to configure it. --------- Co-authored-by: JIAQIA <jqq1716@gmail.com> Co-authored-by: Dimitri Lozeve <dimitri@lozeve.com>	2024-05-31 18:41:04 +00:00
Yuming Long	4a96d54906	chore: move logger error to debug when pdfminer extract fails (#3028 ) ### Summary We are seeing logger error `Invalid dictionary construct` for hosted APIs, move this logger error to debug level - we still continue partition when pdfminer text extraction fails as before (just don't throw the log error anymore) ### Test I was able to reproduce the logger error with an internal only file (please DM me if needed) and the error trace look like ``` File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 709, in _process_pdfminer_pages annotation_list = get_uris(page.annots, height, coordinate_system, page_number) File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 1049, in get_uris resolved_annots = annots.resolve() ... ``` we also won't be able to repair pdf structure on `get_uris` (not a page level) so move this exception to debug level.	2024-05-31 17:58:36 +00:00
Matt Robinson	865ef496e6	ci: update `pinecone` test to use serverless (#3127 ) ### Summary Closes #3068. Updates the Pinecone connector tests to use serverless indexes, per the documentation [here](https://docs.pinecone.io/reference/api/control-plane/create_index). Also updates the CHANGELOG to mention serverless. Turns out we already supported it with the client version bump, but it hadn't been tested yet. ### Testing See [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127) that passed, running only the Pinecone test.	2024-05-31 15:24:41 +00:00
ryannikolaidis	1f8768750c	chore: add auth to s3 destination test (#3122 ) We should be validating the S3 Destination with authenticated requests, with credentials from a limited test user. ## Changes - Updates s3 destination test to point to a bucket that requires authentication. - Adds authentication to the s3 destination test request - Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination - Bonus: fix S3ConnectionConfig never registered for s3 V2 destination - Bonus: repair version and changelog version for consistency with -dev convention ## Testing Validated by changes to S3 destination ingest test	2024-05-31 07:05:09 +00:00
Matt Robinson	23e570fc8a	docs: cleanup readme; add python 3.12 (#3120 ) ### Summary Updates documentation references in the README to point to https://docs.unstructured.io and cleans up a few sections of the README. Specifically: - Removes an old API announcement - Removes the section mentioning Chipper as a beta feature. Chipper is only available through the SaaS API. Also adds a Python 3.12 tag to `setup.py` since we now support Python 3.12.	2024-05-30 16:22:54 +00:00
qued	293901e144	build: pin python-docx (#3110 ) Since we incorporate a newer feature from `python-docx` [here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L521), we should make the version of `python-docx` that first supports that method an explicit requirement. I didn't pip recompile since our generated dependencies already have `python-docx==1.1.2`, but I can do that if someone thinks it's necessary.	2024-05-30 15:08:10 +00:00
Matt Robinson	9acf26ec2e	docs: explicitly replace all old pages with link to new docs (#3118 ) ### Summary Explicitly replaces all old docs pages with a link to the new docs. This was required because 404 redirects didn't work for pages that previously existed, though they worked non-existing paths that never existed.	2024-05-30 13:01:33 +00:00
Matt Robinson	8415db5112	docs: make 404 pages same as index (#3114 ) ### Summary Makes a custom 404 page that's the same as `index.html`, so any path shows the URL for the new docs.	2024-05-30 07:46:38 -04:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Matt Robinson	2ecaf5e38c	fix: remove 404 from docs (#3112 ) ### Summary Removes 404 from the docs build to avoid rate limiting behavior.	2024-05-29 20:41:32 +00:00
ryannikolaidis	6b5d8a9785	fix: revert dropping of filename extension for some connectors (#3109 ) V2 refactor of ingest code introduces the removal of original file extensions. Since the upgrade of connectors is incomplete this means that some connectors will remove the original file extension and some will not. Still TBD whether this is actually something we want at all. This PR reverts specifically that change in the V2 ingest code so that original file extension is preserved downstream. ## Testing CI is passing with filenames updated via `Ingest Test Fixtures Update` workflow. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2024-05-29 19:14:22 +00:00
Christine Straub	f4457249a7	fix: `partition_pdf()` removes spaces from the text (#3106 ) Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` Results: - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ``` 0.14.3	2024-05-29 04:53:17 +00:00
Matt Robinson	3158169585	fix: uninstall bson for mongo connector (#3104 ) ### Summary Closes #3049. Reenables the MongoDB connector test, which was disabled previously in #3047 due to incompatibility between the `pymongo` and the `bson` package from `pip`, which is a dependency for the Astra connector. Per the `pymongo` docs below, `pymongo` ships with its own version of `bson` and installing `bson` from `pip` breaks `pymongo`. - https://pymongo.readthedocs.io/en/stable/installation.html ### Testing Ingest tests ran successfully for the [source connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512636315) and the [destination connector](https://github.com/Unstructured-IO/unstructured/actions/runs/9273154676/job/25512635546).	2024-05-28 17:45:18 +00:00
Matt Robinson	6b400b46fe	feat: add VoyageAI embeddings (#3069 ) (#3099 ) Original PR was #3069. Merged in to a feature branch to fix dependency and linting issues. Application code changes from the original PR were already reviewed and approved. ------------ Original PR description: Adding VoyageAI embeddings Voyage AI’s embedding models and rerankers are state-of-the-art in retrieval accuracy. --------- Co-authored-by: fzowl <160063452+fzowl@users.noreply.github.com> Co-authored-by: Liuhong99 <39693953+Liuhong99@users.noreply.github.com>	2024-05-24 21:48:35 +00:00
Yao You	32df4ee1c6	fix: disable table_as_cells output by default (#3093 ) This PR changes the output of table elements: now by default the table elements' `metadata.table_as_cells` is `None`. The data will only be populated when the env `EXTRACT_TABLE_AS_CELLS` is set to `true`. The original design of the `table_as_cells` is for evaluate table extraction performance. The format itself is not as readable as the `table_as_html` metadata for human or RAG consumption. Therefore by default this data is not needed. Since this output is meant for evaluation use this PR choose to use an environment variable to control if it should be present in the partitioned results. This approach avoids adding parameters to the `partition` function call. Adding a new parameter to the `partition` interface increases the complexity of the interface and adds more maintenance cost since there is a long chain of function calls to pass down this parameter to where it is needed. ## test running the following code snippet on main vs. this PR ```python from unstructured.partition.auto import partition elements = partition("example-docs/layout-parser-paper-with-table.pdf", strategy="hi_res", skip_infer_table_types=[]) table_cells = [element.metadata.table_as_cells, None) for element in elements if element.category == "Table"] ``` on main branch `table_cells` contains cell structured data but on this branch it is a list of `None` However if we first set in terminal: ```bash export EXTRACT_TABLE_AS_CELLS=true ``` then run the same code again with this PR the `table_cells` would contain actual data, the same as on main branch. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: badGarnet <badGarnet@users.noreply.github.com>	2024-05-24 16:41:25 +00:00
Yao You	809c7e515a	chore: reduce excessive logging (#3095 ) - change some info level logging for per page processing into detail level logging on trace logger - replace the try block in `document_to_element_list` to use `getattr` instead and add comment on the reason why sometimes `type` attribute may not exist for an element	2024-05-24 14:58:47 +00:00
Steve Canny	26d403d7a7	fix: add missing params to ElementMetadata (#3092 ) A couple of parameters needed for DOCX image extraction were not added as parameters to the `ElementMetadata` constructor when they were added as known fields. Also repair a couple gaps in alphabetical ordering cause by recent additions.	2024-05-23 21:30:55 +00:00
Christine Straub	35ec21ecd0	fix: decide table extraction (#3090 ) This PR aims to add backward compatibility for the deprecated `pdf_infer_table_structure` parameter. A missing part of turning table extraction for PDFs and Images off by default in https://github.com/Unstructured-IO/unstructured/pull/3035, which was turned on in https://github.com/Unstructured-IO/unstructured/pull/2588. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>	2024-05-23 20:37:15 +00:00
David Potter	31a53c8a28	Fix: Chroma Upsert instead of Add (#3086 ) Thanks to @0xjgv we have upserting instead of adding in Chroma. This will prevent duplicate embeddings. Also including a huggingface example. We had examples for all the other embedders.	2024-05-23 19:56:19 +00:00
Steve Canny	47d28612f7	feat(docx): add pluggable picture sub-partitioner (#3081 ) Summary Allow registration of a custom sub-partitioner that extracts images from a DOCX paragraph. Additional Context - A custom image sub-partitioner must implement the `PicturePartitionerT` interface defined in this PR. Basically have an `.iter_elements()` classmethod that takes the paragraph and generates zero or more `Image` elements from it. - The custom image sub-partitioner must be registered by passing the class to `register_picture_partitioner()`. - The default image sub-partitioner is `_NullPicturePartitioner` that does nothing. - The registered picture partitioner is called once for each paragraph.	2024-05-23 18:46:30 +00:00
Matt Robinson	171b5df09f	fix: set `resolve_entities=False` in `partition_xml` (#3088 ) ### Summary Closes #3078. Sets `resolve_entities=False` for parsing XML with `lxml` in `partition_xml` to avoid text being dynamically injected into the document. ### Testing `pytest test_unstructured/partition/test_xml.py` continues to pass with the update.	2024-05-23 18:38:11 +00:00
Jan Kanty Milczek	9b83330b5a	fix: added the missing function argument (#3085 )	2024-05-23 14:30:26 +00:00
Hubert Rutkowski	b8d894f963	feat/Move the category field to Element (#3056 ) It's pretty basic change, just literally moved the category field to Element class. Can't think of other changes that are needed here, because I think pretty much everything expected the category to be directly in elements list. For local testing, IDE's and linters should see difference in that `category` is now in Element.	2024-05-23 10:43:26 +00:00
Matt Robinson	c9976760c5	fix: revert back to old requirements file for sphinx docs (#3077 ) ### Summary As seen in [this job](https://github.com/Unstructured-IO/unstructured/actions/runs/9182534479/job/25251583102), the build job for sphinx docs is failing, and has been failing for quite some time. This PR reverts the requirements file back to a [previous good commit](`91b892c79d`) for that job, and also moves the `build.in` file so the requirements file doesn't get update on `make pip-compile.` This is fine since those requirements don't get installed as part of the package, and we're deprecated the `sphinx` docs in favor of https://docs.unstructured.io anyway. ### Testing Build was [successful](https://github.com/Unstructured-IO/unstructured/actions/runs/9198605026/job/25301670934?pr=3077) on the feature branch. --------- Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-05-23 03:32:06 +00:00
Steve Canny	b4ee019170	rfctr: flatten test_unstructured/partition (#3073 ) Summary Some partitioner test modules are placed in directories by themselves or with one other test module. This unnecessarily obscures where to find the test module corresponding to a partitiner. Move partitioner test modules to mirror the directory structure of `unstructured/partition`.	2024-05-23 00:51:08 +00:00
Christine Straub	18428f24ab	chore: bump unstructured-inference 0.7.33 (#3074 ) Summary: - bump unstructured-inference to `0.7.33` - cut a release for `0.14.2` - add some dependencies that previously came through from the layoutparser extras. 0.14.2	2024-05-22 22:35:00 +00:00
Steve Canny	30e5a0cd4e	rfctr(docx): organize docx tests (#3070 ) Summary I preparation for adding DOCX pluggable image extraction, organize a few of the DOCX tests to be parallel to very similar tests for the DOC and ODT partitioners. 0.14.1	2024-05-21 22:11:46 +00:00
Matt Robinson	7832dfc723	feat: add attribution for pinecone (#3067 ) ### Summary - Updates the `pinecone-client` from v2 to v4 using the [client migration guide](https://canyon-quilt-082.notion.site/Pinecone-Python-SDK-v3-0-0-Migration-Guide-056d3897d7634bf7be399676a4757c7b#932ad98a2d33432cac4229e1df34d3d5). Version bump was required to [add attribution](https://pinecone-2-partner-integration-guide.mintlify.app/integrations/build-integration/attribute-api-activity) and will also enable use to support [serverless indexes](https://docs.pinecone.io/reference/pinecone-clients#initialize) - Adds `"unstructured.{version}"` as the source tag for the connector ### Testing Destination connection tests [pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9180305080/job/25244484432?pr=3067) with the updates.	2024-05-21 20:56:08 +00:00
Christine Straub	b0d8a779da	feat: `partiton_pdf()` set inferred elements text (#3061 ) This PR adds the ability to fill inferred elements text from embedded text (`pdfminer`) without depending on `unstructured-inference` library. This PR is the second part of moving embedded text related code from `unstructured-inference` to `unstructured` and works together with https://github.com/Unstructured-IO/unstructured-inference/pull/349.	2024-05-21 19:43:38 +00:00
Matt Robinson	059fc64bd9	build: apk add libreoffice24 (#3065 ) ### Summary Switches to installing `libreoffice` from the Wolfi repository and upgrades the `libreoffice` version to `libreoffice==24.x.x`. Resolves a medium vulnerability in the old `libreoffice` version. Security scanning with `anchore/grype` was also added to the `test_dockerfile` job. Requirements were bumped to resolve a vulnerability in the `requests` library. ### Testing `test_dockerfile` passes with the updates.	2024-05-21 18:54:16 +00:00
Roman Isecke	3eaf65a8c1	feat: refactor ingest (#3009 ) ### Description This refactors the current ingest CLI process to support better granularity in how the steps are ran * Both multiprocessing and async now supported. Given that a lot of the steps are IO-bound, such as downloading and uploading content, we can achieve better parallelization by using async here * Destination step broken up into a stager step and an upload step. This will allow for steps that require manipulation of the data between formats, such as converting the elements json into a csv format to upload for tabular destinations, to be pulled out of the step that does the actual upload. * The process of writing the content to a local destination was now pulled out as it's own dedicated destination connector, meaning you no longer need to persist the content locally once the process is done if the content was uploaded elsewhere. * Quick update to the chunker/partition step to use the python client. * Move the uncompress suppport as a pipeline step since this can arbitrarily apply to any concrete files that have been downloaded, regardless of where they came from. * Leverage last modified date to mark files to be reprocessed, even if the file already exists locally. ### Callouts Retry configs haven't been moved over yet. This is an open question because the intent was for it to wrap potential connection errors but now any of the other steps that leverage an API might run into network connection issues. Should those be isolated in each of the steps and wrapped with the same retry configs? Or do we need to expose a unique retry config for each step? This would bloat the input params even more. ### Testing * If you want to run the new code as an SDK, there's an example file that was added to highlight how to do that: [example.py](https://github.com/Unstructured-IO/unstructured/blob/roman/refactor-ingest/unstructured/ingest/v2/example.py) * If you want to run the new code as an isolated CLI: ```shell PYTHONPATH=. python unstructured/ingest/v2/main.py --help ``` * If you want to see which commands have been migrated to the new version, there's now a `v2` short help text next to those commands when running the current cli: ```shell PYTHONPATH=. python unstructured/ingest/main.py --help Usage: main.py [OPTIONS] COMMAND [ARGS]...main.py --help Options: --help Show this message and exit. Commands: airtable azure biomed box confluence delta-table discord dropbox elasticsearch fsspec gcs github gitlab google-drive hubspot jira local v2 mongodb notion onedrive opensearch outlook reddit s3 v2 salesforce sftp sharepoint slack wikipedia ``` You can run any of the local or s3 specific ingest tests and these should now work. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-05-21 17:01:49 +00:00
Matt Robinson	73739b38cc	docs: redirect to docs.unstructured.io on github pages (#3054 ) ### Summary Updates GitHub pages to redirect to the new https://docs.unstructured.io page. This will appear on GitHub pages after the next tag. ### Testing 1. From the docs direction, run `make html`. You should not see any errors or warnings 2. Open `unstructured/docs/build/html/index.html`. It should look like the following: <img width="1512" alt="image" src="https://github.com/Unstructured-IO/unstructured/assets/1635179/077626a5-d88a-467e-9e37-273a92e75d30"> 3. Open `unstructured/docs/build/html/404.html`. It should redirect back to `index.html`. Per the [GitHub pages docs](https://docs.github.com/en/pages/getting-started-with-github-pages/creating-a-custom-404-page-for-your-github-pages-site), that page will get served for 404 errors, meaning any links to old docs pages will redirect to `index.html`, which points users to the new docs page.	2024-05-21 09:38:32 -04:00
Matt Robinson	acda4d0707	fix: set `skip_infer_tables` explicitly in `test_partition_via_api_with_no_strategy` (#3057 ) ### Summary A `partition_via_api` test that only runs on `main` was [failing](https://github.com/Unstructured-IO/unstructured/actions/runs/9159429513/job/25181600959) with the following output, likely due to the change in the default behavior for `skip_infer_table_types`. This PR explicitly sets the `skip_infer_table_types` param to avoid the failure.. ```python =========================== short test summary info ============================ FAILED test_unstructured/partition/test_api.py::test_partition_via_api_with_no_strategy - AssertionError: assert 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' != 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' + where 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb9069fc610>.text + and 'Zejiang Shen® (<), Ruochen Zhang?, Melissa Dell®, Benjamin Charles Germain Lee?, Jacob Carlson®, and Weining Li®' = <unstructured.documents.elements.Text object at 0x7fb90648ad90>.text = 1 failed, 2299 passed, 9 skipped, 2 deselected, 2 xfailed, 9 xpassed, 14 warnings in 1241.64s (0:20:41) = make: *** [Makefile:302: test] Error 1 ``` ### Testing After temporarily removing the "skip if not on `main`" `pytest` mark, the [unit tests pass](https://github.com/Unstructured-IO/unstructured/actions/runs/9163268381/job/25192040902?pr=3057O) on the feature branch.	2024-05-20 19:05:13 -04:00

1 2 3 4 5 ...

1447 Commits