unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-09-17 12:27:45 +00:00

Author	SHA1	Message	Date
Christine Straub	b47e6e9fdc	refactor: remove download packages step (#3225 ) This PR aims to remove the download packages step since all of that gets installed in the base images. This PR also updates the base `wolfi` image because the original base image can not be found anymore: https://github.com/Unstructured-IO/unstructured/actions/runs/9555654898/job/26339587945	2024-06-18 12:15:44 +00:00
Steve Canny	77a9e1b54d	rfctr(html): drop convert_and_partition_html() (#3215 ) Summary Remove `unstructured.partition.html.convert_and_partition_html()`. Move file-type conversion (to HTML) responsibility to each brokering partitioner that uses that strategy and let them call `partition_html()` for themselves with the result. Additional Context Rationale: - `partition_html()` does not want or need to know which partitioners might broker partitioning to it. - Different brokering partitioners have their own methods to convert their format to HTML and quirks that may be involved for their format. Avoid coupling them so they can evolve independently. - The core of the conversion work is already encapsulated in `unstructured.partition.common.convert_file_to_html_text_using_pandoc()`. - `convert_and_partition_html()` represents an additional brokering layer with the entailed complexities of an additional site for default parameter values to be (mis-)applied and/or dropped and is an additional location for new parameters to be added.	2024-06-17 19:43:18 +00:00
Roman Isecke	d876a386ed	Roman/fix ingest async connectors (#3210 ) ### Description Choosing to use async needs to be very careful because if a connector is set to use async, the pipeline will not fan out the inputs via multiprocessing but instead it will be limited to run in a single process under the assumption it has more benefit from async due to heavy network traffic. This means the exact same code that is not optimized for async and is blocking will force the pipeline to perform worse than simply never marking the connector to use async since the pipeline will fan that out using multiprocessing. All connectors and processes in the pipeline we revisited to make sure this criteria was met and updated accordingly: * Currently the unstructured client does not support making requests async, so this was moved over to use multiprocessing * fsspec connector was updated to use the async client from the fsspec library. This also required that the client be a `@property` fetched on demand, otherwise the client would break the multiprocessing pool since it maintains a thread lock and that can't be pickled when the fsspec connector doesn't support async. * elasticsearch was also updated to use the async client * weaviate only recently came out with async support in their SDK at a version that is higher than we can use in the open source repo, so a TODO was left but otherwise moved to use multiprocessing * all underlying embedders don't use async to embedder step must be multiprocessing for now. TODO left to update underlying embedder classes to optionally support async. * Chunking parameters were not accurately being passed through from cli to chunker params, this was fixed --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-17 16:55:19 +00:00
Frederic Marvin Abraham	6220633d3f	enhancement: make tempfiles windows friendly (#3108 ) ### Summary Updates handling of tempfiles so that they work on Windows systems. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io> Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>	2024-06-17 13:28:48 -04:00
Matt Robinson	2815226b54	build(deps): version bumps for 2024-06-17 (#3220 ) ### Summary Version bumps for the week of 2024-06-17. There is a now a pin on `numpy` due to a breaking change in the latest version that we'll need to investigate and remove in a subsequent PR.	2024-06-17 14:04:29 +00:00
Steve Canny	9fae0111d9	rfctr(html): drop HTML-specific elements (#3207 ) Summary Remove HTML-specific element types and return "regular" elements like `Title` and `NarrativeText` from `partition_html()`. Additional Context - An aspect of the legacy HTML partitioner was the use of HTML-specific element types used to track metadata during partitioning. - That role is no longer necessary or desireable. - HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were returned from partitioning HTML but also the seven other file-formats that broker partitioning to HTML (convert-to-HTML and partition_html()). This does not cause immediate breakage because these are still `Text` element subtypes, but it produces a confusing developer experience. - Remove the prior metadata roles from HTML-specific elements and remove those element types entirely.	2024-06-15 00:14:22 +00:00
Matt Robinson	08383a27de	build: pull from wolfi base image (#3213 ) ### Summary Updates the `wolfi` image to pull from the upstream `wolfi-base` base image to avoid maintaining the base layers in both locations. Closes #3105 by pulling in the fix from upstream. ### Testing `test_dockerfile` should continue to pass with the changes.	2024-06-14 20:41:27 +00:00
Christine Straub	9552fbbfbf	chore: bump unstructured-inference 0.7.35 (#3205 ) ### Summary - bump unstructured-inference to `0.7.35` which fixed syntax for generated HTML tables - update unit tests and ingest test fixtures to reflect changes in the generated HTML tables - cut a release for `0.14.6` --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: christinestraub <christinestraub@users.noreply.github.com> 0.14.6	2024-06-14 18:11:38 +00:00
Roman Isecke	a6c09ec621	Roman/dry ingest pipeline step (#3203 ) ### Description The main goal of this was to reduce the duplicate code that was being written for each ingest pipeline step to support async and not async functionality. Additional bug fixes found and fixed: * each logger for ingest wasn't being instantiated correctly. This was fixed to instantiate in the beginning of a pipeline run as soon as the verbosity level can be determined. * The `requires_dependencies` wrapper wasn't wrapping async functions correctly. This was fixed so that `asyncio.iscoroutinefunction()` gets trigger correctly.	2024-06-14 13:46:44 +00:00
Pawel Kmiecik	29e64eb281	feat: table evaluations for fixed html table generation (#3196 ) Update to the evaluation script to handle correct HTML syntax for tables. See https://github.com/Unstructured-IO/unstructured-inference/pull/355 for details. This change: - modifies transforming HTML tables to evaluation internal `cells` format - fixes the indexing of the output (internal format cells) when HTML cells use spans	2024-06-14 09:03:27 +00:00
Roman Isecke	dadc9c6d0b	feat/tqdm ingest support (#3199 ) ### Description Add in tqdm support to show progress bar of status of each job when being run. Supported for each mode (serial, async, multiprocess). Also small timing wrapper around jobs to print out how long it took in total.	2024-06-13 18:41:54 +00:00
Steve Canny	f5ebb209a4	rfctr(html): drop page concept (#3184 ) Summary Pagination of HTML documents is currently unused. The `Page` class and concept were deeply embedding in the legacy organization of HTML partitioning code due to the legacy `Document` (= pages of elements) domain model. Remove this concept from the code such that elements are available directly from the partitioner. Additional Context - Pagination can be re-added later if we decide we want it again. A re-implementation would be much simpler and much lower impact to the structure of the code and introduce much less additional complexity, similar to the approach we take in `partition_docx()`.	2024-06-13 18:19:42 +00:00
ryannikolaidis	da3492b529	fix: dropbox source connector file path bugs (#3189 ) The Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'. ## Changes * add a `/` prefix to file path used by DropboxIndexer * override the fsspec sterilize_info method in DropboxIndexer to call `self.fs.info` with `url` rather than `path`; to accommodate for the fact that `dropboxdrivefs` diverges with this signature * remove `dropbox.sh` from ignored source tests * update test fixtures (now that the dropbox connector has been fixed and not skipped) ## Testing `dropbox.sh` source ingest test now succeeds (and is no longer ignored) --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com> Co-authored-by: Christine Straub <christinemstraub@gmail.com>	2024-06-13 18:06:41 +00:00
Roman Isecke	f7b0a37c86	Feat/migrate elasticsearch src connector (#3174 ) ### Description Migrate elasticsearch connector with support for what used to be batch ingest docs but not it support for the download step to generate additional file data. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-13 17:57:59 +00:00
Matt Robinson	ad69bdcd4e	build(deps): deltalake bump to `0.18.x` (#3197 ) ### Summary Closes #3173. Removes the `overwrite_schema` kwarg from the Delta Table connector and bumps the `deltalake` version. Per [this PR](https://github.com/delta-io/delta-rs/pull/2554) in the `deltalake` repo, the `overwrite_schema` kwarg is deprecated as of version `0.18.0`. Users can specify `schema_mode="merge"` to obtain the same behavior. - `schema_mode="merge"` is equivalent to `overwrite_schema=False` - `schema_mode="overwrite"` is equivalent to `overwrite_schema=True` Also adds an `engine` parameter that you can use to set `"rust"` or `"pyarrow"` as the engine. `engine` defaults to `"pyarrow"` and `schema_mode` defaults to `None`, which is consistent with the behavior in `deltalake` documented [here](https://delta-io.github.io/delta-rs/api/delta_writer/). ### Testing The Delta Table ingest tests should pass on this PR. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-06-13 15:59:34 +00:00
Steve Canny	5f582f1716	ci: update to Node 20 actions (#3200 ) Summary Silence the long list of warnings we get in CI from using Node 16 actions by updating to Node 20 versions.	2024-06-13 03:43:26 +00:00
ryannikolaidis	17bc55e7be	fix: relative path / permissions issues with v2 fsspec connectors (#3186 ) When the v2 fsspec connectors currently generate the relative path, they may introduce a path with a leading slash (this happens in the case of the Box connector, which is a subclass of fsspec). When this happens this results in the paths unintentionally being treated as absolute paths. As a result, the ingest pipeline attempts to write files to directories at root level, which in turn raises permission issues. Note: Box expected results needed to update now that it's no longer failing. Aside: found that our tests were unintentionally skipping `box.sh` tests because we were intending to skip `dropbox.sh` and we use regex to match if a given test is in skip tests. This adds changes to force an exact match. ## Changes * Strip leading slashes during the creating of relative paths in fsspec connectors * Add expected results for Box connector * (bonus): `make tidy` altered an unrelated file by removing an unnecessary call of `pass` * (bonus): check exact match for skipped ingest tests which fixes Box tests getting skipped ## Testing [Tests](https://github.com/Unstructured-IO/unstructured/actions/runs/9461928289/job/26093475612#step:7:2085) for the Box connector was failing. It was accidentally getting skipped (see changes above). It is now no longer skipped and passing.	2024-06-12 03:39:35 +00:00
Filip Knefel	c2065db716	fix API-297: List parameters incorrectly passed to API requests (#3154 ) In two places parameters passed to the python client when using either Ingest workflow and `partition_via_api` function directly we parse the parameters with list values to strings e.g. ```python extract_image_block_types=["image"] -> extract_image_block_types='["image"]' ``` as of now these parameters are parsed incorrectly when given as strings and correctly when given as lists. This PR removes parsing from `PartitionConfig` and `partition_via_api`. --------- Co-authored-by: Filip Knefel <filip@unstructured.io>	2024-06-11 21:00:41 +00:00
Steve Canny	2f0400f279	rfctr(html): break coupling to DocumentLayout (#3180 ) Summary Remove use of `partition.common.document_to_element_list()` by `HTMLDocument`. The transitive coupling with layout-inference through this shared function have been the source of frustration and a drain on engineering time and there's no compelling reason for the two to share this code. Additional Context `partition_html()` uses `partition.common.document_to_element_list()` to get finalized elements from `HTMLDocument` (pages). This gives rise to a very nasty coupling between `DocumentLayout`, used by `unstructured_inference`, and `HTMLDocument`. `document_to_element_list()` has evolved to work for both callers, but they share very few common characteristics with each other. This coupling is bad news for us and also, importantly, for the inference and page layout folks working on PDF and images. Break that coupling so those inference-related functions can evolve whatever way they need to without being dragged down by legacy `HTMLDocument` connections. The initial step is to extract a `document_to_element_list()` function of our own, getting rid of the coordinates and other `DocumentLayout`-related bits we don't need. As you'll see in the next few PRs, all of this `document_to_element_list()` code will end up either going away or being relocated closer to where it's used in `HTMLDocument`.	2024-06-11 20:54:11 +00:00
Steve Canny	e39ee16161	rfctr(html): promote HTMLDoc candidate methods (#3177 ) Summary Make `._find_articles()` and `._find_main` into `._articles` and `._main` properties on HTMLDocument, respectively. Additional Context After prior refactorings, these two functions now each require only `self` and can become `@lazyproperty`s on `HTMLDocument`. This ensures they are computed at most once. In addition, their close relationship to `HTMLDocument` is indicated by their membership as methods rather than "loose" functions.	2024-06-10 22:07:21 +00:00
Matt Robinson	c822e3fd10	build(deps): weekly dependency bumps (6/10/2024) (#3170 ) ### Summary Weekly dependency bumps for the week of 6/10/2024. The `deltalake` dependency was pinned to `<0.18.0` because `0.18.0` seemed to break the connector test, per [this test](https://github.com/Unstructured-IO/unstructured/actions/runs/9450141486/job/26028131005). Opened #3173 to address.	2024-06-10 16:20:22 +00:00
Tracy Shen	d82a34519e	[Merge request] bug fix on table structure metric (#3089 ) Summary This fix is to provide better logic oon matched_idx of calculating table structure metric to provide more accurate calculation on the acc Additional Context - this fix has passed CI run in Draft PR #3025 initially - therefore, this time we would like to merge into main branch - this commit has merged the latest change from main after the Draft PR	2024-06-10 15:14:32 +00:00
Duda Nogueira	657a949a00	chore: Weaviate pyv4 example (#3151 ) Update Unstructured example for Weaviate, now using latest python v4 client. --------- Co-authored-by: Matt Robinson <mrobinson@unstructured.io>	2024-06-10 10:08:46 -04:00
Steve Canny	a66661a7bf	rfctr(html): drop now dead XMLDocument and Document (#3165 ) Summary `HTMLDocument` is the class handling the core of HTML parsing. This is critical code because 8 of the 20 file-type partitioners end up using this code (`partition_html()` + 7 brokering partitioners like EPUB, MD, and RST). For historical reasons, `HTMLDocument` subclassed `XMLDocument` which in turn subclassed `Document`, both of which are no longer relevant and unnecessarily complicate reasoning about `HTMLDocument` behavior. Remove that inheritance and dependency and drop both `XMLDocument` and `Document` modules which become dead code after no longer being used by `HTMLDocument`.	2024-06-08 07:36:18 +00:00
Matt Robinson	b4876f1b18	build: 0.14.5 release (#3164 ) ### Summary Update changelog and version for `0.14.5` release. 0.14.5	2024-06-07 17:20:30 +00:00
Roman Isecke	0fe0f15f30	feat: migrate weaviate connector to new framework (#3160 ) ### Description Add weaviate output connector to those supported in the new v2 ingest framework. Some fixes were needed to the upoad stager step as this was the first connector moved over that leverages this part of the pipeline.	2024-06-06 23:18:55 +00:00
Steve Canny	a883fc9df2	rfctr(html): improve SNR in HTMLDocument (#3162 ) Summary Remove dead code and organize helpers of HTMLDocument in preparation for improvements and bug-fixes to follow	2024-06-06 21:21:33 +00:00
Steve Canny	8378ddaa3b	rfctr(html): organize and improve HTMLDocument tests (#3161 ) Summary In preparation for further work on HTMLDocument, organize the organic growth in `documents/tests_html.py` and improving typing and expression. Reviewers: Commits are groomed and review is probably eased by going commit-by-commit	2024-06-06 18:16:02 +00:00
Steve Canny	f1cab248ce	rfctr(msg): remove temporary new_msg.py (#3157 ) Summary Remove temporary `new_msg.py` module. Additional Context The rewrite of `partition_msg()` was placed in a separate file `new_msg.py` to avoid a messy diff for code-review. This PR makes that `new_msg.py` the new `msg.py`. No code changes were made in the process.	2024-06-06 08:31:56 +00:00
Steve Canny	ddbe90f6bb	rfctr(html): clean html tests in prep for PRs to follow (#3156 ) Summary Clean `tests_unstructured/partition/test_html.py` in preparation for broader refactor of HTML partitioner to follow. That refactor will address a cluster of bugs. Temporarily remove blank lines in tests so reordering tests in following commit is easier to follow. Those will go back in after that.	2024-06-05 23:11:58 +00:00
Steve Canny	e4158deaff	fix(msg): use python-oxmsg for MSG email parsing (#3142 ) Summary `partition_msg()` previously used the `msg_parser` library for parsing Outlook MSG email files (.msg files). The `msg_parser` library is unmaintained and has several major shortcomings such as not being able to parse MSG files with 8-bit encoded strings and not reliably extracting attachments. Use the new and permissively licenced `python-oxmsg` library instead. Additional Context For reviewability purposes, this PR temporarily places the new `partition_msg()` implementation in `new_msg.py` and references that implementation from `msg.py`. `new_msg.py` will be renamed to `msg.py` in a closely following PR. This avoids a very messy interleaving of hunks in a diff between the old and re-written `partition_msg()` implementation. Fixes #2481 Fixes #3006	2024-06-05 21:12:27 +00:00
Roman Isecke	b777864296	feat: Migrate over fsspec connectors (#3066 ) ### Description Move over all fsspec connectors to the new framework Variety of bug fixes found and fixed in this PR as well: * custom json mixin being used for the enhanced dataclass would break if typing was quoted. That was fixed. A check was also added to the enhanced dataclass to prevent `InitVar` from being used in the root dataclass since this breaks serialization. * hashing for partitioner was using the filename of the raw file being partitioned rather than the file name of the file data generated from indexing. This means that mutliple files could result in the same partition hash when recursive flag is passed in. This was updated to use the file data file name instead. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2024-06-05 19:12:06 +00:00
Matt Robinson	0e16bf4bf0	enhancement: apply tar filters when using python 3.12 or above (#3124 ) ### Summary Applies tar filters when using Python 3.12 or above. This was added to the [Python `tarfile` library in 3.12](https://docs.python.org/3/library/tarfile.html#extraction-filters) and guards against malicious content being extracted from `.tar.gz` files. ### Testing Added smoke test. If this passes for all Python versions, we're good.	2024-06-05 18:28:59 +00:00
Yao You	fdb27378cb	chore: use python3 consistently in makefile (#3152 ) This PR changes two `python` commands in `Makefile` to use `python3` to be consistent with other make commands. This makes it more explicit on which python to use when the makefile is used outside of a controlled virtualenv where only one python exists.	2024-06-05 00:05:57 +00:00
Matt Robinson	5203390a4a	build(deps): weekly pip version bump (#3147 ) ### Summary Weekly PR to bump dependency versions.	2024-06-04 20:47:04 +00:00
Christine Straub	1dede5029d	fix: parsing pdf error - new_cells as str has no "copy" (#3130 ) Closes #3119. ### Testing Parsing the provided PDF should be successful. [testing_brochure_2.pdf](https://github.com/user-attachments/files/15518094/testing_brochure_2.pdf) ``` filename = "testing_brochure_2.pdf" with open(filename, "rb") as pdf_content: elements = partition_pdf( file=pdf_content, infer_table_structure=True, extract_image_block_types=["Image", "Table"], chunking_strategy="by_title", max_characters=1000, new_after_n_chars=3000, combine_text_under_n_chars=1000, ) print("\n\n".join([str(el) for el in elements])) ``` 0.14.4	2024-06-03 18:49:38 +00:00
Matt Robinson	1b43102762	fix: remote root handlers when they exist (#3128 ) ### Summary In some environments, such as Google Colab, loggers have a root handling that did not mask sensitive values. As a result, secrets such as API keys appeared in the logs. The PR removes root handlers when they exist to ensure sensitive values are handler properly. ### Testing Run the following in a Colab notebook. You should see two log outputs, one with the API key masked and one with it exposed. ``` !pip install unstructured ``` ```python import logging import json from unstructured.ingest.interfaces import ( ChunkingConfig, EmbeddingConfig, PartitionConfig, ProcessorConfig, ReadConfig, ) partition_config = PartitionConfig( partition_by_api=True, api_key="super secret", ) from unstructured.ingest.logger import ingest_log_streaming_init ingest_log_streaming_init(logging.INFO) logger = logging.getLogger("unstructured.ingest") logger.setLevel(logging.INFO) logger.info( f"Running partition node to extract content from json files. " f"Config: {partition_config.to_json()}, " ) ``` Now replace the first cell with the following and rerun the Python code. Only the masked logging output should remain. ``` !git clone https://github.com/Unstructured-IO/unstructured.git && cd unstructured && git checkout fix/rm-log-dupes && pip install -e . ```	2024-05-31 22:07:38 +00:00
Matt Robinson	54c1e4e57f	ci: remove jira issue workflow (#3129 ) ### Summary Removes the workflow for creating Jira tickets.	2024-05-31 22:00:40 +00:00
Matt Robinson	6005abce79	feat: configure googlevisionapi (#3126 ) ### Summary Includes changes from #3117. Merged into a feature branch to run the full test suite. Original PR description: The Google Vision API allows for [configuration of the API endpoint](https://cloud.google.com/vision/docs/ocr#regionalization), to select if the data should be sent to the US or the EU. This PR adds an environment variable (`GOOGLEVISION_API_ENDPOINT`) to configure it. --------- Co-authored-by: JIAQIA <jqq1716@gmail.com> Co-authored-by: Dimitri Lozeve <dimitri@lozeve.com>	2024-05-31 18:41:04 +00:00
Yuming Long	4a96d54906	chore: move logger error to debug when pdfminer extract fails (#3028 ) ### Summary We are seeing logger error `Invalid dictionary construct` for hosted APIs, move this logger error to debug level - we still continue partition when pdfminer text extraction fails as before (just don't throw the log error anymore) ### Test I was able to reproduce the logger error with an internal only file (please DM me if needed) and the error trace look like ``` File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 709, in _process_pdfminer_pages annotation_list = get_uris(page.annots, height, coordinate_system, page_number) File "/Users/yumingl/develops/unstructured/unstructured/partition/pdf.py", line 1049, in get_uris resolved_annots = annots.resolve() ... ``` we also won't be able to repair pdf structure on `get_uris` (not a page level) so move this exception to debug level.	2024-05-31 17:58:36 +00:00
Matt Robinson	865ef496e6	ci: update `pinecone` test to use serverless (#3127 ) ### Summary Closes #3068. Updates the Pinecone connector tests to use serverless indexes, per the documentation [here](https://docs.pinecone.io/reference/api/control-plane/create_index). Also updates the CHANGELOG to mention serverless. Turns out we already supported it with the client version bump, but it hadn't been tested yet. ### Testing See [this CI job](https://github.com/Unstructured-IO/unstructured/actions/runs/9319836670/job/25655322433?pr=3127) that passed, running only the Pinecone test.	2024-05-31 15:24:41 +00:00
ryannikolaidis	1f8768750c	chore: add auth to s3 destination test (#3122 ) We should be validating the S3 Destination with authenticated requests, with credentials from a limited test user. ## Changes - Updates s3 destination test to point to a bucket that requires authentication. - Adds authentication to the s3 destination test request - Bonus: fix deserialization of S3ConnectionConfig for s3 V2 destination - Bonus: fix S3ConnectionConfig never registered for s3 V2 destination - Bonus: repair version and changelog version for consistency with -dev convention ## Testing Validated by changes to S3 destination ingest test	2024-05-31 07:05:09 +00:00
Matt Robinson	23e570fc8a	docs: cleanup readme; add python 3.12 (#3120 ) ### Summary Updates documentation references in the README to point to https://docs.unstructured.io and cleans up a few sections of the README. Specifically: - Removes an old API announcement - Removes the section mentioning Chipper as a beta feature. Chipper is only available through the SaaS API. Also adds a Python 3.12 tag to `setup.py` since we now support Python 3.12.	2024-05-30 16:22:54 +00:00
qued	293901e144	build: pin python-docx (#3110 ) Since we incorporate a newer feature from `python-docx` [here](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py#L521), we should make the version of `python-docx` that first supports that method an explicit requirement. I didn't pip recompile since our generated dependencies already have `python-docx==1.1.2`, but I can do that if someone thinks it's necessary.	2024-05-30 15:08:10 +00:00
Matt Robinson	9acf26ec2e	docs: explicitly replace all old pages with link to new docs (#3118 ) ### Summary Explicitly replaces all old docs pages with a link to the new docs. This was required because 404 redirects didn't work for pages that previously existed, though they worked non-existing paths that never existed.	2024-05-30 13:01:33 +00:00
Matt Robinson	8415db5112	docs: make 404 pages same as index (#3114 ) ### Summary Makes a custom 404 page that's the same as `index.html`, so any path shows the URL for the new docs.	2024-05-30 07:46:38 -04:00
Steve Canny	f2e67539b1	rfctr: clean MSG partitioner and tests as prep (#3107 ) Summary Fix type errors and generally prepare `partition_msg()` and its tests for refactoring to use `python-oxmsg` library instead of the problematic `msg_parser` library for partitioning Outlook MSG files.	2024-05-29 21:36:05 +00:00
Matt Robinson	2ecaf5e38c	fix: remove 404 from docs (#3112 ) ### Summary Removes 404 from the docs build to avoid rate limiting behavior.	2024-05-29 20:41:32 +00:00
ryannikolaidis	6b5d8a9785	fix: revert dropping of filename extension for some connectors (#3109 ) V2 refactor of ingest code introduces the removal of original file extensions. Since the upgrade of connectors is incomplete this means that some connectors will remove the original file extension and some will not. Still TBD whether this is actually something we want at all. This PR reverts specifically that change in the V2 ingest code so that original file extension is preserved downstream. ## Testing CI is passing with filenames updated via `Ingest Test Fixtures Update` workflow. --------- Co-authored-by: ryannikolaidis <ryannikolaidis@users.noreply.github.com>	2024-05-29 19:14:22 +00:00
Christine Straub	f4457249a7	fix: `partition_pdf()` removes spaces from the text (#3106 ) Closes #2896. This PR aims to fix `partition_pdf()` to keep spaces in text. The control character `\t` is now replaced with a space instead of being removed when merging inferred and embedded elements. ### Testing PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", ) print(str(elements[20])) ``` Results: - PR ``` Name of each exchange on which registered New York Stock Exchange ``` - main branch ``` Nameofeachexchangeonwhichregistered NewYorkStockExchange ``` 0.14.3	2024-05-29 04:53:17 +00:00

1 2 3 4 5 ...

1418 Commits