unstructured

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-11-09 23:17:21 +00:00

Author	SHA1	Message	Date
John	3843af666e	feat: Enable remote chunking via unstructured-ingest (#2905 ) Update: The cli shell script works when sending documents to the free api, but the paid api is down, so waiting to test against it. - The first commit adds docstrings and fixes type hints. - The second commit reorganizes `test_unstructured_ingest` so it matches the structure of `unstructured/ingest`. - The third commit contains the primary changes for this PR. - The `.chunk()` method responsible for sending elements to the correct method is moved from `ChunkingConfig` to `Chunker` so that `ChunkingConfig` acts as a config object instead of containing implementation logic. `Chunker.chunk()` also now takes a json file instead of a list of elements. This is done to avoid redundant serialization if the file is to be sent to the api for chunking. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2024-04-25 00:24:58 +00:00
David Potter	c8cf8f31ac	bug CORE-4225: mongodb url bug (#2662 ) The mongodb redact method was created because we wanted part of the url to be exposed to the user during logging. Thus it did not use the dataclass `enhanced_field(sensitive=True)` solution. This changes it to use our standard redacted solution. This also minimizes the amount of work to be done in platform.	2024-03-28 18:38:50 +00:00
Steve Canny	31bef433ad	rfctr: prepare to add orig_elements serde (#2668 ) Summary The serialization and deserialization (serde) of `metadata.orig_elements` will be located in `unstructured.staging.base` alongside `elements_to_json()` and other existing serde functions. Improve the typing, readability, and structure of that module before adding the new serde functions for `metadata.orig_elements`. Reviewers: The commits are well-groomed and are probably quicker to review commit-by-commit than as all files-changed at once.	2024-03-20 21:27:59 +00:00
David Potter	5b92e0bb6b	bug CORE-4089: Onedrive partitioning fails - datetime formatting error (#2638 ) Fixes Onedrive bug the same way Ryan fixed the Sharepoint error. (both are microsoft products) https://github.com/Unstructured-IO/unstructured/pull/2591 https://github.com/Unstructured-IO/unstructured/pull/2592/files We are seeing occurrences of inconsistency in the timestamps returned by Onedrive when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. Changes This adds logic to guarantee Onedrive dates will be properly formatted as ISO, regardless of the format provided by the onedrive library. Bumps timestamp format output to include timezone offset (as we do with others) Adds unit tests for isofomat. json_to_dict already unit tested here: https://github.com/Unstructured-IO/unstructured/blob/main/test_unstructured_ingest/unit/test_utils.py Adds small change for AstraDB to allow them to see what source called their api	2024-03-15 14:01:05 +00:00
David Potter	43250d5576	bug CORE-3971: fix deserialization in google-drive source connector key path (#2586 ) Google Drive Service account key can be a dict or a file path(str) We have successfully been using the path. But the dict can also end up being stored as a string that needs to be deserialized. The deserialization can have issues with single and double quotes.	2024-03-03 15:30:35 +00:00
ryannikolaidis	71d5d513ef	fix: handling of varied SharePoint date formats (#2591 ) We are seeing occurrences of inconsistency in the timestamps returned by office365.sharepoint when fetching created and modified dates. Furthermore, in future versions of this library, a datetime object will be returned rather than a string. ## Changes - This adds logic to guarantee SharePoint dates will be properly formatted as ISO, regardless of the format provided by the sharepoint library. - Bumps timestamp format output to include timezone offset (as we do with others) ## Testing Unit test added to validate this datetime handling across various formats. --------- Co-authored-by: David Potter <potterdavidm@gmail.com>	2024-02-28 16:11:53 +00:00
Ahmet Melek	be71633415	refactor: isolate ingest dependencies into local scopes (#2509 ) This PR: - Moves ingest dependencies into local scopes to be able to import ingest connector classes without the need of installing imported external dependencies. This allows lightweight use of the classes (not the instances. to use the instances as intended you'll still need the dependencies). - Upgrades the embed module dependencies from `langchain` to `langchain-community` module (to pass CI [rather than introducing a pin]) - Does pip-compile - Does minor refactors in other files to pass `ruff 2.0` checks which were introduced by pip-compile	2024-02-06 21:28:55 +00:00
David Potter	74dcca44ca	fix: link_texts was breaking postgres destination connector (#2460 ) Formatting of link_texts was breaking metadata storage. Turns out it didn't need any conforming and came in correctly from json. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-27 04:29:38 +00:00
ryannikolaidis	2e97494613	fix: fsspec connectors returning data source version as integer (#2427 ) Connector data source versions should always be string values, however we were using the integer checksum value for the version for fsspec connectors. This casts that value to a string. ## Changes * Cast the checksum value to a string when assigning the version value for fsspec connectors. * Adds test to validate that these connectors will assign a string value when an integer checksum is fetched. ## Testing Unit test added.	2024-01-19 15:58:01 +00:00
jakub-sandomierz-deepsense-ai	411aa98bbf	feat: Salesforce connector accepts key path or value (#2321 ) (#2327 ) Solution to issue https://github.com/Unstructured-IO/unstructured/issues/2321. simple_salesforce API allows for passing private key path or value. This PR introduces this support for Ingest connector. Salesforce parameter "private-key-file" has been renamed to "private-key". It can contain one of following: - path to PEM encoded key file (as string) - key contents (PEM encoded string) If the provided value cannot be parsed as PEM encoded private key, then the file existence is checked. This way private key contents are not exposed to unnecessary underlying function calls.	2024-01-11 11:15:24 +00:00
jakub-sandomierz-deepsense-ai	5581e6a4c4	fix: Ingest GCS accepts JSON auth token (#2322 ) (#2371 ) FSSpec serialization caused conversion of JSON token to string with single quotes. GCS requires JSON token in form of dict so this format is now assured. Other forms of auth are not modified but there is improved validation for all of the options.	2024-01-11 09:03:47 +00:00
Roman Isecke	8dc130c920	fix: ensure consistency in method signatures across destination connectors (#2381 ) ### Description * Make sure all destination connectors implement the base abstract methods using the same signatures. * Also leverage conform dict in the base methods to make sure it's called in a consistent fashion. * Additional updates to move the common code into the base destination connector class	2024-01-11 00:19:49 +00:00
jakub-sandomierz-deepsense-ai	0ca154a0f3	Fix: MongoDB connector URI password redaction, basic unit tests for Git connector (#2268 ) MongoDB connector: Issue: [MongoDB documentation](https://www.mongodb.com/docs/manual/reference/connection-string/) states that characters `$ : / ? # [ ] @` must be percent encoded. URI with password containing such special character will not be redacted. Fix: This fix removes usage of `unquote_plus` on password which allows detected password to match with one inside URI and successfully replace it. Git connector: Added very basic unit tests for repository filtering methods. Their impact is rather minimal but showcases current limitation in `is_file_type_supported` method.	2024-01-08 11:27:08 +00:00
rvztz	950e5d68f9	feat: adds postgresql/sqlite destination connector (#2005 ) - Adds a destination connector to upload processed output into a PostgreSQL/Sqlite database instance. - Users are responsible to provide their instances. This PR includes a couple of configuration examples. - Defines the scripts required to setup a PostgreSQL instance with the unstructured elements schema. - Validates postgres/pgvector embedding storage and retrieval --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2024-01-04 19:33:16 +00:00
Roman Isecke	cc05e948ff	chore: sensitive info connector audit (#2227 ) ### Description All other connectors that were not included in https://github.com/Unstructured-IO/unstructured/pull/2194 are now updated to follow the new pattern and mark any variables as sensitive where it makes sense. Core changes: * All connectors now support an `AccessConfig` to mark data that's needed for auth (i.e. username, password) and those that are sensitive are designated appropriately using the new enhanced field. * All cli configs on the cli definition now inherit from the base config in the connector file to reuse the variables set on that dataclass * The base writer class was updated to better generalize the new approach given better use of dataclasses * The base cli classes were refactored to also take into account the need for a connector and write config when creating the respective runner/writer classes. * Any mismatch between the cli field name and the dataclass field name were updated on the dataclass side to not impact the user but maintain consistency * Add custom redaction logic for mongodb URIs since the password is expected to be a part of it. Now this: `"mongodb+srv://ingest-test-user:r4hK3BD07b@ingest-test.hgaig.mongodb.net/"` -> `"mongodb+srv://ingest-test-user:*REDACTED@ingest-test.hgaig.mongodb.net/"` in the logs Bundle all fsspec based files into their own packages. * Refactor custom `_decode_dataclass` used for enhanced json mixin by using a monkey-patch approach. The original approach was breaking on optional nested dataclasses when serializing since the other methods in `dataclasses_json_core` weren't using the new method. By monkey-patching the original method with a new one, all other methods in that library would use the new one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-11 17:37:49 +00:00
David Potter	cde11d1eb0	feat: Add sftp source connector (#2163 ) Adds source connector for SFTP which uses fsspec and paramiko via fsspec. Paramiko is the standard sftp package for python used in pysftp etc... ``` --username foo \ --password bar \ --remote-url sftp://localhost:47474/upload/ ``` Will only download a specifically requested file if it has an extension. (i.e. `--remote-url sftp://localhost:47474/upload/bob.zip`) It will treat any other remote_url as a folder path. This is intentional. --------- Co-authored-by: potter-potter <david.potter@gmail.com>	2023-12-07 19:33:19 +00:00
Roman Isecke	f193d3d43b	feat: improve sensitive data handling by fsspec connectors (#2194 ) ### Description Building off of PR https://github.com/Unstructured-IO/unstructured/pull/2179, updating fsspec based connectors to use better authentication field handling. This PR adds in the following changes: * Update the base classes to inherit from the enhanced json mixin * Add in a new access config dataclass that should be used as a nest dataclass in the connector configs * Update the code extracting configs out of the cli options dictionary to support the nested access config if it exists on the parent config * Update all fsspec connectors with explicit access configs given what each one's SDKs support * Update the json mixin and enhanced field to support a name override when serializing/deserializing from json/dicts. This allows a different name to be used for the CLI option than what the name of the field is on the dataclass. * Update all the writes to use class-based approach and share the same structure of the runner classes * Above update allowed for better code to be used in the base source and destination CLI commands * Add in utility code around paring a flat dictionary (coming from the click based options) into dataclass-based configs with potentially nested dataclasses. Slightly unrelated changes: * session handle removed from pinecone connector as this was breaking the serialization of the write config and didn't have any benefit as a connection was never being shared, the index used simply makes a new http call each time it's invoked. * Dedicated write configs were created for all destination connectors to better support serialization * Refactor of Elasticsearch connector included, with update to ingest test to use auth TODOs * Left a `#TODO` in the code but the way session handler is implemented right now, it breaks serialization since it adds a generic variable based on the library being used for a connector (i.e. `googleapiclient.discovery.Resource`) which is not serializable. This will need to be updated to omit that from serialization but still support the current workflow. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-12-05 20:55:19 +00:00
Roman Isecke	7ad8e88a95	feat: leverage logger to hide sensitive data in ingest logs (#2175 ) ### Description Modify the logger being used by ingest to leverage a new class implemented inheriting `logging.Formatter` which adds in some middleware to update the message being logged to omit any sensitive content. It does this by dynamically pulled out any valid json from the string being logged and runs that through a `hide_sensitive_fields` method which updated any values that are considered sensitive. Replaces the original json strings with the `json.dumps` version of the new dictionary.	2023-11-29 18:16:23 +00:00
Roman Isecke	30cbc420a0	bug: fsspec output filepath including base directory (#2146 ) ### Description When passing in a remote path for fsspec-based source connectors, the base directory was always being included in the output path itself. This was updated to exclude the base directory any only include any child directories relative to the base one. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-28 14:19:42 +00:00
Roman Isecke	2bb463d006	feat: support both single and batch ingest docs (#2105 ) ### Description There are some source ingest connectors that would be more efficient to read the content in batches rather than use an entire process per document. For example, reading from ElasticSearch. Given an index with possible hundreds of documents, reading each one individually is not as optimal as reading in batches. To try and maintain as much of the ingest doc paradigm already being supported, a new class `BaseIngestDocBatch` was added to handle reading in batches. It produces a list of `BaseSingleIngestDoc` which is what all current implementations were renamed to. This list is generated after it runs its `get_files` method. Past the source node, all other steps in the pipeline should not be affected, this is just an optimization for the read step. Additional Changes: * Removed use of jq and instead converted this into a fields filter on the content to let the database handle the filtering and limit the amount of data being pulled in. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-27 19:25:30 +00:00
Roman Isecke	ba4477ac20	feat: support table conversion for tabular destination connectors (#1917 ) ### Description * A full schema was introduced to map the type of all output content from the json partition output and mapped to a flattened table structure to leverage table-based destination connectors. The delta table destination connector was updated at the moment to take advantage of this. * Existing method to convert to a dataframe was updated because it had a bug in it. Object content in the metadata would have the key name changed when flattened but then this would be omitted since it didn't exist in the `_get_metadata_table_fieldnames` response. * Unit test was added to make sure we handle all values possible in an Element when converting to a table * Delta table ingest test was split into a source and destination test (looking ahead to split these up in CI) --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-11-03 16:47:21 +00:00
Roman Isecke	135aa65906	update ingest pipeline to share ingest docs via multiprocessing.manager.dict (#1814 ) ### Description * If the contents of a doc were updated by the process of reading/downloading it, this was not being persisted. To fix this, the data being passed around was updated to use a multiprocessing safe dict rather than the json string. Now that dict is updated after the `get_file` method is called. * Wikipedia connector was updated to use a static filename rather than one requiring a call to fetch data. * The read config param `re_download` was not being leveraged by the source node, this was fixed. * Added fix: chunking and embedding order reversed so chunking runs before embeddings --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-25 22:04:27 +00:00
Roman Isecke	a2af72bb79	local connector metadata and deserialization fix (#1800 ) ### Description * Priority of this was to fix deserialization of ingest docs. Currently the source metadata wasn't being persisted * To help debug this, source metadata was added to the local ingest doc as well. * Unit test added to make sure the metadata itself was persisted. * As part of serialization, it was forcing docs to fetch source metadata if it hadn't already to add to the generated dict/json. This shouldn't have happened if the underlying variable `_source_metadata` was `None`. This way the doc can be serialized without any calls being made. * Serialization was moved to the `to_dict` method to make it more universal. --------- Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>	2023-10-23 15:51:52 +00:00
Roman Isecke	63861f537e	Add check for duplicate click options (#1775 ) ### Description Given that many of the options associated with the `Click` based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options. ### Issues that were found and fixes: * duplicate api-key option set on Notion command conflicts with api key used for unstructured api. Added notion prefix. * retry logic configs had duplicates in biomed. Removed since this is not handled by the pipeline.	2023-10-20 14:00:19 +00:00
Roman Isecke	2e1404e02c	refactor: unstructured ingest as a pipeline (#1551 ) ### Description As we add more and more steps to the pipeline (i.e. chunking, embedding, table manipulation), it would help seperate the responsibility of each of these into their own processes, running each in parallel using json files to share data across. This will also help guarantee data is serializable if this code was used in an actual pipeline. Following is a flow diagram of the proposed changes. As part of this change: * A parent pipeline class will be responsible for running each `node`, which can optionally be run via multiprocessing if it supports it, or not. Possible nodes at this moment: * Doc factory: creates all the ingest docs via the source connector * Source: reads/downloads all of the content to process to the local filesystem to the location set by the `download_dir` parameter. * Partition: runs partition on all of the downloaded content in json format. * Any number of reformat nodes that modify the partitioned content. This can include chunking, embedding, etc. * Write: push the final json into the destination via the destination connector * This pipeline relies on the information of the ingest docs to be available via their serialization. An optimization was introduced with the `IngestDocJsonMixin` which adds in all the `@property` fields to the serialized json already being created via the `DataClassJsonMixin` * For all intermediate steps (partitioning, reformatting), the content is saved to a dedicated location on the local filesystem. Right now it's set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`. * Minor changes: made sense to move some of the config parameters between the read and partition configs when I explicitly divided the responsibility to download vs partition the content in the pipeline. * The pipeline class only makes the doc factory, source and partition nodes required, keeping with the logic that has been supported so far. All reformatting nodes and write node are optional. * Long term, there should also be some changes to the base configs supported by the CLI to support pipeline specific configs, but for now what exists was used to minimize changes in this PR. * Final step to copy the final output to the location designated by the `_output_filename` value of the ingest doc. * Hashing occurs at each step by hashing the parameters of that step (i.e. partition configs) along with the previous step via the filename used. This allows each step to be the same _if_ all the parameters for it have not changed and the content so far is the same. * The only data that is shared and has writes to across processes is the dictionary of ingest json data. This dict is created using the `multiprocessing.manager.DictProxy` to make sure any interaction with it is behind a lock. ### Minor refactors included: * Utility methods added to extract configs from the click options * Utility method to add common options to click commands. * All writers moved to using the class approach which extracts a lot of the common code so there's less copy-paste when new runners are added. * Use `@property` for source metadata on base ingest doc to add logic to call `update_source_metadata` if it's still `None` at the time it's fetched. ### Additional bug fixes included * Fsspec connectors were not serializable due to the `ingest_doc_cls`. This was removed from the fields captured by the `@dataclass` decorator and added in a `__post_init__` method. * Various reddit connector params were missing. This doesn't have an explicit ingest test at the moment so was never caught. * Fsspec connector had the parent `update_source_metadata` misnamed as `update_source_metadata_metadata` so it was never being called. ### Flow Diagram ![ingest_pipeline](https://github.com/Unstructured-IO/unstructured/assets/136338424/be485606-cfe0-4931-8b81-c2bf569cf1e2)	2023-10-06 18:49:29 +00:00
shreyanid	32bfebccf7	feat: introduce language detection function for text partitioning function (#1453 ) ### Summary Uses `langdetect` to detect all languages present in the input document. ### Details - Converts all language codes (whether user inputted or detected using `langdetect`) to a standard ISO 639-3 code. - Adds `languages` field to the metadata - Will revisit how to nonstandardly represent simplified vs traditional Chinese scripts internally (separate PR). - Update ingest test results to add `languages` field to documents. Some other side effects are changes in order of some elements and changes in element categorization ### Test You can test the detect_languages function individually by importing the function and inputting a text sample and optionally a language: ``` text = "My lubimy mleko i chleb." doc_langs = detect_languages(text) print(doc_langs) ``` -> ['ces', 'pol', 'slk'] --------- Co-authored-by: Newel H <37004249+newelh@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: shreyanid <shreyanid@users.noreply.github.com> Co-authored-by: Trevor Bossert <37596773+tabossert@users.noreply.github.com> Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>	2023-09-26 18:09:27 +00:00
Newel H	cd704e873b	Feat: Create a naive hierarchy for elements (#1268 ) ## Summary By adding hierarchy to unstructured elements, users will have more information for implementing vector db/LLM chunking strategies. For example, text elements could be queried by their preceding title element. The hierarchy is implemented by a parent_id tag in the element's metadata. ### Features - Introduces a parent_id to ElementMetadata (The id of the parent element, not a pointer) - Creates a rule set for assigning hierarchies. Sensible default is assigned, with an optional override parameter - Sets element parent ids if there isn't an existing parent id or matches the ruleset ### How it works Hierarchies are assigned via a parent id field in element metadata. Elements are read sequentially and evaluated against a ruleset. For example take the following elements: 1. Title, "This is the Title" 2. Text, "this is the text" And the ruleset: `{"title": ["text"]}`. When evaluated, the parent_id of 2 will be the id of 1. The algorithm for determining this is more complex and resolves several edge cases, so please read the code for further details. ### Schema Changes ``` @dataclass class ElementMetadata: coordinates: Optional[CoordinatesMetadata] = None data_source: Optional[DataSourceMetadata] = None filename: Optional[str] = None file_directory: Optional[str] = None last_modified: Optional[str] = None filetype: Optional[str] = None attached_to_filename: Optional[str] = None + parent_id: Optional[Union[str, uuid.UUID, NoID, UUID]] = None + category_depth: Optional[int] = None ... ``` ### Testing ``` from unstructured.partition.auto import partition from typing import List elements = partition(filename="./unstructured/example-docs/fake-html.html", strategy="auto") for element in elements: print( f"Category: {getattr(element, 'category', '')}\n"\ f"Text: {getattr(element, 'text', '')}\n" f"ID: {element.id}\n" \ f"Parent ID: {element.metadata.parent_id}\n"\ f"Depth: {element.metadata.category_depth}\n" \ ) ``` ### Additional Notes Implementing this feature revealed a possibly undesired side-effect in how element metadata are processed. In `unstructured/partition/common.py` the `_add_element_metadata` is invoked as part of the `add_metadata_with_filetype` decorator for filetype partitioning. This method is intended to add additional information to the metadata generated with the element including filename and filetype, however the existing metadata is merged into a newly created metadata object rather than the other way around. Because of the way it's structured, new metadata fields can easily be forgotten and pose debugging challenges to developers. This likely warrants a new issue. I'm guessing that the implementation is done this way to avoid issues with deserializing elements, but could be wrong. --------- Co-authored-by: Benjamin Torres <benjats07@users.noreply.github.com>	2023-09-14 11:23:16 -04:00
Roman Isecke	59e850bbd9	Roman/downstream connector cli subcommand (#1302 ) ### Description Update all other connectors to use the new downstream architecture that was recently introduced for the s3 connector. Closes #1313 and #1311	2023-09-11 11:40:56 -04:00
ryannikolaidis	92692ad8d7	fix: wrapped error handling for connectors (#1262 ) The CustomError that we use to wrap custom ingest errors inherits from BaseException rather than Exception (as we should, per specification [here](https://docs.python.org/3/library/exceptions.html#BaseException)). This resulted in exceptions not properly raising as expected. This PR changes the inheritance which resolves the known issue. Additionally, our base definition for get_file on IngestDoc was wrapped with SourceConnectionError, however this must be explicitly decorating each subclass definition in order to function. This PR does that. ## Testing Some unit test coverage was added for the error wrapping class, however this wasn't properly recreating the issue we are seeing when running ingest tests. To recreate that issue one can intentionally raise an exception in the [partition_file](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/ingest/interfaces.py#L214C9-L214C23) definition and then run any ingest test. Prior to this change: the code and logs suggest that everything ran without exception, but the partitioned output was not generated (as a result the test will fail without any clues as to what went wrong). With this update, the expected custom partition error, error message, and stack trace will be visible. --------- Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>	2023-09-04 20:52:32 +00:00
ryannikolaidis	076b1e38f4	feat: serialize ingest docs as json (#1178 )	2023-08-31 01:48:41 +00:00
ryannikolaidis	668d0f1b01	feat: per-process ingest connections (#1058 ) * adds per process connections for Google Drive connector	2023-08-17 17:34:08 +00:00
Matt Robinson	331c7faf38	build(deps): split up dependencies by document type (#986 ) * split dependencies by document type * make pip-compile with new requirements * add extra requirements to setup.py * add in all docs; re pip-compile * extra for all docs * add pandas to xlsx * dependency requires for tsv and csv * handling for doc, docx and odt * dependency check for pypandoc * required dependencies for pandoc files * xml and html * markdown * msg * add in pdf * add in pptx * add in excel * add lxml as base req * extra all docs for local inference * local inference installs all * pin pillow version * fixes for plain text tests * fixes for doc * update make commands * changelog and version * add xlrd * update pip-compile * pin numpy for python 3.8 support * more constraints * contraint on scipy * update install docs * constrain ipython * add outlook to pip-compile * more ipython constraints * add extras to dockerfile * pin office365 client * few doc tweaks * types as strings * last pip-compile * re pip-comple * make tidy * make tidy	2023-08-01 11:31:13 -04:00
cragwolfe	13d3559fa4	chore: rename Element's "date" field to "last_modified" (#997 ) Change the Element's date field name to the more specific last_modified so there is less room for confusion of what that field represents.	2023-08-01 02:55:43 +00:00
Matt Robinson	d9aed66b65	feat: add document date for remaining file types (#930 ) (#969 ) * feat: add document date for remaining file types (#930) * feat: add functions for getting modification date * feat: add date field to metadata from csv file * feat: add tests for csv patition * feat: add date field to metadata from html file * feat: add tests for html partition * fix: return file name onlyif possible * feat: add csv tests * fix: renaming * feat: add filed metadata_date as date of last mod * feat: add tests for partition_docx * feat: add filed metadata_date to .doc file * feat: add tests for partition_doc * feat: add metadata_date to .epub file * feat: add tests for partition_epub * fix: fix test mocking * feat: add metadata_date for image partition * feat: add test for image partition * feat: add coorrdinate system argument * feat: add date to element metadata * feat: add metadata_date for JSON partition * feat: add test for JSON partition * fix: rename variable * feat: add metadata_date for md partition * feat: add test for md partition * feat: update doc string * feat: add metadata_date for .odt partition * feat: update .odt string * feat: add metadata_date for .org partition * feat: add tests for .org partition * feat: add metadata_date for .pdf partition * feat: add tests for .pdf partition * feat: add metadata_date for .pptx partition * feat: add metadata_date for .ppt partition * feat: add tests for .ppt partition * feat: add tests for .pptx partition * feat: add metadata_date for .rst partition * feat: add tests for .rst partition * fix: get modification date after file checking * feat: add tests for .rtf partition * feat: add tests for .rtf partition * feat: add metadata_date for .txt partition * fix: rename argument * feat: add tests for .txt partition * feat: update doc string rst patrition function * feat: add metadata_date for .tsv partition * feat: add tests for .tsv partition * feat: add metadata_date for .xlsx partition * feat: add tests for .xlsx partition * fix: clean up * feat: add tests for .xml partition * feat: add tests for .xml partition * fix: use `or ` instead of `if` * fix: fix epub tests * fix: remove not used code * fix: add try block for getting file name * fix: applying linter changes * fix: fix test_partition_file * feat: add metadata_date for email * feat: add test for email partition * feat: add metadata_date for msg * feat: add tests for msg partition * feat: update CHANGELOG file * fix: update partitions doc string * don't push * fix: clean up code * linting, linting, linting * remove unnecessary example doc * update version and changelog * ingest-test-fixtures-update * set metadata date in test --------- Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io> * ingest-test-fixtures-update * Update ingest test fixtures (#970) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> * Revert "Update ingest test fixtures (#970)" This reverts commit 1d182ae474b3545b15551fffc15977757d552cd2. * remove date from metadata in outputs * update docstring ordering * remove print * remove print * remove print * linting, linting, linting * fix version and test * fix changelog * fix changelog * update version --------- Co-authored-by: kravetsmic <79907559+kravetsmic@users.noreply.github.com> Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-07-26 15:10:14 -04:00
David Potter	f7e46af22f	feat: adds Outlook connector (#939 ) * bonus: fixes issue with email partitioning where From field was being assigned the To field value.	2023-07-26 04:09:26 +00:00
Emily Chen	24ebd0fa4e	chore: Move coordinate details from Element model to a metadata model (#827 )	2023-07-05 11:25:11 -07:00
ryannikolaidis	a5c7e5b41e	chore: DRY ingest connectors (#769 )	2023-06-26 20:12:05 +00:00
qued	db4c5dfdf7	feat: coordinate systems (#774 ) Added the CoordinateSystem class for tracking the system in which coordinates are represented, and changing the system if desired.	2023-06-20 11:19:55 -05:00
Matt Robinson	a800967478	enhancements: add page numbers for word docs when available (#750 ) * add support for page numbers in docx when present * version and changelog * add comment on page numbers * add header and footer to doc elements list * update integrations docs * include_page_breaks kwarg for doc and docx * merge element metadata for pagebreaks * fix typo * fix changelog typo * change page number default to None * add initial_page_number kwarg * make page number tests in pdf more explicit * revert test file * update ingest tests * update test fixture outputs * updates to IRS forms fixtures * ingest-test-fixtures-update * Update ingest test fixtures (#759) Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com> --------- Co-authored-by: Unstructured-DevOps <111007769+Unstructured-DevOps@users.noreply.github.com> Co-authored-by: MthwRobinson <MthwRobinson@users.noreply.github.com>	2023-06-15 12:21:17 -04:00
ryannikolaidis	2094b976cf	feat: adds data_source metadata to ElementMetadata (#690 )	2023-06-07 21:22:18 -07:00
ryannikolaidis	29f0deda63	test: revive ingest unit tests (#688 )	2023-06-06 09:03:13 -07:00

41 Commits