2023-09-02 08:50:31 -07:00
|
|
|
"""
|
|
|
|
Salesforce Connector
|
|
|
|
Able to download Account, Case, Campaign, EmailMessage, Lead
|
|
|
|
Salesforce returns everything as a list of json.
|
|
|
|
This saves each entry as a separate file to be partitioned.
|
|
|
|
Using JWT authorization
|
|
|
|
https://developer.salesforce.com/docs/atlas.en-us.sfdx_dev.meta/sfdx_dev/sfdx_dev_auth_key_and_cert.htm
|
|
|
|
https://developer.salesforce.com/docs/atlas.en-us.sfdx_dev.meta/sfdx_dev/sfdx_dev_auth_connected_app.htm
|
|
|
|
"""
|
|
|
|
import os
|
2023-09-11 11:40:56 -04:00
|
|
|
import typing as t
|
2023-09-25 11:44:28 -07:00
|
|
|
from collections import OrderedDict
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
from dataclasses import dataclass, field
|
2023-09-25 11:44:28 -07:00
|
|
|
from datetime import datetime
|
2023-09-02 08:50:31 -07:00
|
|
|
from email.utils import formatdate
|
|
|
|
from pathlib import Path
|
|
|
|
from string import Template
|
|
|
|
from textwrap import dedent
|
|
|
|
|
|
|
|
from dateutil import parser # type: ignore
|
|
|
|
|
2023-10-31 10:02:28 -04:00
|
|
|
from unstructured.ingest.error import SourceConnectionError, SourceConnectionNetworkError
|
2023-09-02 08:50:31 -07:00
|
|
|
from unstructured.ingest.interfaces import (
|
|
|
|
BaseConnectorConfig,
|
|
|
|
BaseIngestDoc,
|
2023-09-11 11:40:56 -04:00
|
|
|
BaseSourceConnector,
|
2023-09-02 08:50:31 -07:00
|
|
|
IngestDocCleanupMixin,
|
2023-09-11 11:40:56 -04:00
|
|
|
SourceConnectorCleanupMixin,
|
2023-09-25 11:44:28 -07:00
|
|
|
SourceMetadata,
|
2023-09-02 08:50:31 -07:00
|
|
|
)
|
|
|
|
from unstructured.ingest.logger import logger
|
|
|
|
from unstructured.utils import requires_dependencies
|
|
|
|
|
|
|
|
|
|
|
|
class MissingCategoryError(Exception):
|
|
|
|
"""There are no categories with that name."""
|
|
|
|
|
|
|
|
|
2023-09-25 11:44:28 -07:00
|
|
|
SALESFORCE_API_VERSION = "57.0"
|
2023-09-02 08:50:31 -07:00
|
|
|
|
2023-09-25 11:44:28 -07:00
|
|
|
ACCEPTED_CATEGORIES = ["Account", "Case", "Campaign", "EmailMessage", "Lead"]
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
EMAIL_TEMPLATE = Template(
|
|
|
|
"""MIME-Version: 1.0
|
|
|
|
Date: $date
|
|
|
|
Message-ID: $message_identifier
|
|
|
|
Subject: $subject
|
|
|
|
From: $from_email
|
|
|
|
To: $to_email
|
|
|
|
Content-Type: multipart/alternative; boundary="00000000000095c9b205eff92630"
|
|
|
|
--00000000000095c9b205eff92630
|
|
|
|
Content-Type: text/plain; charset="UTF-8"
|
|
|
|
$textbody
|
|
|
|
--00000000000095c9b205eff92630
|
|
|
|
Content-Type: text/html; charset="UTF-8"
|
2023-09-25 11:44:28 -07:00
|
|
|
$htmlbody
|
2023-09-02 08:50:31 -07:00
|
|
|
--00000000000095c9b205eff92630--
|
|
|
|
""",
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
@dataclass
|
|
|
|
class SimpleSalesforceConfig(BaseConnectorConfig):
|
|
|
|
"""Connector specific attributes"""
|
|
|
|
|
2023-09-11 11:40:56 -04:00
|
|
|
categories: t.List[str]
|
2023-09-02 08:50:31 -07:00
|
|
|
username: str
|
|
|
|
consumer_key: str
|
|
|
|
private_key_path: str
|
|
|
|
recursive: bool = False
|
|
|
|
|
|
|
|
@requires_dependencies(["simple_salesforce"], extras="salesforce")
|
2023-09-11 11:40:56 -04:00
|
|
|
def get_client(self):
|
2023-09-02 08:50:31 -07:00
|
|
|
from simple_salesforce import Salesforce
|
|
|
|
|
|
|
|
return Salesforce(
|
|
|
|
username=self.username,
|
|
|
|
consumer_key=self.consumer_key,
|
|
|
|
privatekey_file=self.private_key_path,
|
2023-09-25 11:44:28 -07:00
|
|
|
version=SALESFORCE_API_VERSION,
|
2023-09-02 08:50:31 -07:00
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
@dataclass
|
|
|
|
class SalesforceIngestDoc(IngestDocCleanupMixin, BaseIngestDoc):
|
2023-09-11 11:40:56 -04:00
|
|
|
connector_config: SimpleSalesforceConfig
|
2023-09-02 08:50:31 -07:00
|
|
|
record_type: str
|
|
|
|
record_id: str
|
|
|
|
registry_name: str = "salesforce"
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
_record: OrderedDict = field(default_factory=lambda: OrderedDict())
|
|
|
|
|
|
|
|
@property
|
|
|
|
def record(self):
|
|
|
|
if not self._record:
|
|
|
|
self._record = self.get_record()
|
|
|
|
return self._record
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
def _tmp_download_file(self) -> Path:
|
|
|
|
if self.record_type == "EmailMessage":
|
|
|
|
record_file = self.record_id + ".eml"
|
|
|
|
elif self.record_type in ["Account", "Lead", "Case", "Campaign"]:
|
2023-09-25 11:44:28 -07:00
|
|
|
record_file = self.record_id + ".xml"
|
2023-09-02 08:50:31 -07:00
|
|
|
else:
|
|
|
|
raise MissingCategoryError(
|
|
|
|
f"There are no categories with the name: {self.record_type}",
|
|
|
|
)
|
2023-09-11 11:40:56 -04:00
|
|
|
return Path(self.read_config.download_dir) / self.record_type / record_file
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
@property
|
|
|
|
def _output_filename(self) -> Path:
|
|
|
|
record_file = self.record_id + ".json"
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
return Path(self.processor_config.output_dir) / self.record_type / record_file
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
def _create_full_tmp_dir_path(self):
|
|
|
|
self._tmp_download_file().parent.mkdir(parents=True, exist_ok=True)
|
|
|
|
|
2023-09-25 11:44:28 -07:00
|
|
|
def _xml_for_record(self, record: OrderedDict) -> str:
|
|
|
|
"""Creates partitionable xml file from a record"""
|
|
|
|
import xml.etree.ElementTree as ET
|
|
|
|
|
|
|
|
def flatten_dict(data, parent, prefix=""):
|
|
|
|
for key, value in data.items():
|
|
|
|
if isinstance(value, OrderedDict):
|
|
|
|
flatten_dict(value, parent, prefix=f"{prefix}{key}.")
|
|
|
|
else:
|
|
|
|
item = ET.Element("item")
|
|
|
|
item.text = f"{prefix}{key}: {value}"
|
|
|
|
parent.append(item)
|
|
|
|
|
|
|
|
root = ET.Element("root")
|
|
|
|
flatten_dict(record, root)
|
|
|
|
xml_string = ET.tostring(root, encoding="utf-8", xml_declaration=True).decode()
|
|
|
|
return xml_string
|
|
|
|
|
|
|
|
def _eml_for_record(self, email_json: t.Dict[str, t.Any]) -> str:
|
2023-09-02 08:50:31 -07:00
|
|
|
"""Recreates standard expected .eml format using template."""
|
|
|
|
eml = EMAIL_TEMPLATE.substitute(
|
|
|
|
date=formatdate(parser.parse(email_json.get("MessageDate")).timestamp()),
|
|
|
|
message_identifier=email_json.get("MessageIdentifier"),
|
|
|
|
subject=email_json.get("Subject"),
|
|
|
|
from_email=email_json.get("FromAddress"),
|
|
|
|
to_email=email_json.get("ToAddress"),
|
|
|
|
textbody=email_json.get("TextBody"),
|
2023-09-25 11:44:28 -07:00
|
|
|
# TODO: This is a hack to get emails to process correctly.
|
|
|
|
# The HTML partitioner seems to have issues with <br> and text without tags like <p>
|
|
|
|
htmlbody=email_json.get("HtmlBody", "") # "" because you can't .replace None
|
|
|
|
.replace("<br />", "<p>")
|
|
|
|
.replace("<body", "<body><p"),
|
2023-09-02 08:50:31 -07:00
|
|
|
)
|
|
|
|
return dedent(eml)
|
|
|
|
|
2023-10-31 10:02:28 -04:00
|
|
|
@SourceConnectionNetworkError.wrap
|
|
|
|
def _get_response(self):
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
client = self.connector_config.get_client()
|
2023-10-31 10:02:28 -04:00
|
|
|
return client.query_all(
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
f"select FIELDS(STANDARD) from {self.record_type} where Id='{self.record_id}'",
|
|
|
|
)
|
2023-10-31 10:02:28 -04:00
|
|
|
|
|
|
|
def get_record(self) -> OrderedDict:
|
|
|
|
# Get record from Salesforce based on id
|
|
|
|
response = self._get_response()
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
logger.debug(f"response from salesforce record request: {response}")
|
|
|
|
records = response["records"]
|
|
|
|
if not records:
|
|
|
|
raise ValueError(f"No record found with record id {self.record_id}: {response}")
|
|
|
|
record_json = records[0]
|
|
|
|
return record_json
|
|
|
|
|
|
|
|
def update_source_metadata(self) -> None: # type: ignore
|
|
|
|
record_json = self.record
|
|
|
|
|
2023-09-25 11:44:28 -07:00
|
|
|
date_format = "%Y-%m-%dT%H:%M:%S.000+0000"
|
|
|
|
self.source_metadata = SourceMetadata(
|
|
|
|
date_created=datetime.strptime(record_json["CreatedDate"], date_format).isoformat(),
|
|
|
|
date_modified=datetime.strptime(
|
|
|
|
record_json["LastModifiedDate"],
|
|
|
|
date_format,
|
|
|
|
).isoformat(),
|
|
|
|
# SystemModstamp is Timestamp if record has been modified by person or automated system
|
|
|
|
version=record_json.get("SystemModstamp"),
|
|
|
|
source_url=record_json["attributes"].get("url"),
|
|
|
|
exists=True,
|
|
|
|
)
|
|
|
|
|
|
|
|
@SourceConnectionError.wrap
|
2023-09-02 08:50:31 -07:00
|
|
|
@BaseIngestDoc.skip_if_file_exists
|
|
|
|
def get_file(self):
|
|
|
|
"""Saves individual json records locally."""
|
|
|
|
self._create_full_tmp_dir_path()
|
|
|
|
logger.debug(f"Writing file {self.record_id} - PID: {os.getpid()}")
|
|
|
|
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
record = self.record
|
2023-09-02 08:50:31 -07:00
|
|
|
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
self.update_source_metadata()
|
2023-09-25 11:44:28 -07:00
|
|
|
|
2023-09-02 08:50:31 -07:00
|
|
|
try:
|
|
|
|
if self.record_type == "EmailMessage":
|
2023-09-25 11:44:28 -07:00
|
|
|
document = self._eml_for_record(record)
|
2023-09-11 11:40:56 -04:00
|
|
|
else:
|
2023-09-25 11:44:28 -07:00
|
|
|
document = self._xml_for_record(record)
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
with open(self._tmp_download_file(), "w") as page_file:
|
2023-09-25 11:44:28 -07:00
|
|
|
page_file.write(document)
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
except Exception as e:
|
|
|
|
logger.error(
|
|
|
|
f"Error while downloading and saving file: {self.record_id}.",
|
|
|
|
)
|
|
|
|
logger.error(e)
|
|
|
|
|
|
|
|
@property
|
|
|
|
def filename(self):
|
|
|
|
"""The filename of the file created from a Salesforce record"""
|
|
|
|
return self._tmp_download_file()
|
|
|
|
|
|
|
|
|
2023-09-11 11:40:56 -04:00
|
|
|
@dataclass
|
|
|
|
class SalesforceSourceConnector(SourceConnectorCleanupMixin, BaseSourceConnector):
|
|
|
|
connector_config: SimpleSalesforceConfig
|
2023-10-17 12:15:08 -04:00
|
|
|
|
|
|
|
def __post_init__(self):
|
|
|
|
self.ingest_doc_cls: t.Type[SalesforceIngestDoc] = SalesforceIngestDoc
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
def initialize(self):
|
|
|
|
pass
|
|
|
|
|
2023-11-07 22:11:39 -05:00
|
|
|
@requires_dependencies(["simple_salesforce"], extras="salesforce")
|
|
|
|
def check_connection(self):
|
|
|
|
from simple_salesforce.exceptions import SalesforceError
|
|
|
|
|
|
|
|
try:
|
|
|
|
self.connector_config.get_client()
|
|
|
|
except SalesforceError as salesforce_error:
|
|
|
|
logger.error(f"failed to validate connection: {salesforce_error}", exc_info=True)
|
|
|
|
raise SourceConnectionError(f"failed to validate connection: {salesforce_error}")
|
|
|
|
|
2023-09-02 08:50:31 -07:00
|
|
|
@requires_dependencies(["simple_salesforce"], extras="salesforce")
|
2023-09-11 11:40:56 -04:00
|
|
|
def get_ingest_docs(self) -> t.List[SalesforceIngestDoc]:
|
2023-09-02 08:50:31 -07:00
|
|
|
"""Get Salesforce Ids for the records.
|
|
|
|
Send them to next phase where each doc gets downloaded into the
|
|
|
|
appropriate format for partitioning.
|
|
|
|
"""
|
|
|
|
from simple_salesforce.exceptions import SalesforceMalformedRequest
|
|
|
|
|
2023-09-11 11:40:56 -04:00
|
|
|
client = self.connector_config.get_client()
|
2023-09-02 08:50:31 -07:00
|
|
|
|
|
|
|
ingest_docs = []
|
2023-09-11 11:40:56 -04:00
|
|
|
for record_type in self.connector_config.categories:
|
2023-09-02 08:50:31 -07:00
|
|
|
if record_type not in ACCEPTED_CATEGORIES:
|
|
|
|
raise ValueError(f"{record_type} not currently an accepted Salesforce category")
|
|
|
|
|
|
|
|
try:
|
|
|
|
# Get ids from Salesforce
|
|
|
|
records = client.query_all(
|
|
|
|
f"select Id from {record_type}",
|
|
|
|
)
|
|
|
|
for record in records["records"]:
|
|
|
|
ingest_docs.append(
|
|
|
|
SalesforceIngestDoc(
|
2023-09-11 11:40:56 -04:00
|
|
|
connector_config=self.connector_config,
|
refactor: unstructured ingest as a pipeline (#1551)
### Description
As we add more and more steps to the pipeline (i.e. chunking, embedding,
table manipulation), it would help seperate the responsibility of each
of these into their own processes, running each in parallel using json
files to share data across. This will also help guarantee data is
serializable if this code was used in an actual pipeline. Following is a
flow diagram of the proposed changes. As part of this change:
* A parent pipeline class will be responsible for running each `node`,
which can optionally be run via multiprocessing if it supports it, or
not. Possible nodes at this moment:
* Doc factory: creates all the ingest docs via the source connector
* Source: reads/downloads all of the content to process to the local
filesystem to the location set by the `download_dir` parameter.
* Partition: runs partition on all of the downloaded content in json
format.
* Any number of reformat nodes that modify the partitioned content. This
can include chunking, embedding, etc.
* Write: push the final json into the destination via the destination
connector
* This pipeline relies on the information of the ingest docs to be
available via their serialization. An optimization was introduced with
the `IngestDocJsonMixin` which adds in all the `@property` fields to the
serialized json already being created via the `DataClassJsonMixin`
* For all intermediate steps (partitioning, reformatting), the content
is saved to a dedicated location on the local filesystem. Right now it's
set to `$HOME/.cache/unstructured/ingest/pipeline/STEP_NAME/`.
* Minor changes: made sense to move some of the config parameters
between the read and partition configs when I explicitly divided the
responsibility to download vs partition the content in the pipeline.
* The pipeline class only makes the doc factory, source and partition
nodes required, keeping with the logic that has been supported so far.
All reformatting nodes and write node are optional.
* Long term, there should also be some changes to the base configs
supported by the CLI to support pipeline specific configs, but for now
what exists was used to minimize changes in this PR.
* Final step to copy the final output to the location designated by the
`_output_filename` value of the ingest doc.
* Hashing occurs at each step by hashing the parameters of that step
(i.e. partition configs) along with the previous step via the filename
used. This allows each step to be the same _if_ all the parameters for
it have not changed and the content so far is the same.
* The only data that is shared and has writes to across processes is the
dictionary of ingest json data. This dict is created using the
`multiprocessing.manager.DictProxy` to make sure any interaction with it
is behind a lock.
### Minor refactors included:
* Utility methods added to extract configs from the click options
* Utility method to add common options to click commands.
* All writers moved to using the class approach which extracts a lot of
the common code so there's less copy-paste when new runners are added.
* Use `@property` for source metadata on base ingest doc to add logic to
call `update_source_metadata` if it's still `None` at the time it's
fetched.
### Additional bug fixes included
* Fsspec connectors were not serializable due to the `ingest_doc_cls`.
This was removed from the fields captured by the `@dataclass` decorator
and added in a `__post_init__` method.
* Various reddit connector params were missing. This doesn't have an
explicit ingest test at the moment so was never caught.
* Fsspec connector had the parent `update_source_metadata` misnamed as
`update_source_metadata_metadata` so it was never being called.
### Flow Diagram

2023-10-06 14:49:29 -04:00
|
|
|
processor_config=self.processor_config,
|
2023-09-11 11:40:56 -04:00
|
|
|
read_config=self.read_config,
|
|
|
|
record_type=record_type,
|
|
|
|
record_id=record["Id"],
|
2023-09-02 08:50:31 -07:00
|
|
|
),
|
|
|
|
)
|
|
|
|
except SalesforceMalformedRequest as e:
|
|
|
|
raise SalesforceMalformedRequest(f"Problem with Salesforce query: {e}")
|
|
|
|
|
|
|
|
return ingest_docs
|