From b2f37c3effeb114bb2d337e4aa0ff19623662b93 Mon Sep 17 00:00:00 2001 From: Sebastian Laverde Alfonso Date: Fri, 17 Mar 2023 20:11:38 +0100 Subject: [PATCH] Docs: add Integrations section (#372) * docs: update index, add integrations * docs: fix typos * docs: create integrations.rst section structure * docs: descriptions and use for 8 integrations * refactor: SEC example in Label Studio section * Apply suggestions from code review Co-authored-by: qued <64741807+qued@users.noreply.github.com> * docs: change links order and refactor|paraphrase --------- Co-authored-by: qued <64741807+qued@users.noreply.github.com> --- docs/source/bricks.rst | 8 ++--- docs/source/index.rst | 3 ++ docs/source/integrations.rst | 67 ++++++++++++++++++++++++++++++++++++ 3 files changed, 74 insertions(+), 4 deletions(-) create mode 100644 docs/source/integrations.rst diff --git a/docs/source/bricks.rst b/docs/source/bricks.rst index 5f8f3d81a..bed7d4190 100644 --- a/docs/source/bricks.rst +++ b/docs/source/bricks.rst @@ -119,7 +119,7 @@ faster processing and `"hi_res"` for ------------------ The ``partition_docx`` partitioning brick pre-processes Microsoft Word documents -saved in the ``.docx`` format. This staging brick uses a combination of the styling +saved in the ``.docx`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_docx`` can take a filename or file-like object as input, as shown in the two examples below. @@ -148,7 +148,7 @@ Examples: ------------------ The ``partition_doc`` partitioning brick pre-processes Microsoft Word documents -saved in the ``.doc`` format. This staging brick uses a combination of the styling +saved in the ``.doc`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_doc`` can take a filename or file-like object as input. @@ -169,7 +169,7 @@ Examples: --------------------- The ``partition_pptx`` partitioning brick pre-processes Microsoft PowerPoint documents -saved in the ``.pptx`` format. This staging brick uses a combination of the styling +saved in the ``.pptx`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_pptx`` can take a filename or file-like object as input, as shown in the two examples below. @@ -190,7 +190,7 @@ Examples: --------------------- The ``partition_ppt`` partitioning brick pre-processes Microsoft PowerPoint documents -saved in the ``.ppt`` format. This staging brick uses a combination of the styling +saved in the ``.ppt`` format. This partition brick uses a combination of the styling information in the document and the structure of the text to determine the type of a text element. The ``partition_ppt`` can take a filename or file-like object. ``partition_ppt`` uses ``libreoffice`` to convert the file to ``.pptx`` and then diff --git a/docs/source/index.rst b/docs/source/index.rst index 663119ceb..da80c9a94 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -20,6 +20,8 @@ Library Documentation :doc:`examples` Examples of other types of workflows within the ``unstructured`` package. +:doc:`integrations` + We make it easy for you to connect your output with other popular ML services. .. Hidden TOCs @@ -32,3 +34,4 @@ Library Documentation getting_started bricks examples + integrations diff --git a/docs/source/integrations.rst b/docs/source/integrations.rst new file mode 100644 index 000000000..cc28bd785 --- /dev/null +++ b/docs/source/integrations.rst @@ -0,0 +1,67 @@ +Integrations +====== +Integrate your model development pipeline with your favorite machine learning frameworks and libraries, +and prepare your data for ingestion into downstream systems. Most of our integrations come in the form of +`staging bricks `_, +which take a list of ``Element`` objects as input and return formatted dictionaries as output. + + +``Integration with Argilla`` +-------------- +You can convert a list of ``Text`` elements to an `Argilla `_ ``Dataset`` using the `stage_for_argilla `_ staging brick. Specify the type of dataset to be generated using the ``argilla_task`` parameter. Valid values are ``"text_classification"``, ``"token_classification"``, and ``"text2text"``. Follow the link for more details on usage. + + +``Integration with Datasaur`` +-------------- +You can format a list of ``Text`` elements as input to token based tasks in `Datasaur `_ using the `stage_for_datasaur `_ staging brick. You will obtain a list of dictionaries indexed by the keys ``"text"`` with the content of the element, and ``"entities"`` with an empty list. Follow the link to learn how to customise your entities and for more details on usage. + + +``Integration with Hugging Face`` +-------------- +You can prepare ``Text`` elements for processing in Hugging Face `Transformers `_ +pipelines by splitting the elements into chunks that fit into the model's attention window using the `stage_for_transformers `_ staging brick. You can customise the transformation by defining +the ``buffer`` and ``window_size``, the ``split_function`` and the ``chunk_separator``. if you need to operate on +text directly instead of ``unstructured`` ``Text`` objects, use the `chunk_by_attention_window `_ helper function. Follow the links for more details on usage. + + +``Integration with Labelbox`` +-------------- +You can format your outputs for use with `LabelBox `_ using the `stage_for_label_box `_ staging brick. LabelBox accepts cloud-hosted data and does not support importing text directly. With this integration you can stage the data files in the ``output_directory`` to be uploaded to a cloud storage service (such as S3 buckets) and get a config of type ``List[Dict[str, Any]]`` that can be written to a ``.json`` file and imported into LabelBox. Follow the link to see how to generate the ``config.json`` file that can be used with LabelBox, how to upload the staged data files to an S3 bucket, and for more details on usage. + + +``Integration with Label Studio`` +-------------- +You can format your outputs for upload to `Label Studio `_ using the `stage_for_label_studio `_ staging brick. After running ``stage_for_label_studio``, you can write the results +to a JSON folder that is ready to be included in a new Label Studio project. You can also include pre-annotations and predictions +as part of your upload. + +Check our example notebook to format and upload the risk section from an SEC filing to Label Studio for a sentiment analysis labeling task `here `_ . Follow the link for more details on usage, and check `Label Studio docs `_ for a full list of options for labels and annotations. + + +``Integration with LangChain`` +-------------- +Our integration with `LangChain `_ makes it incredibly easy to combine language models with your data, no matter what form it is in. The `Unstructured.io File Loader `_ extracts the text from a variety of unstructured text files using our ``unstructured`` library. It is designed to be used as a way to load data into `LlamaIndex `_ and/or subsequently used as a Tool in a LangChain Agent. See `here `_ for more `LlamaHub `_ examples. + +To use ``Unstructured.io File Loader`` you will need to have LlamaIndex 🦙 (GPT Index) installed in your environment. Just ``pip install llama-index`` and then pass in a ``Path`` to a local file. Optionally, you may specify split_documents if you want each element generated by ``unstructured`` to be placed in a separate document. Here is a simple example on how to use it: + +.. code:: python + from pathlib import Path + from llama_index import download_loader + + + UnstructuredReader = download_loader("UnstructuredReader") + + loader = UnstructuredReader() + documents = loader.load_data(file=Path('./10k_filing.html')) + + +``Integration with Pandas`` +-------------- +You can convert a list of ``Element`` objects to a Pandas dataframe with columns for +the text from each element and their types such as ``NarrativeText`` or ``Title`` using the `convert_to_dataframe `_ staging brick. Follow the link for more details on usage. + + +``Integration with Prodigy`` +-------------- +You can format your JSON or CSV outputs for use with `Prodigy `_ using the `stage_for_prodigy `_ and `stage_csv_for_prodigy `_ staging bricks. After running ``stage_for_prodigy`` | +``stage_csv_for_prodigy``, you can write the results to a ``.json`` | ``.jsonl`` or a ``.csv`` file that is ready to be used with Prodigy. Follow the links for more details on usage.