docs: Add source code links to bricks' docs (#923)

Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
This commit is contained in:
fran-unstructured 2023-07-13 13:27:47 -04:00 committed by GitHub
parent 9b830693bd
commit 26da51c765
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -131,6 +131,8 @@ to disable SSL verification in the request.
elements = partition(url=url)
elements = partition(url=url, content_type="text/markdown")
For more information about the ``partition`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/auto.py>`_.
``partition_csv``
------------------
@ -148,22 +150,7 @@ Examples:
elements = partition_csv(filename="example-docs/stanley-cups.csv")
print(elements[0].metadata.text_as_html)
``partition_tsv``
------------------
The ``partition_tsv`` function pre-processes TSV files. The output is a single
``Table`` element. The ``text_as_html`` attribute in the element metadata will
contain an HTML representation of the table.
Examples:
.. code:: python
from unstructured.partition.tsv import partition_tsv
elements = partition_tsv(filename="example-docs/stanley-cups.tsv")
print(elements[0].metadata.text_as_html)
For more information about the ``partition_csv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/csv.py>`_.
``partition_doc``
@ -186,6 +173,8 @@ Examples:
elements = partition_doc(filename="example-docs/fake.doc")
For more information about the ``partition_doc`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/doc.py>`_.
``partition_docx``
------------------
@ -228,6 +217,8 @@ insert page breaks when you save the document. If your Word document renderer do
you may not see page numbers in the output even if you see them visually when you open the
document. If that is the case, you can try saving the document with a different renderer.
For more information about the ``partition_docx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py>`_.
``partition_email``
---------------------
@ -288,6 +279,8 @@ workflow looks like:
filename=filename, process_attachments=True, attachment_partitioner=partition
)
For more information about the ``partition_email`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/email.py>`_.
``partition_epub``
---------------------
@ -306,6 +299,8 @@ Examples:
elements = partition_epub(filename="example-docs/winter-sports.epub")
For more information about the ``partition_epub`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/epub.py>`_.
``partition_html``
---------------------
@ -361,6 +356,8 @@ If ``html_assemble_articles`` is ``True``, each ``<article>`` tag will be treate
If ``html_assemble_articles`` is ``True`` and no ``<article>`` tags are present, the behavior
is the same as ``html_assemble_articles=False``.
For more information about the ``partition_html`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/html.py>`_.
``partition_image``
---------------------
@ -416,6 +413,8 @@ have the Korean language pack for Tesseract installed on your system.
filename = "example-docs/english-and-korean.png"
elements = partition_image(filename=filename, ocr_languages="eng+kor", strategy="ocr_only")
For more information about the ``partition_image`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/image.py>`_.
``partition_md``
---------------------
@ -432,6 +431,8 @@ Examples:
elements = partition_md(filename="README.md")
For more information about the ``partition_md`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/md.py>`_.
``partition_msg``
-----------------
@ -470,6 +471,8 @@ workflow looks like:
filename=filename, process_attachments=True, attachment_partitioner=partition
)
For more information about the ``partition_msg`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/msg.py>`_.
``partition_multiple_via_api``
------------------------------
@ -509,6 +512,8 @@ Examples:
files = [stack.enter_context(open(filename, "rb")) for filename in filenames]
documents = partition_multiple_via_api(files=files, file_filenames=filenames)
For more information about the ``partition_multiple_via_api`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/api.py>`_.
``partition_odt``
------------------
@ -525,6 +530,28 @@ Examples:
elements = partition_odt(filename="example-docs/fake.odt")
For more information about the ``partition_odt`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/odt.py>`_.
``partition_org``
---------------------
The ``partition_org`` function processes Org Mode (``.org``) documents. The function
first converts the document to HTML using ``pandoc`` and then calls ``partition_html``.
You'll need `pandoc <https://pandoc.org/installing.html>`_ installed on your system
to use ``partition_org``.
Examples:
.. code:: python
from unstructured.partition.org import partition_org
elements = partition_org(filename="example-docs/README.org")
For more information about the ``partition_org`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/org.py>`_.
``partition_pdf``
---------------------
@ -603,6 +630,8 @@ The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.
For more information about the ``partition_pdf`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/pdf.py>`_.
``partition_ppt``
---------------------
@ -623,6 +652,8 @@ Examples:
elements = partition_ppt(filename="example-docs/fake-power-point.ppt")
For more information about the ``partition_ppt`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/ppt.py>`_.
``partition_pptx``
---------------------
@ -644,23 +675,7 @@ Examples:
with open("example-docs/fake-power-point.pptx", "rb") as f:
elements = partition_pptx(file=f)
``partition_org``
---------------------
The ``partition_org`` function processes Org Mode (``.org``) documents. The function
first converts the document to HTML using ``pandoc`` and then calls ``partition_html``.
You'll need `pandoc <https://pandoc.org/installing.html>`_ installed on your system
to use ``partition_org``.
Examples:
.. code:: python
from unstructured.partition.org import partition_org
elements = partition_org(filename="example-docs/README.org")
For more information about the ``partition_pptx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/pptx.py>`_.
``partition_rst``
@ -680,6 +695,9 @@ Examples:
elements = partition_rst(filename="example-docs/README.rst")
For more information about the ``partition_rst`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/rst.py>`_.
``partition_rtf``
---------------------
@ -697,6 +715,8 @@ Examples:
elements = partition_rtf(filename="example-docs/fake-doc.rtf")
For more information about the ``partition_rtf`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/rtf.py>`_.
``partition_text``
---------------------
@ -746,6 +766,27 @@ The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.
For more information about the ``partition_text`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text.py>`_.
``partition_tsv``
------------------
The ``partition_tsv`` function pre-processes TSV files. The output is a single
``Table`` element. The ``text_as_html`` attribute in the element metadata will
contain an HTML representation of the table.
Examples:
.. code:: python
from unstructured.partition.tsv import partition_tsv
elements = partition_tsv(filename="example-docs/stanley-cups.tsv")
print(elements[0].metadata.text_as_html)
For more information about the ``partition_tsv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/tsv.py>`_.
``partition_via_api``
---------------------
@ -802,6 +843,7 @@ documentation on how to run the API as a container locally.
filename=filename, api_url="http://localhost:5000/general/v0/general"
)
For more information about the ``partition_via_api`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/api.py>`_.
``partition_xlsx``
@ -821,6 +863,8 @@ Examples:
elements = partition_xlsx(filename="example-docs/stanley-cups.xlsx")
print(elements[0].metadata.text_as_html)
For more information about the ``partition_xlsx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/xlsx.py>`_.
``partition_xml``
-----------------
@ -846,6 +890,7 @@ The default value is ``1500``, which roughly corresponds to
the average character length for a paragraph.
You can disable ``max_partition`` by setting it to ``None``.
For more information about the ``partition_xml`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/xml.py>`_.
########
@ -931,6 +976,8 @@ Examples:
# The output should be "Hello 😀"
elements[0].text
For more information about the ``bytes_string_to_string`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean``
---------
@ -959,6 +1006,8 @@ Examples:
# Returns "ITEM 1A: RISK FACTORS"
clean("ITEM 1A: RISK-FACTORS", extra_whitespace=True, dashes=True)
For more information about the ``clean`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_bullets``
-----------------
@ -978,6 +1027,8 @@ Examples:
# Returns "I love Morse Code! ●●●"
clean_bullets("I love Morse Code! ●●●")
For more information about the ``clean_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_dashes``
----------------
@ -994,6 +1045,8 @@ Examples:
# Returns "ITEM 1A: RISK FACTORS"
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")
For more information about the ``clean_dashes`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_extra_whitespace``
--------------------------
@ -1010,6 +1063,8 @@ Examples:
# Returns "ITEM 1A: RISK FACTORS"
clean_extra_whitespace("ITEM 1A: RISK FACTORS\n")
For more information about the ``clean_extra_whitespace`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_non_ascii_chars``
-------------------------
@ -1027,6 +1082,8 @@ Examples:
# Returns "This text containsnon-ascii characters!"
clean_non_ascii_chars(text)
For more information about the ``clean_non_ascii_chars`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_ordered_bullets``
-------------------------
@ -1045,6 +1102,8 @@ Examples:
# Returns "This is a very important point ●"
clean_bullets("a.b This is a very important point ●")
For more information about the ``clean_ordered_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_postfix``
-----------------
@ -1068,6 +1127,8 @@ Examples:
# Returns "The end!"
clean_postfix(text, r"(END|STOP)", ignore_case=True)
For more information about the ``clean_postfix`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_prefix``
----------------
@ -1091,6 +1152,8 @@ Examples:
# Returns "This is the best summary of all time!"
clean_prefix(text, r"(SUMMARY|DESCRIPTION):", ignore_case=True)
For more information about the ``clean_prefix`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``clean_trailing_punctuation``
-------------------------------
@ -1106,6 +1169,8 @@ Examples:
# Returns "ITEM 1A: RISK FACTORS"
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")
For more information about the ``clean_trailing_punctuation`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``extract_datetimetz``
----------------------
@ -1125,6 +1190,8 @@ object from the input string.
# Returns datetime.datetime(2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200)))
extract_datetimetz(text)
For more information about the ``extract_datetimetz`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_email_address``
--------------------------
@ -1142,6 +1209,8 @@ addresses in the input string.
# Returns "['me@email.com', 'you@email.com']"
extract_email_address(text)
For more information about the ``extract_email_address`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_ip_address``
------------------------
@ -1159,6 +1228,8 @@ returns a list of all IP address in input string.
# Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
extract_ip_address(text)
For more information about the ``extract_ip_address`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_ip_address_name``
----------------------------
@ -1178,6 +1249,8 @@ IP addresses in the input string.
# Returns "['ABC.DEF.local', 'ABC.DEF.local2']"
extract_ip_address_name(text)
For more information about the ``extract_ip_address_name`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_mapi_id``
----------------------
@ -1197,6 +1270,8 @@ containing the ``mapi id`` in the input string.
# Returns "['32.88.5467.123']"
extract_mapi_id(text)
For more information about the ``extract_mapi_id`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_ordered_bullets``
---------------------------
@ -1215,6 +1290,8 @@ Examples:
# Returns ("a", "1", None)
extract_ordered_bullets("a.1 This is a very important point")
For more information about the ``extract_ordered_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_text_after``
----------------------
@ -1238,6 +1315,8 @@ Examples:
# Returns "Look at me, I'm flying!"
extract_text_after(text, r"SPEAKER \d{1}:")
For more information about the ``extract_text_after`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_text_before``
-----------------------
@ -1261,6 +1340,8 @@ Examples:
# Returns "Here I am!"
extract_text_before(text, r"STOP")
For more information about the ``extract_text_before`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``extract_us_phone_number``
---------------------------
@ -1276,6 +1357,8 @@ Examples:
# Returns "215-867-5309"
extract_us_phone_number("Phone number: 215-867-5309")
For more information about the ``extract_us_phone_number`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
``group_broken_paragraphs``
---------------------------
@ -1319,6 +1402,8 @@ Examples:
group_broken_paragraphs(text, paragraph_split=para_split_re)
For more information about the ``group_broken_paragraphs`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``remove_punctuation``
--------------------------
@ -1334,6 +1419,8 @@ Examples:
# Returns "A lovely quote"
remove_punctuation("“A lovely quote!”")
For more information about the ``remove_punctuation`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``replace_unicode_quotes``
--------------------------
@ -1352,6 +1439,8 @@ Examples:
# Returns ""A lovely quote!"
replace_unicode_characters("\x91A lovely quote!\x92")
For more information about the ``replace_unicode_quotes`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
``translate_text``
------------------
@ -1383,6 +1472,8 @@ Examples:
# Output is "I can also translate Russian!"
translate_text("Я тоже можно переводать русский язык!", "ru", "en")
For more information about the ``translate_text`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/translate.py>`_.
#######
Staging
@ -1419,6 +1510,8 @@ Examples:
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
isd_csv = convert_to_csv(elements)
For more information about the ``convert_to_csv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
``convert_to_dataframe``
------------------------
@ -1437,6 +1530,8 @@ Examples:
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
df = convert_to_dataframe(elements)
For more information about the ``convert_to_dataframe`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
``convert_to_dict``
--------------------
@ -1454,6 +1549,8 @@ Examples:
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
isd = convert_to_dict(elements)
For more information about the ``convert_to_dict`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
``dict_to_elements``
---------------------
@ -1475,6 +1572,8 @@ Examples:
# [ Title(text="My Title"), NarrativeText(text="My Narrative")]
elements = dict_to_elements(isd)
For more information about the ``dict_to_elements`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
``stage_csv_for_prodigy``
--------------------------
@ -1497,6 +1596,8 @@ Examples:
with open("prodigy.csv", "w") as csv_file:
csv_file.write(prodigy_csv_data)
For more information about the ``stage_csv_for_prodigy`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/prodigy.py>`_.
``stage_for_argilla``
--------------------------
@ -1523,6 +1624,8 @@ Examples:
argilla_dataset = stage_for_argilla(elements, "text_classification", metadata=metadata)
For more information about the ``stage_for_argilla`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/argilla.py>`_.
``stage_for_baseplate``
-----------------------
@ -1575,6 +1678,8 @@ The output will look like:
],
}
For more information about the ``stage_for_baseplate`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/baseplate.py>`_.
``stage_for_datasaur``
--------------------------
@ -1611,6 +1716,8 @@ Example:
entities = [[{"text": "Matt", "type": "PER", "start_idx": 11, "end_idx": 15}]]
datasaur_data = stage_for_datasaur(elements, entities)
For more information about the ``stage_for_datasaur`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/datasaur.py>`_.
``stage_for_label_box``
--------------------------
@ -1676,6 +1783,8 @@ files to an S3 bucket.
upload_staged_files()
For more information about the ``stage_for_label_box`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/label_box.py>`_.
``stage_for_label_studio``
--------------------------
@ -1838,6 +1947,8 @@ task in LabelStudio:
See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full list of options
for labels and annotations.
For more information about the ``stage_for_label_studio`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/label_studio.py>`_.
``stage_for_prodigy``
--------------------------
@ -1879,6 +1990,8 @@ use the ``save_as_jsonl`` utility function to save the formatted data to a ``.js
# The resulting jsonl file is ready to be used with Prodigy.
save_as_jsonl(prodigy_data, "prodigy.jsonl")
For more information about the ``stage_for_prodigy`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/prodigy.py>`_.
``stage_for_transformers``
--------------------------
@ -1961,6 +2074,8 @@ The following optional keyword arguments can be specified in
results = [nlp(chunk) for chunk in chunks]
For more information about the ``stage_for_transformers`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/huggingface.py>`_.
``stage_for_weaviate``
-----------------------
@ -2012,6 +2127,8 @@ options for uploading data and querying data once it has been uploaded.
uuid=generate_uuid5(data_object),
)
For more information about the ``stage_for_weaviate`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/weaviate.py>`_.
######################
Other helper functions
@ -2035,6 +2152,8 @@ Examples:
# Returns True because the text includes a phone number
contains_us_phone_number("Phone number: 215-867-5309")
For more information about the ``contains_us_phone_number`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``contains_verb``
-----------------
@ -2066,6 +2185,8 @@ Examples:
example_2 = "A friendly dog"
contains_verb(example_2)
For more information about the ``contains_verb`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``exceeds_cap_ratio``
---------------------
@ -2092,6 +2213,8 @@ Examples:
# Returns False because the text is more than 1% caps
exceeds_cap_ratio(example_2, threshold=0.01)
For more information about the ``exceeds_cap_ratio`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``extract_attachment_info``
----------------------------
@ -2110,6 +2233,8 @@ if specified.
msg = email.message_from_file(f)
attachment_info = extract_attachment_info(msg, output_dir="example-docs")
For more information about the ``extract_attachment_info`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/email.py>`_.
``is_bulleted_text``
----------------------
@ -2129,6 +2254,8 @@ Examples:
# Returns False
is_bulleted_text("I love Morse Code! ●●●")
For more information about the ``is_bulleted_text`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``is_possible_narrative_text``
------------------------------
@ -2174,6 +2301,8 @@ Examples:
example_3 = "OLD MCDONALD HAD A FARM"
is_possible_narrative_text(example_3, cap_threshold=1.0)
For more information about the ``is_possible_narrative_text`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``is_possible_title``
---------------------
@ -2218,6 +2347,8 @@ Examples:
example_3 = "Make sure you brush your teeth. Do it before you go to bed."
is_possible_title(example_3, sentence_min_length=5)
For more information about the ``is_possible_title`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
``sentence_count``
------------------
@ -2240,3 +2371,5 @@ Examples:
# Returns 1 because the first sentence in the example does not contain five word tokens.
sentence_count(example, min_length=5)
For more information about the ``sentence_count`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.