mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-11-03 03:23:25 +00:00
docs: Add source code links to bricks' docs (#923)
Co-authored-by: Francisco Ansaldo <franciscoansaldo@Franciscos-MacBook-Pro.local>
This commit is contained in:
parent
9b830693bd
commit
26da51c765
@ -131,6 +131,8 @@ to disable SSL verification in the request.
|
||||
elements = partition(url=url)
|
||||
elements = partition(url=url, content_type="text/markdown")
|
||||
|
||||
For more information about the ``partition`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/auto.py>`_.
|
||||
|
||||
|
||||
``partition_csv``
|
||||
------------------
|
||||
@ -148,22 +150,7 @@ Examples:
|
||||
elements = partition_csv(filename="example-docs/stanley-cups.csv")
|
||||
print(elements[0].metadata.text_as_html)
|
||||
|
||||
|
||||
``partition_tsv``
|
||||
------------------
|
||||
|
||||
The ``partition_tsv`` function pre-processes TSV files. The output is a single
|
||||
``Table`` element. The ``text_as_html`` attribute in the element metadata will
|
||||
contain an HTML representation of the table.
|
||||
|
||||
Examples:
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.partition.tsv import partition_tsv
|
||||
|
||||
elements = partition_tsv(filename="example-docs/stanley-cups.tsv")
|
||||
print(elements[0].metadata.text_as_html)
|
||||
For more information about the ``partition_csv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/csv.py>`_.
|
||||
|
||||
|
||||
``partition_doc``
|
||||
@ -186,6 +173,8 @@ Examples:
|
||||
|
||||
elements = partition_doc(filename="example-docs/fake.doc")
|
||||
|
||||
For more information about the ``partition_doc`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/doc.py>`_.
|
||||
|
||||
|
||||
``partition_docx``
|
||||
------------------
|
||||
@ -228,6 +217,8 @@ insert page breaks when you save the document. If your Word document renderer do
|
||||
you may not see page numbers in the output even if you see them visually when you open the
|
||||
document. If that is the case, you can try saving the document with a different renderer.
|
||||
|
||||
For more information about the ``partition_docx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/docx.py>`_.
|
||||
|
||||
|
||||
``partition_email``
|
||||
---------------------
|
||||
@ -288,6 +279,8 @@ workflow looks like:
|
||||
filename=filename, process_attachments=True, attachment_partitioner=partition
|
||||
)
|
||||
|
||||
For more information about the ``partition_email`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/email.py>`_.
|
||||
|
||||
|
||||
``partition_epub``
|
||||
---------------------
|
||||
@ -306,6 +299,8 @@ Examples:
|
||||
|
||||
elements = partition_epub(filename="example-docs/winter-sports.epub")
|
||||
|
||||
For more information about the ``partition_epub`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/epub.py>`_.
|
||||
|
||||
|
||||
``partition_html``
|
||||
---------------------
|
||||
@ -361,6 +356,8 @@ If ``html_assemble_articles`` is ``True``, each ``<article>`` tag will be treate
|
||||
If ``html_assemble_articles`` is ``True`` and no ``<article>`` tags are present, the behavior
|
||||
is the same as ``html_assemble_articles=False``.
|
||||
|
||||
For more information about the ``partition_html`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/html.py>`_.
|
||||
|
||||
|
||||
``partition_image``
|
||||
---------------------
|
||||
@ -416,6 +413,8 @@ have the Korean language pack for Tesseract installed on your system.
|
||||
filename = "example-docs/english-and-korean.png"
|
||||
elements = partition_image(filename=filename, ocr_languages="eng+kor", strategy="ocr_only")
|
||||
|
||||
For more information about the ``partition_image`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/image.py>`_.
|
||||
|
||||
|
||||
``partition_md``
|
||||
---------------------
|
||||
@ -432,6 +431,8 @@ Examples:
|
||||
|
||||
elements = partition_md(filename="README.md")
|
||||
|
||||
For more information about the ``partition_md`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/md.py>`_.
|
||||
|
||||
|
||||
``partition_msg``
|
||||
-----------------
|
||||
@ -470,6 +471,8 @@ workflow looks like:
|
||||
filename=filename, process_attachments=True, attachment_partitioner=partition
|
||||
)
|
||||
|
||||
For more information about the ``partition_msg`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/msg.py>`_.
|
||||
|
||||
|
||||
``partition_multiple_via_api``
|
||||
------------------------------
|
||||
@ -509,6 +512,8 @@ Examples:
|
||||
files = [stack.enter_context(open(filename, "rb")) for filename in filenames]
|
||||
documents = partition_multiple_via_api(files=files, file_filenames=filenames)
|
||||
|
||||
For more information about the ``partition_multiple_via_api`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/api.py>`_.
|
||||
|
||||
|
||||
``partition_odt``
|
||||
------------------
|
||||
@ -525,6 +530,28 @@ Examples:
|
||||
|
||||
elements = partition_odt(filename="example-docs/fake.odt")
|
||||
|
||||
For more information about the ``partition_odt`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/odt.py>`_.
|
||||
|
||||
|
||||
``partition_org``
|
||||
---------------------
|
||||
|
||||
The ``partition_org`` function processes Org Mode (``.org``) documents. The function
|
||||
first converts the document to HTML using ``pandoc`` and then calls ``partition_html``.
|
||||
You'll need `pandoc <https://pandoc.org/installing.html>`_ installed on your system
|
||||
to use ``partition_org``.
|
||||
|
||||
|
||||
Examples:
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.partition.org import partition_org
|
||||
|
||||
elements = partition_org(filename="example-docs/README.org")
|
||||
|
||||
For more information about the ``partition_org`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/org.py>`_.
|
||||
|
||||
|
||||
``partition_pdf``
|
||||
---------------------
|
||||
@ -603,6 +630,8 @@ The default value is ``1500``, which roughly corresponds to
|
||||
the average character length for a paragraph.
|
||||
You can disable ``max_partition`` by setting it to ``None``.
|
||||
|
||||
For more information about the ``partition_pdf`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/pdf.py>`_.
|
||||
|
||||
|
||||
``partition_ppt``
|
||||
---------------------
|
||||
@ -623,6 +652,8 @@ Examples:
|
||||
|
||||
elements = partition_ppt(filename="example-docs/fake-power-point.ppt")
|
||||
|
||||
For more information about the ``partition_ppt`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/ppt.py>`_.
|
||||
|
||||
|
||||
``partition_pptx``
|
||||
---------------------
|
||||
@ -644,23 +675,7 @@ Examples:
|
||||
with open("example-docs/fake-power-point.pptx", "rb") as f:
|
||||
elements = partition_pptx(file=f)
|
||||
|
||||
|
||||
``partition_org``
|
||||
---------------------
|
||||
|
||||
The ``partition_org`` function processes Org Mode (``.org``) documents. The function
|
||||
first converts the document to HTML using ``pandoc`` and then calls ``partition_html``.
|
||||
You'll need `pandoc <https://pandoc.org/installing.html>`_ installed on your system
|
||||
to use ``partition_org``.
|
||||
|
||||
|
||||
Examples:
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.partition.org import partition_org
|
||||
|
||||
elements = partition_org(filename="example-docs/README.org")
|
||||
For more information about the ``partition_pptx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/pptx.py>`_.
|
||||
|
||||
|
||||
``partition_rst``
|
||||
@ -680,6 +695,9 @@ Examples:
|
||||
|
||||
elements = partition_rst(filename="example-docs/README.rst")
|
||||
|
||||
For more information about the ``partition_rst`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/rst.py>`_.
|
||||
|
||||
|
||||
``partition_rtf``
|
||||
---------------------
|
||||
|
||||
@ -697,6 +715,8 @@ Examples:
|
||||
|
||||
elements = partition_rtf(filename="example-docs/fake-doc.rtf")
|
||||
|
||||
For more information about the ``partition_rtf`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/rtf.py>`_.
|
||||
|
||||
|
||||
``partition_text``
|
||||
---------------------
|
||||
@ -746,6 +766,27 @@ The default value is ``1500``, which roughly corresponds to
|
||||
the average character length for a paragraph.
|
||||
You can disable ``max_partition`` by setting it to ``None``.
|
||||
|
||||
For more information about the ``partition_text`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text.py>`_.
|
||||
|
||||
|
||||
``partition_tsv``
|
||||
------------------
|
||||
|
||||
The ``partition_tsv`` function pre-processes TSV files. The output is a single
|
||||
``Table`` element. The ``text_as_html`` attribute in the element metadata will
|
||||
contain an HTML representation of the table.
|
||||
|
||||
Examples:
|
||||
|
||||
.. code:: python
|
||||
|
||||
from unstructured.partition.tsv import partition_tsv
|
||||
|
||||
elements = partition_tsv(filename="example-docs/stanley-cups.tsv")
|
||||
print(elements[0].metadata.text_as_html)
|
||||
|
||||
For more information about the ``partition_tsv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/partition/tsv.py>`_.
|
||||
|
||||
|
||||
``partition_via_api``
|
||||
---------------------
|
||||
@ -802,6 +843,7 @@ documentation on how to run the API as a container locally.
|
||||
filename=filename, api_url="http://localhost:5000/general/v0/general"
|
||||
)
|
||||
|
||||
For more information about the ``partition_via_api`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/api.py>`_.
|
||||
|
||||
|
||||
``partition_xlsx``
|
||||
@ -821,6 +863,8 @@ Examples:
|
||||
elements = partition_xlsx(filename="example-docs/stanley-cups.xlsx")
|
||||
print(elements[0].metadata.text_as_html)
|
||||
|
||||
For more information about the ``partition_xlsx`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/xlsx.py>`_.
|
||||
|
||||
|
||||
``partition_xml``
|
||||
-----------------
|
||||
@ -846,6 +890,7 @@ The default value is ``1500``, which roughly corresponds to
|
||||
the average character length for a paragraph.
|
||||
You can disable ``max_partition`` by setting it to ``None``.
|
||||
|
||||
For more information about the ``partition_xml`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/xml.py>`_.
|
||||
|
||||
|
||||
########
|
||||
@ -931,6 +976,8 @@ Examples:
|
||||
# The output should be "Hello 😀"
|
||||
elements[0].text
|
||||
|
||||
For more information about the ``bytes_string_to_string`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean``
|
||||
---------
|
||||
@ -959,6 +1006,8 @@ Examples:
|
||||
# Returns "ITEM 1A: RISK FACTORS"
|
||||
clean("ITEM 1A: RISK-FACTORS", extra_whitespace=True, dashes=True)
|
||||
|
||||
For more information about the ``clean`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_bullets``
|
||||
-----------------
|
||||
@ -978,6 +1027,8 @@ Examples:
|
||||
# Returns "I love Morse Code! ●●●"
|
||||
clean_bullets("I love Morse Code! ●●●")
|
||||
|
||||
For more information about the ``clean_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_dashes``
|
||||
----------------
|
||||
@ -994,6 +1045,8 @@ Examples:
|
||||
# Returns "ITEM 1A: RISK FACTORS"
|
||||
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")
|
||||
|
||||
For more information about the ``clean_dashes`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_extra_whitespace``
|
||||
--------------------------
|
||||
@ -1010,6 +1063,8 @@ Examples:
|
||||
# Returns "ITEM 1A: RISK FACTORS"
|
||||
clean_extra_whitespace("ITEM 1A: RISK FACTORS\n")
|
||||
|
||||
For more information about the ``clean_extra_whitespace`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_non_ascii_chars``
|
||||
-------------------------
|
||||
@ -1027,6 +1082,8 @@ Examples:
|
||||
# Returns "This text containsnon-ascii characters!"
|
||||
clean_non_ascii_chars(text)
|
||||
|
||||
For more information about the ``clean_non_ascii_chars`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_ordered_bullets``
|
||||
-------------------------
|
||||
@ -1045,6 +1102,8 @@ Examples:
|
||||
# Returns "This is a very important point ●"
|
||||
clean_bullets("a.b This is a very important point ●")
|
||||
|
||||
For more information about the ``clean_ordered_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_postfix``
|
||||
-----------------
|
||||
@ -1068,6 +1127,8 @@ Examples:
|
||||
# Returns "The end!"
|
||||
clean_postfix(text, r"(END|STOP)", ignore_case=True)
|
||||
|
||||
For more information about the ``clean_postfix`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_prefix``
|
||||
----------------
|
||||
@ -1091,6 +1152,8 @@ Examples:
|
||||
# Returns "This is the best summary of all time!"
|
||||
clean_prefix(text, r"(SUMMARY|DESCRIPTION):", ignore_case=True)
|
||||
|
||||
For more information about the ``clean_prefix`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``clean_trailing_punctuation``
|
||||
-------------------------------
|
||||
@ -1106,6 +1169,8 @@ Examples:
|
||||
# Returns "ITEM 1A: RISK FACTORS"
|
||||
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")
|
||||
|
||||
For more information about the ``clean_trailing_punctuation`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``extract_datetimetz``
|
||||
----------------------
|
||||
@ -1125,6 +1190,8 @@ object from the input string.
|
||||
# Returns datetime.datetime(2021, 3, 26, 11, 4, 9, tzinfo=datetime.timezone(datetime.timedelta(seconds=43200)))
|
||||
extract_datetimetz(text)
|
||||
|
||||
For more information about the ``extract_datetimetz`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_email_address``
|
||||
--------------------------
|
||||
@ -1142,6 +1209,8 @@ addresses in the input string.
|
||||
# Returns "['me@email.com', 'you@email.com']"
|
||||
extract_email_address(text)
|
||||
|
||||
For more information about the ``extract_email_address`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_ip_address``
|
||||
------------------------
|
||||
@ -1159,6 +1228,8 @@ returns a list of all IP address in input string.
|
||||
# Returns "['ba23::58b5:2236:45g2:88h2', '10.0.2.01']"
|
||||
extract_ip_address(text)
|
||||
|
||||
For more information about the ``extract_ip_address`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_ip_address_name``
|
||||
----------------------------
|
||||
@ -1178,6 +1249,8 @@ IP addresses in the input string.
|
||||
# Returns "['ABC.DEF.local', 'ABC.DEF.local2']"
|
||||
extract_ip_address_name(text)
|
||||
|
||||
For more information about the ``extract_ip_address_name`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_mapi_id``
|
||||
----------------------
|
||||
@ -1197,6 +1270,8 @@ containing the ``mapi id`` in the input string.
|
||||
# Returns "['32.88.5467.123']"
|
||||
extract_mapi_id(text)
|
||||
|
||||
For more information about the ``extract_mapi_id`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_ordered_bullets``
|
||||
---------------------------
|
||||
@ -1215,6 +1290,8 @@ Examples:
|
||||
# Returns ("a", "1", None)
|
||||
extract_ordered_bullets("a.1 This is a very important point")
|
||||
|
||||
For more information about the ``extract_ordered_bullets`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_text_after``
|
||||
----------------------
|
||||
@ -1238,6 +1315,8 @@ Examples:
|
||||
# Returns "Look at me, I'm flying!"
|
||||
extract_text_after(text, r"SPEAKER \d{1}:")
|
||||
|
||||
For more information about the ``extract_text_after`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_text_before``
|
||||
-----------------------
|
||||
@ -1261,6 +1340,8 @@ Examples:
|
||||
# Returns "Here I am!"
|
||||
extract_text_before(text, r"STOP")
|
||||
|
||||
For more information about the ``extract_text_before`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``extract_us_phone_number``
|
||||
---------------------------
|
||||
@ -1276,6 +1357,8 @@ Examples:
|
||||
# Returns "215-867-5309"
|
||||
extract_us_phone_number("Phone number: 215-867-5309")
|
||||
|
||||
For more information about the ``extract_us_phone_number`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/extract.py>`_.
|
||||
|
||||
|
||||
``group_broken_paragraphs``
|
||||
---------------------------
|
||||
@ -1319,6 +1402,8 @@ Examples:
|
||||
|
||||
group_broken_paragraphs(text, paragraph_split=para_split_re)
|
||||
|
||||
For more information about the ``group_broken_paragraphs`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``remove_punctuation``
|
||||
--------------------------
|
||||
@ -1334,6 +1419,8 @@ Examples:
|
||||
# Returns "A lovely quote"
|
||||
remove_punctuation("“A lovely quote!”")
|
||||
|
||||
For more information about the ``remove_punctuation`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``replace_unicode_quotes``
|
||||
--------------------------
|
||||
@ -1352,6 +1439,8 @@ Examples:
|
||||
# Returns ""‘A lovely quote!’"
|
||||
replace_unicode_characters("\x91A lovely quote!\x92")
|
||||
|
||||
For more information about the ``replace_unicode_quotes`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/core.py>`_.
|
||||
|
||||
|
||||
``translate_text``
|
||||
------------------
|
||||
@ -1383,6 +1472,8 @@ Examples:
|
||||
# Output is "I can also translate Russian!"
|
||||
translate_text("Я тоже можно переводать русский язык!", "ru", "en")
|
||||
|
||||
For more information about the ``translate_text`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/cleaners/translate.py>`_.
|
||||
|
||||
|
||||
#######
|
||||
Staging
|
||||
@ -1419,6 +1510,8 @@ Examples:
|
||||
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
||||
isd_csv = convert_to_csv(elements)
|
||||
|
||||
For more information about the ``convert_to_csv`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
|
||||
|
||||
|
||||
``convert_to_dataframe``
|
||||
------------------------
|
||||
@ -1437,6 +1530,8 @@ Examples:
|
||||
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
||||
df = convert_to_dataframe(elements)
|
||||
|
||||
For more information about the ``convert_to_dataframe`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
|
||||
|
||||
|
||||
``convert_to_dict``
|
||||
--------------------
|
||||
@ -1454,6 +1549,8 @@ Examples:
|
||||
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
||||
isd = convert_to_dict(elements)
|
||||
|
||||
For more information about the ``convert_to_dict`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
|
||||
|
||||
|
||||
``dict_to_elements``
|
||||
---------------------
|
||||
@ -1475,6 +1572,8 @@ Examples:
|
||||
# [ Title(text="My Title"), NarrativeText(text="My Narrative")]
|
||||
elements = dict_to_elements(isd)
|
||||
|
||||
For more information about the ``dict_to_elements`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/base.py>`_.
|
||||
|
||||
|
||||
``stage_csv_for_prodigy``
|
||||
--------------------------
|
||||
@ -1497,6 +1596,8 @@ Examples:
|
||||
with open("prodigy.csv", "w") as csv_file:
|
||||
csv_file.write(prodigy_csv_data)
|
||||
|
||||
For more information about the ``stage_csv_for_prodigy`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/prodigy.py>`_.
|
||||
|
||||
|
||||
``stage_for_argilla``
|
||||
--------------------------
|
||||
@ -1523,6 +1624,8 @@ Examples:
|
||||
|
||||
argilla_dataset = stage_for_argilla(elements, "text_classification", metadata=metadata)
|
||||
|
||||
For more information about the ``stage_for_argilla`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/argilla.py>`_.
|
||||
|
||||
|
||||
``stage_for_baseplate``
|
||||
-----------------------
|
||||
@ -1575,6 +1678,8 @@ The output will look like:
|
||||
],
|
||||
}
|
||||
|
||||
For more information about the ``stage_for_baseplate`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/baseplate.py>`_.
|
||||
|
||||
|
||||
``stage_for_datasaur``
|
||||
--------------------------
|
||||
@ -1611,6 +1716,8 @@ Example:
|
||||
entities = [[{"text": "Matt", "type": "PER", "start_idx": 11, "end_idx": 15}]]
|
||||
datasaur_data = stage_for_datasaur(elements, entities)
|
||||
|
||||
For more information about the ``stage_for_datasaur`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/datasaur.py>`_.
|
||||
|
||||
|
||||
``stage_for_label_box``
|
||||
--------------------------
|
||||
@ -1676,6 +1783,8 @@ files to an S3 bucket.
|
||||
|
||||
upload_staged_files()
|
||||
|
||||
For more information about the ``stage_for_label_box`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/label_box.py>`_.
|
||||
|
||||
|
||||
``stage_for_label_studio``
|
||||
--------------------------
|
||||
@ -1838,6 +1947,8 @@ task in LabelStudio:
|
||||
See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full list of options
|
||||
for labels and annotations.
|
||||
|
||||
For more information about the ``stage_for_label_studio`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/label_studio.py>`_.
|
||||
|
||||
|
||||
``stage_for_prodigy``
|
||||
--------------------------
|
||||
@ -1879,6 +1990,8 @@ use the ``save_as_jsonl`` utility function to save the formatted data to a ``.js
|
||||
# The resulting jsonl file is ready to be used with Prodigy.
|
||||
save_as_jsonl(prodigy_data, "prodigy.jsonl")
|
||||
|
||||
For more information about the ``stage_for_prodigy`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/prodigy.py>`_.
|
||||
|
||||
|
||||
``stage_for_transformers``
|
||||
--------------------------
|
||||
@ -1961,6 +2074,8 @@ The following optional keyword arguments can be specified in
|
||||
|
||||
results = [nlp(chunk) for chunk in chunks]
|
||||
|
||||
For more information about the ``stage_for_transformers`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/huggingface.py>`_.
|
||||
|
||||
|
||||
``stage_for_weaviate``
|
||||
-----------------------
|
||||
@ -2012,6 +2127,8 @@ options for uploading data and querying data once it has been uploaded.
|
||||
uuid=generate_uuid5(data_object),
|
||||
)
|
||||
|
||||
For more information about the ``stage_for_weaviate`` brick, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/staging/weaviate.py>`_.
|
||||
|
||||
|
||||
######################
|
||||
Other helper functions
|
||||
@ -2035,6 +2152,8 @@ Examples:
|
||||
# Returns True because the text includes a phone number
|
||||
contains_us_phone_number("Phone number: 215-867-5309")
|
||||
|
||||
For more information about the ``contains_us_phone_number`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``contains_verb``
|
||||
-----------------
|
||||
@ -2066,6 +2185,8 @@ Examples:
|
||||
example_2 = "A friendly dog"
|
||||
contains_verb(example_2)
|
||||
|
||||
For more information about the ``contains_verb`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``exceeds_cap_ratio``
|
||||
---------------------
|
||||
@ -2092,6 +2213,8 @@ Examples:
|
||||
# Returns False because the text is more than 1% caps
|
||||
exceeds_cap_ratio(example_2, threshold=0.01)
|
||||
|
||||
For more information about the ``exceeds_cap_ratio`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``extract_attachment_info``
|
||||
----------------------------
|
||||
@ -2110,6 +2233,8 @@ if specified.
|
||||
msg = email.message_from_file(f)
|
||||
attachment_info = extract_attachment_info(msg, output_dir="example-docs")
|
||||
|
||||
For more information about the ``extract_attachment_info`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/email.py>`_.
|
||||
|
||||
|
||||
``is_bulleted_text``
|
||||
----------------------
|
||||
@ -2129,6 +2254,8 @@ Examples:
|
||||
# Returns False
|
||||
is_bulleted_text("I love Morse Code! ●●●")
|
||||
|
||||
For more information about the ``is_bulleted_text`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``is_possible_narrative_text``
|
||||
------------------------------
|
||||
@ -2174,6 +2301,8 @@ Examples:
|
||||
example_3 = "OLD MCDONALD HAD A FARM"
|
||||
is_possible_narrative_text(example_3, cap_threshold=1.0)
|
||||
|
||||
For more information about the ``is_possible_narrative_text`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``is_possible_title``
|
||||
---------------------
|
||||
@ -2218,6 +2347,8 @@ Examples:
|
||||
example_3 = "Make sure you brush your teeth. Do it before you go to bed."
|
||||
is_possible_title(example_3, sentence_min_length=5)
|
||||
|
||||
For more information about the ``is_possible_title`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
|
||||
``sentence_count``
|
||||
------------------
|
||||
@ -2240,3 +2371,5 @@ Examples:
|
||||
|
||||
# Returns 1 because the first sentence in the example does not contain five word tokens.
|
||||
sentence_count(example, min_length=5)
|
||||
|
||||
For more information about the ``sentence_count`` function, you can check the `source code here <https://github.com/Unstructured-IO/unstructured/blob/a583d47b841bdd426b9058b7c34f6aa3ed8de152/unstructured/partition/text_type.py>`_.
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user