unstructured/docs/source/bricks.rst

565 lines
16 KiB
ReStructuredText
Raw Normal View History

2022-06-29 14:35:19 -04:00
Bricks
======
The ``unstructured`` library provides bricks to make it quick and
easy to parse documents and create new pre-processing pipelines. The following documents
bricks currently available in the library.
############
Partitioning
############
The partitioning bricks in ``unstructured`` differentiate between different sections
of text in a document. For example, the partitioning bricks can help distinguish between
titles, narrative text, and tables.
``is_bulleted_text``
----------------------
Uses regular expression patterns to check if a snippet of text is a bullet point. Only
triggers if the bullet point appears at the start of the snippet.
Examples:
.. code:: python
from unstructured.nlp.partition import is_bulleted_text
# Returns True
is_bulleted_text("● An excellent point!")
# Returns False
is_bulleted_text("I love Morse Code! ●●●")
``is_possible_narrative_text``
------------------------------
The ``is_possible_narrative_text`` function determines if a section of text is a candidate
for consideration as narrative text. The function performs the following checks on input text:
* Empty text cannot be narrative text
* Text that is all numeric cannot be narrative text
* Text that does not contain a verb cannot be narrative text
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
``cap_threshold=1.0``. You may want to ignore this check when dealing with text
that is all caps.
Examples:
.. code:: python
from unstructured.nlp.partition import is_possible_narrative_text
# Returns True because the example passes all the checks
example_1 = "Make sure you brush your teeth before you go to bed."
is_possible_narrative_text(example_1)
# Returns False because the text exceeds the caps ratio and does not contain a verb
example_2 = "ITEM 1A. RISK FACTORS"
is_possible_narrative_text(example_2)
# Returns True because the text has a verb and does not exceed the cap_threshold
example_3 = "OLD MCDONALD HAD A FARM"
is_possible_narrative_text(example_3, cap_threshold=1.0)
``is_possible_title``
---------------------
The ``is_possible_title`` function determines if a section of text is a candidate
for consideration as a title. The function performs the following checks:
* Empty text cannot be a title
* Text that is all numeric cannot be a title
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
Examples:
.. code:: python
from unstructured.nlp.partition import is_possible_title
# Returns True because the text passes all the tests
example_2 = "ITEM 1A. RISK FACTORS"
is_possible_title(example_2)
# Returns True because there is only one sentence
example_2 = "Make sure you brush your teeth before you go to bed."
is_possible_title(example_2, sentence_min_length=5)
# Returns False because there are two sentences
example_3 = "Make sure you brush your teeth. Do it before you go to bed."
is_possible_title(example_3, sentence_min_length=5)
``contains_verb``
-----------------
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
be used independently as well. The function identifies verbs using the NLTK part of speech
tagger. The following part of speech tags are identified as verbs:
* ``VB``
* ``VBG``
* ``VBD``
* ``VBN``
* ``VBP``
* ``VBZ``
Examples:
.. code:: python
from unstructured.nlp.partition import contains_verb
# Returns True because the text contains a verb
example_1 = "I am going to run to the store to pick up some milk."
contains_verb(example_1)
# Returns False because the text does not contain a verb
example_2 = "A friendly dog"
contains_verb(example_2)
``sentence_count``
------------------
Counts the number of sentences in a section of text. Optionally, you can only include
sentences that exceed a specified word count. Punctuation counts as a word token
in the sentence. The function uses the NLTK sentence and word tokeniers to identify
distinct sentences and words.
Examples:
.. code:: python
from unstructured.nlp.partition import sentence_count
example = "Look at me! I am a document with two sentences."
# Returns 2 because the example contains two sentences
sentence_count(example)
# Returns 1 because the first sentence in the example does not contain five word tokens.
sentence_count(example, min_length=5)
``exceeds_cap_ratio``
---------------------
Determines if the section of text exceeds the specified caps ratio. Used in
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
Examples:
.. code:: python
from unstructured.nlp.partition import exceeds_cap_ratio
# Returns True because the text is more than 30% caps
example_1 = "LOOK AT ME I AM YELLING"
exceeds_cap_ratio(example_1)
# Returns False because the text is less than 30% caps
example_2 = "Look at me, I am no longer yelling"
exceeds_cap_ratio(example_2)
# Returns False because the text is more than 1% caps
exceeds_cap_ratio(example_2, threshold=0.01)
########
Cleaning
########
The cleaning bricks in ``unstructured`` remove unwanted text from source documents.
Examples include removing extra whitespace, boilerplate, or sentence fragments.
``clean``
---------
Cleans a section of text with options including removing bullets, extra whitespace, dashes
and trailing punctuation. Optionally, you can choose to lowercase the output.
Options:
* Applies ``clean_bullets`` if ``bullets=True``.
* Applies ``clean_extra_whitespace`` if ``extra_whitespace=True``.
* Applies ``clean_dashes`` if ``dashes=True``.
* Applies ``clean_trailing_punctuation`` if ``trailing_punctuation=True``.
* Lowercases the output if ``lowercase=True``.
Examples:
.. code:: python
from unstructured.cleaners.core import clean
# Returns "an excellent point!"
clean("● An excellent point!", bullets=True, lowercase=True)
# Returns "ITEM 1A: RISK FACTORS"
clean("ITEM 1A: RISK-FACTORS", whitespace=True, dashes=True)
``clean_bullets``
-----------------
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the
text are not removed.
Examples:
.. code:: python
from unstructured.cleaners.core import clean_bullets
# Returns "An excellent point!"
clean_bullets("● An excellent point!")
# Returns "I love Morse Code! ●●●"
clean_bullets("I love Morse Code! ●●●")
``clean_extra_whitespace``
--------------------------
Removes extra whitespace from a section of text. Also handles special characters
such as ``\xa0`` and newlines.
Examples:
.. code:: python
from unstructured.cleaners.core import clean_extra_whitespace
# Returns "ITEM 1A: RISK FACTORS"
clean_extra_whitespace("ITEM 1A: RISK FACTORS\n")
``clean_dashes``
----------------
Removes dashes from a section of text. Also handles special characters
such as ``\u2013``.
Examples:
.. code:: python
from unstructured.cleaners.core import clean_dashes
# Returns "ITEM 1A: RISK FACTORS"
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")
``clean_trailing_punctuation``
-------------------------------
Removes trailing punctuation from a section of text.
Examples:
.. code:: python
from unstructured.cleaners.core import clean_trailing_punctuation
# Returns "ITEM 1A: RISK FACTORS"
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")
``replace_unicode_quotes``
--------------------------
Replaces unicode quote characters such as ``\x91`` in strings.
Examples:
.. code:: python
from unstructured.cleaners.core import replace_unicode_quotes
# Returns "“A lovely quote!”"
replace_unicode_characters("\x93A lovely quote!\x94")
# Returns ""A lovely quote!"
replace_unicode_characters("\x91A lovely quote!\x92")
``remove_punctuation``
--------------------------
Removes ASCII and unicode punctuation from a string.
Examples:
.. code:: python
from unstructured.cleaners.core import remove_punctuation
# Returns "A lovely quote"
replace_unicode_characters("“A lovely quote!”")
# Returns ""
replace_unicode_characters("'()[]{};:'\",.?/\\-_")
#######
Staging
#######
Staging bricks in ``unstructured`` prepare extracted text for downstream tasks such
as machine learning inference and data labeling.
``convert_to_isd``
------------------
Converts outputs to the initial structured data (ISD) format. This is the default format
for returning data in Unstructured pipeline APIs.
Examples:
.. code:: python
from unstructured.documents.elements import Title, NarrativeText
from unstructured.staging.base import convert_to_isd
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
isd = convert_to_isd(elements)
``stage_for_label_studio``
--------------------------
Formats outputs for upload to LabelStudio. After running ``stage_for_label_studio``, you can
write the results to a JSON folder that is ready to be included in a new LabelStudio project.
Examples:
.. code:: python
import json
from unstructured.documents.elements import Title, NarrativeText
from unstructured.staging.label_studio import stage_for_label_studio
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
label_studio_data = stage_for_label_studio(elements, text_field="my_text", id_field="my_id")
2022-06-29 14:35:19 -04:00
# The resulting JSON file is ready to be uploaded to LabelStudio
with open("label_studio.json", "w") as f:
json.dump(label_studio_data, f, indent=4)
You can also include pre-annotations and predictions as part of your LabelStudio upload.
The ``annotations`` kwarg is a list of lists. If ``annotations`` is specified, there must be a list of
annotations for each element in the ``elements`` list. If an element does not have any annotations,
use an empty list.
The following shows an example of how to upload annotations for the "Text Classification"
task in LabelStudio:
.. code:: python
import json
from unstructured.documents.elements import NarrativeText
from unstructured.staging.label_studio import (
stage_for_label_studio,
LabelStudioAnnotation,
LabelStudioResult,
)
elements = [NarrativeText(text="Narrative")]
annotations = [[
LabelStudioAnnotation(
result=[
LabelStudioResult(
type="choices",
value={"choices": ["Positive"]},
from_name="sentiment",
to_name="text",
)
]
)
]]
label_studio_data = stage_for_label_studio(
elements,
annotations=annotations,
text_field="my_text",
id_field="my_id"
)
# The resulting JSON file is ready to be uploaded to LabelStudio
# with annotations included
with open("label_studio.json", "w") as f:
json.dump(label_studio_data, f, indent=4)
Similar to annotations, the ``predictions`` kwarg is also a list of lists. A ``prediction`` is an annotation with
the addition of a ``score`` value. If ``predictions`` is specified, there must be a list of
predictions for each element in the ``elements`` list. If an element does not have any predictions, use an empty list.
The following shows an example of how to upload predictions for the "Text Classification"
task in LabelStudio:
.. code:: python
import json
from unstructured.documents.elements import NarrativeText
from unstructured.staging.label_studio import (
stage_for_label_studio,
LabelStudioPrediction,
LabelStudioResult,
)
elements = [NarrativeText(text="Narrative")]
predictions = [[
LabelStudioPrediction(
result=[
LabelStudioResult(
type="choices",
value={"choices": ["Positive"]},
from_name="sentiment",
to_name="text",
)
],
score=0.68
)
]]
label_studio_data = stage_for_label_studio(
elements,
predictions=predictions,
text_field="my_text",
id_field="my_id"
)
# The resulting JSON file is ready to be uploaded to LabelStudio
# with annotations included
with open("label_studio.json", "w") as f:
json.dump(label_studio_data, f, indent=4)
The following shows an example of how to upload annotations for the "Named Entity Recognition"
task in LabelStudio:
.. code:: python
import json
from unstructured.documents.elements import NarrativeText
from unstructured.staging.label_studio import (
stage_for_label_studio,
LabelStudioAnnotation,
LabelStudioResult,
)
elements = [NarrativeText(text="Narrative")]
annotations = [[
LabelStudioAnnotation(
result=[
LabelStudioResult(
type="labels",
value={"start": 0, "end": 9, "text": "Narrative", "labels": ["MISC"]},
from_name="label",
to_name="text",
)
]
)
]]
label_studio_data = stage_for_label_studio(
elements,
annotations=annotations,
text_field="my_text",
id_field="my_id"
)
# The resulting JSON file is ready to be uploaded to LabelStudio
# with annotations included
with open("label_studio.json", "w") as f:
json.dump(label_studio_data, f, indent=4)
See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full list of options
for labels and annotations.
``stage_for_prodigy``
--------------------------
Formats outputs in JSON format for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_. After running ``stage_for_prodigy``, you can
write the results to a JSON file that is ready to be used with Prodigy.
Examples:
.. code:: python
import json
from unstructured.documents.elements import Title, NarrativeText
from unstructured.staging.prodigy import stage_for_prodigy
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
metadata = [{"type": "title"}, {"type": "text"}]
prodigy_data = stage_for_prodigy(elements, metadata)
# The resulting JSON file is ready to be used with Prodigy
with open("prodigy.json", "w") as f:
json.dump(prodigy_data, f, indent=4)
**Note**: Prodigy recommends ``.jsonl`` format for feeding data to API loaders. After running ``stage_for_prodigy``, you can
use the ``save_as_jsonl`` utility function to save the formatted data to a ``.jsonl`` file that is ready to be used with Prodigy.
.. code:: python
from unstructured.documents.elements import Title, NarrativeText
from unstructured.staging.prodigy import stage_for_prodigy
from unstructured.utils import save_as_jsonl
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
metadata = [{"type": "title"}, {"type": "text"}]
prodigy_data = stage_for_prodigy(elements, metadata)
# The resulting jsonl file is ready to be used with Prodigy.
save_as_jsonl(prodigy_data, "prodigy.jsonl")
``stage_csv_for_prodigy``
--------------------------
Formats outputs in CSV format for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_. After running ``stage_csv_for_prodigy``, you can
write the results to a CSV file that is ready to be used with Prodigy.
Examples:
.. code:: python
from unstructured.documents.elements import Title, NarrativeText
from unstructured.staging.prodigy import stage_csv_for_prodigy
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
metadata = [{"type": "title"}, {"source": "news"}]
prodigy_csv_data = stage_csv_for_prodigy(elements, metadata)
# The resulting CSV file is ready to be used with Prodigy
with open("prodigy.csv", "w") as csv_file:
csv_file.write(prodigy_csv_data)