Bricks ====== The ``unstructured`` library provides bricks to make it quick and easy to parse documents and create new pre-processing pipelines. The following documents bricks currently available in the library. ############ Partitioning ############ The partitioning bricks in ``unstructured`` differentiate between different sections of text in a document. For example, the partitioning bricks can help distinguish between titles, narrative text, and tables. ``is_bulleted_text`` ---------------------- Uses regular expression patterns to check if a snippet of text is a bullet point. Only triggers if the bullet point appears at the start of the snippet. Examples: .. code:: python from unstructured.nlp.partition import is_bulleted_text # Returns True is_bulleted_text("● An excellent point!") # Returns False is_bulleted_text("I love Morse Code! ●●●") ``is_possible_narrative_text`` ------------------------------ The ``is_possible_narrative_text`` function determines if a section of text is a candidate for consideration as narrative text. The function performs the following checks on input text: * Empty text cannot be narrative text * Text that is all numeric cannot be narrative text * Text that does not contain a verb cannot be narrative text * Text that exceeds the specified caps ratio cannot be narrative text. The threshold is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set ``cap_threshold=1.0``. You may want to ignore this check when dealing with text that is all caps. Examples: .. code:: python from unstructured.nlp.partition import is_possible_narrative_text # Returns True because the example passes all the checks example_1 = "Make sure you brush your teeth before you go to bed." is_possible_narrative_text(example_1) # Returns False because the text exceeds the caps ratio and does not contain a verb example_2 = "ITEM 1A. RISK FACTORS" is_possible_narrative_text(example_2) # Returns True because the text has a verb and does not exceed the cap_threshold example_3 = "OLD MCDONALD HAD A FARM" is_possible_narrative_text(example_3, cap_threshold=1.0) ``is_possible_title`` --------------------- The ``is_possible_title`` function determines if a section of text is a candidate for consideration as a title. The function performs the following checks: * Empty text cannot be a title * Text that is all numeric cannot be a title * If a title contains more than one sentence that exceeds a certain length, it cannot be a title. Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5. Examples: .. code:: python from unstructured.nlp.partition import is_possible_title # Returns True because the text passes all the tests example_2 = "ITEM 1A. RISK FACTORS" is_possible_title(example_2) # Returns True because there is only one sentence example_2 = "Make sure you brush your teeth before you go to bed." is_possible_title(example_2, sentence_min_length=5) # Returns False because there are two sentences example_3 = "Make sure you brush your teeth. Do it before you go to bed." is_possible_title(example_3, sentence_min_length=5) ``contains_verb`` ----------------- Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can be used independently as well. The function identifies verbs using the NLTK part of speech tagger. The following part of speech tags are identified as verbs: * ``VB`` * ``VBG`` * ``VBD`` * ``VBN`` * ``VBP`` * ``VBZ`` Examples: .. code:: python from unstructured.nlp.partition import contains_verb # Returns True because the text contains a verb example_1 = "I am going to run to the store to pick up some milk." contains_verb(example_1) # Returns False because the text does not contain a verb example_2 = "A friendly dog" contains_verb(example_2) ``sentence_count`` ------------------ Counts the number of sentences in a section of text. Optionally, you can only include sentences that exceed a specified word count. Punctuation counts as a word token in the sentence. The function uses the NLTK sentence and word tokeniers to identify distinct sentences and words. Examples: .. code:: python from unstructured.nlp.partition import sentence_count example = "Look at me! I am a document with two sentences." # Returns 2 because the example contains two sentences sentence_count(example) # Returns 1 because the first sentence in the example does not contain five word tokens. sentence_count(example, min_length=5) ``exceeds_cap_ratio`` --------------------- Determines if the section of text exceeds the specified caps ratio. Used in ``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold defaults to ``0.3``. Only runs on sections of text that are a single sentence. Examples: .. code:: python from unstructured.nlp.partition import exceeds_cap_ratio # Returns True because the text is more than 30% caps example_1 = "LOOK AT ME I AM YELLING" exceeds_cap_ratio(example_1) # Returns False because the text is less than 30% caps example_2 = "Look at me, I am no longer yelling" exceeds_cap_ratio(example_2) # Returns False because the text is more than 1% caps exceeds_cap_ratio(example_2, threshold=0.01) ######## Cleaning ######## The cleaning bricks in ``unstructured`` remove unwanted text from source documents. Examples include removing extra whitespace, boilerplate, or sentence fragments. ``clean`` --------- Cleans a section of text with options including removing bullets, extra whitespace, dashes and trailing punctuation. Optionally, you can choose to lowercase the output. Options: * Applies ``clean_bullets`` if ``bullets=True``. * Applies ``clean_extra_whitespace`` if ``extra_whitespace=True``. * Applies ``clean_dashes`` if ``dashes=True``. * Applies ``clean_trailing_punctuation`` if ``trailing_punctuation=True``. * Lowercases the output if ``lowercase=True``. Examples: .. code:: python from unstructured.cleaners.core import clean # Returns "an excellent point!" clean("● An excellent point!", bullets=True, lowercase=True) # Returns "ITEM 1A: RISK FACTORS" clean("ITEM 1A: RISK-FACTORS", whitespace=True, dashes=True) ``clean_bullets`` ----------------- Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the text are not removed. Examples: .. code:: python from unstructured.cleaners.core import clean_bullets # Returns "An excellent point!" clean_bullets("● An excellent point!") # Returns "I love Morse Code! ●●●" clean_bullets("I love Morse Code! ●●●") ``clean_extra_whitespace`` -------------------------- Removes extra whitespace from a section of text. Also handles special characters such as ``\xa0`` and newlines. Examples: .. code:: python from unstructured.cleaners.core import clean_extra_whitespace # Returns "ITEM 1A: RISK FACTORS" clean_extra_whitespace("ITEM 1A: RISK FACTORS\n") ``clean_dashes`` ---------------- Removes dashes from a section of text. Also handles special characters such as ``\u2013``. Examples: .. code:: python from unstructured.cleaners.core import clean_dashes # Returns "ITEM 1A: RISK FACTORS" clean_dashes("ITEM 1A: RISK-FACTORS\u2013") ``clean_trailing_punctuation`` ------------------------------- Removes trailing punctuation from a section of text. Examples: .. code:: python from unstructured.cleaners.core import clean_trailing_punctuation # Returns "ITEM 1A: RISK FACTORS" clean_trailing_punctuation("ITEM 1A: RISK FACTORS.") ``replace_unicode_quotes`` -------------------------- Replaces unicode quote characters such as ``\x91`` in strings. Examples: .. code:: python from unstructured.cleaners.core import replace_unicode_quotes # Returns "“A lovely quote!”" replace_unicode_characters("\x93A lovely quote!\x94") # Returns ""‘A lovely quote!’" replace_unicode_characters("\x91A lovely quote!\x92") ``remove_punctuation`` -------------------------- Removes ASCII and unicode punctuation from a string. Examples: .. code:: python from unstructured.cleaners.core import remove_punctuation # Returns "A lovely quote" replace_unicode_characters("“A lovely quote!”") # Returns "" replace_unicode_characters("'()[]{};:'\",.?/\\-_") ####### Staging ####### Staging bricks in ``unstructured`` prepare extracted text for downstream tasks such as machine learning inference and data labeling. ``convert_to_isd`` ------------------ Converts outputs to the initial structured data (ISD) format. This is the default format for returning data in Unstructured pipeline APIs. Examples: .. code:: python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.base import convert_to_isd elements = [Title(text="Title"), NarrativeText(text="Narrative")] isd = convert_to_isd(elements) ``stage_for_label_studio`` -------------------------- Formats outputs for upload to LabelStudio. After running ``stage_for_label_studio``, you can write the results to a JSON folder that is ready to be included in a new LabelStudio project. Examples: .. code:: python import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.label_studio import stage_for_label_studio elements = [Title(text="Title"), NarrativeText(text="Narrative")] label_studio_data = stage_for_label_studio(elements, text_field="my_text", id_field="my_id") # The resulting JSON file is ready to be uploaded to LabelStudio with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) You can also include pre-annotations and predictions as part of your LabelStudio upload. The ``annotations`` kwarg is a list of lists. If ``annotations`` is specified, there must be a list of annotations for each element in the ``elements`` list. If an element does not have any annotations, use an empty list. The following shows an example of how to upload annotations for the "Text Classification" task in LabelStudio: .. code:: python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioAnnotation, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] annotations = [[ LabelStudioAnnotation( result=[ LabelStudioResult( type="choices", value={"choices": ["Positive"]}, from_name="sentiment", to_name="text", ) ] ) ]] label_studio_data = stage_for_label_studio( elements, annotations=annotations, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) Similar to annotations, the ``predictions`` kwarg is also a list of lists. A ``prediction`` is an annotation with the addition of a ``score`` value. If ``predictions`` is specified, there must be a list of predictions for each element in the ``elements`` list. If an element does not have any predictions, use an empty list. The following shows an example of how to upload predictions for the "Text Classification" task in LabelStudio: .. code:: python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioPrediction, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] predictions = [[ LabelStudioPrediction( result=[ LabelStudioResult( type="choices", value={"choices": ["Positive"]}, from_name="sentiment", to_name="text", ) ], score=0.68 ) ]] label_studio_data = stage_for_label_studio( elements, predictions=predictions, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) The following shows an example of how to upload annotations for the "Named Entity Recognition" task in LabelStudio: .. code:: python import json from unstructured.documents.elements import NarrativeText from unstructured.staging.label_studio import ( stage_for_label_studio, LabelStudioAnnotation, LabelStudioResult, ) elements = [NarrativeText(text="Narrative")] annotations = [[ LabelStudioAnnotation( result=[ LabelStudioResult( type="labels", value={"start": 0, "end": 9, "text": "Narrative", "labels": ["MISC"]}, from_name="label", to_name="text", ) ] ) ]] label_studio_data = stage_for_label_studio( elements, annotations=annotations, text_field="my_text", id_field="my_id" ) # The resulting JSON file is ready to be uploaded to LabelStudio # with annotations included with open("label_studio.json", "w") as f: json.dump(label_studio_data, f, indent=4) See the `LabelStudio docs `_ for a full list of options for labels and annotations. ``stage_for_prodigy`` -------------------------- Formats outputs in JSON format for use with `Prodigy `_. After running ``stage_for_prodigy``, you can write the results to a JSON file that is ready to be used with Prodigy. Examples: .. code:: python import json from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_for_prodigy elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"type": "text"}] prodigy_data = stage_for_prodigy(elements, metadata) # The resulting JSON file is ready to be used with Prodigy with open("prodigy.json", "w") as f: json.dump(prodigy_data, f, indent=4) **Note**: Prodigy recommends ``.jsonl`` format for feeding data to API loaders. After running ``stage_for_prodigy``, you can use the ``save_as_jsonl`` utility function to save the formatted data to a ``.jsonl`` file that is ready to be used with Prodigy. .. code:: python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_for_prodigy from unstructured.utils import save_as_jsonl elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"type": "text"}] prodigy_data = stage_for_prodigy(elements, metadata) # The resulting jsonl file is ready to be used with Prodigy. save_as_jsonl(prodigy_data, "prodigy.jsonl") ``stage_csv_for_prodigy`` -------------------------- Formats outputs in CSV format for use with `Prodigy `_. After running ``stage_csv_for_prodigy``, you can write the results to a CSV file that is ready to be used with Prodigy. Examples: .. code:: python from unstructured.documents.elements import Title, NarrativeText from unstructured.staging.prodigy import stage_csv_for_prodigy elements = [Title(text="Title"), NarrativeText(text="Narrative")] metadata = [{"type": "title"}, {"source": "news"}] prodigy_csv_data = stage_csv_for_prodigy(elements, metadata) # The resulting CSV file is ready to be used with Prodigy with open("prodigy.csv", "w") as csv_file: csv_file.write(prodigy_csv_data)