2022-06-29 14:35:19 -04:00
|
|
|
|
Bricks
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
The ``unstructured`` library provides bricks to make it quick and
|
|
|
|
|
easy to parse documents and create new pre-processing pipelines. The following documents
|
|
|
|
|
bricks currently available in the library.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
############
|
|
|
|
|
Partitioning
|
|
|
|
|
############
|
|
|
|
|
|
|
|
|
|
The partitioning bricks in ``unstructured`` differentiate between different sections
|
|
|
|
|
of text in a document. For example, the partitioning bricks can help distinguish between
|
|
|
|
|
titles, narrative text, and tables.
|
|
|
|
|
|
|
|
|
|
``is_bulleted_text``
|
|
|
|
|
----------------------
|
|
|
|
|
|
|
|
|
|
Uses regular expression patterns to check if a snippet of text is a bullet point. Only
|
|
|
|
|
triggers if the bullet point appears at the start of the snippet.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import is_bulleted_text
|
|
|
|
|
|
|
|
|
|
# Returns True
|
|
|
|
|
is_bulleted_text("● An excellent point!")
|
|
|
|
|
|
|
|
|
|
# Returns False
|
|
|
|
|
is_bulleted_text("I love Morse Code! ●●●")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``is_possible_narrative_text``
|
|
|
|
|
------------------------------
|
|
|
|
|
|
|
|
|
|
The ``is_possible_narrative_text`` function determines if a section of text is a candidate
|
|
|
|
|
for consideration as narrative text. The function performs the following checks on input text:
|
|
|
|
|
|
|
|
|
|
* Empty text cannot be narrative text
|
|
|
|
|
* Text that is all numeric cannot be narrative text
|
|
|
|
|
* Text that does not contain a verb cannot be narrative text
|
|
|
|
|
* Text that exceeds the specified caps ratio cannot be narrative text. The threshold
|
|
|
|
|
is configurable with the ``cap_threshold`` kwarg. To ignore this check, you can set
|
|
|
|
|
``cap_threshold=1.0``. You may want to ignore this check when dealing with text
|
|
|
|
|
that is all caps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import is_possible_narrative_text
|
|
|
|
|
|
|
|
|
|
# Returns True because the example passes all the checks
|
|
|
|
|
example_1 = "Make sure you brush your teeth before you go to bed."
|
|
|
|
|
is_possible_narrative_text(example_1)
|
|
|
|
|
|
|
|
|
|
# Returns False because the text exceeds the caps ratio and does not contain a verb
|
|
|
|
|
example_2 = "ITEM 1A. RISK FACTORS"
|
|
|
|
|
is_possible_narrative_text(example_2)
|
|
|
|
|
|
|
|
|
|
# Returns True because the text has a verb and does not exceed the cap_threshold
|
|
|
|
|
example_3 = "OLD MCDONALD HAD A FARM"
|
|
|
|
|
is_possible_narrative_text(example_3, cap_threshold=1.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``is_possible_title``
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
The ``is_possible_title`` function determines if a section of text is a candidate
|
|
|
|
|
for consideration as a title. The function performs the following checks:
|
|
|
|
|
|
|
|
|
|
* Empty text cannot be a title
|
|
|
|
|
* Text that is all numeric cannot be a title
|
|
|
|
|
* If a title contains more than one sentence that exceeds a certain length, it cannot be a title.
|
|
|
|
|
Sentence length threshold is controlled by the ``sentence_min_length`` kwarg and defaults to 5.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import is_possible_title
|
|
|
|
|
|
|
|
|
|
# Returns True because the text passes all the tests
|
|
|
|
|
example_2 = "ITEM 1A. RISK FACTORS"
|
|
|
|
|
is_possible_title(example_2)
|
|
|
|
|
|
|
|
|
|
# Returns True because there is only one sentence
|
|
|
|
|
example_2 = "Make sure you brush your teeth before you go to bed."
|
|
|
|
|
is_possible_title(example_2, sentence_min_length=5)
|
|
|
|
|
|
|
|
|
|
# Returns False because there are two sentences
|
|
|
|
|
example_3 = "Make sure you brush your teeth. Do it before you go to bed."
|
|
|
|
|
is_possible_title(example_3, sentence_min_length=5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``contains_verb``
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
Checks if the text contains a verb. This is used in ``is_possible_narrative_text``, but can
|
|
|
|
|
be used independently as well. The function identifies verbs using the NLTK part of speech
|
|
|
|
|
tagger. The following part of speech tags are identified as verbs:
|
|
|
|
|
|
|
|
|
|
* ``VB``
|
|
|
|
|
* ``VBG``
|
|
|
|
|
* ``VBD``
|
|
|
|
|
* ``VBN``
|
|
|
|
|
* ``VBP``
|
|
|
|
|
* ``VBZ``
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import contains_verb
|
|
|
|
|
|
|
|
|
|
# Returns True because the text contains a verb
|
|
|
|
|
example_1 = "I am going to run to the store to pick up some milk."
|
|
|
|
|
contains_verb(example_1)
|
|
|
|
|
|
|
|
|
|
# Returns False because the text does not contain a verb
|
|
|
|
|
example_2 = "A friendly dog"
|
|
|
|
|
contains_verb(example_2)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``sentence_count``
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
Counts the number of sentences in a section of text. Optionally, you can only include
|
|
|
|
|
sentences that exceed a specified word count. Punctuation counts as a word token
|
|
|
|
|
in the sentence. The function uses the NLTK sentence and word tokeniers to identify
|
|
|
|
|
distinct sentences and words.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import sentence_count
|
|
|
|
|
|
|
|
|
|
example = "Look at me! I am a document with two sentences."
|
|
|
|
|
|
|
|
|
|
# Returns 2 because the example contains two sentences
|
|
|
|
|
sentence_count(example)
|
|
|
|
|
|
|
|
|
|
# Returns 1 because the first sentence in the example does not contain five word tokens.
|
|
|
|
|
sentence_count(example, min_length=5)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``exceeds_cap_ratio``
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
Determines if the section of text exceeds the specified caps ratio. Used in
|
|
|
|
|
``is_possible_narrative_text`` and ``is_possible_title``, but can be used independently
|
|
|
|
|
as well. You can set the caps threshold using the ``threshold`` kwarg. The threshold
|
|
|
|
|
defaults to ``0.3``. Only runs on sections of text that are a single sentence.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.nlp.partition import exceeds_cap_ratio
|
|
|
|
|
|
|
|
|
|
# Returns True because the text is more than 30% caps
|
|
|
|
|
example_1 = "LOOK AT ME I AM YELLING"
|
|
|
|
|
exceeds_cap_ratio(example_1)
|
|
|
|
|
|
|
|
|
|
# Returns False because the text is less than 30% caps
|
|
|
|
|
example_2 = "Look at me, I am no longer yelling"
|
|
|
|
|
exceeds_cap_ratio(example_2)
|
|
|
|
|
|
|
|
|
|
# Returns False because the text is more than 1% caps
|
|
|
|
|
exceeds_cap_ratio(example_2, threshold=0.01)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
########
|
|
|
|
|
Cleaning
|
|
|
|
|
########
|
|
|
|
|
|
|
|
|
|
The cleaning bricks in ``unstructured`` remove unwanted text from source documents.
|
|
|
|
|
Examples include removing extra whitespace, boilerplate, or sentence fragments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``clean``
|
|
|
|
|
---------
|
|
|
|
|
|
|
|
|
|
Cleans a section of text with options including removing bullets, extra whitespace, dashes
|
|
|
|
|
and trailing punctuation. Optionally, you can choose to lowercase the output.
|
|
|
|
|
|
|
|
|
|
Options:
|
|
|
|
|
|
|
|
|
|
* Applies ``clean_bullets`` if ``bullets=True``.
|
|
|
|
|
* Applies ``clean_extra_whitespace`` if ``extra_whitespace=True``.
|
|
|
|
|
* Applies ``clean_dashes`` if ``dashes=True``.
|
|
|
|
|
* Applies ``clean_trailing_punctuation`` if ``trailing_punctuation=True``.
|
|
|
|
|
* Lowercases the output if ``lowercase=True``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean
|
|
|
|
|
|
|
|
|
|
# Returns "an excellent point!"
|
|
|
|
|
clean("● An excellent point!", bullets=True, lowercase=True)
|
|
|
|
|
|
|
|
|
|
# Returns "ITEM 1A: RISK FACTORS"
|
|
|
|
|
clean("ITEM 1A: RISK-FACTORS", whitespace=True, dashes=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``clean_bullets``
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
Removes bullets from the beginning of text. Bullets that do not appear at the beginning of the
|
|
|
|
|
text are not removed.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean_bullets
|
|
|
|
|
|
|
|
|
|
# Returns "An excellent point!"
|
|
|
|
|
clean_bullets("● An excellent point!")
|
|
|
|
|
|
|
|
|
|
# Returns "I love Morse Code! ●●●"
|
|
|
|
|
clean_bullets("I love Morse Code! ●●●")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``clean_extra_whitespace``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Removes extra whitespace from a section of text. Also handles special characters
|
|
|
|
|
such as ``\xa0`` and newlines.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean_extra_whitespace
|
|
|
|
|
|
|
|
|
|
# Returns "ITEM 1A: RISK FACTORS"
|
|
|
|
|
clean_extra_whitespace("ITEM 1A: RISK FACTORS\n")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``clean_dashes``
|
|
|
|
|
----------------
|
|
|
|
|
|
|
|
|
|
Removes dashes from a section of text. Also handles special characters
|
|
|
|
|
such as ``\u2013``.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean_dashes
|
|
|
|
|
|
|
|
|
|
# Returns "ITEM 1A: RISK FACTORS"
|
|
|
|
|
clean_dashes("ITEM 1A: RISK-FACTORS\u2013")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``clean_trailing_punctuation``
|
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
|
|
Removes trailing punctuation from a section of text.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import clean_trailing_punctuation
|
|
|
|
|
|
|
|
|
|
# Returns "ITEM 1A: RISK FACTORS"
|
|
|
|
|
clean_trailing_punctuation("ITEM 1A: RISK FACTORS.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``replace_unicode_quotes``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Replaces unicode quote characters such as ``\x91`` in strings.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import replace_unicode_quotes
|
|
|
|
|
|
|
|
|
|
# Returns "“A lovely quote!”"
|
|
|
|
|
replace_unicode_characters("\x93A lovely quote!\x94")
|
|
|
|
|
|
|
|
|
|
# Returns ""‘A lovely quote!’"
|
|
|
|
|
replace_unicode_characters("\x91A lovely quote!\x92")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``remove_punctuation``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Removes ASCII and unicode punctuation from a string.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.cleaners.core import remove_punctuation
|
|
|
|
|
|
|
|
|
|
# Returns "A lovely quote"
|
|
|
|
|
replace_unicode_characters("“A lovely quote!”")
|
|
|
|
|
|
|
|
|
|
# Returns ""
|
|
|
|
|
replace_unicode_characters("'()[]{};:'\",.?/\\-_")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#######
|
|
|
|
|
Staging
|
|
|
|
|
#######
|
|
|
|
|
|
|
|
|
|
Staging bricks in ``unstructured`` prepare extracted text for downstream tasks such
|
|
|
|
|
as machine learning inference and data labeling.
|
|
|
|
|
|
|
|
|
|
``convert_to_isd``
|
|
|
|
|
------------------
|
|
|
|
|
|
|
|
|
|
Converts outputs to the initial structured data (ISD) format. This is the default format
|
|
|
|
|
for returning data in Unstructured pipeline APIs.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Title, NarrativeText
|
|
|
|
|
from unstructured.staging.base import convert_to_isd
|
|
|
|
|
|
|
|
|
|
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
|
|
|
|
isd = convert_to_isd(elements)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
``stage_for_label_studio``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Formats outputs for upload to LabelStudio. After running ``stage_for_label_studio``, you can
|
|
|
|
|
write the results to a JSON folder that is ready to be included in a new LabelStudio project.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
import json
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Title, NarrativeText
|
|
|
|
|
from unstructured.staging.label_studio import stage_for_label_studio
|
|
|
|
|
|
|
|
|
|
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
2022-09-28 09:30:17 -05:00
|
|
|
|
label_studio_data = stage_for_label_studio(elements, text_field="my_text", id_field="my_id")
|
2022-06-29 14:35:19 -04:00
|
|
|
|
|
|
|
|
|
# The resulting JSON file is ready to be uploaded to LabelStudio
|
|
|
|
|
with open("label_studio.json", "w") as f:
|
|
|
|
|
json.dump(label_studio_data, f, indent=4)
|
2022-09-30 21:41:37 +05:00
|
|
|
|
|
|
|
|
|
|
2022-10-06 18:35:55 +05:00
|
|
|
|
You can also include pre-annotations and predictions as part of your LabelStudio upload.
|
|
|
|
|
|
|
|
|
|
The ``annotations`` kwarg is a list of lists. If ``annotations`` is specified, there must be a list of
|
2022-10-04 09:25:05 -04:00
|
|
|
|
annotations for each element in the ``elements`` list. If an element does not have any annotations,
|
|
|
|
|
use an empty list.
|
|
|
|
|
The following shows an example of how to upload annotations for the "Text Classification"
|
|
|
|
|
task in LabelStudio:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
import json
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import NarrativeText
|
|
|
|
|
from unstructured.staging.label_studio import (
|
|
|
|
|
stage_for_label_studio,
|
|
|
|
|
LabelStudioAnnotation,
|
|
|
|
|
LabelStudioResult,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
elements = [NarrativeText(text="Narrative")]
|
|
|
|
|
annotations = [[
|
|
|
|
|
LabelStudioAnnotation(
|
|
|
|
|
result=[
|
|
|
|
|
LabelStudioResult(
|
|
|
|
|
type="choices",
|
|
|
|
|
value={"choices": ["Positive"]},
|
|
|
|
|
from_name="sentiment",
|
|
|
|
|
to_name="text",
|
|
|
|
|
)
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
]]
|
|
|
|
|
label_studio_data = stage_for_label_studio(
|
|
|
|
|
elements,
|
|
|
|
|
annotations=annotations,
|
|
|
|
|
text_field="my_text",
|
|
|
|
|
id_field="my_id"
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# The resulting JSON file is ready to be uploaded to LabelStudio
|
|
|
|
|
# with annotations included
|
|
|
|
|
with open("label_studio.json", "w") as f:
|
|
|
|
|
json.dump(label_studio_data, f, indent=4)
|
|
|
|
|
|
|
|
|
|
|
2022-10-06 18:35:55 +05:00
|
|
|
|
Similar to annotations, the ``predictions`` kwarg is also a list of lists. A ``prediction`` is an annotation with
|
|
|
|
|
the addition of a ``score`` value. If ``predictions`` is specified, there must be a list of
|
|
|
|
|
predictions for each element in the ``elements`` list. If an element does not have any predictions, use an empty list.
|
|
|
|
|
The following shows an example of how to upload predictions for the "Text Classification"
|
|
|
|
|
task in LabelStudio:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
import json
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import NarrativeText
|
|
|
|
|
from unstructured.staging.label_studio import (
|
|
|
|
|
stage_for_label_studio,
|
|
|
|
|
LabelStudioPrediction,
|
|
|
|
|
LabelStudioResult,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
elements = [NarrativeText(text="Narrative")]
|
|
|
|
|
predictions = [[
|
|
|
|
|
LabelStudioPrediction(
|
|
|
|
|
result=[
|
|
|
|
|
LabelStudioResult(
|
|
|
|
|
type="choices",
|
|
|
|
|
value={"choices": ["Positive"]},
|
|
|
|
|
from_name="sentiment",
|
|
|
|
|
to_name="text",
|
|
|
|
|
)
|
|
|
|
|
],
|
|
|
|
|
score=0.68
|
|
|
|
|
)
|
|
|
|
|
]]
|
|
|
|
|
label_studio_data = stage_for_label_studio(
|
|
|
|
|
elements,
|
|
|
|
|
predictions=predictions,
|
|
|
|
|
text_field="my_text",
|
|
|
|
|
id_field="my_id"
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# The resulting JSON file is ready to be uploaded to LabelStudio
|
|
|
|
|
# with annotations included
|
|
|
|
|
with open("label_studio.json", "w") as f:
|
|
|
|
|
json.dump(label_studio_data, f, indent=4)
|
|
|
|
|
|
|
|
|
|
|
2022-10-04 09:25:05 -04:00
|
|
|
|
The following shows an example of how to upload annotations for the "Named Entity Recognition"
|
|
|
|
|
task in LabelStudio:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
import json
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import NarrativeText
|
|
|
|
|
from unstructured.staging.label_studio import (
|
|
|
|
|
stage_for_label_studio,
|
|
|
|
|
LabelStudioAnnotation,
|
|
|
|
|
LabelStudioResult,
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
elements = [NarrativeText(text="Narrative")]
|
|
|
|
|
annotations = [[
|
|
|
|
|
LabelStudioAnnotation(
|
|
|
|
|
result=[
|
|
|
|
|
LabelStudioResult(
|
|
|
|
|
type="labels",
|
|
|
|
|
value={"start": 0, "end": 9, "text": "Narrative", "labels": ["MISC"]},
|
|
|
|
|
from_name="label",
|
|
|
|
|
to_name="text",
|
|
|
|
|
)
|
|
|
|
|
]
|
|
|
|
|
)
|
|
|
|
|
]]
|
|
|
|
|
label_studio_data = stage_for_label_studio(
|
|
|
|
|
elements,
|
|
|
|
|
annotations=annotations,
|
|
|
|
|
text_field="my_text",
|
|
|
|
|
id_field="my_id"
|
|
|
|
|
)
|
|
|
|
|
|
|
|
|
|
# The resulting JSON file is ready to be uploaded to LabelStudio
|
|
|
|
|
# with annotations included
|
|
|
|
|
with open("label_studio.json", "w") as f:
|
|
|
|
|
json.dump(label_studio_data, f, indent=4)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
See the `LabelStudio docs <https://labelstud.io/tags/labels.html>`_ for a full list of options
|
|
|
|
|
for labels and annotations.
|
|
|
|
|
|
|
|
|
|
|
2022-09-30 21:41:37 +05:00
|
|
|
|
``stage_for_prodigy``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
2022-10-03 18:30:30 +05:00
|
|
|
|
Formats outputs in JSON format for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_. After running ``stage_for_prodigy``, you can
|
2022-09-30 21:41:37 +05:00
|
|
|
|
write the results to a JSON file that is ready to be used with Prodigy.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
import json
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Title, NarrativeText
|
|
|
|
|
from unstructured.staging.prodigy import stage_for_prodigy
|
|
|
|
|
|
|
|
|
|
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
|
|
|
|
metadata = [{"type": "title"}, {"type": "text"}]
|
|
|
|
|
prodigy_data = stage_for_prodigy(elements, metadata)
|
|
|
|
|
|
|
|
|
|
# The resulting JSON file is ready to be used with Prodigy
|
|
|
|
|
with open("prodigy.json", "w") as f:
|
|
|
|
|
json.dump(prodigy_data, f, indent=4)
|
2022-10-03 18:30:30 +05:00
|
|
|
|
|
|
|
|
|
|
2022-10-04 18:51:11 +05:00
|
|
|
|
**Note**: Prodigy recommends ``.jsonl`` format for feeding data to API loaders. After running ``stage_for_prodigy``, you can
|
|
|
|
|
use the ``save_as_jsonl`` utility function to save the formatted data to a ``.jsonl`` file that is ready to be used with Prodigy.
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Title, NarrativeText
|
|
|
|
|
from unstructured.staging.prodigy import stage_for_prodigy
|
|
|
|
|
from unstructured.utils import save_as_jsonl
|
|
|
|
|
|
|
|
|
|
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
|
|
|
|
metadata = [{"type": "title"}, {"type": "text"}]
|
|
|
|
|
prodigy_data = stage_for_prodigy(elements, metadata)
|
|
|
|
|
|
|
|
|
|
# The resulting jsonl file is ready to be used with Prodigy.
|
|
|
|
|
save_as_jsonl(prodigy_data, "prodigy.jsonl")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2022-10-03 18:30:30 +05:00
|
|
|
|
``stage_csv_for_prodigy``
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
Formats outputs in CSV format for use with `Prodigy <https://prodi.gy/docs/api-loaders>`_. After running ``stage_csv_for_prodigy``, you can
|
|
|
|
|
write the results to a CSV file that is ready to be used with Prodigy.
|
|
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
from unstructured.documents.elements import Title, NarrativeText
|
|
|
|
|
from unstructured.staging.prodigy import stage_csv_for_prodigy
|
|
|
|
|
|
|
|
|
|
elements = [Title(text="Title"), NarrativeText(text="Narrative")]
|
|
|
|
|
metadata = [{"type": "title"}, {"source": "news"}]
|
|
|
|
|
prodigy_csv_data = stage_csv_for_prodigy(elements, metadata)
|
|
|
|
|
|
|
|
|
|
# The resulting CSV file is ready to be used with Prodigy
|
|
|
|
|
with open("prodigy.csv", "w") as csv_file:
|
|
|
|
|
csv_file.write(prodigy_csv_data)
|