mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-12-03 02:29:52 +00:00
Resolve numerous typos (#280)
* Resolve numerous typos * Resolve typo in mime type
This commit is contained in:
parent
956f04d770
commit
9062d25d0d
@ -228,7 +228,7 @@ The output will look the same as the example from the document parsing section a
|
||||
### E-mail Parsing
|
||||
|
||||
The `partition_email` function within `unstructured` is helpful for parsing `.eml` files. Common
|
||||
e-mail clients such as Microsoft Outlook and Gmail support exproting e-mails as `.eml` files.
|
||||
e-mail clients such as Microsoft Outlook and Gmail support exporting e-mails as `.eml` files.
|
||||
`partition_email` accepts filenames, file-like object, and raw text as input. The following
|
||||
three snippets for parsing `.eml` files are equivalent:
|
||||
|
||||
|
||||
@ -20,7 +20,7 @@ titles, narrative text, and tables.
|
||||
The ``partition`` brick is the simplest way to partition a document in ``unstructured``.
|
||||
If you call the ``partition`` function, ``unstructured`` will attempt to detect the
|
||||
file type and route it to the appropriate partitioning brick. All partitioning bricks
|
||||
called within ``partition`` are called using the defualt kwargs. Use the document-type
|
||||
called within ``partition`` are called using the default kwargs. Use the document-type
|
||||
specific bricks if you need to apply non-default settings.
|
||||
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
|
||||
``.png``, ``.jpg``, and ``.txt`` files.
|
||||
@ -539,7 +539,7 @@ Examples:
|
||||
``clean_ordered_bullets``
|
||||
-------------------------
|
||||
|
||||
Remove alpha-numeric bullets from the beginning of text up to three “sub-section” levels.
|
||||
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
|
||||
|
||||
Examples:
|
||||
|
||||
@ -687,7 +687,7 @@ Extracts text that occurs before the specified pattern.
|
||||
|
||||
Options:
|
||||
|
||||
* If ``index`` is set, extract before the ``(index + 1)``th occurence of the pattern. The default is ``0``.
|
||||
* If ``index`` is set, extract before the ``(index + 1)``th occurrence of the pattern. The default is ``0``.
|
||||
* Strips leading whitespace if ``strip`` is set to ``True``. The default is ``True``.
|
||||
|
||||
|
||||
@ -710,7 +710,7 @@ Extracts text that occurs after the specified pattern.
|
||||
|
||||
Options:
|
||||
|
||||
* If ``index`` is set, extract after the ``(index + 1)``th occurence of the pattern. The default is ``0``.
|
||||
* If ``index`` is set, extract after the ``(index + 1)``th occurrence of the pattern. The default is ``0``.
|
||||
* Strips trailing whitespace if ``strip`` is set to ``True``. The default is ``True``.
|
||||
|
||||
|
||||
@ -834,7 +834,7 @@ Examples:
|
||||
``extract_ordered_bullets``
|
||||
---------------------------
|
||||
|
||||
Extracts alpha-numeric bullets from the beginning of text up to three “sub-section” levels.
|
||||
Extracts alphanumeric bullets from the beginning of text up to three “sub-section” levels.
|
||||
|
||||
Examples:
|
||||
|
||||
|
||||
@ -2,7 +2,7 @@ Elements
|
||||
--------
|
||||
|
||||
The following are the structured page elements that are available within the ``unstructured``
|
||||
package. Partioning bricks convert raw documents to this common set of elements. If you need
|
||||
package. Partitioning bricks convert raw documents to this common set of elements. If you need
|
||||
a custom element, the recommended approach is to create a sub-class of one of the default
|
||||
elements.
|
||||
|
||||
|
||||
@ -8,7 +8,7 @@ complete a data science project in hours that previously would have taken weeks.
|
||||
To get started, use the following steps:
|
||||
|
||||
- Ensure you have Python 3.8 or higher installed on your system
|
||||
- Create a new Python virtual enviornment
|
||||
- Create a new Python virtual environment
|
||||
- Run `pip install -r requirements.txt` to install the dependencies
|
||||
- Run `PYTHONPATH=. jupyter notebook` from this directory to launch the notebook
|
||||
|
||||
|
||||
@ -5,7 +5,7 @@ and several bricks from the `unstructured` library to train a sentiment analysis
|
||||
risk factors section of S-1 filings. To get started, use the following steps:
|
||||
|
||||
- Ensure you have Python 3.8 or higher installed on your system
|
||||
- Create a new Python virtual enviornment
|
||||
- Create a new Python virtual environment
|
||||
- Run `pip install -r requirements.txt` to install the dependencies
|
||||
- Run `PYTHONPATH=. jupyter notebook` from this directory to launch the notebook
|
||||
|
||||
|
||||
@ -125,7 +125,7 @@ def get_form_by_ticker(
|
||||
|
||||
|
||||
def _form_types(form_type: str, allow_amended_filing: Optional[bool] = True):
|
||||
"""Potentialy expand to include amended filing, e.g.:
|
||||
"""Potentially expand to include amended filing, e.g.:
|
||||
"10-Q" -> "10-Q/A"
|
||||
"""
|
||||
assert form_type in VALID_FILING_TYPES
|
||||
@ -144,7 +144,7 @@ def get_form_by_cik(
|
||||
) -> str:
|
||||
"""For a given CIK, returns the most recent form of a given form_type. By default
|
||||
an amended version of the form_type may be retrieved (allow_amended_filing=True).
|
||||
E.g., if form_type is "10-Q", the retrived form could be a 10-Q or 10-Q/A.
|
||||
E.g., if form_type is "10-Q", the retrieved form could be a 10-Q or 10-Q/A.
|
||||
"""
|
||||
session = _get_session(company, email)
|
||||
acc_num, _ = _get_recent_acc_num_by_cik(
|
||||
|
||||
@ -187,7 +187,7 @@
|
||||
" - `Image`\n",
|
||||
" - `PageBreak`\n",
|
||||
" \n",
|
||||
"Other element types that we will add in the future include tables and figures. Different partioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below."
|
||||
"Other element types that we will add in the future include tables and figures. Different partitioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@ -143,7 +143,7 @@
|
||||
"id": "e3a8e7f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partioning brick instead of `partition`:\n",
|
||||
"The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partitioning brick instead of `partition`:\n",
|
||||
"\n",
|
||||
"1. If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly will make your program run faster.\n",
|
||||
"2. Fewer dependencies. You don't need to install `libmagic` for filetype detection if you're only using document-specific bricks.\n",
|
||||
@ -312,7 +312,7 @@
|
||||
"id": "358e149b",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to inclue for model training purposes."
|
||||
"Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to include for model training purposes."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
@ -7,7 +7,7 @@
|
||||
"source": [
|
||||
"# File Exploration\n",
|
||||
"\n",
|
||||
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw doucments. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
|
||||
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
|
||||
"\n",
|
||||
"- [Filetype detection in `unstructured`](#filetype)\n",
|
||||
"- [How to generate summary statistics about documents](#summary)"
|
||||
|
||||
@ -15,5 +15,5 @@ types-requests
|
||||
vcrpy
|
||||
|
||||
# NOTE(robinson) - The following pins are to address
|
||||
# vulernabilities in dependency scans
|
||||
# vulnerabilities in dependency scans
|
||||
certifi>=2022.12.07
|
||||
|
||||
@ -23,7 +23,7 @@ def clean_bullets(text) -> str:
|
||||
|
||||
def clean_ordered_bullets(text) -> str:
|
||||
"""Cleans the start of bulleted text sections up to three “sub-section”
|
||||
bullets accounting numeric and alpha-numeric types.
|
||||
bullets accounting numeric and alphanumeric types.
|
||||
|
||||
Example
|
||||
-------
|
||||
|
||||
@ -29,7 +29,7 @@ def _get_indexed_match(text: str, pattern: str, index: int = 0) -> re.Match:
|
||||
|
||||
def extract_text_before(text: str, pattern: str, index: int = 0, strip: bool = True) -> str:
|
||||
"""Extracts texts that occurs before the specified pattern. By default, it will use
|
||||
the first occurence of the pattern (index 0). Use the index kwarg to choose a different
|
||||
the first occurrence of the pattern (index 0). Use the index kwarg to choose a different
|
||||
index.
|
||||
|
||||
Input
|
||||
@ -44,7 +44,7 @@ def extract_text_before(text: str, pattern: str, index: int = 0, strip: bool = T
|
||||
|
||||
def extract_text_after(text: str, pattern: str, index: int = 0, strip: bool = True) -> str:
|
||||
"""Extracts texts that occurs before the specified pattern. By default, it will use
|
||||
the first occurence of the pattern (index 0). Use the index kwarg to choose a different
|
||||
the first occurrence of the pattern (index 0). Use the index kwarg to choose a different
|
||||
index.
|
||||
|
||||
Input
|
||||
@ -99,7 +99,7 @@ def extract_us_phone_number(text: str):
|
||||
|
||||
def extract_ordered_bullets(text) -> tuple:
|
||||
"""Extracts the start of bulleted text sections bullets
|
||||
accounting numeric and alpha-numeric types.
|
||||
accounting numeric and alphanumeric types.
|
||||
|
||||
Output
|
||||
-----
|
||||
|
||||
@ -59,7 +59,7 @@ def translate_text(text, source_lang: Optional[str] = None, target_lang: str = "
|
||||
except OSError:
|
||||
raise ValueError(
|
||||
f"Transformers could not find the translation model {model_name}. "
|
||||
"The requested source/target language combo is not suppored."
|
||||
"The requested source/target language combo is not supported."
|
||||
)
|
||||
|
||||
chunks: List[str] = chunk_by_attention_window(text, tokenizer, split_function=sent_tokenize)
|
||||
|
||||
@ -230,7 +230,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
|
||||
|
||||
|
||||
def _is_container_with_text(tag_elem: etree.Element) -> bool:
|
||||
"""Checks if a tag is a container that also happens to containe text.
|
||||
"""Checks if a tag is a container that also happens to contain text.
|
||||
Example
|
||||
-------
|
||||
<div>Hi there,
|
||||
|
||||
@ -236,7 +236,7 @@ def _detect_filetype_from_octet_stream(file: IO) -> FileType:
|
||||
elif all([f in archive_filenames for f in EXPECTED_PPTX_FILES]):
|
||||
return FileType.PPTX
|
||||
|
||||
logger.warning("Could not detect the filetype from application/octet-strem MIME type.")
|
||||
logger.warning("Could not detect the filetype from application/octet-stream MIME type.")
|
||||
return FileType.UNK
|
||||
|
||||
|
||||
|
||||
@ -16,7 +16,7 @@ class BaseConnector(ABC):
|
||||
|
||||
@abstractmethod
|
||||
def cleanup(self, cur_dir=None):
|
||||
"""Any additonal cleanup up need after processing is complete. E.g., removing
|
||||
"""Any additional cleanup up need after processing is complete. E.g., removing
|
||||
temporary download dirs that are empty.
|
||||
|
||||
By convention, documents that failed to process are typically not cleaned up."""
|
||||
|
||||
@ -98,7 +98,7 @@ def partition_docx(
|
||||
|
||||
def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[Text]:
|
||||
"""Converts a docx Paragraph object into the appropriate unstructured document element.
|
||||
If the paragaraph style is "Normal" or unknown, we try to predict the element type from the
|
||||
If the paragraph style is "Normal" or unknown, we try to predict the element type from the
|
||||
raw text."""
|
||||
text = paragraph.text
|
||||
style_name = paragraph.style.name
|
||||
|
||||
@ -228,7 +228,7 @@ def under_non_alpha_ratio(text: str, threshold: float = 0.5):
|
||||
|
||||
def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
|
||||
"""Checks the title ratio in a section of text. If a sufficient proportion of the words
|
||||
are capitalized, that can be indiciated on non-narrative text (i.e. "1A. Risk Factors").
|
||||
are capitalized, that can be indicated on non-narrative text (i.e. "1A. Risk Factors").
|
||||
|
||||
Parameters
|
||||
----------
|
||||
|
||||
@ -12,7 +12,7 @@ def stage_for_datasaur(
|
||||
_entities: List[List[Dict[str, Any]]] = [[] for _ in range(len(elements))]
|
||||
if entities is not None:
|
||||
if len(entities) != len(elements):
|
||||
raise ValueError("If entities is specified, it must be the same lenth as elements.")
|
||||
raise ValueError("If entities is specified, it must be the same length as elements.")
|
||||
|
||||
for entity_list in entities:
|
||||
for entity in entity_list:
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user