Resolve numerous typos (#280)

* Resolve numerous typos

* Resolve typo in mime type
This commit is contained in:
Tom Aarsen 2023-02-25 02:48:23 +01:00 committed by GitHub
parent 956f04d770
commit 9062d25d0d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
19 changed files with 27 additions and 27 deletions

View File

@ -228,7 +228,7 @@ The output will look the same as the example from the document parsing section a
### E-mail Parsing
The `partition_email` function within `unstructured` is helpful for parsing `.eml` files. Common
e-mail clients such as Microsoft Outlook and Gmail support exproting e-mails as `.eml` files.
e-mail clients such as Microsoft Outlook and Gmail support exporting e-mails as `.eml` files.
`partition_email` accepts filenames, file-like object, and raw text as input. The following
three snippets for parsing `.eml` files are equivalent:

View File

@ -20,7 +20,7 @@ titles, narrative text, and tables.
The ``partition`` brick is the simplest way to partition a document in ``unstructured``.
If you call the ``partition`` function, ``unstructured`` will attempt to detect the
file type and route it to the appropriate partitioning brick. All partitioning bricks
called within ``partition`` are called using the defualt kwargs. Use the document-type
called within ``partition`` are called using the default kwargs. Use the document-type
specific bricks if you need to apply non-default settings.
``partition`` currently supports ``.docx``, ``.doc``, ``.pptx``, ``.ppt``, ``.eml``, ``.html``, ``.pdf``,
``.png``, ``.jpg``, and ``.txt`` files.
@ -539,7 +539,7 @@ Examples:
``clean_ordered_bullets``
-------------------------
Remove alpha-numeric bullets from the beginning of text up to three “sub-section” levels.
Remove alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:
@ -687,7 +687,7 @@ Extracts text that occurs before the specified pattern.
Options:
* If ``index`` is set, extract before the ``(index + 1)``th occurence of the pattern. The default is ``0``.
* If ``index`` is set, extract before the ``(index + 1)``th occurrence of the pattern. The default is ``0``.
* Strips leading whitespace if ``strip`` is set to ``True``. The default is ``True``.
@ -710,7 +710,7 @@ Extracts text that occurs after the specified pattern.
Options:
* If ``index`` is set, extract after the ``(index + 1)``th occurence of the pattern. The default is ``0``.
* If ``index`` is set, extract after the ``(index + 1)``th occurrence of the pattern. The default is ``0``.
* Strips trailing whitespace if ``strip`` is set to ``True``. The default is ``True``.
@ -834,7 +834,7 @@ Examples:
``extract_ordered_bullets``
---------------------------
Extracts alpha-numeric bullets from the beginning of text up to three “sub-section” levels.
Extracts alphanumeric bullets from the beginning of text up to three “sub-section” levels.
Examples:

View File

@ -2,7 +2,7 @@ Elements
--------
The following are the structured page elements that are available within the ``unstructured``
package. Partioning bricks convert raw documents to this common set of elements. If you need
package. Partitioning bricks convert raw documents to this common set of elements. If you need
a custom element, the recommended approach is to create a sub-class of one of the default
elements.

View File

@ -8,7 +8,7 @@ complete a data science project in hours that previously would have taken weeks.
To get started, use the following steps:
- Ensure you have Python 3.8 or higher installed on your system
- Create a new Python virtual enviornment
- Create a new Python virtual environment
- Run `pip install -r requirements.txt` to install the dependencies
- Run `PYTHONPATH=. jupyter notebook` from this directory to launch the notebook

View File

@ -5,7 +5,7 @@ and several bricks from the `unstructured` library to train a sentiment analysis
risk factors section of S-1 filings. To get started, use the following steps:
- Ensure you have Python 3.8 or higher installed on your system
- Create a new Python virtual enviornment
- Create a new Python virtual environment
- Run `pip install -r requirements.txt` to install the dependencies
- Run `PYTHONPATH=. jupyter notebook` from this directory to launch the notebook

View File

@ -125,7 +125,7 @@ def get_form_by_ticker(
def _form_types(form_type: str, allow_amended_filing: Optional[bool] = True):
"""Potentialy expand to include amended filing, e.g.:
"""Potentially expand to include amended filing, e.g.:
"10-Q" -> "10-Q/A"
"""
assert form_type in VALID_FILING_TYPES
@ -144,7 +144,7 @@ def get_form_by_cik(
) -> str:
"""For a given CIK, returns the most recent form of a given form_type. By default
an amended version of the form_type may be retrieved (allow_amended_filing=True).
E.g., if form_type is "10-Q", the retrived form could be a 10-Q or 10-Q/A.
E.g., if form_type is "10-Q", the retrieved form could be a 10-Q or 10-Q/A.
"""
session = _get_session(company, email)
acc_num, _ = _get_recent_acc_num_by_cik(

View File

@ -187,7 +187,7 @@
" - `Image`\n",
" - `PageBreak`\n",
" \n",
"Other element types that we will add in the future include tables and figures. Different partioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below."
"Other element types that we will add in the future include tables and figures. Different partitioning functions use different methods for determining the element type and extracting the associated content. Document elements have a `str` representation. You can print them using the snippet below."
]
},
{

View File

@ -143,7 +143,7 @@
"id": "e3a8e7f4",
"metadata": {},
"source": [
"The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partioning brick instead of `partition`:\n",
"The `unstructured` library also includes partitioning bricks targeted at specific document types. The `partition` brick uses these document-specific partitioning bricks under the hood. There are a few reasons you may want to use a document-specific partitioning brick instead of `partition`:\n",
"\n",
"1. If you already know the document type, filetype detection is unnecessary. Using the document-specific brick directly will make your program run faster.\n",
"2. Fewer dependencies. You don't need to install `libmagic` for filetype detection if you're only using document-specific bricks.\n",
@ -312,7 +312,7 @@
"id": "358e149b",
"metadata": {},
"source": [
"Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to inclue for model training purposes."
"Since a cleaning brick is just a `str -> str` function, users can also easily include their own cleaning bricks for custom data preparation tasks. In the example below, we partition a Russian offensive campaign assessment from the institute of the study of war and remove citations, which are not natural language text that we want to include for model training purposes."
]
},
{

View File

@ -7,7 +7,7 @@
"source": [
"# File Exploration\n",
"\n",
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw doucments. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
"\n",
"- [Filetype detection in `unstructured`](#filetype)\n",
"- [How to generate summary statistics about documents](#summary)"

View File

@ -15,5 +15,5 @@ types-requests
vcrpy
# NOTE(robinson) - The following pins are to address
# vulernabilities in dependency scans
# vulnerabilities in dependency scans
certifi>=2022.12.07

View File

@ -23,7 +23,7 @@ def clean_bullets(text) -> str:
def clean_ordered_bullets(text) -> str:
"""Cleans the start of bulleted text sections up to three “sub-section”
bullets accounting numeric and alpha-numeric types.
bullets accounting numeric and alphanumeric types.
Example
-------

View File

@ -29,7 +29,7 @@ def _get_indexed_match(text: str, pattern: str, index: int = 0) -> re.Match:
def extract_text_before(text: str, pattern: str, index: int = 0, strip: bool = True) -> str:
"""Extracts texts that occurs before the specified pattern. By default, it will use
the first occurence of the pattern (index 0). Use the index kwarg to choose a different
the first occurrence of the pattern (index 0). Use the index kwarg to choose a different
index.
Input
@ -44,7 +44,7 @@ def extract_text_before(text: str, pattern: str, index: int = 0, strip: bool = T
def extract_text_after(text: str, pattern: str, index: int = 0, strip: bool = True) -> str:
"""Extracts texts that occurs before the specified pattern. By default, it will use
the first occurence of the pattern (index 0). Use the index kwarg to choose a different
the first occurrence of the pattern (index 0). Use the index kwarg to choose a different
index.
Input
@ -99,7 +99,7 @@ def extract_us_phone_number(text: str):
def extract_ordered_bullets(text) -> tuple:
"""Extracts the start of bulleted text sections bullets
accounting numeric and alpha-numeric types.
accounting numeric and alphanumeric types.
Output
-----

View File

@ -59,7 +59,7 @@ def translate_text(text, source_lang: Optional[str] = None, target_lang: str = "
except OSError:
raise ValueError(
f"Transformers could not find the translation model {model_name}. "
"The requested source/target language combo is not suppored."
"The requested source/target language combo is not supported."
)
chunks: List[str] = chunk_by_attention_window(text, tokenizer, split_function=sent_tokenize)

View File

@ -230,7 +230,7 @@ def _text_to_element(text: str, tag: str, ancestortags: Tuple[str, ...]) -> Opti
def _is_container_with_text(tag_elem: etree.Element) -> bool:
"""Checks if a tag is a container that also happens to containe text.
"""Checks if a tag is a container that also happens to contain text.
Example
-------
<div>Hi there,

View File

@ -236,7 +236,7 @@ def _detect_filetype_from_octet_stream(file: IO) -> FileType:
elif all([f in archive_filenames for f in EXPECTED_PPTX_FILES]):
return FileType.PPTX
logger.warning("Could not detect the filetype from application/octet-strem MIME type.")
logger.warning("Could not detect the filetype from application/octet-stream MIME type.")
return FileType.UNK

View File

@ -16,7 +16,7 @@ class BaseConnector(ABC):
@abstractmethod
def cleanup(self, cur_dir=None):
"""Any additonal cleanup up need after processing is complete. E.g., removing
"""Any additional cleanup up need after processing is complete. E.g., removing
temporary download dirs that are empty.
By convention, documents that failed to process are typically not cleaned up."""

View File

@ -98,7 +98,7 @@ def partition_docx(
def _paragraph_to_element(paragraph: docx.text.paragraph.Paragraph) -> Optional[Text]:
"""Converts a docx Paragraph object into the appropriate unstructured document element.
If the paragaraph style is "Normal" or unknown, we try to predict the element type from the
If the paragraph style is "Normal" or unknown, we try to predict the element type from the
raw text."""
text = paragraph.text
style_name = paragraph.style.name

View File

@ -228,7 +228,7 @@ def under_non_alpha_ratio(text: str, threshold: float = 0.5):
def exceeds_cap_ratio(text: str, threshold: float = 0.5) -> bool:
"""Checks the title ratio in a section of text. If a sufficient proportion of the words
are capitalized, that can be indiciated on non-narrative text (i.e. "1A. Risk Factors").
are capitalized, that can be indicated on non-narrative text (i.e. "1A. Risk Factors").
Parameters
----------

View File

@ -12,7 +12,7 @@ def stage_for_datasaur(
_entities: List[List[Dict[str, Any]]] = [[] for _ in range(len(elements))]
if entities is not None:
if len(entities) != len(elements):
raise ValueError("If entities is specified, it must be the same lenth as elements.")
raise ValueError("If entities is specified, it must be the same length as elements.")
for entity_list in entities:
for entity in entity_list: