**Summary**
Remove HTML-specific element types and return "regular" elements like
`Title` and `NarrativeText` from `partition_html()`.
**Additional Context**
- An aspect of the legacy HTML partitioner was the use of HTML-specific
element types used to track metadata during partitioning.
- That role is no longer necessary or desireable.
- HTML-specific elements like `HTMLTitle` and `HTMLNarrativeText` were
returned from partitioning HTML but also the seven other file-formats
that broker partitioning to HTML (convert-to-HTML and partition_html()).
This does not cause immediate breakage because these are still `Text`
element subtypes, but it produces a confusing developer experience.
- Remove the prior metadata roles from HTML-specific elements and remove
those element types entirely.
### Summary
Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)
### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this
Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
add support for start_index in html links extraction (closes#2625)
Testing
```
from unstructured.partition.html import partition_html
from unstructured.staging.base import elements_to_json
html_text = """<html>
<p>Hello there I am a <a href="/link">very important link!</a></p>
<p>Here is a list of my favorite things</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Parrot">Parrots</a></li>
<li>Dogs</li>
</ul>
<a href="/loner">A lone link!</a>
</html>"""
elements = partition_html(text=html_text)
print(elements_to_json(elements))
```
---------
Co-authored-by: Michael Niestroj <michael.niestroj@unblu.com>
Co-authored-by: christinestraub <christinemstraub@gmail.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: christinestraub <christinestraub@users.noreply.github.com>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
Addresses a cluster of HTML-related bugs:
- empty table is identified as bulleted-table
- `partition_html()` emits empty (no text) tables (#1928)
- `.text_as_html` contains inappropriate `<br>` elements in invalid
locations.
- cells enclosed in `<thead>` and `<tfoot>` elements are dropped (#1928)
- `.text_as_html` contains whitespace padding
Each of these is addressed in a separate commit below.
Fixes#1928.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: scanny <scanny@users.noreply.github.com>
Co-authored-by: Yuming Long <63475068+yuming-long@users.noreply.github.com>
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
### Description
* If the contents of a doc were updated by the process of
reading/downloading it, this was not being persisted. To fix this, the
data being passed around was updated to use a multiprocessing safe dict
rather than the json string. Now that dict is updated after the
`get_file` method is called.
* Wikipedia connector was updated to use a static filename rather than
one requiring a call to fetch data.
* The read config param `re_download` was not being leveraged by the
source node, this was fixed.
* Added fix: chunking and embedding order reversed so chunking runs
before embeddings
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Description
Set language to None by default. Update ingest test to use local file
used in language unit tests to validate.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
### Summary
Closes#1534 and #1535
Detects document language using `langdetect` package.
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)
---------
Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
Adding table extraction to HTML partitioning.
This PR utilizes 'table' HTML elements to extract and parse HTML tables
and return them in partitioning.
```
# checkout this branch, go into ipython shell
In [1]: from unstructured.partition.html import partition_html
In [2]: path_to_html = "{html sample file with table}"
In [3]: elements = partition_html(path_to_html)
```
you should see the table in the elements list!
Uncomment confluence-diff ingest test to:
- see if the test has consistent results
- keep testing the confluence connector
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* track tags in html
* pass through links as metadata
* add test for grabbing links
* one more link
* changelog and version
* update docs
* fix tests
* update empty link assertion
* ingest-test-fixtures-update
* Update ingest test fixtures (#961)
* Add confluence connector and an example script
* add test script, add dependency installations
* add authentication secret variables for ci tests and actions
* add dependency installation commands for workflows
* add dependency installation commands for workflows
* Update ingest test fixtures (#907)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add add ingest test fixtures update workflow for python 3.10, update example script with dummy values
* change workflow name to avoid confusion
* change workflow name to avoid confusion
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions, remove 3.10 workflow for the test fixtures update
* only leave 3.8 in ingest test matrix to test consistent partitioning among python versions
* Update ingest test fixtures (#911)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* revert back the test python version matrix
* recompile dependencies
* modifications for shellcheck
* update changelog and version
* changelog and version
* remove comments
* Update ingest test fixtures (#915)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* add the option to state the number of spaces to be fetched
* add scroll functionality, expose --confluence-num-of-spaces, --confluence-list-of-spaces and --confluence-num-of-docs-from-each-space to users
* add help message
* add docstrings for two tests, validate grabbing every doc in the fetched spaces, count number of files instead of diffing for confluence2 test
* change test names
* rename connector arg
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* change arg name for connector
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
* add comment to example
* change arg names
* add new tests to ingest test
* shellcheck remove redundant statement
* Update ingest test fixtures (#932)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* Update ingest test fixtures (#936)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* linting
* change file extensions to parse as html
* Update ingest test fixtures (#943)
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
* remove old fixtures
* update version to 0.8.2-dev3
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
* change file to trigger CI
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>