### Description
Migrate over the sharepoint connector to v2 and in the process refactor
the majority of the connector. It now pulls in much more content from
the SDK on index time, including permissions data is the parameters are
passed in. HTML content generated from the SitePage is isolated to the
html content in the `CanvasContent1` and `LayoutWebpartsContent`
returned by the SDK.
Some TODOs were left in there for future iterations. Currently only
document and site page content is being pulled in from sharepoint, but
sharepoint has more types of content than just that, such as lists. Note
left in there to support other sharepoint types.
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: rbiseck3 <rbiseck3@users.noreply.github.com>
Co-authored-by: vangheem <vangheem@gmail.com>
Co-authored-by: Ahmet Melek <ahmetmeleq@gmail.com>
Co-authored-by: Ahmet Melek <39141206+ahmetmeleq@users.noreply.github.com>
### Summary
Rip off page_number metadata fields until we have page counting for all
kinds of html files (not just limited to news articles with multiple
`<article>` tag)
### Test
Unit tests
`test_add_chunking_strategy_on_partition_html_respects_multipage` and
`test_add_chunking_strategy_title_on_partition_auto_respects_multipage`
removed since they relay on the `page_number` fields from the SEC html
file - now test moved to mock test for chunk_by_title -> revisit those
tests when we find test file for this
Also changed the element ids from partition outputs for html files -
element id change due to page number change (in element id hashing) ->
todo ticket: update other deterministic element id tests per crag's
comment
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: yuming-long <yuming-long@users.noreply.github.com>
Part two of: https://github.com/Unstructured-IO/unstructured/pull/2842
Main changes compared to part one:
* hash computation includes element's sequence number on page, page
number, document filename and its text
* there are more test for deterministic behavior of IDs returned by
partitioning functions + their uniqueness (guaranteed at the document
level, and high probability across multiple documents)
This PR addresses the following issue:
https://github.com/Unstructured-IO/unstructured/issues/2461
Canonicalize JSON produced for ingest tests such that incidental changes
is _form_ of the JSON objects (keys moving around) that does not change
the _content_ of that JSON object does not trigger an ingest-test
failure.
Closes: #1891 (check the issue for more info)
---------
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: ahmetmeleq <ahmetmeleq@users.noreply.github.com>
Co-authored-by: Yao You <yao@unstructured.io>