14 Commits

Author SHA1 Message Date
Steve Canny
1ce60f2bba
rfctr(xlsx): extract _XlsxPartitionerOptions (#2838)
**Summary**
As an initial step in reducing the complexity of the monolithic
`partition_xlsx()` function, extract all argument-handling to a separate
`_XlsxPartitionerOptions` object which can be fully covered by isolated
unit tests.
    
**Additional Context**
This code was from a prior XLSX bug-fix branch that did not get
committed because of time constraints. I wanted to revisit it here
because I need the benefits of this as part of some new work on PPTX
that will require a separate options object that can be passed to
delegate objects.

This approach was incubated in the chunking context and has produced a
lot of opportunities there to decompose the logic into smaller
components that are more understandable and isolated-test-able, without
having to pass an extended list of option values in ever sub-call. As
well as decluttering the code, this removes coupling where the caller
needs to know which options a subroutine might need to reference.
2024-04-03 23:27:33 +00:00
Filip Knefel
6af6604057
feat: introduce date_from_file_object parameter to partitions (#2563)
Introduce `date_from_file_object` to `partition*` functions, by default
set to `False`.
If set to `True` and file is provided via `file` parameter, partition
will attempt to infer last modified date from `file`'s contents
otherwise last modified metadata will be set to `None`.

---------

Co-authored-by: Filip Knefel <filip@unstructured.io>
Co-authored-by: Ronny H <138828701+ron-unstructured@users.noreply.github.com>
2024-03-18 01:09:44 +00:00
Steve Canny
b27ad9b6aa
fix: raises on file-like object with .name not a valid path (#2614)
**Summary**
Fixes: #2308

**Additional context**
Through a somewhat deep call-chain, partitioning a file-like object
(e.g. io.BytesIO) having its `.name` attribute set to a path not
pointing to an actual file on the local filesystem would raise
`FileNotFoundError` when the last-modified date was being computed for
the document.

This scenario is a legitimate partitioning call, where `file.name` is
used downstream to describe the source of, for example, a bytes payload
downloaded from the network.

**Fix**
- explicitly check for the existence of a file at the given path before
accessing it to get its modified date. Return `None` (already a
legitimate return value) when no such file exists.
- Generally clean up the implementations.
- Add unit tests that exercise all cases.

---------

Co-authored-by: John <43506685+Coniferish@users.noreply.github.com>
2024-03-07 19:02:04 +00:00
Steve Canny
d3242fb546
rfctr(xlsx): extract connected components (#2575)
**Summary**
Refactoring as part of `partition_xlsx()` algorithm replacement that was
delayed by some CI challenges.

A separate PR because it is cohesive and relatively independent from the
prior PR.
2024-02-22 22:50:48 +00:00
Steve Canny
d9f8467187
fix(xlsx): xlsx subtable algorithm (#2534)
**Reviewers:** It may be easier to review each of the two commits
separately. The first adds the new `_SubtableParser` object with its
unit-tests and the second one uses that object to replace the flawed
existing subtable-parsing algorithm.

**Summary**

There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).

This PR replaces the flawed existing algorithm with a `_SubtableParser`
object that encapsulates all that logic and has thorough unit-tests.

**Additional Context**

This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes. This PR fixes all these bugs:
```python
    #
    # --  CASE 1: There are no leading or trailing single-cell rows.
    #       -> this subtable functions never get called, subtable is emitted as the only element
    #
    #    a b  -> Table(a, b, c, d)
    #    c d

    # --  CASE 2: There is exactly one leading single-cell row.
    #       -> Leading single-cell row emitted as `Title` element, core-table properly identified.
    #
    #    a    -> [ Title(a),
    #    b c       Table(b, c, d, e) ]
    #    d e

    # --  CASE 3: There are two-or-more leading single-cell rows.
    #       -> leading single-cell rows are included in subtable
    #
    #    a    -> [ Table(a, b, c, d, e, f) ]
    #    b
    #    c d
    #    e f

    # --  CASE 4: There is exactly one trailing single-cell row.
    #      -> core table is dropped. trailing single-cell row is emitted as Title
    #         (this is the behavior in the reported bug)
    #
    #    a b  -> [ Title(e) ]
    #    c d
    #      e

    # --  CASE 5: There are two-or-more trailing single-cell rows.
    #      -> core table is dropped. trailing single-cell rows are each emitted as a Title
    #
    #    a b  -> [ Title(e),
    #    c d       Title(f) ]
    #      e
    #      f

    # --  CASE 6: There are exactly one each leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #    b c       Table(b, c, d, e),
    #    d e       Title(f) ]
    #    f

    # --  CASE 7: There are two leading and one trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #    a    -> [ Title(a),
    #    b         Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g) ]
    #      g

    # --  CASE 8: There are two-or-more leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #      b       Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g),
    #    g         Title(h) ]
    #    h

    # --  CASE 9: Single-row subtable, no single-cell rows above or below.
    #      -> First cell is mistakenly emitted as title, remaining cells are dropped.
    #
    #    a b c  -> [ Title(a) ]

    # --  CASE 10: Single-row subtable with one leading single-cell row.
    #      -> Leading single-row cell is correctly identified as title, core-table is mis-identified
    #         as a `Title` and truncated.
    #
    #    a      -> [ Title(a),
    #    b c d       Title(b) ]
```
2024-02-13 20:29:17 -08:00
Steve Canny
dd6576c603
rfctr(xlsx): cleaning in prep for XLSX algorithm replacement (#2524)
**Reviewers:** It may be faster to review each of the three commits
separately since they are groomed to only make one type of change each
(typing, docstrings, test-cleanup).

**Summary**

There are a cluster of bugs in `partition_xlsx()` that all derive from
flaws in the algorithm we use to detect "subtables". These are
encountered when the user wants to get multiple document-elements from
each worksheet, which is the default (argument `find_subtable = True`).

These commits clean up typing, lint, and other non-behavior-changing
aspects of the code in preparation for installing a new algorithm that
correctly identifies and partitions contiguous sub-regions of an Excel
worksheet into distinct elements.

**Additional Context**

This is a summary of the failure cases. There are a few other cases but
they're closely related and this was enough evidence and scope for my
purposes:
```python
    #
    # --  CASE 1: There are no leading or trailing single-cell rows.
    #       -> this subtable functions never get called, subtable is emitted as the only element
    #
    #    a b  -> Table(a, b, c, d)
    #    c d

    # --  CASE 2: There is exactly one leading single-cell row.
    #       -> Leading single-cell row emitted as `Title` element, core-table properly identified.
    #
    #    a    -> [ Title(a),
    #    b c       Table(b, c, d, e) ]
    #    d e

    # --  CASE 3: There are two-or-more leading single-cell rows.
    #       -> leading single-cell rows are included in subtable
    #
    #    a    -> [ Table(a, b, c, d, e, f) ]
    #    b
    #    c d
    #    e f

    # --  CASE 4: There is exactly one trailing single-cell row.
    #      -> core table is dropped. trailing single-cell row is emitted as Title
    #         (this is the behavior in the reported bug)
    #
    #    a b  -> [ Title(e) ]
    #    c d
    #      e

    # --  CASE 5: There are two-or-more trailing single-cell rows.
    #      -> core table is dropped. trailing single-cell rows are each emitted as a Title
    #
    #    a b  -> [ Title(e),
    #    c d       Title(f) ]
    #      e
    #      f

    # --  CASE 6: There are exactly one each leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #    b c       Table(b, c, d, e),
    #    d e       Title(f) ]
    #    f

    # --  CASE 7: There are two leading and one trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #    a    -> [ Title(a),
    #    b         Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g) ]
    #      g

    # --  CASE 8: There are two-or-more leading and trailing single-cell rows.
    #      -> core table is correctly identified, leading and trailing single-cell rows are each
    #         emitted as a Title.
    #
    #      a  -> [ Title(a),
    #      b       Title(b),
    #    c d       Table(c, d, e, f),
    #    e f       Title(g),
    #    g         Title(h) ]
    #    h

    # --  CASE 9: Single-row subtable, no single-cell rows above or below.
    #      -> First cell is mistakenly emitted as title, remaining cells are dropped.
    #
    #    a b c  -> [ Title(a) ]

    # --  CASE 10: Single-row subtable with one leading single-cell row.
    #      -> Leading single-row cell is correctly identified as title, core-table is mis-identified
    #         as a `Title` and truncated.
    #
    #    a      -> [ Title(a),
    #    b c d       Title(b) ]
```
2024-02-08 23:33:41 +00:00
Amanda Cameron
0584e1d031
chore: fix infer_table bug (#1833)
Carrying `skip_infer_table_types` to `infer_table_structure` in
partition flow. Now PPT/X, DOC/X, etc. Table elements should not have a
`text_as_html` field.

Note: I've continued to exclude this var from partitioners that go
through html flow, I think if we've already got the html it doesn't make
sense to carry the infer variable along, since we're not 'infer-ing' the
html table in these cases.


TODO:
  add unit tests

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: amanda103 <amanda103@users.noreply.github.com>
2023-10-24 00:11:53 +00:00
Léa
89fa88f076
fix: stop csv and tsv dropping the first line of the file (#1530)
The current code assumes the first line of csv and tsv files are a
header line. Most csv and tsv files don't have a header line, and even
for those that do, dropping this line may not be the desired behavior.

Here is a snippet of code that demonstrates the current behavior and the
proposed fix

```
import pandas as pd
from lxml.html.soupparser import fromstring as soupparser_fromstring

c1 = """
    Stanley Cups,,
    Team,Location,Stanley Cups
    Blues,STL,1
    Flyers,PHI,2
    Maple Leafs,TOR,13
    """

f = "./test.csv"
with open(f, 'w') as ff:
    ff.write(c1)
  
print("Suggested Improvement Keep First Line") 
table = pd.read_csv(f, header=None)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)

print("\n\nOriginal Looses First Line") 
table = pd.read_csv(f)
html_text = table.to_html(index=False, header=False, na_rep="")
text = soupparser_fromstring(html_text).text_content()
print(text)
```

---------

Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Yao You <theyaoyou@gmail.com>
Co-authored-by: Yao You <yao@unstructured.io>
2023-10-16 17:59:35 -05:00
qued
cf31c9a2c4
fix: use nx to avoid recursion limit (#1761)
Fixes recursion limit error that was being raised when partitioning
Excel documents of a certain size.

Previously we used a recursive method to find subtables within an excel
sheet. However this would run afoul of Python's recursion depth limit
when there was a contiguous block of more than 1000 cells within a
sheet. This function has been updated to use the NetworkX library which
avoids Python recursion issues.

* Updated `_get_connected_components` to use `networkx` graph methods
rather than implementing our own algorithm for finding contiguous groups
of cells within a sheet.
* Added a test and example doc that replicates the `RecursionError`
prior to the change.
*  Added `networkx` to `extra_xlsx` dependencies and `pip-compile`d.

#### Testing:
The following run from a Python terminal should raise a `RecursionError`
on `main` and succeed on this branch:
```python
import sys
from unstructured.partition.xlsx import partition_xlsx
old_recursion_limit = sys.getrecursionlimit()
try:
    sys.setrecursionlimit(1000)
    filename = "example-docs/more-than-1k-cells.xlsx"
    partition_xlsx(filename=filename)
finally:
    sys.setrecursionlimit(old_recursion_limit)

```
Note: the recursion limit is different in different contexts. Checking
my own system, the default in a notebook seems to be 3000, but in a
terminal it's 1000. The documented Python default recursion limit is
1000.
2023-10-14 19:38:21 +00:00
Steve Canny
d726963e42
serde tests round-trip through JSON (#1681)
Each partitioner has a test like `test_partition_x_with_json()`. What
these do is serialize the elements produced by the partitioner to JSON,
then read them back in from JSON and compare the before and after
elements.

Because our element equality (`Element.__eq__()`) is shallow, this
doesn't tell us a lot, but if we take it one more step, like
`List[Element] -> JSON -> List[Element] -> JSON` and then compare the
JSON, it gives us some confidence that the serialized elements can be
"re-hydrated" without losing any information.

This actually showed up a few problems, all in the
serialization/deserialization (serde) code that all elements share.
2023-10-12 19:47:55 +00:00
John
9500d04791
detect document language across all partitioners (#1627)
### Summary
Closes #1534 and #1535
Detects document language using `langdetect` package. 
Creates new kwargs for user to set the document language (`languages`)
or detect the language at the element level instead of the default
document level (`detect_language_per_element`)

---------

Co-authored-by: shreyanid <42684285+shreyanid@users.noreply.github.com>
Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Coniferish <Coniferish@users.noreply.github.com>
Co-authored-by: cragwolfe <crag@unstructured.io>
Co-authored-by: Austin Walker <austin@unstructured.io>
2023-10-11 01:47:56 +00:00
Klaijan
0a65fc2134
feat: xlsx subtable extraction (#1585)
**Executive Summary**
Unstructured is now able to capture subtables, along with other text
element types within the `.xlsx` sheet.

**Technical Details**
- The function now reads the excel *without* header as default
- Leverages the connected components search to find subtables within the
sheet. This search is based on dfs search
- It also handle the overlapping table or text cases
- Row with only single cell of data is considered not a table, and
therefore passed on the determine the element type as text
- In connected elements, it is possible to have table title, header, or
footer. We run the count for the first non-single empty rows from top
and bottom to determine those text

**Result**
This table now reads as:
<img width="747" alt="image"
src="https://github.com/Unstructured-IO/unstructured/assets/2177850/6b8e6d01-4ca5-43f4-ae88-6104b0174ed2">

```
[
    {
        "type": "Title",
        "element_id": "3315afd97f7f2ebcd450e7c939878429",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Financial performance"
    },
    {
        "type": "Table",
        "element_id": "17f5d512705be6f8812e5dbb801ba727",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "3315afd97f7f2ebcd450e7c939878429",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Quarterly revenue</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <td>Group financial performance</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>2</td>\n    </tr>\n    <tr>\n      <td>Segmental results</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>3</td>\n    </tr>\n    <tr>\n      <td>Segmental analysis</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>4</td>\n    </tr>\n    <tr>\n      <td>Cash flow</td>\n      <td>FY 22</td>\n      <td>FY 23</td>\n      <td></td>\n      <td>5</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nQuarterly revenue\nNine quarters to 30 June 2023\n\n\n1\n\n\nGroup financial performance\nFY 22\nFY 23\n\n2\n\n\nSegmental results\nFY 22\nFY 23\n\n3\n\n\nSegmental analysis\nFY 22\nFY 23\n\n4\n\n\nCash flow\nFY 22\nFY 23\n\n5\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "8a9db7161a02b427f8fda883656036e1",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Operational metrics"
    },
    {
        "type": "Table",
        "element_id": "d5d16f7bf9c7950cd45fae06e12e5847",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "8a9db7161a02b427f8fda883656036e1",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Mobile customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>6</td>\n    </tr>\n    <tr>\n      <td>Fixed broadband customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>7</td>\n    </tr>\n    <tr>\n      <td>Marketable homes passed</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>8</td>\n    </tr>\n    <tr>\n      <td>TV customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>9</td>\n    </tr>\n    <tr>\n      <td>Converged customers</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>10</td>\n    </tr>\n    <tr>\n      <td>Mobile churn</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>11</td>\n    </tr>\n    <tr>\n      <td>Mobile data usage</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>12</td>\n    </tr>\n    <tr>\n      <td>Mobile ARPU</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>13</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nMobile customers\nNine quarters to 30 June 2023\n\n\n6\n\n\nFixed broadband customers\nNine quarters to 30 June 2023\n\n\n7\n\n\nMarketable homes passed\nNine quarters to 30 June 2023\n\n\n8\n\n\nTV customers\nNine quarters to 30 June 2023\n\n\n9\n\n\nConverged customers\nNine quarters to 30 June 2023\n\n\n10\n\n\nMobile churn\nNine quarters to 30 June 2023\n\n\n11\n\n\nMobile data usage\nNine quarters to 30 June 2023\n\n\n12\n\n\nMobile ARPU\nNine quarters to 30 June 2023\n\n\n13\n\n\n"
    },
    {
        "type": "Title",
        "element_id": "f97e9da0e3b879f0a9df979ae260a5f7",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "Other"
    },
    {
        "type": "Table",
        "element_id": "080e1a745a2a3f2df22b6a08d33d59bb",
        "metadata": {
            "filename": "vodafone.xlsx",
            "file_directory": "example-docs",
            "last_modified": "2023-10-03T17:51:34",
            "filetype": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
            "parent_id": "f97e9da0e3b879f0a9df979ae260a5f7",
            "languages": [
                "spa",
                "ita"
            ],
            "page_number": 1,
            "page_name": "Index",
            "text_as_html": "<table border=\"1\" class=\"dataframe\">\n  <tbody>\n    <tr>\n      <td>Topic</td>\n      <td>Period</td>\n      <td></td>\n      <td></td>\n      <td>Page</td>\n    </tr>\n    <tr>\n      <td>Average foreign exchange rates</td>\n      <td>Nine quarters to 30 June 2023</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n    <tr>\n      <td>Guidance rates</td>\n      <td>FY 23/24</td>\n      <td></td>\n      <td></td>\n      <td>14</td>\n    </tr>\n  </tbody>\n</table>"
        },
        "text": "\n\n\nTopic\nPeriod\n\n\nPage\n\n\nAverage foreign exchange rates\nNine quarters to 30 June 2023\n\n\n14\n\n\nGuidance rates\nFY 23/24\n\n\n14\n\n\n"
    }
]
```
2023-10-04 13:30:23 -04:00
Klaijan
675a10ea69
fix: update test_json to not use auto partition (#1187)
Update `test_json` to not use auto partition due to dependencies. Previously, to run `test_json` requires full requirements installation library to read file types, including but not limited to, docx, pptx, as well as others. Therefore the test will raise error with base installation. With the update, this fix also add to other test files to check its invariant with `elements_to_json`.
2023-08-29 16:59:26 -04:00
Newel H
e4aa7373e2
test: create CI pipelines for verifying base and extras pass respective tests (#1137)
**Summary**
Closes #747
* Create CI Pipeline for running text, xml, email, and html doc tests
against the library installed without extras
* Create CI Pipeline for running each library extra against their
respective tests
2023-08-19 12:56:13 -04:00