cragwolfe 4c13d12dc3
fix: prevent spammy ListItem's from images and PDF's (#1210)
The issue was that for blocks detected in an image such as:

![image](https://github.com/Unstructured-IO/unstructured/assets/28578599/a955bf2c-a683-4cef-a19f-546f9378835a)
, where the full image is:

https://utic-dev-tech-fixtures.s3.us-east-2.amazonaws.com/pastebin//Users/cragwolfe/tmp/IRS-form-1987.png
, many ListItem's would be extracted that were not adding much value to
the output (assuming the block was determined to be of type List from
the layout model). This particular file is also used in ingest tests,
and you can see the prior output here:


https://github.com/Unstructured-IO/unstructured/blob/483b09b/test_unstructured_ingest/expected-structured-output/azure/IRS-form-1987.png.json#L93-L280

Test Instructions:

1. run the following snippet:

```
import json
import os
from datetime import datetime

from unstructured.__version__ import __version__
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
                                                                                                 
filename = "/opt/home/tmp/IRS-form-1987.png"
output_dir = "/opt/home/tmp/json"
base_name_with_ext = os.path.basename(filename)
output_filename_part = os.path.join(output_dir, base_name_with_ext)

print(f"unstructured version: {__version__}")
#for strategy in ("hi_res", "fast", "auto"):                                                                                                            
for strategy in ("hi_res",):
    d1 = datetime.now()
    elements = partition(filename=filename, strategy=strategy)
    elems_as_dicts = json.loads(elements_to_json(elements, indent=2))

    # strip out metadata for the sake of more readable results                                                                                          
    for element_dict in elems_as_dicts:
	del element_dict["metadata"]
    json_filename=f"{output_filename_part}-{strategy}.json"

    with open(json_filename, "w") as jsonf:
        jsonf.write(json.dumps(elems_as_dicts, indent=2))
    d2 = datetime.now()
    print(f"num elements for {strategy}: {len(elements)}")
    print(f"time elapsed     {strategy}: {(d2-d1).total_seconds()}")
```
updating the `filename` and `output_dir` paths for your particular local
environment.

2. Open the json file that was writen to your `output_dir`, named
IRS-form-1987.png-hi_res.json

Witness the new element:
```
  {
    "type": "ListItem",
    "element_id": "7d3ba328af2c20ddeef5d2c1d270f60f",
    "text": "Long-term contracts.\u2014If you are required to change your method of accounting for long-term contracts under section 460, see Notice 87
-61 (9/21/87), 1987-38 IRB 40, for the notification procedures that must be followed Other methods. \u2014Unless the Service has Published a regulation
 or procedure to the contrary, all other changes in accounting methods required by the Act are automatically considered to be approved by the Commissio
ner. Examples of method changes automatically approved by the Commissioner are those changes required to effect: (1) the repeal of the reserve method f
or bad debts of taxpayers other than financial institutions (Act section 805); (2) the repeal of the installment method for sales under a revolving cre
dit plan (Act section 812); (3) the Inclusion of income attributable to the sale or furnishing of utility services no later than the year in which the 
services were provided to customers (Act section 821); and (4) the repeal of the deduction for qualified discount coupons (Act section 823). Do not fil
e Form 3115 for these changes."
  },
```
2023-08-26 21:01:07 -07:00
..
2023-08-21 15:16:50 -04:00