unstructured

yujunjun/unstructured

Fork 0

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-12-19 03:10:21 +00:00

History

Klaijan 00181b88df

feat: pdf auto strategy groups broken numbered and bullet list items(#1393 )

**Summary**
Adds logic to combine broken numbered list for pdf fast strategy.

**Details**
Previously the document reads the numbered list items part of the
`layout-parser-paper-fast.pdf` file as:

```
'1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character'
'recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that'
'underlies the oﬀ-the-shelf usage'
'3. Comprehensive tools for eﬃcient document image data annotation and model'
'tuning to support diﬀerent levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

Now it reads:

```
'1. An oﬀ-the-shelf toolkit for applying DL models for layout detection, character recognition, and other DIA tasks (Section 3)'
'2. A rich repository of pre-trained neural network models (Model Zoo) that underlies the oﬀ-the-shelf usage'
'3. Comprehensive tools for eﬃcient document image data annotation and model' tuning to support diﬀerent levels of customization'
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
```

The added logic leverages `ElementType` and `coordinates` to determine
whether the following lines is a part of the previously detected
`ListItem` or not.

**Test**
Add test that checks the element length less than original version with
broken numbered list. The test also checks whether the first detected
numbered list ends with previously broken line.

---------

Co-authored-by: ryannikolaidis <1208590+ryannikolaidis@users.noreply.github.com>
Co-authored-by: Klaijan <Klaijan@users.noreply.github.com>

2023-09-13 21:30:06 +00:00

layout-parser-paper.pdf.json

feat: pdf auto strategy groups broken numbered and bullet list items(#1393 )

2023-09-13 21:30:06 +00:00