unstructured/example-docs
Newel H e34396b2c9
Feat: Native hierarchies for elements from pptx documents (#1616)
## Summary
**Improve title detection in pptx documents** The default title
textboxes on a pptx slide are now categorized as titles.
**Improve hierarchy detection in pptx documents** List items, and other
slide text are properly nested under the slide title. This will enable
better chunking of pptx documents.

Hierarchy detection is improved by determining category depth via the
following:
- Check if the paragraph item has a level parameter via the python pptx
paragraph. If so, use the paragraph level as the category_depth level.
- If the shape being checked is a title shape and the item is not a
bullet or email, the element will be set as a Title with a depth
corresponding to the enumerated paragraph increment (e.g. 1st line of
title shape is depth 0, second is depth 1 etc.).
- If the shape is not a title shape but the paragraph is a title, the
increment will match the level + 1, so that all paragraph titles are at
least 1 to set them below the slide title element
2023-10-05 12:55:45 -04:00
..
2023-06-05 09:14:43 -07:00
2023-06-05 09:14:43 -07:00
2022-09-26 14:55:20 -07:00
2023-05-11 18:31:38 +00:00
2022-09-26 14:55:20 -07:00
2023-04-02 09:52:14 -07:00
2023-05-26 01:55:32 -07:00

Example Docs

The sample docs directory contains the following files:

  • example-10k.html - A 10-K SEC filing in HTML format
  • layout-parser-paper.pdf - A PDF copy of the layout parser paper
  • factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.