unstructured/example-docs
Steve Canny 3fe5c094fa
rfctr(file): refactor detect_filetype() (#3429)
**Summary**
In preparation for fixing a cluster of bugs with automatic file-type
detection and paving the way for some reliability improvements, refactor
`unstructured.file_utils.filetype` module and improve thoroughness of
tests.

**Additional Context**
Factor type-recognition process into three distinct strategies that are
attempted in sequence. Attempted in order of preference,
type-recognition falls to the next strategy when the one before it is
not applicable or cannot determine the file-type. This provides a clear
basis for organizing the code and tests at the top level.

Consolidate the existing tests around these strategies, adding
additional cases to achieve better coverage.

Several bugs were uncovered in the process. Small ones were just fixed,
bigger ones will be remedied in following PRs.
2024-07-23 23:18:48 +00:00
..
2022-09-26 14:55:20 -07:00
2023-05-11 18:31:38 +00:00
2023-11-02 10:52:41 -05:00
2023-04-02 09:52:14 -07:00
2023-10-30 14:09:46 +00:00
2023-05-26 01:55:32 -07:00

Example Docs

The sample docs directory contains the following files:

  • example-10k.html - A 10-K SEC filing in HTML format
  • layout-parser-paper.pdf - A PDF copy of the layout parser paper
  • factbook.xml/factbook.xsl - Example XML/XLS files that you can use to test stylesheets

These documents can be used to test out the parsers in the library. In addition, here are instructions for pulling in some sample docs that are too big to store in the repo.

XBRL 10-K

You can get an example 10-K in inline XBRL format using the following curl. Note, you need to have the user agent set in the header or the SEC site will reject your request.

curl -O \
  -A '${organization} ${email}'
  https://www.sec.gov/Archives/edgar/data/311094/000117184321001344/0001171843-21-001344.txt

You can parse this document using the HTML parser.