unstructured

yujunjun/unstructured

Fork 0

mirror of https://github.com/Unstructured-IO/unstructured.git synced 2025-07-31 04:46:07 +00:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Christine Straub	48bdf94656	feat: `partition_pdf()` support language specification for PaddleOCR (#3400 ) Closes #3159. This PR extends language specification capability to `PaddleOCR` in addition to `TesseractOCR`. Users can now specify OCR languages for both OCR engines when using `partition_pdf()`. ### Testing ``` os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" elements = partition_pdf( filename=<file_path>, strategy=strategy, languages=["chi_sim"], # chinese - simplified infer_table_structure=True, ) ```	2024-07-16 22:19:25 +00:00
Christine Straub	493bfccddd	fix: exception handling for OCRAgent.get_agent() (#3335 ) The purpose of this PR is to help investigate https://github.com/Unstructured-IO/unstructured/issues/3202.	2024-07-03 17:58:04 +00:00
Steve Canny	cb55245f70	rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965 ) Summary File-types other than PDF need to use OCR on extracted images. Extract `OCRAgent.get_agent()` such that any file-type partitioner can use it without risking dependency on PDF-only extras.	2024-05-03 19:39:22 +00:00

Christine Straub

48bdf94656

feat: partition_pdf() support language specification for PaddleOCR (#3400 )

Closes #3159.

This PR extends language specification capability to `PaddleOCR` in
addition to `TesseractOCR`. Users can now specify OCR languages for both
OCR engines when using `partition_pdf()`.

### Testing

```
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle"

elements = partition_pdf(
    filename=<file_path>,
    strategy=strategy,
    languages=["chi_sim"], # chinese - simplified
    infer_table_structure=True,
)
```

2024-07-16 22:19:25 +00:00

Christine Straub

493bfccddd

fix: exception handling for OCRAgent.get_agent() (#3335 )

The purpose of this PR is to help investigate
https://github.com/Unstructured-IO/unstructured/issues/3202.

2024-07-03 17:58:04 +00:00

Steve Canny

cb55245f70

rfctr: extract OCRAgent.get_agent() out of PDF subtree (#2965 )

**Summary**
File-types other than PDF need to use OCR on extracted images. Extract
`OCRAgent.get_agent()` such that any file-type partitioner can use it
without risking dependency on PDF-only extras.

2024-05-03 19:39:22 +00:00

3 Commits