7 Commits

Author SHA1 Message Date
grungyfeline998
956f04d770
feat: detect filetype with extension if libmagic is unavailable (#268)
* included the previous PR changes and verified black

* resolved the issues mentioned

* make tidy and add tests

---------

Co-authored-by: Matt Robinson <mrobinson@unstructured.io>
Co-authored-by: Matt Robinson <mrobinson@unstructuredai.io>
2023-02-24 15:23:29 +00:00
Matt Robinson
47ab808e0f
feat: file info dataframe from filenames and file content (#204)
* added function for exploring a list of files

* file info from file contents

* added tests for file info from contents

* bump version and add tests

* add dev to version
2023-02-08 20:48:39 +00:00
Matt Robinson
26a5546152
fix: handle xml filetype detection on amazon linux (#173)
* fix: handle xml filetype detection on amazon linux

* option for html or xml

* fix typo

* back to dev tag
2023-01-25 11:20:01 -05:00
Matt Robinson
74ce2ae6e5
fix: update detect_filetype to properly handle older office files (#161) 2023-01-18 11:18:20 -05:00
Matt Robinson
eba4c80b1e
feat: get_directory_file_info for exploring a directory of files (#142)
* added python-pptx to requirements

* added filetype detection for powerpoint

* add more filetypes to detect

* more tests

* added tests for filetype

* reorder document types

* tests for get_directory_file_info

* added docs for get_directory_file_info

* bump version

* Word -> Office

* added test for filetype

* add group by filetype example
2023-01-11 12:40:50 -05:00
Matt Robinson
5376bc510f
feat: generic partition brick with filetype detection (#132)
* add python-magic

* first pass on filetype detection

* tests for filetype detection

* more tests for file detection

* added tests for error conditions

* install libmagic dev in github

* libmagic install instructions

* pattern for checking email files

* support reading .eml in rb mode

* add auto partition function

* auto tests for emal

* auto tests for docx

* added tests for html

* add pdf and html tests

* linting, linting, linting

* added docs for auto partitioning

* update readme with generic partition brick

* bumped version

* added test for bad type

* detect .docx files from application/octet-stream

* linting, linting, linting

* identify xlsx from octet stream

* install poppler in ci

* fix mocks; test for unknown type

* install poppler utils

* install in one line

* only poppler-utils

* file extension logic from application/octet-stream

* install local inference for ci

* install detectron2

* removing unused dockerfile
2023-01-09 16:15:14 -05:00
Matt Robinson
b14f6ac9bd
feat: extract metadata from .docx, .xlsx, and .jpg (#113)
* add python-docx dependency

* added function for extracting metadata from word documents

* add openpyxl

* added get_jpg_metadata; fixed typing

* bump changelog

* added pillow to dependencies
2022-12-26 09:34:36 -05:00