mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2026-01-05 20:00:56 +00:00
30 KiB
30 KiB
0.8.2-dev7
Enhancements
- Additional tests and refactor of JSON detection.
- Update functionality to retrieve image metadata from a page for
document_to_element_list - Links are now tracked in
partition_htmloutput. - Set the file's current position to the beginning after reading the file in
convert_to_bytes - Add
min_partitionkwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split. - set the file's current position to the beginning after reading the file in
convert_to_bytes - Add slide notes to pptx
- Add
--encodingdirective to ingest - Improve json detection by
detect_filetype
Features
- Adds Onedrive connector.
- Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.
Fixes
- Use the
image_metadataproperty of thePageLayoutinstance to get the page image info in thedocument_to_element_list - Add functionality to write images to computer storage temporarily instead of keeping them in memory for
ocr_onlystrategy - Add functionality to convert a PDF in small chunks of pages at a time for
ocr_onlystrategy - Adds
.txt,.text, and.tabto list of extensions to check if file has atext/plainMIME type. - Enables filters to be passed to
partition_docso it doesn't error with LibreOffice7. - Removed old error message that's superseded by
requires_dependencies.
0.8.1
Enhancements
- Add support for Python 3.11
Features
Fixes
- Fixed
autostrategy detected scanned document as having extractable text and usingfaststrategy, resulting in no output. - Fix list detection in MS Word documents.
- Don't instantiate an element with a coordinate system when there isn't a way to get its location data.
0.8.0
Enhancements
- Allow model used for hi res pdf partition strategy to be chosen when called.
- Updated inference package
Features
- Add
metadata_filenameparameter across all partition functions
Fixes
-
Update to ensure
convert_to_datafamegrabs all of the metadata fields. -
Adjust encoding recognition threshold value in
detect_file_encoding -
Fix KeyError when
isd_to_elementsdoesn't find a type -
Fix
_output_filenamefor local connector, allowing single files to be written correctly to the disk -
Fix for cases where an invalid encoding is extracted from an email header.
BREAKING CHANGES
- Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the
coordinatesattribute of the element's metadata.
0.7.12
Enhancements
- Adds
include_metadatakwarg topartition_doc,partition_docx,partition_email,partition_epub,partition_json,partition_msg,partition_odt,partition_org,partition_pdf,partition_ppt,partition_pptx,partition_rst, andpartition_rtf
Features
- Add Elasticsearch connector for ingest cli to pull specific fields from all documents in an index.
- Adds Dropbox connector
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.11
Enhancements
- More deterministic element ordering when using
hi_resPDF parsing strategy (from unstructured-inference bump to 0.5.4) - Make large model available (from unstructured-inference bump to 0.5.3)
- Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
partition_emailandpartition_msgwill now process attachments ifprocess_attachments=Trueand a attachment partitioning functions is passed through withattachment_partitioner=partition.
Features
Fixes
- Fix tests that call unstructured-api by passing through an api-key
- Fixed page breaks being given (incorrect) page numbers
- Fix skipping download on ingest when a source document exists locally
0.7.10
Enhancements
- Adds a
max_partitionparameter topartition_text,partition_pdf,partition_email,partition_msgandpartition_xmlthat sets a limit for the size of an individual document elements. Defaults to1500for everything exceptpartition_xml, which has a default value ofNone. - DRY connector refactor
Features
hi_resmodel for pdfs and images is selectable via environment variable.
Fixes
- CSV check now ignores escaped commas.
- Fix for filetype exploration util when file content does not have a comma.
- Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like
-------as list items. - Fix pre tag parsing for
partition_html - Fix lookup error for annotated Arabic and Hebrew encodings
0.7.9
Enhancements
- Improvements to string check for leafs in
partition_xml. - Adds --partition-ocr-languages to unstructured-ingest.
Features
- Adds
partition_orgfor processed Org Mode documents.
Fixes
0.7.8
Enhancements
Features
- Adds Google Cloud Service connector
Fixes
- Updates the
parse_emailforpartition_emlso thatunstructured-apipasses the smoke tests partition_emailnow works if there is no message content- Updates the
"fast"strategy forpartition_pdfso that it's able to recursively - Adds recursive functionality to all fsspec connectors
- Adds generic --recursive ingest flag
0.7.7
Enhancements
- Adds functionality to replace the
MIMEencodings foremlfiles with one of the common encodings if aunicodeerror occurs - Adds missed file-like object handling in
detect_file_encoding - Adds functionality to extract charset info from
emlfiles
Features
- Added coordinate system class to track coordinate types and convert to different coordinate
Fixes
- Adds an
html_assemble_articleskwarg topartition_htmlto enable users to capture control whether content outside of<article>tags is captured when<article>tags are present. - Check for the
xmlattribute onelementbefore looking for pagebreaks inpartition_docx.
0.7.6
Enhancements
- Convert fast startegy to ocr_only for images
- Adds support for page numbers in
.docxand.docwhen user or renderer created page breaks are present. - Adds retry logic for the unstructured-ingest Biomed connector
Features
- Provides users with the ability to extract additional metadata via regex.
- Updates
partition_docxto include headers and footers in the output. - Create
partition_tsvand associated tests. Make additional changes todetect_filetype.
Fixes
- Remove fake api key in test
partition_via_apisince we now require valid/empty api keys - Page number defaults to
Noneinstead of1when page number is not present in the metadata. A page number ofNoneindicates that page numbers are not being tracked for the document or that page numbers do not apply to the element in question.. - Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
in case the shape.top and shape.left attributes are
None.
0.7.5
Enhancements
- Adds functionality to sort elements in
partition_pdfforfaststrategy - Adds ingest tests with
--faststrategy on PDF documents - Adds --api-key to unstructured-ingest
Features
- Adds
partition_rstfor processed ReStructured Text documents.
Fixes
- Adds handling for emails that do not have a datetime to extract.
- Adds pdf2image package as core requirement of unstructured (with no extras)
0.7.4
Enhancements
- Allows passing kwargs to request data field for
partition_via_apiandpartition_multiple_via_api - Enable MIME type detection if libmagic is not available
- Adds handling for empty files in
detect_filetypeandpartition.
Features
Fixes
- Reslove
grpcioimport issue onweaviate.schema.validate_schemafor python 3.9 and 3.10 - Remove building
detectron2from source in Dockerfile
0.7.3
Enhancements
- Update IngestDoc abstractions and add data source metadata in ElementMetadata
Features
Fixes
- Pass
strategyparameter down frompartitionforpartition_image - Filetype detection if a CSV has a
text/plainMIME type convert_office_docno longers prints file conversion info messages to stdout.partition_via_apireflects the actual filetype for the file processed in the API.
0.7.2
Enhancements
- Adds an optional encoding kwarg to
elements_to_jsonandelements_from_json - Bump version of base image to use new stable version of tesseract
Features
Fixes
- Update the
read_txt_fileutility function to keep usingspooled_to_bytes_io_if_neededfor xml - Add functionality to the
read_txt_fileutility function to handle file-like object from URL - Remove the unused parameter
encodingfrompartition_pdf - Change auto.py to have a
Nonedefault for encoding - Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
- Adds benchmark test with test docs in example-docs
- Re-enable test_upload_label_studio_data_with_sdk
- File detection now detects code files as plain text
- Adds
tabulateexplicitly to dependencies - Fixes an issue in
metadata.page_numberof pptx files - Adds showing help if no parameters passed
0.7.1
Enhancements
Features
- Add
stage_for_weaviateto stageunstructuredoutputs for upload to Weaviate, along with a helper function for defining a class to use in Weaviate schemas. - Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.
Fixes
0.7.0
Enhancements
- Installing
detectron2from source is no longer required when using thelocal-inferenceextra. - Updates
.pptxparsing to include text in tables.
Features
Fixes
- Fixes an issue in
_add_element_metadatathat caused all elements to havepage_number=1in the element metadata. - Adds
.logas a file extension for TXT files. - Adds functionality to try other common encodings for email (
.eml) files if an error related to the encoding is raised and the user has not specified an encoding. - Allow passed encoding to be used in the
replace_mime_encodings - Fixes page metadata for
partition_htmlwheninclude_metadata=False - A
ValueErrornow raises iffile_filenameis not specified when you usepartition_via_apiwith a file-like object.
0.6.11
Enhancements
- Supports epub tests since pandoc is updated in base image
Features
Fixes
0.6.10
Enhancements
- XLS support from auto partition
Features
Fixes
0.6.9
Enhancements
- fast strategy for pdf now keeps element bounding box data
- setup.py refactor
Features
Fixes
- Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
- Adds additional MIME types for CSV
0.6.8
Enhancements
Features
- Add
partition_csvfor CSV files.
Fixes
0.6.7
Enhancements
- Deprecate
--s3-urlin favor of--remote-urlin CLI - Refactor out non-connector-specific config variables
- Add
file_directoryto metadata - Add
page_nameto metadata. Currently used for the sheet name in XLSX documents. - Added a
--partition-strategyparameter to unstructured-ingest so that users can specify partition strategy in CLI. For example,--partition-strategy fast. - Added metadata for filetype.
- Add Discord connector to pull messages from a list of channels
- Refactor
unstructured/file-utils/filetype.pyto better utilise hashmap to return mime type. - Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for
test_filetype.py.
Features
- Add
partition_xmlfor XML files. - Add
partition_xlsxfor Microsoft Excel documents.
Fixes
- Supports
hmlfiletype for partition as a variation of html filetype. - Makes
pytesseracta function level import inpartition_pdfso you can use the"fast"or"hi_res"strategies ifpytesseractis not installed. Also adds therequired_dependenciesdecorator for the"hi_res"and"ocr_only"strategies. - Fix to ensure
filenameis tracked in metadata fordocxtables.
0.6.6
Enhancements
- Adds an
"auto"strategy that chooses the partitioning strategy based on document characteristics and function kwargs. This is the new default strategy forpartition_pdfandpartition_image. Users can maintain existing behavior by explicitly settingstrategy="hi_res". - Added an additional trace logger for NLP debugging.
- Add
get_datemethod toElementMetadatafor converting the datestring to adatetimeobject. - Cleanup the
filenameattribute onElementMetadatato remove the full filepath.
Features
- Added table reading as html with URL parsing to
partition_docxin docx - Added metadata field for text_as_html for docx files
Fixes
fileutils/file_typecheck json and eml decode ignore errorpartition_emailwas updated to more flexibly handle deviations from the RFC-2822 standard. The time in the metadata returnsNoneif the time does not match RFC-2822 at all.- Include all metadata fields when converting to dataframe or CSV
0.6.5
Enhancements
- Added support for SpooledTemporaryFile file argument.
Features
Fixes
0.6.4
Enhancements
- Added an "ocr_only" strategy for
partition_pdf. Refactored the strategy decision logic into its own module.
Features
Fixes
0.6.3
Enhancements
- Add an "ocr_only" strategy for
partition_image.
Features
- Added
partition_multiple_via_apifor partitioning multiple documents in a single REST API call. - Added
stage_for_baseplatefunction to prepare outputs for ingestion into Baseplate. - Added
partition_odtfor processing Open Office documents.
Fixes
- Updates the grouping logic in the
partition_pdffast strategy to group together text in the same bounding box.
0.6.2
Enhancements
- Added logic to
partition_pdffor detecting copy protected PDFs and falling back to the hi res strategy when necessary.
Features
- Add
partition_via_apifor partitioning documents through the hosted API.
Fixes
- Fix how
exceeds_cap_ratiohandles empty (returnsTrueinstead ofFalse) - Updates
detect_filetypeto properly detect JSONs when the MIME type istext/plain.
0.6.1
Enhancements
- Updated the table extraction parameter name to be more descriptive
Features
Fixes
0.6.0
Enhancements
- Adds an
ssl_verifykwarg topartitionandpartition_htmlto enable turning off SSL verification for HTTP requests. SSL verification is on by default. - Allows users to pass in ocr language to
partition_pdfandpartition_imagethrough theocr_languagekwarg.ocr_languagecorresponds to the code for the language pack in Tesseract. You will need to install the relevant Tesseract language pack to use a given language.
Features
- Table extraction is now possible for pdfs from
partitionandpartition_pdf. - Adds support for extracting attachments from
.msgfiles
Fixes
- Adds an
ssl_verifykwarg topartitionandpartition_htmlto enable turning off SSL verification for HTTP requests. SSL verification is on by default.
0.5.13
Enhancements
- Allow headers to be passed into
partitionwhenurlis used.
Features
bytes_string_to_stringcleaning brick for bytes string output.
Fixes
- Fixed typo in call to
exactly_oneinpartition_json - unstructured-documents encode xml string if document_tree is
Nonein_read_xml. - Update to
_read_xmlso that Markdown files with embedded HTML process correctly. - Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
- unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
partition_pdfandpartition_textgroup broken paragraphs to avoid fragmentedNarrativeTextelements.- .json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)
0.5.12
Enhancements
- Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
- Use the image registry as a cache when building Docker images.
- Adds the ability for
partition_textto group together broken paragraphs. - Added method to utils to allow date time format validation
Features
-
Add Slack connector to pull messages for a specific channel
-
Add --partition-by-api parameter to unstructured-ingest
-
Added
partition_rtffor processing rich text files. -
partitionnow accepts aurlkwarg in addition tofileandfilename.
Fixes
- Allow encoding to be passed into
replace_mime_encodings. - unstructured-ingest connector-specific dependencies are imported on demand.
- unstructured-ingest --flatten-metadata supported for local connector.
- unstructured-ingest fix runtime error when using --metadata-include.
0.5.11
Enhancements
Features
Fixes
- Guard against null style attribute in docx document elements
- Update HTML encoding to better support foreign language characters
0.5.10
Enhancements
- Updated inference package
- Add sender, recipient, date, and subject to element metadata for emails
Features
- Added
--download-onlyparameter tounstructured-ingest
Fixes
- FileNotFound error when filename is provided but file is not on disk
0.5.9
Enhancements
Features
Fixes
- Convert file to str in helper
split_by_paragraphforpartition_text
0.5.8
Enhancements
- Update
elements_to_jsonto return string when filename is not specified elements_from_jsonmay take a string instead of a filename with thetextkwargdetect_filetypenow does a final fallback to file extension.- Empty tags are now skipped during the depth check for HTML processing.
Features
- Add local file system to
unstructured-ingest - Add
--max-docsparameter tounstructured-ingest - Added
partition_msgfor processing MSFT Outlook .msg files.
Fixes
convert_file_to_textnow passes through thesource_formatandtarget_formatkwargs. Previously they were hard coded.- Partitioning functions that accept a
textkwarg no longer raise an error if an empty string is passed (and empty list of elements is returned instead). partition_jsonno longer fails if the input is an empty list.- Fixed bug in
chunk_by_attention_windowthat caused the last word in segments to be cut-off in some cases.
BREAKING CHANGES
stage_for_transformersnow returns a list of elements, making it consistent with other staging bricks
0.5.7
Enhancements
- Refactored codebase using
exactly_one - Adds ability to pass headers when passing a url in partition_html()
- Added optional
content_typeandfile_filenameparameters topartition()to bypass file detection
Features
- Add
--flatten-metadataparameter tounstructured-ingest - Add
--fields-includeparameter tounstructured-ingest
Fixes
0.5.6
Enhancements
contains_english_word(), used heavily in text processing, is 10x faster.
Features
- Add
--metadata-includeand--metadata-excludeparameters tounstructured-ingest - Add
clean_non_ascii_charsto remove non-ascii characters from unicode string
Fixes
- Fix problem with PDF partition (duplicated test)
0.5.4
Enhancements
- Added Biomedical literature connector for ingest cli.
- Add
FsspecConnectorto easily integrate any existingfsspecfilesystem as a connector. - Rename
s3_connector.pytos3.pyfor readability and consistency with the rest of the connectors. - Now
S3Connectorrelies ons3fsinstead of onboto3, and it inherits fromFsspecConnector. - Adds an
UNSTRUCTURED_LANGUAGE_CHECKSenvironment variable to control whether or not language specific checks like vocabulary and POS tagging are applied. Set to"true"for higher resolution partitioning and"false"for faster processing. - Improves
detect_filetypewarning to include filename when provided. - Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast" strategy if detectron2 is not available.
- Start deprecation life cycle for
unstructured-ingest --s3-urloption, to be deprecated in favor of--remote-url.
Features
- Add
AzureBlobStorageConnectorbased on itsfsspecimplementation inheriting fromFsspecConnector - Add
partition_epubfor partitioning e-books in EPUB3 format.
Fixes
- Fixes processing for text files with
message/rfc822MIME type. - Open xml files in read-only mode when reading contents to construct an XMLDocument.
0.5.3
Enhancements
auto.partition()can now load Unstructured ISD json documents.- Simplify partitioning functions.
- Improve logging for ingest CLI.
Features
- Add
--wikipedia-auto-suggestargument to the ingest CLI to disable automatic redirection to pages with similar names. - Add setup script for Amazon Linux 2
- Add optional
encodingargument to thepartition_(text/email/html)functions. - Added Google Drive connector for ingest cli.
- Added Gitlab connector for ingest cli.
Fixes
0.5.2
Enhancements
- Fully move from printing to logging.
unstructured-ingestnow uses a default--download_dirof$HOME/.cache/unstructured/ingestrather than a "tmp-ingest-" dir in the working directory.
Features
Fixes
setup_ubuntu.shno longer fails in some contexts by interpretingDEBIAN_FRONTEND=noninteractiveas a commandunstructured-ingestno longer re-downloads files when --preserve-downloads is used without --download-dir.- Fixed an issue that was causing text to be skipped in some HTML documents.
0.5.1
Enhancements
Features
Fixes
- Fixes an error causing JavaScript to appear in the output of
partition_htmlsometimes. - Fix several issues with the
requires_dependenciesdecorator, including the error message and how it was used, which had caused an error forunstructured-ingest --github-url ....
0.5.0
Enhancements
- Add
requires_dependenciesPython decorator to check dependencies are installed before instantiating a class or running a function
Features
- Added Wikipedia connector for ingest cli.
Fixes
- Fix
process_documentfile cleaning on failure - Fixes an error introduced in the metadata tracking commit that caused
NarrativeTextandFigureCaptionelements to be represented asTextin HTML documents.
0.4.16
Enhancements
- Fallback to using file extensions for filetype detection if
libmagicis not present
Features
- Added setup script for Ubuntu
- Added GitHub connector for ingest cli.
- Added
partition_mdpartitioner. - Added Reddit connector for ingest cli.
Fixes
- Initializes connector properly in ingest.main::MainProcess
- Restricts version of unstructured-inference to avoid multithreading issue
0.4.15
Enhancements
- Added
elements_to_jsonandelements_from_jsonfor easier serialization/deserialization convert_to_dict,dict_to_elementsandconvert_to_csvare now aliases for functions that use the ISD terminology.
Fixes
- Update to ensure all elements are preserved during serialization/deserialization
0.4.14
- Automatically install
nltkmodels in thetokenizemodule.
0.4.13
- Fixes unstructured-ingest cli.
0.4.12
- Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.
- Add
parserparameter topartition_html.
0.4.11
- Adds
partition_docfor partitioning Word documents in.docformat. Requireslibreoffice. - Adds
partition_pptfor partitioning PowerPoint documents in.pptformat. Requireslibreoffice.
0.4.10
- Fixes
ElementMetadataso that it's JSON serializable when the filename is aPathobject.
0.4.9
- Added ingest modules and s3 connector, sample ingest script
- Default to
url=Noneforpartition_pdfandpartition_image - Add ability to skip English specific check by setting the
UNSTRUCTURED_LANGUAGEenv var to"". - Document
Elementobjects now track metadata
0.4.8
- Modified XML and HTML parsers not to load comments.
0.4.7
- Added the ability to pull an HTML document from a url in
partition_html. - Added the the ability to get file summary info from lists of filenames and lists of file contents.
- Added optional page break to
partitionfor.pptx,.pdf, images, and.htmlfiles. - Added
to_dictmethod to document elements. - Include more unicode quotes in
replace_unicode_quotes.
0.4.6
- Loosen the default cap threshold to
0.5. - Add a
UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLDenvironment variable for controlling the cap ratio threshold. - Unknown text elements are identified as
Textfor HTML and plain text documents. Body Textstyles no longer default toNarrativeTextfor Word documents. The style information is insufficient to determine that the text is narrative.- Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
- Adds an
Addresselement for capturing elements that only contain an address. - Suppress the
UserWarningwhen detectron is called. - Checks that titles and narrative test have at least one English word.
- Checks that titles and narrative text are at least 50% alpha characters.
- Restricts titles to a maximum word length. Adds a
UNSTRUCTURED_TITLE_MAX_WORD_LENGTHenvironment variable for controlling the max number of words in a title. - Updated
partition_pptxto order the elements on the page
0.4.4
- Updated
partition_pdfandpartition_imageto returnunstructuredElementobjects - Fixed the healthcheck url path when partitioning images and PDFs via API
- Adds an optional
coordinatesattribute to document objects - Adds
FigureCaptionandCheckBoxdocument elements - Added ability to split lists detected in
LayoutElementobjects - Adds
partition_pptxfor partitioning PowerPoint documents - LayoutParser models now download from HugginfaceHub instead of DropBox
- Fixed file type detection for XML and HTML files on Amazone Linux
0.4.3
- Adds
requestsas a base dependency - Fix in
exceeds_cap_ratioso the function doesn't break with empty text - Fix bug in
_parse_received_data. - Update
detect_filetypeto properly handle.doc,.xls, and.ppt.
0.4.2
- Added
partition_imageto process documents in an image format. - Fixed utf-8 encoding error in
partition_emailwith attachments fortext/html
0.4.1
- Added support for text files in the
partitionfunction - Pinned
opencv-pythonfor easier installation on Linux
0.4.0
- Added generic
partitionbrick that detects the file type and routes a file to the appropriate partitioning brick. - Added a file type detection module.
- Updated
partition_htmlandpartition_emlto support file-like objects in 'rb' mode. - Cleaning brick for removing ordered bullets
clean_ordered_bullets. - Extract brick method for ordered bullets
extract_ordered_bullets. - Test for
clean_ordered_bullets. - Test for
extract_ordered_bullets. - Added
partition_docxfor pre-processing Word Documents. - Added new REGEX patterns to extract email header information
- Added new functions to extract header information
parse_received_dataandpartition_header - Added new function to parse plain text files
partition_text - Added new cleaners functions
extract_ip_address,extract_ip_address_name,extract_mapi_id,extract_datetimetz - Add new
Imageelement and function to find embedded imagesfind_embedded_images - Added
get_directory_file_infofor summarizing information about source documents
0.3.5
- Add support for local inference
- Add new pattern to recognize plain text dash bullets
- Add test for bullet patterns
- Fix for
partition_htmlthat allows for processingdivtags that have both text and child elements - Add ability to extract document metadata from
.docx,.xlsx, and.jpgfiles. - Helper functions for identifying and extracting phone numbers
- Add new function
extract_attachment_infothat extracts and decodes the attachment of an email. - Staging brick to convert a list of
Elements to apandasdataframe. - Add plain text functionality to
partition_email
0.3.4
- Python-3.7 compat
0.3.3
- Removes BasicConfig from logger configuration
- Adds the
partition_emailpartitioning brick - Adds the
replace_mime_encodingscleaning bricks - Small fix to HTML parsing related to processing list items with sub-tags
- Add
EmailElementdata structure to store email documents
0.3.2
- Added
translate_textbrick for translating text between languages - Add an
applymethod to make it easier to apply cleaners to elements
0.3.1
- Added __init.py__ to
partition
0.3.0
- Implement staging brick for Argilla. Converts lists of
Textelements toargilladataset classes. - Removing the local PDF parsing code and any dependencies and tests.
- Reorganizes the staging bricks in the unstructured.partition module
- Allow entities to be passed into the Datasaur staging brick
- Added HTML escapes to the
replace_unicode_quotesbrick - Fix bad responses in partition_pdf to raise ValueError
- Adds
partition_htmlfor partitioning HTML documents.
0.2.6
- Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf
- Add partitioning brick for calling the document image analysis API
0.2.5
- Update python requirement to >=3.7
0.2.4
- Add alternative way of importing
Finalto support google colab
0.2.3
- Add cleaning bricks for removing prefixes and postfixes
- Add cleaning bricks for extracting text before and after a pattern
0.2.2
- Add staging brick for Datasaur
0.2.1
- Added brick to convert an ISD dictionary to a list of elements
- Update
PDFDocumentto use thefrom_filemethod - Added staging brick for CSV format for ISD (Initial Structured Data) format.
- Added staging brick for separating text into attention window size chunks for
transformers. - Added staging brick for LabelBox.
- Added ability to upload LabelStudio predictions
- Added utility function for JSONL reading and writing
- Added staging brick for CSV format for Prodigy
- Added staging brick for Prodigy
- Added ability to upload LabelStudio annotations
- Added text_field and id_field to stage_for_label_studio signature
0.2.0
- Initial release of unstructured