mirror of
				https://github.com/Unstructured-IO/unstructured.git
				synced 2025-10-31 10:03:07 +00:00 
			
		
		
		
	 4ea716837d
			
		
	
	
		4ea716837d
		
			
		
	
	
	
	
		
			
			* first pass on regex metadata * fix typing for regex metadata * add dataclass back in * add decorators * fix tests * update docs * add tests for regex metadata * add process metadata to tsv * changelog and version * docs typos * consolidate to using a single kwarg * fix test
		
			
				
	
	
		
			85 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			85 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Metadata
 | |
| ========
 | |
| 
 | |
| The ``unstructured`` package tracks a variety of metadata about Elements extracted from documents.
 | |
| Tracking metadata enables users to filter document elements downstream based on element metadata of interest.
 | |
| For example, a user may be interested in selected document elements from a given page number
 | |
| or an e-mail with a given subject line.
 | |
| 
 | |
| Metadata is tracked at the element level. You can extract the metadata for a given document element
 | |
| with ``element.metadata``. For a dictionary representation, use ``element.metadata.to_dict()``.
 | |
| All document types return the following metadata fields when the information is available from
 | |
| the source file:
 | |
| 
 | |
| * ``filename``
 | |
| * ``file_directory``
 | |
| * ``date``
 | |
| * ``filetype``
 | |
| * ``page_number``
 | |
| 
 | |
| 
 | |
| Email
 | |
| -----
 | |
| 
 | |
| Emails will include ``sent_from``, ``sent_to``, and ``subject`` metadata.
 | |
| ``sent_from`` is a list of strings because the `RFC 822 <https://www.rfc-editor.org/rfc/rfc822>`_
 | |
| spec for emails allows for multiple sent from email addresses.
 | |
| 
 | |
| 
 | |
| Microsoft Excel Documents
 | |
| --------------------------
 | |
| 
 | |
| For Excel documents, ``ElementMetadata`` will contain a ``page_name`` element, which corresponds
 | |
| to the sheet name in the Excel document.
 | |
| 
 | |
| 
 | |
| Microsoft Word Documents
 | |
| -------------------------
 | |
| 
 | |
| Headers and footers in Word documents include a ``header_footer_type`` indicating which page
 | |
| a header or footer applies to. Valid values are ``"primary"``, ``"even_only"``, and ``"first_page"``.
 | |
| 
 | |
| 
 | |
| Webpages
 | |
| ---------
 | |
| 
 | |
| Elements from webpages will include a ``url`` metadata field, corresponding to the URL for the webpage.
 | |
| 
 | |
| 
 | |
| 
 | |
| ##########################
 | |
| Advanced Metadata Options
 | |
| ###########################
 | |
| 
 | |
| 
 | |
| 
 | |
| Extract Metadata with Regexes
 | |
| ------------------------------
 | |
| 
 | |
| ``unstructured`` allows users to extract additional metadata with regexes using the ``regex_metadata`` kwarg.
 | |
| Here is an example of how to extract regex metadata:
 | |
| 
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|   from unstructured.partition.text import partition_text
 | |
| 
 | |
|   text = "SPEAKER 1: It is my turn to speak now!"
 | |
|   elements = partition_text(text=text, regex_metadata={"speaker": r"SPEAKER \d{1,3}:"})
 | |
|   elements[0].metadata.regex_metadata
 | |
| 
 | |
| The result will look like:
 | |
| 
 | |
| 
 | |
| .. code:: python
 | |
| 
 | |
|   {'speaker':
 | |
|     [
 | |
|       {
 | |
|         'text': 'SPEAKER 1:',
 | |
|         'start': 0,
 | |
|         'end': 10,
 | |
|      }
 | |
|     ]
 | |
|   }
 |