mirror of
				https://github.com/deepset-ai/haystack.git
				synced 2025-11-04 03:39:31 +00:00 
			
		
		
		
	Updated the example code in readme for Indexing PDF / Docx files (#502)
* Updated the example code to Indexing PDF / Docx files The example code was referencing a structure haystack.indexing which does not exist anymore. Modified this and the function "extract_pages" with "convert" * Update converter example in readme Co-authored-by: Malte Pietsch <malte.pietsch@deepset.ai>
This commit is contained in:
		
							parent
							
								
									3434d5205d
								
							
						
					
					
						commit
						dc16258dab
					
				
							
								
								
									
										22
									
								
								README.rst
									
									
									
									
									
								
							
							
						
						
									
										22
									
								
								README.rst
									
									
									
									
									
								
							@ -284,15 +284,15 @@ Example:
 | 
			
		||||
.. code-block:: python
 | 
			
		||||
 | 
			
		||||
    #PDF
 | 
			
		||||
    from haystack.indexing.file_converters.pdf import PDFToTextConverter
 | 
			
		||||
    converter = PDFToTextConverter(remove_header_footer=True, remove_numeric_tables=True, valid_languages=["de","en"])
 | 
			
		||||
    pages = converter.extract_pages(file_path=file)
 | 
			
		||||
    # => list of str, one per page
 | 
			
		||||
    from haystack.file_converter.pdf import PDFToTextConverter
 | 
			
		||||
    converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
 | 
			
		||||
    doc = converter.convert(file_path=file, meta=None)
 | 
			
		||||
    # => {"text": "text first page \f text second page ...", "meta": None}
 | 
			
		||||
    #DOCX
 | 
			
		||||
    from haystack.indexing.file_converters.docx import DocxToTextConverter
 | 
			
		||||
    converter = DocxToTextConverter()
 | 
			
		||||
    paragraphs = converter.extract_pages(file_path=file)
 | 
			
		||||
    #  => list of str, one per paragraph (as docx has no direct notion of pages)
 | 
			
		||||
    from haystack.file_converter.docx import DocxToTextConverter
 | 
			
		||||
    converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=["de","en"])
 | 
			
		||||
    doc = converter.convert(file_path=file, meta=None)
 | 
			
		||||
    # => {"text": "some text", "meta": None}
 | 
			
		||||
 | 
			
		||||
Advanced document convertion is enabled by leveraging mature text extraction library `Apache Tika <https://tika.apache.org/>`_, which is mostly written in Java. Although it's possible to call Tika API from Python, the current :code:`TikaConverter` only supports RESTful call to a Tika server running at localhost. One may either run Tika as a REST service at port 9998 (default), or to start a `docker container for Tika <https://hub.docker.com/r/apache/tika/tags>`_. The latter is recommended, as it's easily scalable, and does not require setting up any Java runtime environment. What's more, future update is also taken care of by docker.
 | 
			
		||||
Either way, TikaConverter makes RESTful calls to convert any document format supported by Tika. Example code can be found at :code:`indexing/file_converters/utils.py`'s :code:`tika_convert)_files_to_dicts` function:
 | 
			
		||||
@ -312,9 +312,9 @@ If you feel adventurous, Tika even supports some image OCR with Tesseract, or ob
 | 
			
		||||
 | 
			
		||||
.. code-block:: python
 | 
			
		||||
 | 
			
		||||
    converter = TikaConverter(remove_header_footer=True)
 | 
			
		||||
    pages = converter.extract_pages(file_path=path)
 | 
			
		||||
    pages, meta = converter.extract_pages(file_path=path, return_meta=True)
 | 
			
		||||
    converter = TikaConverter(tika_url: str = "http://localhost:9998/tika")
 | 
			
		||||
    doc = converter.convert(file_path=path)
 | 
			
		||||
    # => {"text": "text first page \f text second page ...", "meta": {"Content-Type": 'application/pdf', "Last-Modified":...}}
 | 
			
		||||
 | 
			
		||||
Contributing
 | 
			
		||||
=============
 | 
			
		||||
 | 
			
		||||
		Loading…
	
	
			
			x
			
			
		
	
		Reference in New Issue
	
	Block a user