haystack/tutorials/Tutorial8_Preprocessing.ipynb
Markus Sagen 69a0c9f2ed
Clarify docs for PDF conversion, languages and encodings (#1570)
* Clarify PDF conversion, languages and encodings

The parameter name `valid_languages` may be a bit miss-leading from
reading only the tutorials. Users may, incorrectly assume that it
enforces that the conversions only works for those languages, then it's
more of a check.

- Provided clarifications in the tutorials to highlight what
valid_languages does and that changing the encoding may give better
results for their language of choice
- Updated the command for `pdftotext` to the correct one

* Allow encodings for `convert_files_to_dicts`

- Set option of passing encoding to the converters. Trying even for some
Latin1 languages, the converter does not do it in a good way.

Potential issues is that the encoding defaults to None, which is default
for the other converters, but not for the PDFToTextConverter. Could add
a check and change the ending to Latin1 for pdf if set to None.

Was considering adding it to **kwargs, but since it may be a commonly
used feature to be documented, I added it as a keyword argument instead.
Would love to hear your input and feedback on in.

* Set back PDF default encoding

* Update documentation
2021-10-11 09:30:12 +02:00

524 lines
16 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Preprocessing\n",
"\n",
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial8_Preprocessing.ipynb)\n",
"\n",
"Haystack includes a suite of tools to extract text from different file types, normalize white space\n",
"and split text into smaller pieces to optimize retrieval.\n",
"These data preprocessing steps can have a big impact on the systems performance and effective handling of data is key to getting the most out of Haystack."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"Ultimately, Haystack expects data to be provided as a list documents in the following dictionary format:\n",
"``` python\n",
"docs = [\n",
" {\n",
" 'text': DOCUMENT_TEXT_HERE,\n",
" 'meta': {'name': DOCUMENT_NAME, ...}\n",
" }, ...\n",
"]\n",
"```"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"This tutorial will show you all the tools that Haystack provides to help you cast your data into this format."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 26,
"outputs": [],
"source": [
"# Let's start by installing Haystack\n",
"\n",
"# Install the latest release of Haystack in your own environment\n",
"#! pip install farm-haystack\n",
"\n",
"# Install the latest master of Haystack\n",
"!pip install grpcio-tools==1.34.1\n",
"!pip install git+https://github.com/deepset-ai/haystack.git\n",
"!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz\n",
"!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin\n",
"\n",
"# If you run this notebook on Google Colab, you might need to\n",
"# restart the runtime after installing haystack."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 2,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"01/06/2021 14:49:14 - INFO - faiss - Loading faiss with AVX2 support.\n",
"01/06/2021 14:49:14 - INFO - faiss - Loading faiss.\n"
]
}
],
"source": [
"# Here are the imports we need\n",
"\n",
"from haystack.file_converter.txt import TextConverter\n",
"from haystack.file_converter.pdf import PDFToTextConverter\n",
"from haystack.file_converter.docx import DocxToTextConverter\n",
"\n",
"from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http\n",
"from haystack.preprocessor.preprocessor import PreProcessor"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 29,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"01/05/2021 12:02:30 - INFO - haystack.preprocessor.utils - Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip to `data/preprocessing_tutorial`\n",
"100%|██████████| 595119/595119 [00:00<00:00, 5299765.39B/s]\n"
]
},
{
"data": {
"text/plain": "True"
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This fetches some sample files to work with\n",
"\n",
"doc_dir = \"data/preprocessing_tutorial\"\n",
"s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial.zip\"\n",
"fetch_archive_from_http(url=s3_url, output_dir=doc_dir)"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Converters\n",
"\n",
"Haystack's converter classes are designed to help you turn files on your computer into the documents\n",
"that can be processed by the Haystack pipeline.\n",
"There are file converters for txt, pdf, docx files as well as a converter that is powered by Apache Tika.\n",
"The parameter `valid_langugages` does not convert files to the target language, but checks if the conversion worked as expected.\n",
"For converting PDFs, try changing the encoding to UTF-8 if the conversion isn't great."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 8,
"outputs": [],
"source": [
"# Here are some examples of how you would use file converters\n",
"\n",
"converter = TextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
"doc_txt = converter.convert(file_path=\"data/preprocessing_tutorial/classics.txt\", meta=None)\n",
"\n",
"converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
"doc_pdf = converter.convert(file_path=\"data/preprocessing_tutorial/bert.pdf\", meta=None)\n",
"\n",
"converter = DocxToTextConverter(remove_numeric_tables=True, valid_languages=[\"en\"])\n",
"doc_docx = converter.convert(file_path=\"data/preprocessing_tutorial/heavy_metal.docx\", meta=None)\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": 9,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/heavy_metal.docx\n",
"01/06/2021 14:51:06 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/bert.pdf\n",
"01/06/2021 14:51:07 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/classics.txt\n"
]
}
],
"source": [
"# Haystack also has a convenience function that will automatically apply the right converter to each file in a directory.\n",
"\n",
"all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## PreProcessor\n",
"\n",
"The PreProcessor class is designed to help you clean text and split text into sensible units.\n",
"File splitting can have a very significant impact on the system's performance and is absolutely mandatory for Dense Passage Retrieval models.\n",
"In general, we recommend you split the text from your files into small documents of around 100 words for dense retrieval methods\n",
"and no more than 10,000 words for sparse methods.\n",
"Have a look at the [Preprocessing](https://haystack.deepset.ai/docs/latest/preprocessingmd)\n",
"and [Optimization](https://haystack.deepset.ai/docs/latest/optimizationmd) pages on our website for more details."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 10,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"n_docs_input: 1\n",
"n_docs_output: 51\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"# This is a default usage of the PreProcessor.\n",
"# Here, it performs cleaning of consecutive whitespaces\n",
"# and splits a single large document into smaller documents.\n",
"# Each document is up to 1000 words long and document breaks cannot fall in the middle of sentences\n",
"# Note how the single document passed into the document gets split into 5 smaller documents\n",
"\n",
"preprocessor = PreProcessor(\n",
" clean_empty_lines=True,\n",
" clean_whitespace=True,\n",
" clean_header_footer=False,\n",
" split_by=\"word\",\n",
" split_length=100,\n",
" split_respect_sentence_boundary=True\n",
")\n",
"docs_default = preprocessor.process(doc_txt)\n",
"print(f\"n_docs_input: 1\\nn_docs_output: {len(docs_default)}\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Cleaning\n",
"\n",
"- `clean_empty_lines` will normalize 3 or more consecutive empty lines to be just a two empty lines\n",
"- `clean_whitespace` will remove any whitespace at the beginning or end of each line in the text\n",
"- `clean_header_footer` will remove any long header or footer texts that are repeated on each page"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Splitting\n",
"By default, the PreProcessor will respect sentence boundaries, meaning that documents will not start or end\n",
"midway through a sentence.\n",
"This will help reduce the possibility of answer phrases being split between two documents.\n",
"This feature can be turned off by setting `split_respect_sentence_boundary=False`."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 11,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RESPECTING SENTENCE BOUNDARY\n",
"End of document: \"...cornerstone of a typical elite European education.\"\n",
"\n",
"NOT RESPECTING SENTENCE BOUNDARY\n",
"End of document: \"...on. In England, for instance, Oxford and Cambridge\"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
],
"source": [
"# Not respecting sentence boundary vs respecting sentence boundary\n",
"\n",
"preprocessor_nrsb = PreProcessor(split_respect_sentence_boundary=False)\n",
"docs_nrsb = preprocessor_nrsb.process(doc_txt)\n",
"\n",
"print(\"RESPECTING SENTENCE BOUNDARY\")\n",
"end_text = docs_default[0][\"text\"][-50:]\n",
"print(\"End of document: \\\"...\" + end_text + \"\\\"\")\n",
"print()\n",
"print(\"NOT RESPECTING SENTENCE BOUNDARY\")\n",
"end_text_nrsb = docs_nrsb[0][\"text\"][-50:]\n",
"print(\"End of document: \\\"...\" + end_text_nrsb + \"\\\"\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"A commonly used strategy to split long documents, especially in the field of Question Answering,\n",
"is the sliding window approach. If `split_length=10` and `split_overlap=3`, your documents will look like this:\n",
"\n",
"- doc1 = words[0:10]\n",
"- doc2 = words[7:17]\n",
"- doc3 = words[14:24]\n",
"- ...\n",
"\n",
"You can use this strategy by following the code below."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"source": [
"# Sliding window approach\n",
"\n",
"preprocessor_sliding_window = PreProcessor(\n",
" split_overlap=3,\n",
" split_length=10,\n",
" split_respect_sentence_boundary=False\n",
")\n",
"docs_sliding_window = preprocessor_sliding_window.process(doc_txt)\n",
"\n",
"doc1 = docs_sliding_window[0][\"text\"][:200]\n",
"doc2 = docs_sliding_window[1][\"text\"][:100]\n",
"doc3 = docs_sliding_window[2][\"text\"][:100]\n",
"\n",
"print(\"Document 1: \\\"\" + doc1 + \"...\\\"\")\n",
"print(\"Document 2: \\\"\" + doc2 + \"...\\\"\")\n",
"print(\"Document 3: \\\"\" + doc3 + \"...\\\"\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
},
"execution_count": 12,
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Document 1: \"Classics or classical studies is the study of classical antiquity,...\"\n",
"Document 2: \"of classical antiquity, and in the Western world traditionally refers...\"\n",
"Document 3: \"world traditionally refers to the study of Classical Greek and...\"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Bringing it all together"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": 16,
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/heavy_metal.docx\n",
"01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/bert.pdf\n",
"01/06/2021 14:56:12 - INFO - haystack.preprocessor.utils - Converting data/preprocessing_tutorial/classics.txt\n",
"[nltk_data] Downloading package punkt to /home/branden/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"n_files_input: 3\n",
"n_docs_output: 150\n"
]
}
],
"source": [
"all_docs = convert_files_to_dicts(dir_path=\"data/preprocessing_tutorial\")\n",
"preprocessor = PreProcessor(\n",
" clean_empty_lines=True,\n",
" clean_whitespace=True,\n",
" clean_header_footer=False,\n",
" split_by=\"word\",\n",
" split_length=100,\n",
" split_respect_sentence_boundary=True\n",
")\n",
"docs = preprocessor.process(all_docs)\n",
"\n",
"print(f\"n_files_input: {len(all_docs)}\\nn_docs_output: {len(docs)}\")"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## About us\n",
"\n",
"This [Haystack](https://github.com/deepset-ai/haystack/) notebook was made with love by [deepset](https://deepset.ai/) in Berlin, Germany\n",
"\n",
"We bring NLP to the industry via open source! \n",
"Our focus: Industry specific language models & large scale QA systems. \n",
" \n",
"Some of our other work: \n",
"- [German BERT](https://deepset.ai/german-bert)\n",
"- [GermanQuAD and GermanDPR](https://deepset.ai/germanquad)\n",
"- [FARM](https://github.com/deepset-ai/FARM)\n",
"\n",
"Get in touch:\n",
"[Twitter](https://twitter.com/deepset_ai) | [LinkedIn](https://www.linkedin.com/company/deepset-ai/) | [Slack](https://haystack.deepset.ai/community/join) | [GitHub Discussions](https://github.com/deepset-ai/haystack/discussions) | [Website](https://deepset.ai)\n",
"\n",
"By the way: [we're hiring!](https://apply.workable.com/deepset/) \n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}