unstructured/examples/training/2-File Exploration.ipynb

394 lines
34 KiB
Plaintext
Raw Permalink Normal View History

{
"cells": [
{
"cell_type": "markdown",
"id": "9cf9f373",
"metadata": {},
"source": [
"# File Exploration\n",
"\n",
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
"\n",
"- [Filetype detection in `unstructured`](#filetype)\n",
"- [How to generate summary statistics about documents](#summary)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "59392a21",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pathlib\n",
"\n",
"DIRECTORY = os.path.abspath(\"\")\n",
"EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
]
},
{
"cell_type": "markdown",
"id": "9a6ce38f",
"metadata": {},
"source": [
"## Filetype Detection <a id=\"filetype\"></a>\n",
"\n",
"The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c6bd2f4a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<FileType.HTML: 50>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from unstructured.file_utils.filetype import detect_filetype\n",
"\n",
"filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",
"detect_filetype(filename=filename)"
]
},
{
"cell_type": "markdown",
"id": "c68e6a41",
"metadata": {},
"source": [
"The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
]
},
{
"cell_type": "markdown",
"id": "8b7dc938",
"metadata": {},
"source": [
"## Summary Statistics <a id=\"summary\"></a>\n",
"\n",
"`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c53f054e",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
]
},
{
"data": {
"text/plain": [
"FileType.EML 4\n",
"FileType.TXT 3\n",
"FileType.HTML 2\n",
"FileType.XML 2\n",
"FileType.PDF 2\n",
"FileType.JPG 2\n",
"FileType.UNK 1\n",
"FileType.DOCX 1\n",
"FileType.PPTX 1\n",
"FileType.XLSX 1\n",
"Name: filetype, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from unstructured.file_utils.exploration import get_directory_file_info\n",
"\n",
"file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",
"file_info.filetype.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "7e1b3300",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<AxesSubplot: >"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoU
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"file_info.filetype.value_counts().plot.bar()"
]
},
{
"cell_type": "markdown",
"id": "c756d12e",
"metadata": {},
"source": [
"You can also use this utility to find the average file size of documents in a directory, grouped by filetype."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a600fb0f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>filesize</th>\n",
" </tr>\n",
" <tr>\n",
" <th>filetype</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>FileType.DOCX</th>\n",
" <td>36602.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.EML</th>\n",
" <td>149088.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.HTML</th>\n",
" <td>1228404.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.JPG</th>\n",
" <td>64002.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.PDF</th>\n",
" <td>2429245.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.PPTX</th>\n",
" <td>38412.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.TXT</th>\n",
" <td>619.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.UNK</th>\n",
" <td>1102.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.XLSX</th>\n",
" <td>4765.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>FileType.XML</th>\n",
" <td>713.5</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" filesize\n",
"filetype \n",
"FileType.DOCX 36602.0\n",
"FileType.EML 149088.5\n",
"FileType.HTML 1228404.0\n",
"FileType.JPG 64002.5\n",
"FileType.PDF 2429245.0\n",
"FileType.PPTX 38412.0\n",
"FileType.TXT 619.0\n",
"FileType.UNK 1102.0\n",
"FileType.XLSX 4765.0\n",
"FileType.XML 713.5"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_info.groupby(\"filetype\").mean(numeric_only=True)"
]
},
{
"cell_type": "markdown",
"id": "870ede91",
"metadata": {},
"source": [
"If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e5e3a24d",
"metadata": {},
"outputs": [],
"source": [
"from unstructured.file_utils.exploration import get_file_info\n",
"\n",
"filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d8e59472",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"filenames"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "cb0add28",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
]
},
{
"data": {
"text/plain": [
"FileType.EML 4\n",
"FileType.TXT 3\n",
"FileType.HTML 2\n",
"FileType.XML 2\n",
"FileType.PDF 2\n",
"FileType.JPG 2\n",
"FileType.UNK 1\n",
"FileType.DOCX 1\n",
"FileType.PPTX 1\n",
"FileType.XLSX 1\n",
"Name: filetype, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"file_info = get_file_info(filenames=filenames)\n",
"file_info.filetype.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ac4473b0",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}