unstructured/examples/training/2-File Exploration.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9cf9f373",
   "metadata": {},
   "source": [
    "# File Exploration\n",
    "\n",
    "In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
    "\n",
    "- [Filetype detection in `unstructured`](#filetype)\n",
    "- [How to generate summary statistics about documents](#summary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "59392a21",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pathlib\n",
    "\n",
    "DIRECTORY = os.path.abspath(\"\")\n",
    "EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a6ce38f",
   "metadata": {},
   "source": [
    "## Filetype Detection <a id=\"filetype\"></a>\n",
    "\n",
    "The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "c6bd2f4a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<FileType.HTML: 50>"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from unstructured.file_utils.filetype import detect_filetype\n",
    "\n",
    "filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",
    "detect_filetype(filename=filename)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c68e6a41",
   "metadata": {},
   "source": [
    "The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b7dc938",
   "metadata": {},
   "source": [
    "## Summary Statistics <a id=\"summary\"></a>\n",
    "\n",
    "`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c53f054e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "FileType.EML     4\n",
       "FileType.TXT     3\n",
       "FileType.HTML    2\n",
       "FileType.XML     2\n",
       "FileType.PDF     2\n",
       "FileType.JPG     2\n",
       "FileType.UNK     1\n",
       "FileType.DOCX    1\n",
       "FileType.PPTX    1\n",
       "FileType.XLSX    1\n",
       "Name: filetype, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from unstructured.file_utils.exploration import get_directory_file_info\n",
    "\n",
    "file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",
    "file_info.filetype.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7e1b3300",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot: >"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoU
      "text/plain": [
       "<Figure size 640x480 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "file_info.filetype.value_counts().plot.bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c756d12e",
   "metadata": {},
   "source": [
    "You can also use this utility to find the average file size of documents in a directory, grouped by filetype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "a600fb0f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>filesize</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>filetype</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>FileType.DOCX</th>\n",
       "      <td>36602.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.EML</th>\n",
       "      <td>149088.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.HTML</th>\n",
       "      <td>1228404.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.JPG</th>\n",
       "      <td>64002.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.PDF</th>\n",
       "      <td>2429245.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.PPTX</th>\n",
       "      <td>38412.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.TXT</th>\n",
       "      <td>619.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.UNK</th>\n",
       "      <td>1102.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.XLSX</th>\n",
       "      <td>4765.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>FileType.XML</th>\n",
       "      <td>713.5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                filesize\n",
       "filetype                \n",
       "FileType.DOCX    36602.0\n",
       "FileType.EML    149088.5\n",
       "FileType.HTML  1228404.0\n",
       "FileType.JPG     64002.5\n",
       "FileType.PDF   2429245.0\n",
       "FileType.PPTX    38412.0\n",
       "FileType.TXT       619.0\n",
       "FileType.UNK      1102.0\n",
       "FileType.XLSX     4765.0\n",
       "FileType.XML       713.5"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_info.groupby(\"filetype\").mean(numeric_only=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "870ede91",
   "metadata": {},
   "source": [
    "If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "e5e3a24d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from unstructured.file_utils.exploration import get_file_info\n",
    "\n",
    "filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d8e59472",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",
       " '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "filenames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "cb0add28",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "FileType.EML     4\n",
       "FileType.TXT     3\n",
       "FileType.HTML    2\n",
       "FileType.XML     2\n",
       "FileType.PDF     2\n",
       "FileType.JPG     2\n",
       "FileType.UNK     1\n",
       "FileType.DOCX    1\n",
       "FileType.PPTX    1\n",
       "FileType.XLSX    1\n",
       "Name: filetype, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "file_info = get_file_info(filenames=filenames)\n",
    "file_info.filetype.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac4473b0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
docs: file exploration training notebook (#221) 2023-02-16 15:33:02 -05:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "9cf9f373",`
			`"metadata": {},`
			`"source": [`
			`"# File Exploration\n",`
			`"\n",`
Resolve numerous typos (#280) * Resolve numerous typos * Resolve typo in mime type 2023-02-25 02:48:23 +01:00			"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw documents. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
docs: file exploration training notebook (#221) 2023-02-16 15:33:02 -05:00			`"\n",`
			"- [Filetype detection in `unstructured`](#filetype)\n",
			`"- [How to generate summary statistics about documents](#summary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"id": "59392a21",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"import os\n",`
			`"import pathlib\n",`
			`"\n",`
			`"DIRECTORY = os.path.abspath(\"\")\n",`
			`"EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "9a6ce38f",`
			`"metadata": {},`
			`"source": [`
			`"## Filetype Detection <a id=\"filetype\"></a>\n",`
			`"\n",`
			"The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"id": "c6bd2f4a",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"<FileType.HTML: 50>"`
			`]`
			`},`
			`"execution_count": 2,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from unstructured.file_utils.filetype import detect_filetype\n",`
			`"\n",`
			`"filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",`
			`"detect_filetype(filename=filename)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "c68e6a41",`
			`"metadata": {},`
			`"source": [`
			"The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "8b7dc938",`
			`"metadata": {},`
			`"source": [`
			`"## Summary Statistics <a id=\"summary\"></a>\n",`
			`"\n",`
			"`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"id": "c53f054e",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"FileType.EML 4\n",`
			`"FileType.TXT 3\n",`
			`"FileType.HTML 2\n",`
			`"FileType.XML 2\n",`
			`"FileType.PDF 2\n",`
			`"FileType.JPG 2\n",`
			`"FileType.UNK 1\n",`
			`"FileType.DOCX 1\n",`
			`"FileType.PPTX 1\n",`
			`"FileType.XLSX 1\n",`
			`"Name: filetype, dtype: int64"`
			`]`
			`},`
			`"execution_count": 3,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from unstructured.file_utils.exploration import get_directory_file_info\n",`
			`"\n",`
			`"file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",`
			`"file_info.filetype.value_counts()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"id": "7e1b3300",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"<AxesSubplot: >"`
			`]`
			`},`
			`"execution_count": 4,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`},`
			`{`
			`"data": {`
			"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoU
			`"text/plain": [`
			`"<Figure size 640x480 with 1 Axes>"`
			`]`
			`},`
			`"metadata": {},`
			`"output_type": "display_data"`
			`}`
			`],`
			`"source": [`
			`"file_info.filetype.value_counts().plot.bar()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "c756d12e",`
			`"metadata": {},`
			`"source": [`
			`"You can also use this utility to find the average file size of documents in a directory, grouped by filetype."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"id": "a600fb0f",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>filesize</th>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>filetype</th>\n",`
			`" <th></th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>FileType.DOCX</th>\n",`
			`" <td>36602.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.EML</th>\n",`
			`" <td>149088.5</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.HTML</th>\n",`
			`" <td>1228404.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.JPG</th>\n",`
			`" <td>64002.5</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.PDF</th>\n",`
			`" <td>2429245.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.PPTX</th>\n",`
			`" <td>38412.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.TXT</th>\n",`
			`" <td>619.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.UNK</th>\n",`
			`" <td>1102.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.XLSX</th>\n",`
			`" <td>4765.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>FileType.XML</th>\n",`
			`" <td>713.5</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" filesize\n",`
			`"filetype \n",`
			`"FileType.DOCX 36602.0\n",`
			`"FileType.EML 149088.5\n",`
			`"FileType.HTML 1228404.0\n",`
			`"FileType.JPG 64002.5\n",`
			`"FileType.PDF 2429245.0\n",`
			`"FileType.PPTX 38412.0\n",`
			`"FileType.TXT 619.0\n",`
			`"FileType.UNK 1102.0\n",`
			`"FileType.XLSX 4765.0\n",`
			`"FileType.XML 713.5"`
			`]`
			`},`
			`"execution_count": 5,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"file_info.groupby(\"filetype\").mean(numeric_only=True)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "870ede91",`
			`"metadata": {},`
			`"source": [`
			"If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"id": "e5e3a24d",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"from unstructured.file_utils.exploration import get_file_info\n",`
			`"\n",`
			`"filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"id": "d8e59472",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",`
			`" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"filenames"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"id": "cb0add28",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stderr",`
			`"output_type": "stream",`
			`"text": [`
			`"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"`
			`]`
			`},`
			`{`
			`"data": {`
			`"text/plain": [`
			`"FileType.EML 4\n",`
			`"FileType.TXT 3\n",`
			`"FileType.HTML 2\n",`
			`"FileType.XML 2\n",`
			`"FileType.PDF 2\n",`
			`"FileType.JPG 2\n",`
			`"FileType.UNK 1\n",`
			`"FileType.DOCX 1\n",`
			`"FileType.PPTX 1\n",`
			`"FileType.XLSX 1\n",`
			`"Name: filetype, dtype: int64"`
			`]`
			`},`
			`"execution_count": 8,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"file_info = get_file_info(filenames=filenames)\n",`
			`"file_info.filetype.value_counts()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"id": "ac4473b0",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.8.13"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`