mirror of
https://github.com/Unstructured-IO/unstructured.git
synced 2025-07-07 17:12:48 +00:00
394 lines
34 KiB
Plaintext
394 lines
34 KiB
Plaintext
![]() |
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "9cf9f373",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"# File Exploration\n",
|
||
|
"\n",
|
||
|
"In addition to core document processing capabilities, the `unstructured` library includes utilities for summarizing information about raw doucments. We will cover how to use these utilities in this notebook. At the conclusion of this notebook, you should understand:\n",
|
||
|
"\n",
|
||
|
"- [Filetype detection in `unstructured`](#filetype)\n",
|
||
|
"- [How to generate summary statistics about documents](#summary)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 1,
|
||
|
"id": "59392a21",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"import os\n",
|
||
|
"import pathlib\n",
|
||
|
"\n",
|
||
|
"DIRECTORY = os.path.abspath(\"\")\n",
|
||
|
"EXAMPLE_DOCS_DIRECTORY = os.path.join(DIRECTORY, \"..\", \"..\", \"example-docs\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "9a6ce38f",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Filetype Detection <a id=\"filetype\"></a>\n",
|
||
|
"\n",
|
||
|
"The `unstructured` library includes a `detect_filetype` function that helps detect the type of an input file. To use the filetype detection utilities, you will need to install the `libmagic` library because `unstructured` uses this library under the hood for filetype detection. In addition to the MIME type from `libmagic`, the `unstructured` library uses the file extension and in some cases inspect the document to determine the document type. The following is an example of how to call `detect_filetype`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 2,
|
||
|
"id": "c6bd2f4a",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<FileType.HTML: 50>"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 2,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from unstructured.file_utils.filetype import detect_filetype\n",
|
||
|
"\n",
|
||
|
"filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, \"example-10k.html\")\n",
|
||
|
"detect_filetype(filename=filename)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "c68e6a41",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"The output of `detect_filetype` is a `FileType` enum, which is defined in [`filetype.py`](https://github.com/Unstructured-IO/unstructured/blob/main/unstructured/file_utils/filetype.py). Check out the source file to see the full list of files that are supported by `detect_filetype`."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "8b7dc938",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"## Summary Statistics <a id=\"summary\"></a>\n",
|
||
|
"\n",
|
||
|
"`unstructured` also provides utilities for summarizing the filetypes in a directory. You can use this utility for tasks such as counting by filetype and checking the average size of a file by filetype. The following example shows how to find a count of files by filetype in a directory and plot the results as a histogram."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 3,
|
||
|
"id": "c53f054e",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stderr",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"FileType.EML 4\n",
|
||
|
"FileType.TXT 3\n",
|
||
|
"FileType.HTML 2\n",
|
||
|
"FileType.XML 2\n",
|
||
|
"FileType.PDF 2\n",
|
||
|
"FileType.JPG 2\n",
|
||
|
"FileType.UNK 1\n",
|
||
|
"FileType.DOCX 1\n",
|
||
|
"FileType.PPTX 1\n",
|
||
|
"FileType.XLSX 1\n",
|
||
|
"Name: filetype, dtype: int64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 3,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from unstructured.file_utils.exploration import get_directory_file_info\n",
|
||
|
"\n",
|
||
|
"file_info = get_directory_file_info(EXAMPLE_DOCS_DIRECTORY)\n",
|
||
|
"file_info.filetype.value_counts()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 4,
|
||
|
"id": "7e1b3300",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"<AxesSubplot: >"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 4,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAHzCAYAAADy/B0DAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/P9b71AAAACXBIWXMAAA9hAAAPYQGoP6dpAABDnElEQVR4nO3dd3gVZf7//9cJ5QAhCUYgoQQE6b1YCKIUI4hZMKt4IbB0WAt8hEVRUJRVhLgqgi5IUQGBpahL8cuiGEFsQaWFImtBgaAkwQIJNSC5f3/wI0sg5ZwAuc9Mno/rmj/OFM77vhjmvLhn7ns8xhgjAAAAS4JsFwAAAIo3wggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCppuwBfZGVl6cCBAwoJCZHH47FdDgAA8IExRkeOHFHVqlUVFJR3/4cjwsiBAwcUFRVluwwAAFAI+/fvV/Xq1fPc7ogwEhISIulsY0JDQy1XAwAAfJGRkaGoqKjs3/G8OCKMnLs1ExoaShgBAMBhCnrEggdYAQCAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWHVJYeS5556Tx+PRyJEj893v7bffVoMGDVSmTBk1bdpUq1evvpSvBQAALlLoMLJx40bNmjVLzZo1y3e/xMRE9erVS4MHD9bWrVsVFxenuLg47dy5s7BfDQAAXKRQYeTo0aPq06ePXnvtNV111VX57vvyyy/r9ttv1+jRo9WwYUNNmDBBrVq10rRp0wpVMAAAcJdChZFhw4YpNjZWMTExBe67YcOGi/br0qWLNmzYkOcxmZmZysjIyLEAAAB3KunvAUuWLNGWLVu0ceNGn/ZPTU1VREREjnURERFKTU3N85j4+Hg9/fTT/paWwzVj/nNJx/ti73OxV/w7AABwO796Rvbv368RI0boX//6l8qUKXOlatLYsWOVnp6evezfv/+KfRcAALDLr56RzZs36+DBg2rVqlX2ujNnzuiTTz7RtGnTlJmZqRIlSuQ4JjIyUmlpaTnWpaWlKTIyMs/v8Xq98nq9/pQGAAAcyq+ekVtvvVU7duxQUlJS9nLdddepT58+SkpKuiiISFJ0dLTWrl2bY11CQoKio6MvrXIAAOAKfvWMhISEqEmTJjnWBQcH6+qrr85e369fP1WrVk3x8fGSpBEjRqh9+/aaPHmyYmNjtWTJEm3atEmzZ8++TE0AAABOdtlnYE1OTlZKSkr257Zt22rRokWaPXu2mjdvrnfeeUcrVqy4KNQAAIDiyWOMMbaLKEhGRobCwsKUnp6u0NBQn45hNA0AAHb5+vvNu2kAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVYQRAABgFWEEAABYRRgBAABWEUYAAIBVhBEAAGAVYQQAAFhFGAEAAFYRRgAAgFWEEQAAYBVhBAAAWEUYAQAAVhFGAACAVX6FkRkzZqhZs2YKDQ1VaGiooqOj9d577+W5/7x58+TxeHIsZcqUueSiAQCAe5T0Z+fq1avrueeeU926dWWM0Ztvvqk777xTW7duVePGjXM9JjQ0VN9++232Z4/Hc2kVAwAAV/ErjHTr1i3H54kTJ2rGjBn64osv8gwjHo9HkZGRha8QAAC4WqGfGTlz5oyWLFmiY8eOKTo6Os/9jh49qpo1ayoqKkp33nmnvv766wL/7MzMTGVkZORYAACAO/kdRnbs2KHy5cvL6/Xq/vvv1/Lly9WoUaNc961fv77mzJmjlStXauHChcrKylLbtm31008/5fsd8fHxCgsLy16ioqL8LRMAADiExxhj/Dng1KlTSk5OVnp6ut555x29/vrr+vjjj/MMJOc7ffq0GjZsqF69emnChAl57peZmanMzMzszxkZGYqKilJ6erpCQ0N9qvOaMf/xab9Lsfe52Cv+HQAAOFVGRobCwsIK/P3265kRSSpdurTq1KkjSWrdurU2btyol19+WbNmzSrw2FKlSqlly5bavXt3vvt5vV55vV5/SwMAAA50yfOMZGVl5ejFyM+ZM2e0Y8cOValS5VK/FgAAuIRfPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSnnnmGbVp00Z16tTR4cOH9cILL2jfvn0aMmTI5W8JAABwJL/CyMGDB9WvXz+lpKQoLCxMzZo105o1a3TbbbdJkpKTkxUU9L/OlkOHDmno0KFKTU3VVVddpdatWysxMdGn50sAAEDx4PcDrDb4+gDM+XiAFQAAu3z9/ebdNAAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqv8LIjBkz1KxZM4WGhio0NFTR0dF677338j3m7bffVoMGDVSmTBk1bdpUq1evvqSCAQCAu/gVRqpXr67nnntOmzdv1qZNm9SpUyfdeeed+vrrr3PdPzExUb169dLgwYO1detWxcXFKS4uTjt37rwsxQMAAOfzGGPMpfwB4eHheuGFFzR48OCLtvXs2VPHjh3TqlWrste1adNGLVq00MyZM33+joyMDIWFhSk9PV2hoaE+HXPNmP/4/OcX1t7nYq/4dwAA4FS+/n4X+pmRM2fOaMmSJTp27Jiio6Nz3WfDhg2KiYnJsa5Lly7asGFDvn92ZmamMjIyciwAAMCdSvp7wI4dOxQdHa2TJ0+qfPnyWr58uRo1apTrvqmpqYqIiMixLiIiQqmpqfl+R3x8vJ5++ml/S3MlengAAG7nd89I/fr1lZSUpC+//FIPPPCA+vfvr127dl3WosaOHav09PTsZf/+/Zf1zwcAAIHD756R0qVLq06dOpKk1q1ba+PGjXr55Zc1a9asi/aNjIxUWlpajnVpaWmKjIzM9zu8Xq+8Xq+/pQEAAAe65HlGsrKylJmZmeu26OhorV27Nse6hISEPJ8xAQAAxY9fPSNjx45V165dVaNGDR05ckSLFi3S+vXrtWbNGklSv379VK1aNcXHx0uSRowYofbt22vy5MmKjY3VkiVLtGnTJs2ePfvytwQAADiSX2Hk4MGD6tevn1JSUhQWFqZmzZppzZo1uu222yRJycnJCgr6X2dL27ZttWjRIo0bN06PP/646tatqxUrVqhJkyaXtxUAAMCx/Aojb7zxRr7b169ff9G6e+65R/fcc49fRQEAgOKDd9MAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAqwgjAADAKsIIAACwijACAACsIowAAACrCCMAAMAqwggAALCKMAIAAKwijAAAAKsIIwAAwCrCCAAAsIowAgAArCKMAAAAq/wKI/Hx8br++usVEhKiypUrKy4uTt9++22+x8ybN08ejyfHUqZMmUsqGgAAuIdfYeTjjz/WsGHD9MUXXyghIUGnT59W586ddezYsXyPCw0NVUpKSvayb9++SyoaAAC4R0l/dn7//fdzfJ43b54qV66szZs365ZbbsnzOI/Ho8jIyMJVCAAAXO2SnhlJT0+XJIWHh+e739GjR1WzZk1FRUXpzjvv1Ndff53v/pmZmcrIyMixAAAAdyp0GMnKytLIkSN10003qUmTJnnuV79+fc2ZM0crV67UwoU
|
||
|
"text/plain": [
|
||
|
"<Figure size 640x480 with 1 Axes>"
|
||
|
]
|
||
|
},
|
||
|
"metadata": {},
|
||
|
"output_type": "display_data"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"file_info.filetype.value_counts().plot.bar()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "c756d12e",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"You can also use this utility to find the average file size of documents in a directory, grouped by filetype."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 5,
|
||
|
"id": "a600fb0f",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>filesize</th>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>filetype</th>\n",
|
||
|
" <th></th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.DOCX</th>\n",
|
||
|
" <td>36602.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.EML</th>\n",
|
||
|
" <td>149088.5</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.HTML</th>\n",
|
||
|
" <td>1228404.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.JPG</th>\n",
|
||
|
" <td>64002.5</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.PDF</th>\n",
|
||
|
" <td>2429245.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.PPTX</th>\n",
|
||
|
" <td>38412.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.TXT</th>\n",
|
||
|
" <td>619.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.UNK</th>\n",
|
||
|
" <td>1102.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.XLSX</th>\n",
|
||
|
" <td>4765.0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>FileType.XML</th>\n",
|
||
|
" <td>713.5</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" filesize\n",
|
||
|
"filetype \n",
|
||
|
"FileType.DOCX 36602.0\n",
|
||
|
"FileType.EML 149088.5\n",
|
||
|
"FileType.HTML 1228404.0\n",
|
||
|
"FileType.JPG 64002.5\n",
|
||
|
"FileType.PDF 2429245.0\n",
|
||
|
"FileType.PPTX 38412.0\n",
|
||
|
"FileType.TXT 619.0\n",
|
||
|
"FileType.UNK 1102.0\n",
|
||
|
"FileType.XLSX 4765.0\n",
|
||
|
"FileType.XML 713.5"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 5,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"file_info.groupby(\"filetype\").mean(numeric_only=True)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"id": "870ede91",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"If you want to pass in a list of filenames instead of a directory, use `get_file_info` instead of `get_directory_file_info`, as seen in the workflow below:"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 6,
|
||
|
"id": "e5e3a24d",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": [
|
||
|
"from unstructured.file_utils.exploration import get_file_info\n",
|
||
|
"\n",
|
||
|
"filenames = [os.path.join(EXAMPLE_DOCS_DIRECTORY, f) for f in os.listdir(EXAMPLE_DOCS_DIRECTORY)]"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 7,
|
||
|
"id": "d8e59472",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"['/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-html.html',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example-10k.html',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-header.eml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake.docx',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-image-embedded.eml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-text.txt',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.pdf',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/email-with-image.eml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper-fast.jpg',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-power-point.pptx',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.txt',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/README.md',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/factbook.xsl',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-excel.xlsx',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email.eml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/layout-parser-paper.pdf',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/fake-email-attachment.eml',\n",
|
||
|
" '/Users/mrobinson/repos/unstructured/examples/training/../../example-docs/example.jpg']"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 7,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"filenames"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 8,
|
||
|
"id": "cb0add28",
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stderr",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"MIME type was message/rfc822. This file type is not currently supported in unstructured.\n"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"data": {
|
||
|
"text/plain": [
|
||
|
"FileType.EML 4\n",
|
||
|
"FileType.TXT 3\n",
|
||
|
"FileType.HTML 2\n",
|
||
|
"FileType.XML 2\n",
|
||
|
"FileType.PDF 2\n",
|
||
|
"FileType.JPG 2\n",
|
||
|
"FileType.UNK 1\n",
|
||
|
"FileType.DOCX 1\n",
|
||
|
"FileType.PPTX 1\n",
|
||
|
"FileType.XLSX 1\n",
|
||
|
"Name: filetype, dtype: int64"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 8,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"file_info = get_file_info(filenames=filenames)\n",
|
||
|
"file_info.filetype.value_counts()"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": null,
|
||
|
"id": "ac4473b0",
|
||
|
"metadata": {},
|
||
|
"outputs": [],
|
||
|
"source": []
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.8.13"
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 5
|
||
|
}
|