markitdown/README.md

201 lines
6.9 KiB
Markdown
Raw Permalink Normal View History

2024-11-13 14:33:31 -08:00
# MarkItDown
2024-11-13 11:56:46 -08:00
[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
2024-12-18 14:26:36 -08:00
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
2024-12-20 14:49:20 -08:00
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)
2024-12-18 14:26:36 -08:00
> [!IMPORTANT]
> MarkItDown 0.0.2 alpha 1 (0.0.2a1) introduces a plugin-based architecture. As much as was possible, command-line and Python interfaces have remained the same as 0.0.1a3 to support backward compatibility. Please report any issues you encounter. Some interface changes may yet occur as we continue to refine MarkItDown to a first non-alpha release.
2024-12-18 14:26:36 -08:00
2024-12-17 17:21:39 -08:00
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
It supports:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- ... and more!
2024-11-13 11:56:46 -08:00
To install MarkItDown, use pip: `pip install markitdown`. Alternatively, you can install it from the source:
```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown
```
2024-11-13 11:56:46 -08:00
2024-12-17 17:21:39 -08:00
## Usage
2024-12-17 17:21:39 -08:00
### Command-Line
2024-12-17 17:21:39 -08:00
```bash
markitdown path-to-file.pdf > document.md
```
Or use `-o` to specify the output file:
```bash
markitdown path-to-file.pdf -o document.md
```
You can also pipe content:
```bash
cat path-to-file.pdf | markitdown
```
### Plugins
MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
2024-12-17 17:21:39 -08:00
```bash
markitdown --list-plugins
```
To enable plugins use:
```bash
markitdown --use-plugins path-to-file.pdf
```
To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
### Azure Document Intelligence
To use Microsoft Document Intelligence for conversion:
```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
2024-12-17 17:21:39 -08:00
### Python API
Basic usage in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
2024-12-17 17:21:39 -08:00
result = md.convert("test.xlsx")
print(result.text_content)
```
2024-11-13 11:56:46 -08:00
Document Intelligence conversion in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```
2025-02-20 16:56:47 -05:00
MarkItDown also supports converting file objects directly:
```python
from markitdown import MarkItDown
md = MarkItDown()
2025-02-21 13:31:37 -05:00
2025-02-27 15:07:46 -05:00
# Providing the file extension when converting via file objects is recommended for most consistent results
2025-02-21 13:31:37 -05:00
# Binary Mode
with open("test.docx", 'rb') as file:
2025-02-20 16:56:47 -05:00
result = md.convert(file, file_extension=".docx")
print(result.text_content)
2025-02-21 13:31:37 -05:00
# Non-Binary Mode
with open("sample.ipynb", 'rt', encoding="utf-8") as file:
result = md.convert(file, file_extension=".ipynb")
print(result.text_content)
2025-02-20 16:56:47 -05:00
```
2024-12-17 17:21:39 -08:00
To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)
```
2024-12-17 17:21:39 -08:00
### Docker
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
2024-11-13 11:56:46 -08:00
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### How to Contribute
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
<div align="center">
| | All | Especially Needs Help from Community |
|-----------------------|------------------------------------------|------------------------------------------------------------------------------------------|
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
</div>
2024-12-17 17:21:39 -08:00
### Running Tests and Checks
- Navigate to the MarkItDown package:
```sh
cd packages/markitdown
```
2024-12-17 17:21:39 -08:00
- Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```
2024-12-17 17:21:39 -08:00
(Alternative) Use the Devcontainer which has all the dependencies installed:
```sh
# Reopen the project in Devcontainer and run:
hatch test
```
2024-12-17 17:25:13 -08:00
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
### Contributing 3rd-party Plugins
You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.
2024-11-13 11:56:46 -08:00
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
2024-11-13 11:56:46 -08:00
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.