Updated README
This commit is contained in:
parent
a7ae7c53d8
commit
ae5fd74821
1 changed files with 13 additions and 2 deletions
15
README.md
15
README.md
|
|
@ -9,9 +9,9 @@
|
|||
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install markitdown[all]` to have backward-compatible behavior.
|
||||
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
|
||||
|
||||
MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It is comparable to [Apache Tika](https://tika.apache.org/) or [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?tabs=doc-intel-4.0.0), but can perform many simple operations locally, without a server or subscription. While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools. MarkItDown may not be the best option for high-fidelity document conversions for publication or document sharing, etc.
|
||||
MarkItDown is a lightwight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving mportant document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
|
||||
|
||||
At present, it supports:
|
||||
At present, MarkItDown supports:
|
||||
|
||||
- PDF
|
||||
- PowerPoint
|
||||
|
|
@ -25,6 +25,17 @@ At present, it supports:
|
|||
- Youtube URLs
|
||||
- ... and more!
|
||||
|
||||
## Why Markdown?
|
||||
|
||||
Markdown is extremely close to plain text, with minimal markup or formatting, but still
|
||||
provides a way to represent important document structure. Importantly, mainstream LLMs,
|
||||
such as OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into
|
||||
their responses unprompted. This suggests that they have been trained on vast amounts of
|
||||
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
|
||||
a are also highly token-efficient.
|
||||
|
||||
## Installation
|
||||
|
||||
To install MarkItDown, use pip: `pip install markitdown[all]`. Alternatively, you can install it from the source:
|
||||
|
||||
```bash
|
||||
|
|
|
|||
Loading…
Reference in a new issue