markitdown/docs/user-guide/readme.md
2024-12-25 13:09:27 +05:30

3.1 KiB

Important

(12/19/24) Hello! MarkItDown team members will be resting and recharging with family and friends over the holiday period. Activity/responses on the project may be delayed during the period of Dec 21-Jan 06. We will be excited to engage with you in the new year!

MarkItDown

PyPI PyPI - Downloads Built by AutoGen Team

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports:

  • PDF
  • PowerPoint
  • Word
  • Excel
  • Images (EXIF metadata and OCR)
  • Audio (EXIF metadata and speech transcription)
  • HTML
  • Text-based formats (CSV, JSON, XML)
  • ZIP files (iterates over contents)

To install MarkItDown, use pip: pip install markitdown. Alternatively, you can install it from the source: pip install -e .

Usage

Command-Line

markitdown path-to-file.pdf > document.md

Or use -o to specify the output file:

markitdown path-to-file.pdf -o document.md

You can also pipe content:

cat path-to-file.pdf | markitdown

Python API

Basic usage in Python:

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)

To use Large Language Models for image descriptions, provide llm_client and llm_model:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")
print(result.text_content)

Docker

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
Batch Processing Multiple Files

This example shows how to convert multiple files to markdown format in a single run. The script processes all supported files in a directory and creates corresponding markdown files.

from markitdown import MarkItDown
from openai import OpenAI
import os
client = OpenAI(api_key="your-api-key-here")
md = MarkItDown(llm_client=client, llm_model="gpt-4o-2024-11-20")
supported_extensions = ('.pptx', '.docx', '.pdf', '.jpg', '.jpeg', '.png')
files_to_convert = [f for f in os.listdir('.') if f.lower().endswith(supported_extensions)]
for file in files_to_convert:
    print(f"\nConverting {file}...")
    try:
        md_file = os.path.splitext(file)[0] + '.md'
        result = md.convert(file)
        with open(md_file, 'w') as f:
            f.write(result.text_content)
        
        print(f"Successfully converted {file} to {md_file}")
    except Exception as e:
        print(f"Error converting {file}: {str(e)}")

print("\nAll conversions completed!")
  1. Place the script in the same directory as your files
  2. Install required packages: like openai
  3. Run script bash python convert.py

Note that original files will remain unchanged and new markdown files are created with the same base name.