Commit graph

232 commits

Author SHA1 Message Date
lumin
e4238eb1ac
Merge 4050de78b6 into 82d84e3edd 2025-03-06 05:35:37 -08:00
afourney
82d84e3edd
Fixed formatting. (#1098) 2025-03-05 23:30:29 -08:00
scalabreseGD
36c4bc9ec3
Fixed deepcopy failure when passing llm_client (#1089)
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-05 23:25:37 -08:00
Andrea Pietrobon
80baa5db18
fix(README): correct pip install command formatting (#1090)
Added missing quotes around `markitdown[all]` in the installation command  
to ensure proper package resolution by pip.
2025-03-05 23:21:10 -08:00
Adam Fourney
00a65e8f8b Fixed version in README. 2025-03-05 23:10:21 -08:00
afourney
6bedf6d950
Fixed version. (#1097) 2025-03-05 22:52:52 -08:00
afourney
9380112892
Fixed loading of plugins. (#1096) 2025-03-05 22:24:08 -08:00
Adam Fourney
784c293579 Bump plugin version. 2025-03-05 21:55:20 -08:00
afourney
70e9f8c3c0
Bump version. (#1094) 2025-03-05 21:26:06 -08:00
afourney
e921497f79
Update converter API, user streams rather than file paths (#1088)
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.

---------

Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00
afourney
1d2f231146
Fixed property name (#1085) 2025-03-03 09:45:36 -08:00
afourney
c5cd659f63
Exploring ways to allow Optional dependencies (#1079)
* Enable optional dependencies. Starting with pptx.
* Fix CLI tests.... have them install [all]
* Added .docx to optional dependencies
* Reuse error messages for missing dependencies.
* Added xlsx and xls
* Added pdfs
* Added Ole files.
* Updated READMEs, and finished remaining feature-categories.
* Move OpenAI to hatch-test environment.
2025-03-03 09:06:19 -08:00
afourney
f01c6c5277
Exceptions should subclass Exception not BaseException. (#1082) 2025-02-28 16:28:35 -08:00
afourney
43bd79adc9
Print and log better exceptions when file conversions fail. (#1080)
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
afourney
9182923375
Don't have ZipConverter accept OOXML files. This will never yield a good result. (#1078) 2025-02-28 09:54:19 -08:00
afourney
9a19fdd134
Make sure extensions are unique in MarkItDown's convert methods. (#1076) 2025-02-28 07:43:03 -08:00
Matthew Powers
e82e0c1372
Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) (#331)
* Adds support for Shape Groups

* Update to Test PPtx for nested shape

* This line was accidentally removed and is added back here
2025-02-27 23:21:51 -08:00
Nima Akbarzadeh
a394cc7c27
fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035)
* fix: add error handling, refactor _findKey to use json.items()

* fix: improve metadata and description extraction logic

* fix: improve YouTube transcript extraction reliability

* fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue

* fix(readme): add youtube URLs as markitdown supports
2025-02-27 23:17:54 -08:00
tanreinama
a87fbf01ee
add necessary imports (#861)
* add necessary imports
2025-02-27 23:16:09 -08:00
André Menezes
d0ed74fdf4
Fix UnboundLocalError in MarkItDown._convert (#1038)
Initialize `res` at the beginning of `_convert`. If the first converter raises an exception, then the `res` variable was not initialized and we got an error when checking `if res is not None`
2025-02-27 23:11:27 -08:00
afourney
e4b419ba40
Pin Markdownify version. (#1069)
* Pin markdownify version. TODO: update code for compatibility with Markdownify 1.0.0
2025-02-27 23:09:33 -08:00
afourney
dbdf2c0c10
Added CLI tests. (#327) 2025-02-11 20:42:50 -08:00
KennyZhang1
97eeed5f32
Doc Intelligence fixes for refactored code (#325)
* added priority flag to doc intel converter constructor
* fixed analysis features bug for docx
2025-02-11 16:01:46 -08:00
afourney
935da9976c
Added priority argument to all converter constructors. (#324)
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
Ruijun Gao
5ce85c236c
Fix a typo in sample RTF plugin (#320) 2025-02-11 10:33:52 -08:00
Tomasz Kalinowski
3a5ca22a8d
Don't generate md links in 'pre' blocks (#322) 2025-02-11 07:13:17 -08:00
Adam Fourney
4b62506451 Small typo in README. 2025-02-10 15:24:28 -08:00
afourney
c73afcffea
Cleanup and refactor, in preparation for plugin support. (#318)
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.
2025-02-10 15:21:44 -08:00
wunde005
73ba69d8cd
For csv files mimetypes.guess_type is returning "application/vnd.ms-excel" on windows causing an invalid mime type in plaintextconverter. In reference to issue: https://github.com/microsoft/markitdown/issues/150 (#273) 2025-02-08 20:58:13 -08:00
Werner Robitza
2a4f7bb6a8
fix: argparse CLI option ordering, fixes #268 (#290)
* fix: argparse CLI option ordering, fixes #268
* Fixed formatting.
2025-02-08 20:50:38 -08:00
masquare
7cf5e0bb23
feat(pptx): support image description with LLM for pptx files (#306) 2025-02-08 20:37:34 -08:00
James Hickey
3090917a49
Typo fixed (#270) 2025-02-08 20:30:13 -08:00
ZeyuTeng96
7bea2672a0
remove leading and trailing \n for HtmlConverter (#262) 2025-02-08 20:28:35 -08:00
KennyZhang1
bf6a15e9b5
Kennyzhang/docintel docs (#312)
* updated docs to include doc intelligence

* include reference to doc intel setup docs
2025-01-31 22:23:26 -08:00
KennyZhang1
bfde857420
Add support for conversion via Document Intelligence (#303)
* added cli params for doc intel

* added DocumentIntelligenceConverter class implementation

* initialized doc intel client instance field

* added isolated doc_intel main conversion function

* temp fix for ContentFormat import bug

* ran tests for docintel and offline for many filetypes

* push doc intel converter to the top of the stack

* formatting changes

* modified project toml file
2025-01-24 14:09:32 -08:00
afourney
f58a864951
Set exiftool path explicitly. (#267) 2025-01-06 12:43:47 -08:00
afourney
265aea2edf
Removed the holiday away message from README.md (#266) 2025-01-06 09:06:21 -08:00
afourney
05b78e7ce1
Recognize json as plain text (if no other handlers are present). (#261)
* Recognize json as plain text (if no other handlers are present).
2025-01-03 16:40:43 -08:00
afourney
436407288f
If puremagic has no guesses, try again after ltrim. (#260) 2025-01-03 16:03:11 -08:00
afourney
731b39e7f5
Added a test for leading spaces. (#258) 2025-01-03 14:34:33 -08:00
yeungadrian
08ed32869e
Feature/ Add xls support (#169)
* add xlrd
* add xls converter with tests
2025-01-03 13:58:17 -08:00
Murat Can Kurtuluş
d248621ba4
feat: outlook ".msg" file converter (#196)
* feat: outlook .msg converter
* add test, adjust docstring
2025-01-03 13:34:39 -08:00
AbSadiki
4678c8a2a4
fix(transcription): IS_AUDIO_TRANSCRIPTION_CAPABLE should be iniztialized (#194) 2025-01-03 13:29:26 -08:00
lumin
4050de78b6 refactor: update devcontainer configuration for clarity
Remove unnecessary INSTALL_GIT argument and set target to 
development in the devcontainer.json file. This simplifies 
the configuration and aligns it with the intended development 
environment setup.
2024-12-28 12:10:41 +09:00
lumin
5b811fd66a feat(docker): restructure Dockerfile for multi-stage build
Update the Dockerfile to implement a multi-stage build process. 
Introduce a dedicated FFmpeg stage and separate development, build, 
and production stages to optimize image size and improve build 
efficiency. Add necessary dependencies and configure the 
environment for better performance. Update the .dockerignore 
to exclude sensitive files and unnecessary directories.
2024-12-27 22:56:36 +09:00
Ikko Eltociear Ashimine
125e206047
docs: update README.md (#182)
faciliate -> facilitate
2024-12-21 01:51:30 -08:00
numekudi
f94d09990e
feat: enable Git support in devcontainer (#136)
Co-authored-by: gagb <gagb@users.noreply.github.com>
2024-12-20 18:09:17 -08:00
lumin
cfd2319c14
feat: add version option to markitdown CLI (#172)
Add a `--version` option to the markitdown command-line interface 
that displays the current version number.
2024-12-20 16:24:45 -08:00
dependabot[bot]
73161982ff
Bump actions/setup-python from 2 to 5 (#179)
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 5.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](https://github.com/actions/setup-python/compare/v2...v5)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:20:22 -08:00
dependabot[bot]
9b69467772
Bump actions/cache from 3 to 4 (#178)
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: gagb <gagb@users.noreply.github.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2024-12-20 16:17:43 -08:00