Commit graph

245 commits

Author SHA1 Message Date
Adam Fourney
84f8198d8a Fixed many mypy errors. 2025-03-05 16:41:15 -08:00
Adam Fourney
aa94bce6d9 Bumped version. 2025-03-05 15:15:23 -08:00
Adam Fourney
fe1d57a06f Updated DocumentConverter documentation. 2025-03-05 15:12:13 -08:00
Adam Fourney
1eb8b927c2 Add type hint, resolving circular import. 2025-03-05 15:04:59 -08:00
Adam Fourney
1ce769e70d Fixed formatting. 2025-03-05 14:01:53 -08:00
Kenny Zhang
a96a6a01b5 more formatting 2025-03-05 16:57:54 -05:00
Kenny Zhang
8c3dd01f2f black formatting 2025-03-05 16:54:51 -05:00
Kenny Zhang
30e5189581 removed dupe priority setting 2025-03-05 16:48:23 -05:00
Kenny Zhang
c281844c02 ported over unit tests from prev branch 2025-03-05 16:44:13 -05:00
Adam Fourney
4d097aa379 Updated markdownify dependency. 2025-03-05 13:03:48 -08:00
Adam Fourney
cc38144752 Updated project readme with notes about changes, and use-cases. 2025-03-05 11:50:56 -08:00
Adam Fourney
5f0b63bb95 Remove stale comments. 2025-03-05 11:38:43 -08:00
Adam Fourney
aa57757395 Updated plugin README. 2025-03-05 11:37:00 -08:00
Adam Fourney
36a49806b5 Updated sample plugin to new Converter interface. 2025-03-05 11:30:48 -08:00
Adam Fourney
b3d6009eb8 Small cleanup. 2025-03-05 10:42:36 -08:00
Adam Fourney
736e0ae332 Fixed exif warning test. 2025-03-05 10:39:29 -08:00
Adam Fourney
a9ceb13feb Added support for vaious audio files. 2025-03-05 10:15:42 -08:00
Adam Fourney
c426cb81b3 Most converters are now working. 2025-03-05 00:24:54 -08:00
Adam Fourney
4a034da269 Stream exiftool. 2025-03-04 17:18:54 -08:00
Adam Fourney
7879028c98 Added Outlook messages. 2025-03-04 16:15:07 -08:00
Adam Fourney
4d09a4c6c6 Updating converters. 2025-03-04 13:57:49 -08:00
Adam Fourney
df372fa460 Progress on HTML converter. 2025-03-04 08:33:50 -08:00
Adam Fourney
4129f30c23 More progress. 2025-03-04 00:52:57 -08:00
Adam Fourney
7bc6d827ee Experimeting with new signaures. 2025-03-03 23:01:16 -08:00
Adam Fourney
e43632b048 Initial work updating signatures. 2025-03-03 13:16:15 -08:00
afourney
1d2f231146
Fixed property name (#1085) 2025-03-03 09:45:36 -08:00
afourney
c5cd659f63
Exploring ways to allow Optional dependencies (#1079)
* Enable optional dependencies. Starting with pptx.
* Fix CLI tests.... have them install [all]
* Added .docx to optional dependencies
* Reuse error messages for missing dependencies.
* Added xlsx and xls
* Added pdfs
* Added Ole files.
* Updated READMEs, and finished remaining feature-categories.
* Move OpenAI to hatch-test environment.
2025-03-03 09:06:19 -08:00
afourney
f01c6c5277
Exceptions should subclass Exception not BaseException. (#1082) 2025-02-28 16:28:35 -08:00
afourney
43bd79adc9
Print and log better exceptions when file conversions fail. (#1080)
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
afourney
9182923375
Don't have ZipConverter accept OOXML files. This will never yield a good result. (#1078) 2025-02-28 09:54:19 -08:00
afourney
9a19fdd134
Make sure extensions are unique in MarkItDown's convert methods. (#1076) 2025-02-28 07:43:03 -08:00
Matthew Powers
e82e0c1372
Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) (#331)
* Adds support for Shape Groups

* Update to Test PPtx for nested shape

* This line was accidentally removed and is added back here
2025-02-27 23:21:51 -08:00
Nima Akbarzadeh
a394cc7c27
fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue (#1035)
* fix: add error handling, refactor _findKey to use json.items()

* fix: improve metadata and description extraction logic

* fix: improve YouTube transcript extraction reliability

* fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue

* fix(readme): add youtube URLs as markitdown supports
2025-02-27 23:17:54 -08:00
tanreinama
a87fbf01ee
add necessary imports (#861)
* add necessary imports
2025-02-27 23:16:09 -08:00
André Menezes
d0ed74fdf4
Fix UnboundLocalError in MarkItDown._convert (#1038)
Initialize `res` at the beginning of `_convert`. If the first converter raises an exception, then the `res` variable was not initialized and we got an error when checking `if res is not None`
2025-02-27 23:11:27 -08:00
afourney
e4b419ba40
Pin Markdownify version. (#1069)
* Pin markdownify version. TODO: update code for compatibility with Markdownify 1.0.0
2025-02-27 23:09:33 -08:00
afourney
dbdf2c0c10
Added CLI tests. (#327) 2025-02-11 20:42:50 -08:00
KennyZhang1
97eeed5f32
Doc Intelligence fixes for refactored code (#325)
* added priority flag to doc intel converter constructor
* fixed analysis features bug for docx
2025-02-11 16:01:46 -08:00
afourney
935da9976c
Added priority argument to all converter constructors. (#324)
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
Ruijun Gao
5ce85c236c
Fix a typo in sample RTF plugin (#320) 2025-02-11 10:33:52 -08:00
Tomasz Kalinowski
3a5ca22a8d
Don't generate md links in 'pre' blocks (#322) 2025-02-11 07:13:17 -08:00
Adam Fourney
4b62506451 Small typo in README. 2025-02-10 15:24:28 -08:00
afourney
c73afcffea
Cleanup and refactor, in preparation for plugin support. (#318)
* Work started moving converters to individual files.
* Significant cleanup and refactor.
* Moved everything to a packages subfolder.
* Added sample plugin.
* Added instructions to the README.md
* Bumped version, and added a note about compatibility.
2025-02-10 15:21:44 -08:00
wunde005
73ba69d8cd
For csv files mimetypes.guess_type is returning "application/vnd.ms-excel" on windows causing an invalid mime type in plaintextconverter. In reference to issue: https://github.com/microsoft/markitdown/issues/150 (#273) 2025-02-08 20:58:13 -08:00
Werner Robitza
2a4f7bb6a8
fix: argparse CLI option ordering, fixes #268 (#290)
* fix: argparse CLI option ordering, fixes #268
* Fixed formatting.
2025-02-08 20:50:38 -08:00
masquare
7cf5e0bb23
feat(pptx): support image description with LLM for pptx files (#306) 2025-02-08 20:37:34 -08:00
James Hickey
3090917a49
Typo fixed (#270) 2025-02-08 20:30:13 -08:00
ZeyuTeng96
7bea2672a0
remove leading and trailing \n for HtmlConverter (#262) 2025-02-08 20:28:35 -08:00
KennyZhang1
bf6a15e9b5
Kennyzhang/docintel docs (#312)
* updated docs to include doc intelligence

* include reference to doc intel setup docs
2025-01-31 22:23:26 -08:00
KennyZhang1
bfde857420
Add support for conversion via Document Intelligence (#303)
* added cli params for doc intel

* added DocumentIntelligenceConverter class implementation

* initialized doc intel client instance field

* added isolated doc_intel main conversion function

* temp fix for ContentFormat import bug

* ran tests for docintel and offline for many filetypes

* push doc intel converter to the top of the stack

* formatting changes

* modified project toml file
2025-01-24 14:09:32 -08:00