Yuzhong Zhang
1eaa879b25
Use *kwarg to pass keep_data_uri para.
...
Add module cli vector tests
2025-03-21 00:49:36 +08:00
Yuzhong Zhang
4899148310
fix linter
2025-03-18 20:30:44 +08:00
Yuzhong Zhang
41cd9b5e2a
add other converter para support
2025-03-18 20:14:46 +08:00
Yuzhong Zhang
9f1bcf3b83
optional reserve base64 string in markdown
...
_CustomMarkdownify and pptx
2025-03-18 20:01:35 +08:00
afourney
a93e0567e6
EPub Support. Adapted #123 to not use epublib. ( #1131 )
...
* Adapted #123 to not use epublib.
* Updated README.md
2025-03-17 07:48:15 -07:00
afourney
c5f70b904f
Have magika read from the stream. ( #1136 )
2025-03-17 07:39:19 -07:00
afourney
53834fdd24
Investigate and silence warnings. ( #1133 )
2025-03-15 23:41:35 -07:00
afourney
5c565b7d79
Fix remaining mypy errors. ( #1132 )
2025-03-15 23:12:48 -07:00
afourney
a78857bd43
Added epub test file. ( #1130 )
2025-03-15 18:34:51 -07:00
afourney
09df7fe8df
Small fixes for autogen integration. ( #1124 )
2025-03-12 19:18:11 -07:00
Adam Fourney
6a9f09b153
Updated Magika dependency.
2025-03-12 16:15:33 -07:00
afourney
0b815fb916
Bumping version to 0.1.0a2 ( #1123 )
2025-03-12 11:44:19 -07:00
Emanuele Meazzo
12620f1545
Handle not supported plot type in pptx ( #1122 )
...
* Handle not supported plot type in pptx
* Fixed formatting.
2025-03-12 11:26:23 -07:00
afourney
5f75e16d20
Refactored tests. ( #1120 )
...
* Refactored tests.
* Fixed CI errors, and included misc tests.
* Omit mskanji from streaminfo test.
* Omit mskanji from no hints test.
* Log results of debugging in comments (linked to Magika issue)
* Added docs as to when to use misc tests.
2025-03-12 11:08:06 -07:00
yushihang
75140a90e2
fix: correct f-string formatting in FileConversionException ( #1121 )
2025-03-12 10:15:09 -07:00
afourney
af1be36e0c
Added CLI options for extension, mimetypes, and charset. ( #1115 )
2025-03-11 13:16:33 -07:00
Adam Fourney
2a2ccc86aa
Added mimetypes to _rss_converter
2025-03-10 16:17:41 -07:00
Adam Fourney
2e51ba22e7
Enhance type guessing.
2025-03-10 16:05:41 -07:00
afourney
8f8e58c9bb
Minimize guesses when guesses are compatible. ( #1114 )
...
* Minimize guesses when guesses are compatible.
2025-03-10 15:30:44 -07:00
afourney
8e73a325c6
Switch from puremagic to magika. ( #1108 )
2025-03-10 12:49:52 -07:00
Mohit Agarwal
2405f201af
fix typo in well-known path list ( #1109 )
2025-03-08 19:32:44 -08:00
afourney
99d8e562db
Fix exiftool in well-known paths. ( #1106 )
2025-03-07 21:47:20 -08:00
Sebastian Yaghoubi
515fa854bf
feat(docker): improve dockerfile build ( #220 )
...
* refactor(docker): remove unnecessary root user
The USER root directive isn't needed directly after FROM
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* fix(docker): use generic nobody nogroup default instead of uid gid
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* fix(docker): build app from source locally instead of installing package
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* fix(docker): use correct files in dockerignore
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* chore(docker): dont install recommended packages with git
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* fix(docker): run apt as non-interactive
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
* Update Dockerfile to new package structure, and fix streaming bugs.
---------
Signed-off-by: Sebastian Yaghoubi <sebastianyaghoubi@gmail.com>
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-07 20:07:40 -08:00
Richard Ye
0229ff6cb7
feat: sort pptx shapes to be parsed in top-to-bottom, left-to-right order ( #1104 )
...
* Sort PPTX shapes to be read in top-to-bottom, left-to-right order
Referenced from 39bef65b31/pptx2md/parser.py (L249)
* Update README.md
* Fixed formatting.
* Added missing import
2025-03-07 15:45:14 -08:00
afourney
82d84e3edd
Fixed formatting. ( #1098 )
2025-03-05 23:30:29 -08:00
scalabreseGD
36c4bc9ec3
Fixed deepcopy failure when passing llm_client ( #1089 )
...
Co-authored-by: afourney <adamfo@microsoft.com>
2025-03-05 23:25:37 -08:00
Andrea Pietrobon
80baa5db18
fix(README): correct pip install command formatting ( #1090 )
...
Added missing quotes around `markitdown[all]` in the installation command
to ensure proper package resolution by pip.
2025-03-05 23:21:10 -08:00
Adam Fourney
00a65e8f8b
Fixed version in README.
2025-03-05 23:10:21 -08:00
afourney
6bedf6d950
Fixed version. ( #1097 )
2025-03-05 22:52:52 -08:00
afourney
9380112892
Fixed loading of plugins. ( #1096 )
2025-03-05 22:24:08 -08:00
Adam Fourney
784c293579
Bump plugin version.
2025-03-05 21:55:20 -08:00
afourney
70e9f8c3c0
Bump version. ( #1094 )
2025-03-05 21:26:06 -08:00
afourney
e921497f79
Update converter API, user streams rather than file paths ( #1088 )
...
* Updated DocumentConverter interface
* Updated all DocumentConverter classes
* Added support for various new audio files.
* Updated sample plugin to new DocumentConverter interface.
* Updated project README with notes about changes, and use-cases.
* Updated DocumentConverter documentation.
* Move priority to outside DocumentConverter, allowing them to be reprioritized, and keeping the DocumentConverter interface simple.
---------
Co-authored-by: Kenny Zhang <kzhang678@gmail.com>
2025-03-05 21:16:55 -08:00
afourney
1d2f231146
Fixed property name ( #1085 )
2025-03-03 09:45:36 -08:00
afourney
c5cd659f63
Exploring ways to allow Optional dependencies ( #1079 )
...
* Enable optional dependencies. Starting with pptx.
* Fix CLI tests.... have them install [all]
* Added .docx to optional dependencies
* Reuse error messages for missing dependencies.
* Added xlsx and xls
* Added pdfs
* Added Ole files.
* Updated READMEs, and finished remaining feature-categories.
* Move OpenAI to hatch-test environment.
2025-03-03 09:06:19 -08:00
afourney
f01c6c5277
Exceptions should subclass Exception not BaseException. ( #1082 )
2025-02-28 16:28:35 -08:00
afourney
43bd79adc9
Print and log better exceptions when file conversions fail. ( #1080 )
...
* Print and log better exceptions when file conversions fail.
* Added unit tests for exceptions.
2025-02-28 16:07:47 -08:00
afourney
9182923375
Don't have ZipConverter accept OOXML files. This will never yield a good result. ( #1078 )
2025-02-28 09:54:19 -08:00
afourney
9a19fdd134
Make sure extensions are unique in MarkItDown's convert methods. ( #1076 )
2025-02-28 07:43:03 -08:00
Matthew Powers
e82e0c1372
Add Support For PPTX Shape Groups (Fix in code design to not miss out on slide content) ( #331 )
...
* Adds support for Shape Groups
* Update to Test PPtx for nested shape
* This line was accidentally removed and is added back here
2025-02-27 23:21:51 -08:00
Nima Akbarzadeh
a394cc7c27
fix: Implement retry logic for YouTube transcript fetching and fix URL decoding issue ( #1035 )
...
* fix: add error handling, refactor _findKey to use json.items()
* fix: improve metadata and description extraction logic
* fix: improve YouTube transcript extraction reliability
* fix: implement retry logic for YouTube transcript fetching and fix URL decoding issue
* fix(readme): add youtube URLs as markitdown supports
2025-02-27 23:17:54 -08:00
tanreinama
a87fbf01ee
add necessary imports ( #861 )
...
* add necessary imports
2025-02-27 23:16:09 -08:00
André Menezes
d0ed74fdf4
Fix UnboundLocalError in MarkItDown._convert ( #1038 )
...
Initialize `res` at the beginning of `_convert`. If the first converter raises an exception, then the `res` variable was not initialized and we got an error when checking `if res is not None`
2025-02-27 23:11:27 -08:00
afourney
e4b419ba40
Pin Markdownify version. ( #1069 )
...
* Pin markdownify version. TODO: update code for compatibility with Markdownify 1.0.0
2025-02-27 23:09:33 -08:00
afourney
dbdf2c0c10
Added CLI tests. ( #327 )
2025-02-11 20:42:50 -08:00
KennyZhang1
97eeed5f32
Doc Intelligence fixes for refactored code ( #325 )
...
* added priority flag to doc intel converter constructor
* fixed analysis features bug for docx
2025-02-11 16:01:46 -08:00
afourney
935da9976c
Added priority argument to all converter constructors. ( #324 )
...
* Added priority argument to all converter constructors.
2025-02-11 12:36:32 -08:00
Ruijun Gao
5ce85c236c
Fix a typo in sample RTF plugin ( #320 )
2025-02-11 10:33:52 -08:00
Tomasz Kalinowski
3a5ca22a8d
Don't generate md links in 'pre' blocks ( #322 )
2025-02-11 07:13:17 -08:00
Adam Fourney
4b62506451
Small typo in README.
2025-02-10 15:24:28 -08:00