Microsoft open-sourced a Python tool for converting files and office documents to Markdown

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 3 days ago

Microsoft open-sourced a Python tool for converting files and office documents to Markdown

utopiah@lemmy.ml · edit-2 1 day ago

FWIW if you are interested in such tooling consider also soffice and pandoc which have (as far as I can tell) similar features but have been existing for years now and are not related to Microsoft.

Edit: not related to Microsoft AND Google, seems the transcription aspect (which IMHO is still weird in that context but OK) is done via Google servers, cf https://lemmy.ml/post/23629310/15586865

haverholm@kbin.earth · 2 days ago

The single exception to this (which is actually buried fairly deep in the feature list) is the audio transcription tool. I didn’t take a closer look at what is used to perform this, but at least it’s not “just” document conversion like pandoc.

utopiah@lemmy.ml · 2 days ago

audio transcription tool

Thanks for the clarification but I’m a bit confused here, like audio transcription, STT, done by e.g. Whisper? If so what’s the use case? When I think of Office documents audio transcription is not something I have in mind.

utopiah@lemmy.ml · 2 days ago

PS: related, asked on Github too https://github.com/microsoft/markitdown/issues/20#issuecomment-2544630753

JackbyDev@programming.dev · 2 days ago

You should open a fresh issue for questions like that instead of asking on an unrelated one.

haverholm@kbin.earth · 2 days ago

I’m not completely clear either on how Microsoft have implemented this previously. As I said, I didn’t look very deep into the repository.

If these are indeed other Python projects they piled together, as others suggest, I’d be happy to hear what speech recognition library this might’ve built on.

davel@lemmy.ml · 3 days ago

Huh, Beautiful Soup is still relevant. I was using it twenty years ago when it first came out.

loathsome dongeater@lemmygrad.ml · 3 days ago

This could be useful to me. A while ago I was trying to make something that take all unread posts from my feed reader, make an epub out of them and then put it behind an OPDS server.

I found converting HTML from RSS to first markdown and then compiling them to an epub the most reliable way to take out the unnecessary markup from the source HTML. I used pandoc for this.

utopiah@lemmy.ml · 2 days ago

I used pandoc for this.

Please come back and share if it’s done better or worst and if so along which dimensions. Quite curious to better understand the differences.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 3 days ago

oh yeah that’s definitely a good use case

hexaflexagonbear [he/him]@hexbear.net · 3 days ago

Oh wow, this might actually be incredibly useful

Max-P@lemmy.max-p.me · edit-2 2 days ago

~Not really. All the features of that tool are basic functions we’ve had before LibreOffice was still OpenOffice.~

~Since this converts to Markdown, it’s inherently a very lossy conversion. What’s hard to pull off is preserve the full formatting when converting to an odt or something.~

Someone pointed out it doesn’t just convert word documents to Markdown, it can also transcribe and OCR, so I guess it does have some usefulness!

davel@lemmy.ml · 2 days ago

In your saying this isn’t useful, you’re making a lot of assumptions about how someone might want to use this.

They may not care that it is lossy in the way that it is lossy.
They may want a CLI tool instead of a GUI tool.
They may want it as a Python library rather than as a stand-alone tool.

vort3@lemmy.ml · 2 days ago

I convert from docx to md specifically with the purpose of getting rid of Microsoft formatting aka almost converting to plaintext but preserve at least some structure.

utopiah@lemmy.ml · 2 days ago

soffice works as CLI, can be called from Python and has plenty of related tooling, e.g. https://pypi.org/project/unoserver/ so I agree, I’m confused at what’s actually novel and better than that or even dedicated long lasting FLOSS projects like pandoc.

django@discuss.tchncs.de · 2 days ago

I like libreoffice, but converting audio files to markdown must be a pretty recent feature, for I never heard of it before being part of libreoffice.

utopiah@lemmy.ml · 2 days ago

converting audio files to markdown must be a pretty recent feature

Quite curious… does it actually do that and if so how? Because STT to get a plaintext file or subtitle (so with timing) has been available via e.g. Whisper quite efficiently for a while now. If this though does do more, e.g. structure (differentiating a title, list, etc) I’d like to learn how.

django@discuss.tchncs.de · 2 days ago

There is nothing special going on. This whole project is just a bunch of python libraries coupled together to a cli tool. It uses the package SpeechRecognition to connect to the google speech recognition api: https://github.com/microsoft/markitdown/blob/main/src/markitdown/_markitdown.py#L691

Pretty uninteresting and a bit disappointing. Pandoc is a lot more interesting.

utopiah@lemmy.ml · 1 day ago

Thanks for the clarification. I checked the code you linked and noticed recognize_google and seems it’s relying on https://github.com/Uberi/speech_recognition which then seems to rely on https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/recognizers/google.py so basically are they using an API, sending all the audio data to Google servers?

django@discuss.tchncs.de · 1 day ago

Yes, this is how I read it as well. The library would support to use a local model, but they decided to just send the audio data to Google.

utopiah@lemmy.ml · 1 day ago

Might open up a GDPR related issue there. I don’t think people using such a library assume they need connectivity nor that their data would be send to a 3rd party.

Microsoft open-sourced a Python tool for converting files and office documents to Markdown

Microsoft open-sourced a Python tool for converting files and office documents to Markdown

GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown.