Convert Word documents to beautiful Markdown. Via command line or in your browser. An even better version of the original word-to-markdown
.
- Paragraphs
- Numbered lists
- Bullet lists
- Nested Lists
- Headings
- Lists
- Tables
- Footnotes and endnotes
- Images
- Bold, italics, underlines, strikethrough, superscript and subscript.
- Links
- Line breaks
- Text boxes
- Comments
TL;DR: This project is a complete rewrite, using modern tools and libraries, and is much faster and more reliable. The output should be the same or better. Feedback welcome!
Word to Markdown can be run locally or in your browser. In either event, the conversion happens locally, and no information ever leaves your browser.
- Clone the repo
- Run
npm install
Run w2m path/to/your/file.docx
npm run server:web
You can also run Word to Markdown as an HTTP API server, where you can make requests from elsewhere.
npm run server
The server exposes a POST /raw
endpoint, which returns the converted Markdown.
See the README of the original Word to Markdown for the project's motivation.
The Original Word to Markdown is 10 years old. The conversion process was as follows:
- Use LibreOffice to convert the Word document to HTML.
- Use a bunch of RegEx to clean up the HTML
- User Premailer to inline the CSS
- Use Nokogiri to manipulate the HTML further
- Use Reverse Markdown to convert the HTML to Markdown
- Use a bunch of RegEx to clean up the Markdown
Not only did this process require installing and shelling out to a huge binary (LibreOffice), but it was very fragile, and key projects like Reverse Markdown are no longer maintained. I tried experimenting with Pandoc, but it had many of the same limitation.
- Use Mammoth.js to convert the Word document to HTML.
- Use Turndown to convert the HTML to Markdown.
- Use Markdownlint to clean up the Markdown.
All three of these projects are actively maintained and heavily used, and allows us to convert the document faster, and entirely in JavaScript. Heck, I think theoretically, this could run in the browser for added privacy.
It's still in beta, but so far, I've found the output to be better, with much less manual cleanup required. Notice something is off? Please open an issue.
One note: This project does not yet attempt to guess heading levels based on font size. It could, but it's not yet implemented.