Skip to content

Convert Word documents to beautiful Markdown. Via command line or in your browser.

License

Notifications You must be signed in to change notification settings

nmaniwa/word-to-markdown-js

 
 

Repository files navigation

Word to Markdown

Convert Word documents to beautiful Markdown. Via command line or in your browser. An even better version of the original word-to-markdown.

Supports

  • Paragraphs
  • Numbered lists
  • Bullet lists
  • Nested Lists
  • Headings
  • Lists
  • Tables
  • Footnotes and endnotes
  • Images
  • Bold, italics, underlines, strikethrough, superscript and subscript.
  • Links
  • Line breaks
  • Text boxes
  • Comments

How is this different from the original?

TL;DR: This project is a complete rewrite, using modern tools and libraries, and is much faster and more reliable. The output should be the same or better. Feedback welcome!

A note on privacy

Word to Markdown can be run locally or in your browser. In either event, the conversion happens locally, and no information ever leaves your browser.

Running Locally

Get Setup

  1. Clone the repo
  2. Run npm install

Command line

Run w2m path/to/your/file.docx

Web server (static HTML)

npm run server:web

Web server (HTTP API)

You can also run Word to Markdown as an HTTP API server, where you can make requests from elsewhere.

npm run server

The server exposes a POST /raw endpoint, which returns the converted Markdown.

More context

See the README of the original Word to Markdown for the project's motivation.

The old way

The Original Word to Markdown is 10 years old. The conversion process was as follows:

  1. Use LibreOffice to convert the Word document to HTML.
  2. Use a bunch of RegEx to clean up the HTML
  3. User Premailer to inline the CSS
  4. Use Nokogiri to manipulate the HTML further
  5. Use Reverse Markdown to convert the HTML to Markdown
  6. Use a bunch of RegEx to clean up the Markdown

Not only did this process require installing and shelling out to a huge binary (LibreOffice), but it was very fragile, and key projects like Reverse Markdown are no longer maintained. I tried experimenting with Pandoc, but it had many of the same limitation.

The new way

  1. Use Mammoth.js to convert the Word document to HTML.
  2. Use Turndown to convert the HTML to Markdown.
  3. Use Markdownlint to clean up the Markdown.

All three of these projects are actively maintained and heavily used, and allows us to convert the document faster, and entirely in JavaScript. Heck, I think theoretically, this could run in the browser for added privacy.

It's still in beta, but so far, I've found the output to be better, with much less manual cleanup required. Notice something is off? Please open an issue.

One note: This project does not yet attempt to guess heading levels based on font size. It could, but it's not yet implemented.

About

Convert Word documents to beautiful Markdown. Via command line or in your browser.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 74.5%
  • JavaScript 23.1%
  • Dockerfile 2.4%