fix(slb-495): add readme documentation for converter

AmazeeLabs · Dec 30, 2024 · 4d96d48 · 4d96d48
1 parent c524171
commit 4d96d48
Showing 1 changed file with 92 additions and 0 deletions.
diff --git a/apps/converter/README.md b/apps/converter/README.md
@@ -0,0 +1,92 @@
+# Silverback Converter
+
+The converter is a Node.js application designed to convert documents from various formats (DocX, PDF, and HTML) into Markdown.
+
+This tool is particularly useful for developers and content creators who need to transform documents into a format suitable for further processing, analysis, or integration with other systems.
+
+## Features
+
+- **DocX to Markdown**: Convert Word documents (`.docx`) to Markdown.
+- **PDF to Markdown**: Convert PDF files to Markdown.
+- **HTML to Markdown**: Extract main content from web pages and convert it to Markdown.
+- **Jina AI Integration**: Fetch and convert content using the Jina AI API. (ATTENTION: EXPERIMENTAL, DO NOT USE THIS)
+
+## Setup and Installation
+
+### Prerequisites
+
+- Node.js (version 18 or higher)
+- npm (Node Package Manager)
+
+### Installation
+
+1. **Install dependencies**:
+   ```bash
+   npm i
+   ```
+
+3. **Set up environment variables** (optional):
+   - Create a `.env` file in the root directory.
+   - Add your Jina AI API key if you plan to use the Jina AI integration:
+     ```env
+     JINA_AI_API_KEY=your_jina_ai_api_key
+     ```
+
+### Running the Application
+
+To start the application, run the following command:
+```bash
+npm start
+```
+
+The server will start on `http://localhost:3000`.
+
+## Usage
+
+### Endpoints
+
+- **Convert DocX to Markdown**:
+  ```
+  GET /convert?path=/path/to/your/document.docx
+  ```
+
+- **Convert PDF to Markdown**:
+  ```
+  GET /pdf-convert?path=/path/to/your/document.pdf
+  ```
+
+- **Convert HTML to Markdown**:
+  ```
+  GET /html-convert?path=https://example.com
+  ```
+
+- **Fetch and Convert Content with Jina AI**:
+  ```
+  GET /jina-convert?path=https://example.com
+  ```
+
+### Example
+
+To convert a Word document to Markdown, make a GET request to:
+```
+http://localhost:3000/convert?path=/path/to/your/document.docx
+```
+
+The response will include the converted Markdown content, the output directory, and any warnings generated during the conversion process.
+
+## Configuration
+
+- **Output Directory**: By default, converted files are saved in a directory named after the input file's hash. You can customize the output directory by modifying the `outputDir` variable in the respective conversion scripts.
+- **Image Handling**: Images extracted from documents are saved in an `images` subdirectory within the output directory.
+
+## Dependencies
+
+The application relies on several npm packages, including:
+
+- `mammoth` for DocX conversion
+- `@opendocsg/pdf2md` for PDF conversion
+- `@extractus/article-extractor` for HTML content extraction
+- `turndown` for HTML to Markdown conversion
+- `express` for the server
+
+For a complete list of dependencies, refer to the `package.json` file.