Add support for markdown file input and update documentation #79

tkhongsap · 2024-10-17T11:32:19Z

This Pull Request introduces the following changes:

Markdown File Support:
- Added functionality to handle markdown file input for the content extraction process.
- Updated content_extractor.py and markdown_extractor.py to support parsing of markdown files.
Tests:
- Added unit tests in tests/test_markdown_extractor.py to ensure that markdown extraction works correctly.
Documentation Updates:
- Updated the usage documentation to include information about markdown file input.
- Updated README.md and usage guides to reflect the new functionality.
Miscellaneous:
- Updated pyproject.toml and requirements.txt to include any new dependencies related to markdown support.

Please review the changes, and let me know if any updates or adjustments are required.

souzatharsis · 2024-10-17T11:52:19Z

This looks great and nicely organized and documented.

However, automated tests did not pass. Could you please review?

tkhongsap · 2024-10-17T13:09:34Z

Okay, will check again. I ran test on the markdown extractor, it passed.

tests\test_markdown_extractor.py ... [100%]
============================================================= 3 passed in 0.24s =============================================================

souzatharsis · 2024-10-17T13:13:23Z

We need to make sure all tests in test/*.py pass to avoid regressions.

souzatharsis · 2024-10-17T13:13:54Z

Please see pytest logs at https://github.com/souzatharsis/podcastfy/actions/runs/11384115244/job/31671236157?pr=79

tkhongsap · 2024-10-17T13:49:22Z

Okay, thank you for pointing that out. Will take a look.

souzatharsis · 2024-10-17T14:02:04Z

podcastfy/client.py

@@ -71,14 +71,20 @@ def process_content(
                qa_content = file.read()
        else:
            content_generator = ContentGenerator(
-                api_key=config.GEMINI_API_KEY, conversation_config=conv_config.to_dict()
+                api_key=config.OPENAI_API_KEY, conversation_config=conv_config.to_dict()


This is the issue causing tests to fail. I was wondering why you replaced Gemini's api key with openai's since ContentGenerator runs on gemini by default?

souzatharsis · 2024-10-17T14:03:48Z

podcastfy/client.py

            )

            if urls:
                logger.info(f"Processing {len(urls)} links")
                content_extractor = ContentExtractor()
-                # Extract content from links
-                contents = [content_extractor.extract_content(link) for link in urls]
+                # Extract content from links or file paths


it's better to add the below logic (test whether it's a link or file path) to inside extract_content and keep this as list comprehension

souzatharsis · 2024-10-17T14:05:40Z

podcastfy/config.yaml

@@ -3,7 +3,7 @@ output_directories:
  audio: "./data/audio"

 content_generator:
-  gemini_model: "gemini-1.5-pro-latest"
+  openai_model: "gpt-4o-mini"


let's keep gemini as the default.

gpt-4o-mini has not been tested extensively
we use gemini 1.5 pro due to its massive context window of 2M token which allow for processing a list of input sources without trouble

we should consider generalizing to other llm backends here in a separate PR, we should use langchain for that purpose inside of implementing one backend at a time

Thank you for pointing that out. I was testing on my local machine and didn’t switch it back. The reason I changed from Gemini to OpenAI was for extracting the "Thai" language, as OpenAI seemed to provide better translations in terms of context. However, it's not an issue—I’ll switch it back.

souzatharsis · 2024-10-17T14:06:23Z