Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for markdown file input and update documentation #79

Closed

Conversation

tkhongsap
Copy link

This Pull Request introduces the following changes:

  1. Markdown File Support:

    • Added functionality to handle markdown file input for the content extraction process.
    • Updated content_extractor.py and markdown_extractor.py to support parsing of markdown files.
  2. Tests:

    • Added unit tests in tests/test_markdown_extractor.py to ensure that markdown extraction works correctly.
  3. Documentation Updates:

    • Updated the usage documentation to include information about markdown file input.
    • Updated README.md and usage guides to reflect the new functionality.
  4. Miscellaneous:

    • Updated pyproject.toml and requirements.txt to include any new dependencies related to markdown support.

Please review the changes, and let me know if any updates or adjustments are required.

@souzatharsis
Copy link
Owner

This looks great and nicely organized and documented.

However, automated tests did not pass. Could you please review?

@tkhongsap
Copy link
Author

Okay, will check again. I ran test on the markdown extractor, it passed.

tests\test_markdown_extractor.py ... [100%]
============================================================= 3 passed in 0.24s =============================================================

@souzatharsis
Copy link
Owner

We need to make sure all tests in test/*.py pass to avoid regressions.

@souzatharsis
Copy link
Owner

Please see pytest logs at https://github.com/souzatharsis/podcastfy/actions/runs/11384115244/job/31671236157?pr=79

@tkhongsap
Copy link
Author

Okay, thank you for pointing that out. Will take a look.

@@ -71,14 +71,20 @@ def process_content(
qa_content = file.read()
else:
content_generator = ContentGenerator(
api_key=config.GEMINI_API_KEY, conversation_config=conv_config.to_dict()
api_key=config.OPENAI_API_KEY, conversation_config=conv_config.to_dict()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the issue causing tests to fail. I was wondering why you replaced Gemini's api key with openai's since ContentGenerator runs on gemini by default?

)

if urls:
logger.info(f"Processing {len(urls)} links")
content_extractor = ContentExtractor()
# Extract content from links
contents = [content_extractor.extract_content(link) for link in urls]
# Extract content from links or file paths
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to add the below logic (test whether it's a link or file path) to inside extract_content and keep this as list comprehension

@@ -3,7 +3,7 @@ output_directories:
audio: "./data/audio"

content_generator:
gemini_model: "gemini-1.5-pro-latest"
openai_model: "gpt-4o-mini"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep gemini as the default.

gpt-4o-mini has not been tested extensively
we use gemini 1.5 pro due to its massive context window of 2M token which allow for processing a list of input sources without trouble

we should consider generalizing to other llm backends here in a separate PR, we should use langchain for that purpose inside of implementing one backend at a time

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out. I was testing on my local machine and didn’t switch it back. The reason I changed from Gemini to OpenAI was for extracting the "Thai" language, as OpenAI seemed to provide better translations in terms of context. However, it's not an issue—I’ll switch it back.

@@ -18,6 +18,7 @@
from podcastfy.utils.config import load_config
import logging
from langchain.prompts import HumanMessagePromptTemplate
from langchain_openai import ChatOpenAI
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

@@ -48,10 +49,10 @@ def __init__(
if is_local:
self.llm = Llamafile()
else:
self.llm = ChatGoogleGenerativeAI(
model=model_name,
self.llm = ChatOpenAI(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

@@ -63,10 +64,10 @@ def __init__(
Initialize the ContentGenerator.

Args:
api_key (str): API key for Google's Generative AI.
api_key (str): API key for OpenAI.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment

conversation_config (Optional[Dict[str, Any]]): Custom conversation configuration.
"""
os.environ["GOOGLE_API_KEY"] = api_key
os.environ["OPENAI_API_KEY"] = api_key
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -166,7 +167,7 @@ def generate_qa_content(
),
model_name=(
self.content_generator_config.get(
"gemini_model", "gemini-1.5-pro-latest"
"openai_model", "gpt-4o-mini"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

@@ -210,9 +211,9 @@ def main(seed: int = 42, is_local: bool = False) -> None:
"""
try:
config = load_config()
api_key = config.GEMINI_API_KEY if not is_local else ""
api_key = config.OPENAI_API_KEY if not is_local else ""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

if not is_local and not api_key:
raise ValueError("GEMINI_API_KEY not found in configuration")
raise ValueError("OPENAI_API_KEY not found in configuration")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

@@ -62,15 +66,17 @@ def extract_content(self, source: str) -> str:
ValueError: If the source type is unsupported.
"""
try:
if self.is_url(source):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to have a dedicated method for that as self.is_url, curious the motivation to remove

@@ -11,7 +11,7 @@ dialogue_structure:
- "Conclusion"
podcast_name: "PODCASTFY"
podcast_tagline: "Your Personal Generative AI Podcast"
output_language: "English"
output_language: "Thai"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep English as default




langchain-openai = "^0.2.2"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop openai for now

@@ -67,11 +67,13 @@ langchain-community==0.3.2 ; python_version >= "3.11" and python_version < "4.0"
langchain-core==0.3.11 ; python_version >= "3.11" and python_version < "4.0"
langchain-google-genai==2.0.1 ; python_version >= "3.11" and python_version < "4.0"
langchain-google-vertexai==2.0.5 ; python_version >= "3.11" and python_version < "4.0"
langchain-openai==0.2.2 ; python_version >= "3.11" and python_version < "4.0"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop openai for now

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file required?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required?

def run_command(command):
subprocess.run(command, shell=True, check=True)

def main():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply run make gen-doc in bash instead. there is a Makefile in root that does the below for you

```
API Key Requirements:
- GEMINI_API_KEY: Required for transcript generation if not using a [local llm](local_llm.md). (get your [free API key](aistudio.google.com/app/apikey))
- OPENAI_API_KEY or ELEVENLABS_API_KEY: Required for audio generation if not using Microsoft Edge TTS `tts_model=edge`.
- OPENAI_API_KEY: Mandatory for all operations. (get your API key from [OpenAI](https://platform.openai.com/account/api-keys))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please drop openai for now

Copy link
Owner

@souzatharsis souzatharsis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your PR. I've added comments across several files.

The core observation is that you are replacing ContentGeneration llm backend from Gemini with OpenAI. We should deal with generalizing backend as a separate PR. This one should be focused on markdown support.

Thanks!

@tkhongsap tkhongsap closed this by deleting the head repository Oct 21, 2024
@tkhongsap
Copy link
Author

Understood and will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants