-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for markdown file input and update documentation #79
Conversation
This looks great and nicely organized and documented. However, automated tests did not pass. Could you please review? |
Okay, will check again. I ran test on the markdown extractor, it passed. tests\test_markdown_extractor.py ... [100%] |
We need to make sure all tests in test/*.py pass to avoid regressions. |
Please see pytest logs at https://github.com/souzatharsis/podcastfy/actions/runs/11384115244/job/31671236157?pr=79 |
Okay, thank you for pointing that out. Will take a look. |
@@ -71,14 +71,20 @@ def process_content( | |||
qa_content = file.read() | |||
else: | |||
content_generator = ContentGenerator( | |||
api_key=config.GEMINI_API_KEY, conversation_config=conv_config.to_dict() | |||
api_key=config.OPENAI_API_KEY, conversation_config=conv_config.to_dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the issue causing tests to fail. I was wondering why you replaced Gemini's api key with openai's since ContentGenerator runs on gemini by default?
) | ||
|
||
if urls: | ||
logger.info(f"Processing {len(urls)} links") | ||
content_extractor = ContentExtractor() | ||
# Extract content from links | ||
contents = [content_extractor.extract_content(link) for link in urls] | ||
# Extract content from links or file paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to add the below logic (test whether it's a link or file path) to inside extract_content and keep this as list comprehension
@@ -3,7 +3,7 @@ output_directories: | |||
audio: "./data/audio" | |||
|
|||
content_generator: | |||
gemini_model: "gemini-1.5-pro-latest" | |||
openai_model: "gpt-4o-mini" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep gemini as the default.
gpt-4o-mini has not been tested extensively
we use gemini 1.5 pro due to its massive context window of 2M token which allow for processing a list of input sources without trouble
we should consider generalizing to other llm backends here in a separate PR, we should use langchain for that purpose inside of implementing one backend at a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out. I was testing on my local machine and didn’t switch it back. The reason I changed from Gemini to OpenAI was for extracting the "Thai" language, as OpenAI seemed to provide better translations in terms of context. However, it's not an issue—I’ll switch it back.
@@ -18,6 +18,7 @@ | |||
from podcastfy.utils.config import load_config | |||
import logging | |||
from langchain.prompts import HumanMessagePromptTemplate | |||
from langchain_openai import ChatOpenAI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
@@ -48,10 +49,10 @@ def __init__( | |||
if is_local: | |||
self.llm = Llamafile() | |||
else: | |||
self.llm = ChatGoogleGenerativeAI( | |||
model=model_name, | |||
self.llm = ChatOpenAI( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
@@ -63,10 +64,10 @@ def __init__( | |||
Initialize the ContentGenerator. | |||
|
|||
Args: | |||
api_key (str): API key for Google's Generative AI. | |||
api_key (str): API key for OpenAI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment
conversation_config (Optional[Dict[str, Any]]): Custom conversation configuration. | ||
""" | ||
os.environ["GOOGLE_API_KEY"] = api_key | ||
os.environ["OPENAI_API_KEY"] = api_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -166,7 +167,7 @@ def generate_qa_content( | |||
), | |||
model_name=( | |||
self.content_generator_config.get( | |||
"gemini_model", "gemini-1.5-pro-latest" | |||
"openai_model", "gpt-4o-mini" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
@@ -210,9 +211,9 @@ def main(seed: int = 42, is_local: bool = False) -> None: | |||
""" | |||
try: | |||
config = load_config() | |||
api_key = config.GEMINI_API_KEY if not is_local else "" | |||
api_key = config.OPENAI_API_KEY if not is_local else "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
if not is_local and not api_key: | ||
raise ValueError("GEMINI_API_KEY not found in configuration") | ||
raise ValueError("OPENAI_API_KEY not found in configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
@@ -62,15 +66,17 @@ def extract_content(self, source: str) -> str: | |||
ValueError: If the source type is unsupported. | |||
""" | |||
try: | |||
if self.is_url(source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's better to have a dedicated method for that as self.is_url, curious the motivation to remove
@@ -11,7 +11,7 @@ dialogue_structure: | |||
- "Conclusion" | |||
podcast_name: "PODCASTFY" | |||
podcast_tagline: "Your Personal Generative AI Podcast" | |||
output_language: "English" | |||
output_language: "Thai" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep English as default
|
||
|
||
|
||
langchain-openai = "^0.2.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please drop openai for now
@@ -67,11 +67,13 @@ langchain-community==0.3.2 ; python_version >= "3.11" and python_version < "4.0" | |||
langchain-core==0.3.11 ; python_version >= "3.11" and python_version < "4.0" | |||
langchain-google-genai==2.0.1 ; python_version >= "3.11" and python_version < "4.0" | |||
langchain-google-vertexai==2.0.5 ; python_version >= "3.11" and python_version < "4.0" | |||
langchain-openai==0.2.2 ; python_version >= "3.11" and python_version < "4.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please drop openai for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this file required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this required?
def run_command(command): | ||
subprocess.run(command, shell=True, check=True) | ||
|
||
def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply run make gen-doc in bash instead. there is a Makefile in root that does the below for you
``` | ||
API Key Requirements: | ||
- GEMINI_API_KEY: Required for transcript generation if not using a [local llm](local_llm.md). (get your [free API key](aistudio.google.com/app/apikey)) | ||
- OPENAI_API_KEY or ELEVENLABS_API_KEY: Required for audio generation if not using Microsoft Edge TTS `tts_model=edge`. | ||
- OPENAI_API_KEY: Mandatory for all operations. (get your API key from [OpenAI](https://platform.openai.com/account/api-keys)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please drop openai for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR. I've added comments across several files.
The core observation is that you are replacing ContentGeneration llm backend from Gemini with OpenAI. We should deal with generalizing backend as a separate PR. This one should be focused on markdown support.
Thanks!
Understood and will do. |
This Pull Request introduces the following changes:
Markdown File Support:
content_extractor.py
andmarkdown_extractor.py
to support parsing of markdown files.Tests:
tests/test_markdown_extractor.py
to ensure that markdown extraction works correctly.Documentation Updates:
README.md
and usage guides to reflect the new functionality.Miscellaneous:
pyproject.toml
andrequirements.txt
to include any new dependencies related to markdown support.Please review the changes, and let me know if any updates or adjustments are required.