Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Link Parsing Functionality #93

Open
gromdimon opened this issue Jan 19, 2025 · 1 comment
Open

Add Link Parsing Functionality #93

gromdimon opened this issue Jan 19, 2025 · 1 comment
Assignees
Labels
feature New feature or request
Milestone

Comments

@gromdimon
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently, Nevron lacks the ability to parse content from web links. Users cannot provide a link to fetch and process its content, limiting the framework's ability to gather information directly from online resources such as articles, blogs, and news sites.


Describe the solution you'd like

Add functionality to fetch and parse content from a given web link. This feature should enable the agent to extract meaningful information from web pages for use in workflows like memory updates, action planning, or contextual analysis.

Proposed Implementation Steps:

  1. Add a Link Parsing Utility:

    • Use the requests library to fetch the content of the provided link.
    • Use BeautifulSoup and Goose3 to parse and extract meaningful content, such as:
      • Article title
      • Main text/body
      • Meta description
      • Relevant keywords
    • Handle various web content structures, including standard HTML and minimal HTML layouts.
  2. Integration with Workflows:

    • Extend workflows (e.g., analyze_signal, analyze_news_workflow) to accept and process web links.
    • Store parsed content in the memory module for future reference.
  3. Error Handling:

    • Gracefully handle exceptions like:
      • Invalid or unreachable links.
      • Unsupported or malformed HTML.
      • Parsing errors due to complex or unexpected layouts.
    • Log detailed error messages for debugging.
  4. Configuration Options:

    • Allow users to configure link parsing settings in settings.py, such as:
      • User-Agent for HTTP requests.
      • Maximum allowed content size.
      • Timeout for HTTP requests.
  5. Unit Tests:

    • Write unit tests to validate link parsing functionality using mock web pages:
      • Valid web pages with standard HTML structures.
      • Edge cases, such as pages with minimal or malformed HTML.
      • Links that are unreachable or return HTTP errors.

Describe alternatives you've considered

  • Use dedicated scraping tools/APIs: Services like Scrapy or Puppeteer could be used for more advanced scraping but may introduce overhead and complexity.
  • Rely only on BeautifulSoup: A simpler approach but limited in extracting structured content like article metadata and main body text.

Additional Context

  • Suggested utility function for link parsing:
    import requests
    from bs4 import BeautifulSoup
    from goose3 import Goose
    
    def parse_link_content(url: str) -> dict:
        """
        Parse the content of a web link.
    
        Args:
            url (str): The URL to fetch and parse.
    
        Returns:
            dict: Parsed content including title, body text, and meta description.
        """
        try:
            response = requests.get(url, headers={"User-Agent": "Nevron/1.0"})
            response.raise_for_status()
            soup = BeautifulSoup(response.content, "html.parser")
            goose = Goose()
            article = goose.extract(raw_html=str(soup))
            return {
                "title": article.title,
                "meta_description": article.meta_description,
                "content": article.cleaned_text,
            }
        except Exception as e:
            raise RuntimeError(f"Failed to parse link content: {e}")
  • Example use case:
    • A user provides a link to a news article. The framework fetches and parses the content, storing the extracted text in memory for further analysis.
@gromdimon gromdimon added the feature New feature or request label Jan 19, 2025
@gromdimon gromdimon added this to the v0.2.0 milestone Jan 19, 2025
@gromdimon gromdimon modified the milestones: v0.2.0, v0.3.0 Jan 24, 2025
@gromdimon gromdimon modified the milestones: v0.3.0, v0.2.1 Feb 8, 2025
@gromdimon
Copy link
Contributor Author

There's a tool for that - JinaReader. We just simply should integrate this tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants