Add Link Parsing Functionality #93

gromdimon · 2025-01-19T15:39:08Z

Is your feature request related to a problem? Please describe.

Currently, Nevron lacks the ability to parse content from web links. Users cannot provide a link to fetch and process its content, limiting the framework's ability to gather information directly from online resources such as articles, blogs, and news sites.

Describe the solution you'd like

Add functionality to fetch and parse content from a given web link. This feature should enable the agent to extract meaningful information from web pages for use in workflows like memory updates, action planning, or contextual analysis.

Proposed Implementation Steps:

Add a Link Parsing Utility:
- Use the requests library to fetch the content of the provided link.
- Use BeautifulSoup and Goose3 to parse and extract meaningful content, such as:
  - Article title
  - Main text/body
  - Meta description
  - Relevant keywords
- Handle various web content structures, including standard HTML and minimal HTML layouts.
Integration with Workflows:
- Extend workflows (e.g., analyze_signal, analyze_news_workflow) to accept and process web links.
- Store parsed content in the memory module for future reference.
Error Handling:
- Gracefully handle exceptions like:
  - Invalid or unreachable links.
  - Unsupported or malformed HTML.
  - Parsing errors due to complex or unexpected layouts.
- Log detailed error messages for debugging.
Configuration Options:
- Allow users to configure link parsing settings in settings.py, such as:
  - User-Agent for HTTP requests.
  - Maximum allowed content size.
  - Timeout for HTTP requests.
Unit Tests:
- Write unit tests to validate link parsing functionality using mock web pages:
  - Valid web pages with standard HTML structures.
  - Edge cases, such as pages with minimal or malformed HTML.
  - Links that are unreachable or return HTTP errors.

Describe alternatives you've considered

Use dedicated scraping tools/APIs: Services like Scrapy or Puppeteer could be used for more advanced scraping but may introduce overhead and complexity.
Rely only on BeautifulSoup: A simpler approach but limited in extracting structured content like article metadata and main body text.

Additional Context

Suggested utility function for link parsing:

import requests
from bs4 import BeautifulSoup
from goose3 import Goose

def parse_link_content(url: str) -> dict:
    """
    Parse the content of a web link.

    Args:
        url (str): The URL to fetch and parse.

    Returns:
        dict: Parsed content including title, body text, and meta description.
    """
    try:
        response = requests.get(url, headers={"User-Agent": "Nevron/1.0"})
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")
        goose = Goose()
        article = goose.extract(raw_html=str(soup))
        return {
            "title": article.title,
            "meta_description": article.meta_description,
            "content": article.cleaned_text,
        }
    except Exception as e:
        raise RuntimeError(f"Failed to parse link content: {e}")

Example use case:
- A user provides a link to a news article. The framework fetches and parses the content, storing the extracted text in memory for further analysis.

The text was updated successfully, but these errors were encountered:

gromdimon · 2025-02-11T00:58:54Z

There's a tool for that - JinaReader. We just simply should integrate this tool

gromdimon added the feature New feature or request label Jan 19, 2025

gromdimon added this to the v0.2.0 milestone Jan 19, 2025

gromdimon assigned anderlean Jan 19, 2025

gromdimon modified the milestones: v0.2.0, v0.3.0 Jan 24, 2025

gromdimon modified the milestones: v0.3.0, v0.2.1 Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Link Parsing Functionality #93

Add Link Parsing Functionality #93

gromdimon commented Jan 19, 2025

gromdimon commented Feb 11, 2025

Add Link Parsing Functionality #93

Add Link Parsing Functionality #93

Comments

gromdimon commented Jan 19, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Proposed Implementation Steps:

Describe alternatives you've considered

Additional Context

gromdimon commented Feb 11, 2025