Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added markdownify and localscraper tools #2

Merged
merged 3 commits into from
Dec 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
## 1.0.0 (2024-12-05)

## 1.0.0-beta.1 (2024-12-05)


### Features

* added markdownify and localscraper tools ([03e49dc](https://github.com/ScrapeGraphAI/langchain-scrapegraph/commit/03e49dce84ef5a1b7a59b6dfd046eb563c14d283))
* tools integration ([dc7e9a8](https://github.com/ScrapeGraphAI/langchain-scrapegraph/commit/dc7e9a8fbf4e88bb79e11a9253428b2f61fa1293))


Expand Down
141 changes: 140 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,140 @@
# langchain-scrapegraph
# 🕷️🦜 langchain-scrapegraph

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/)
[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs)

Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language.

## 📦 Installation

```bash
pip install langchain-scrapegraph
```

## 🛠️ Available Tools

### 📝 MarkdownifyTool
Convert any webpage into clean, formatted markdown.

```python
from langchain_scrapegraph.tools import MarkdownifyTool

tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})

print(markdown)
```

### 🔍 SmartscraperTool
Extract structured data from any webpage using natural language prompts.

```python
from langchain_scrapegraph.tools import SmartscraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()

# Extract information using natural language
result = tool.invoke({
"website_url": "https://www.example.com",
"user_prompt": "Extract the main heading and first paragraph"
})

print(result)
```

### 💻 LocalscraperTool
Extract information from HTML content using AI.

```python
from langchain_scrapegraph.tools import LocalscraperTool

tool = LocalscraperTool()
result = tool.invoke({
"user_prompt": "Extract all contact information",
"website_html": "<html>...</html>"
})

print(result)
```

## 🌟 Key Features

- 🐦 **LangChain Integration**: Seamlessly works with LangChain agents and chains
- 🔍 **AI-Powered Extraction**: Use natural language to describe what data to extract
- 📊 **Structured Output**: Get clean, structured data ready for your agents
- 🔄 **Flexible Tools**: Choose from multiple specialized scraping tools
- ⚡ **Async Support**: Built-in support for async operations

## 💡 Use Cases

- 📖 **Research Agents**: Create agents that gather and analyze web data
- 📊 **Data Collection**: Automate structured data extraction from websites
- 📝 **Content Processing**: Convert web content into markdown for further processing
- 🔍 **Information Extraction**: Extract specific data points using natural language

## 🤖 Example Agent

```python
from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartscraperTool
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [
SmartscraperTool(),
]

# Create an agent
agent = initialize_agent(
tools=tools,
llm=ChatOpenAI(temperature=0),
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)

# Use the agent
response = agent.run("""
Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")
```

## ⚙️ Configuration

Set your ScrapeGraph API key in your environment:
```bash
export SGAI_API_KEY="your-api-key-here"
```

Or set it programmatically:
```python
import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"
```

## 📚 Documentation

- [API Documentation](https://scrapegraphai.com/docs)
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html)
- [Examples](examples/)

## 💬 Support & Feedback

- 📧 Email: [email protected]
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new)

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

This project is built on top of:
- [LangChain](https://github.com/langchain-ai/langchain)
- [ScrapeGraph AI](https://scrapegraphai.com)

---

Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)
57 changes: 57 additions & 0 deletions examples/agent_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""
Remember to install the additional dependencies for this example to work:
pip install langchain-openai langchain
"""

from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain_core.messages import SystemMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

from langchain_scrapegraph.tools import (
GetCreditsTool,
LocalScraperTool,
SmartScraperTool,
)

load_dotenv()

# Initialize the tools
tools = [
SmartScraperTool(),
LocalScraperTool(),
GetCreditsTool(),
]

# Create the prompt template
prompt = ChatPromptTemplate.from_messages(
[
SystemMessage(
content=(
"You are a helpful AI assistant that can analyze websites and extract information. "
"You have access to tools that can help you scrape and process web content. "
"Always explain what you're doing before using a tool."
)
),
MessagesPlaceholder(variable_name="chat_history", optional=True),
("user", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
]
)

# Initialize the LLM
llm = ChatOpenAI(temperature=0)

# Create the agent
agent = create_openai_functions_agent(llm, tools, prompt)

# Create the executor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

# Example usage
query = """Extract the main products from https://www.scrapegraphai.com/"""

print("\nQuery:", query, "\n")
response = agent_executor.invoke({"input": query})
print("\nFinal Response:", response["output"])
14 changes: 9 additions & 5 deletions examples/get_credits_tool.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import GetCreditsTool

# Will automatically get SGAI_API_KEY from environment, or set it manually
sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = GetCreditsTool()
credits = tool.run()

print("\nCredits Information:")
print(f"Remaining Credits: {credits['remaining_credits']}")
print(f"Total Credits Used: {credits['total_credits_used']}")
# Use the tool
credits = tool.invoke({})

print(credits)
28 changes: 28 additions & 0 deletions examples/localscraper_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import LocalScraperTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = LocalScraperTool()

# Example website and prompt
html_content = """
<html>
<body>
<h1>Company Name</h1>
<p>We are a technology company focused on AI solutions.</p>
<div class="contact">
<p>Email: [email protected]</p>
<p>Phone: (555) 123-4567</p>
</div>
</body>
</html>
"""
user_prompt = "Make a summary of the webpage and extract the email and phone number"

# Use the tool
result = tool.invoke({"website_html": html_content, "user_prompt": user_prompt})

print(result)
16 changes: 16 additions & 0 deletions examples/markdownify_tool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from scrapegraph_py.logger import sgai_logger

from langchain_scrapegraph.tools import MarkdownifyTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = MarkdownifyTool()

# Example website and prompt
website_url = "https://www.example.com"

# Use the tool
result = tool.invoke({"website_url": website_url})

print(result)
18 changes: 10 additions & 8 deletions examples/smartscraper_tool.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
from langchain_scrapegraph.tools import SmartscraperTool
from scrapegraph_py.logger import sgai_logger

# Will automatically get SGAI_API_KEY from environment, or set it manually
tool = SmartscraperTool()
from langchain_scrapegraph.tools import SmartScraperTool

sgai_logger.set_logging(level="INFO")

# Will automatically get SGAI_API_KEY from environment
tool = SmartScraperTool()

# Example website and prompt
website_url = "https://www.example.com"
user_prompt = "Extract the main heading and first paragraph from this webpage"

# Use the tool synchronously
result = tool.run({"user_prompt": user_prompt, "website_url": website_url})
# Use the tool
result = tool.invoke({"website_url": website_url, "user_prompt": user_prompt})

print("\nExtraction Results:")
print(f"Main Heading: {result['main_heading']}")
print(f"First Paragraph: {result['first_paragraph']}")
print(result)
6 changes: 4 additions & 2 deletions langchain_scrapegraph/tools/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
from .credits import GetCreditsTool
from .smartscraper import SmartscraperTool
from .localscraper import LocalScraperTool
from .markdownify import MarkdownifyTool
from .smartscraper import SmartScraperTool

__all__ = ["SmartscraperTool", "GetCreditsTool"]
__all__ = ["SmartScraperTool", "GetCreditsTool", "MarkdownifyTool", "LocalScraperTool"]
55 changes: 51 additions & 4 deletions langchain_scrapegraph/tools/credits.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,72 @@
from langchain_core.tools import BaseTool
from langchain_core.utils import get_from_dict_or_env
from pydantic import model_validator
from scrapegraph_py import SyncClient
from scrapegraph_py import Client


class GetCreditsTool(BaseTool):
"""Tool for checking remaining credits on your ScrapeGraph AI account.

Setup:
Install ``langchain-scrapegraph`` python package:

.. code-block:: bash

pip install langchain-scrapegraph

Get your API key from ScrapeGraph AI (https://scrapegraphai.com)
and set it as an environment variable:

.. code-block:: bash

export SGAI_API_KEY="your-api-key"

Key init args:
api_key: Your ScrapeGraph AI API key. If not provided, will look for SGAI_API_KEY env var.
client: Optional pre-configured ScrapeGraph client instance.

Instantiate:
.. code-block:: python

from langchain_scrapegraph.tools import GetCreditsTool

# Will automatically get SGAI_API_KEY from environment
tool = GetCreditsTool()

# Or provide API key directly
tool = GetCreditsTool(api_key="your-api-key")

Use the tool:
.. code-block:: python

result = tool.invoke({})

print(result)
# {
# "remaining_credits": 100,
# "total_credits_used": 50
# }

Async usage:
.. code-block:: python

result = await tool.ainvoke({})
"""

name: str = "GetCredits"
description: str = (
"Get the current credits available in your ScrapeGraph AI account"
)
return_direct: bool = True
client: Optional[SyncClient] = None
client: Optional[Client] = None
api_key: str
testing: bool = False

@model_validator(mode="before")
@classmethod
def validate_environment(cls, values: Dict) -> Dict:
"""Validate that api key exists in environment."""
values["api_key"] = get_from_dict_or_env(values, "api_key", "SGAI_API_KEY")
values["client"] = SyncClient(api_key=values["api_key"])
values["client"] = Client(api_key=values["api_key"])
return values

def __init__(self, **data: Any):
Expand Down
Loading
Loading