403 as response #283

fx71 · 2024-05-21T13:32:21Z

fx71
May 21, 2024

Just double checking if I understand project assumptions correctly.
During example web scraping I'm getting (most probably 403) from the site. Which is some protection mechanism.
I guess there is no easy way to override user agent as whole project seems to obey robot.txt see.

Is it only the matter of site response or Scrapegraph-ai additionally restrict access base on eg robot.txt?

Answered by PeriniM

May 22, 2024

Hey @fx71 try setting the headless flag to False and you will be able to fetch the HTML. Sometimes it happens for javascript-heavy website

graph_config = {
    "llm": {
        ...
    },
    "verbose": True,
    "headless": False,
}

View full answer

VinciGit00 · 2024-05-21T14:06:47Z

VinciGit00
May 21, 2024
Maintainer

I think we can overcome the problem with the robots txt, pls write the code

5 replies

fx71 May 22, 2024
Author

It's kind of hello world example

from scrapegraphai.graphs import SmartScraperGraph
from scrapegraphai.utils import prettify_exec_info


if __name__ == '__main__':
    print("Hello ScrapeGraphAi")

    graph_config = {
        "llm": {
            "model": "ollama/llama3",
            "temperature": 0,
            "format": "json",  # Ollama needs the format to be specified explicitly
            "base_url": "http://localhost:11434",  # set Ollama URL
        },
        "embeddings": {
            "model": "ollama/nomic-embed-text",
            "base_url": "http://localhost:11434",  # set Ollama URL
        }
    }

    smart_scraper_graph = SmartScraperGraph(
        prompt="Fetch house price, location and number of rooms",
        # also accepts a string with the already downloaded HTML code
        source="https://www.otodom.pl/pl/wyniki/sprzedaz/dom/dolnoslaskie/wroclaw/wroclaw/wroclaw/psie-pole/widawa?limit=36&ownerTypeSingleSelect=ALL&by=DEFAULT&direction=DESC&viewType=listing",
        config=graph_config
    )

    result = smart_scraper_graph.run()
    print(result)

    graph_exec_info = smart_scraper_graph.get_execution_info()
    print(prettify_exec_info(graph_exec_info))

This returns

{'house_price': None, 'location': None, 'number_of_rooms': None}
        node_name  total_tokens  ...  total_cost_USD  exec_time
0           Fetch             0  ...             0.0   0.370372
1           Parse             0  ...             0.0   0.139742
2             RAG             0  ...             0.0   1.022172
3  GenerateAnswer             0  ...             0.0  10.439219
4    TOTAL RESULT             0  ...             0.0  11.971505

With prompt replaced to prompt="Describe content of the page", to kind of debug this

{'content': "The page contains an error message indicating that the request could not be satisfied. The error is a 403 error and it's caused by too much traffic or a configuration issue. The page also provides some troubleshooting steps to help resolve the issue."}
        node_name  total_tokens  ...  total_cost_USD  exec_time
0           Fetch             0  ...             0.0   0.555157
1           Parse             0  ...             0.0   0.139933
2             RAG             0  ...             0.0   0.302496
3  GenerateAnswer             0  ...             0.0   7.082565
4    TOTAL RESULT             0  ...             0.0   8.080151

So this looks like some 403 from scrapped site. But I wasn't sure if it's only because of site output or something more with Scrapegraph-ai

PeriniM May 22, 2024
Maintainer

Hey @fx71 try setting the headless flag to False and you will be able to fetch the HTML. Sometimes it happens for javascript-heavy website

graph_config = {
    "llm": {
        ...
    },
    "verbose": True,
    "headless": False,
}

Answer selected by fx71

fx71 May 22, 2024
Author

Thanks, looks like this solved the 403 issue.

fx71 May 22, 2024
Author

The different challange(probably for separate discussion) is some GDPR or cookies popup. Looks like this is the one which is parsed instead the page content underneath. Is there any known workaround for this?

PeriniM May 22, 2024
Maintainer

Yes this is tricky but can be solved using a proxy (we have an example here) or by bypassing it directly in the webdriver. In pre/beta we have added undetected-playwright #269 if you want to check it out

VinciGit00 · 2024-05-22T09:42:00Z

VinciGit00
May 22, 2024
Maintainer

we suggest you to use this proxy for making the proxy rotation https://dashboard.statproxies.com/?refferal=scrapegraph

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

403 as response #283

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

403 as response #283

fx71 May 21, 2024

Replies: 2 comments · 5 replies

VinciGit00 May 21, 2024 Maintainer

fx71 May 22, 2024 Author

PeriniM May 22, 2024 Maintainer

fx71 May 22, 2024 Author

fx71 May 22, 2024 Author

PeriniM May 22, 2024 Maintainer

VinciGit00 May 22, 2024 Maintainer

fx71
May 21, 2024

Replies: 2 comments 5 replies

VinciGit00
May 21, 2024
Maintainer

fx71 May 22, 2024
Author

PeriniM May 22, 2024
Maintainer

fx71 May 22, 2024
Author

fx71 May 22, 2024
Author

PeriniM May 22, 2024
Maintainer

VinciGit00
May 22, 2024
Maintainer