Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use WholeSiteReader on the Cloud #12790

Open
1 task done
VaradDeshmukh97 opened this issue Apr 12, 2024 · 5 comments
Open
1 task done

Unable to use WholeSiteReader on the Cloud #12790

VaradDeshmukh97 opened this issue Apr 12, 2024 · 5 comments
Labels
question Further information is requested

Comments

@VaradDeshmukh97
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am using WholeSiteReader() for loading websites into LlamaIndex Document objects. On my local system it works fine, but when I run my notebook on the Cloud, it says

ValueError: No chrome executable found on PATH

I understand that the driver needs to be given the path to the Chrome Executable, but I am unable to work it out. Any help will be appreciated.
Thanks!

@VaradDeshmukh97 VaradDeshmukh97 added the question Further information is requested label Apr 12, 2024
Copy link

dosubot bot commented Apr 12, 2024

Hey @VaradDeshmukh97, great to see you around here again! 🚀 Your adventures in the cloud sound like quite the saga. I'm on it - will get back to you with a strategy to conquer this error shortly! 🌩️

@VaradDeshmukh97
Copy link
Author

When I run this,

from llama_index.readers.web import WholeSiteReader
from selenium import webdriver
def load_websites(prefix, base_url, max_depth=1):
    '''
    Recursively searches and loads a website into LlamaIndex Document objects.
    Args:
        prefix = url of the website
        base_url = url of the page from where to start the scraping
    '''
    driver = webdriver.Chrome()
    scraper = WholeSiteReader(
        prefix=prefix,
        max_depth=max_depth,
        driver=driver
    )
    websites = scraper.load_data(
        base_url=base_url
    )
    return websites

I get the following error:

SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /teamspace/studios/this_studio/.cache/selenium/chrome/linux64/123.0.6312.122/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x55b5d0b0f863 <unknown>
#1 0x55b5d08058c6 <unknown>
#2 0x55b5d0839d34 <unknown>
#3 0x55b5d0835d3d <unknown>
#4 0x55b5d087eaed <unknown>
#5 0x55b5d0872343 <unknown>
#6 0x55b5d0843593 <unknown>
#7 0x55b5d0843f5e <unknown>
#8 0x55b5d0ad384b <unknown>
#9 0x55b5d0ad77a5 <unknown>
#10 0x55b5d0ac1571 <unknown>
#11 0x55b5d0ad8332 <unknown>
#12 0x55b5d0aa687f <unknown>
#13 0x55b5d0afe728 <unknown>
#14 0x55b5d0afe8fb <unknown>
#15 0x55b5d0b0e9b4 <unknown>
#16 0x7ff2fb9b8609 start_thread

Copy link

dosubot bot commented Apr 12, 2024

The error you're encountering suggests a compatibility issue between Chrome and ChromeDriver or a problem with locating the Chrome executable in your cloud environment. Here's how to address these issues:

  1. Check Chrome and ChromeDriver Compatibility: Ensure the versions of Chrome and ChromeDriver you're using are compatible. Mismatches can lead to the SessionNotCreatedException.

  2. Use Remote WebDriver: In cloud environments, it's common to use a Remote WebDriver setup. This involves configuring your code to use webdriver.Remote() instead of webdriver.Chrome(), specifying the URL of the Selenium Grid or cloud-based testing service and the desired capabilities.

  3. Specify Chrome Binary Location: If the Chrome binary is in a non-standard location, you may need to explicitly set its path using ChromeOptions. This is done by setting opt.binary_location = '/path/to/chrome' before initializing the WebDriver.

Here's a snippet for using a Remote WebDriver, which you should adjust based on your environment:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

def setup_driver(self):
    selenium_grid_url = "http://your.selenium.grid.url:4444/wd/hub"
    capabilities = DesiredCapabilities.CHROME.copy()
    
    opt = webdriver.ChromeOptions()
    # Specify binary location if necessary
    # opt.binary_location = '/path/to/chrome'
    capabilities.update(opt.to_capabilities())

    return webdriver.Remote(command_executor=selenium_grid_url, desired_capabilities=capabilities)

Replace "http://your.selenium.grid.url:4444/wd/hub" with your Selenium Grid hub URL or the endpoint of your cloud-based Selenium service. This setup requires the remote environment to have the correct Chrome and ChromeDriver versions installed. If you're still facing issues, please share more details about your cloud setup, including the versions of Chrome and ChromeDriver and the location of the Chrome executable.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@suryaanshrai
Copy link

Hey @VaradDeshmukh97 did you found any solution or workaround to this issue? I am encountering the same issue and it would be amazing if I could use your help!

@3sakshij
Copy link

3sakshij commented Sep 25, 2024

Hi @suryaanshrai , Got the same errors, this solution works for me
You can edit the function and add these parameters.
def setup_driver(self):
"""
Sets up the Selenium WebDriver for Chrome.

    Returns:
        WebDriver: An instance of Chrome WebDriver.
    """
    try:
        import chromedriver_autoinstaller
    except ImportError:
        raise ImportError("Please install chromedriver_autoinstaller")

    opt = webdriver.ChromeOptions()
    opt.add_argument("--start-maximized")
    opt.add_argument('--headless')
    opt.add_argument('--no-sandbox')
    opt.add_argument('--disable-dev-shm-usage')**
    chromedriver_autoinstaller.install()
    return webdriver.Chrome(options=opt)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants