Crawl datasets and ingest them into Vectara using pre-built crawlers or by building your own. With Vectara’s APIs you can create conversational experiences with your data, such as chatbots, semantic search, and workplace search.
vectara-ingest
is an open source Python project that demonstrates how to crawl datasets and ingest them into Vectara. It provides a step-by-step guide on building your own crawler, as well as pre-built crawlers for ingesting data from sources such as: websites, RSS feeds, Jira, Notion, Docusaurus and more.
Let’s create a basic crawler to scrape content from Paul Graham's website, and ingest it into Vectara. This guide assumes you’ve already signed up for a free Vectara account, created a corpus to contain your data (default settings are fine), and created an API key with write-access to this corpus.
Install python >= 3.8 if it's not already installed.
Install pyyaml: pip3 install pyyaml
.
Install Docker.
For Windows,
- Start a Windows Powershell terminal
- Update WSL:
wsl --update
- Ensure WSL has the right version of Linux available:
wsl --install ubuntu-20.04
Clone this repository:
git clone https://github.com/vectara/vectara-ingest.git
Duplicate the secrets.example.toml
file and rename the copy to secrets.toml
.
Edit the secrets.toml
file and change the api_key value to your Vectara API Key.
Since the Paul Graham website does not have a standard sitemap, you're going to crawl its content using the RSS feed built by Aaron Swartz. You can do this by looking inside the config/
directory, duplicating the news-bbc.yaml
config file, and renaming it to pg-rss.yaml
.
Edit the pg-rss.yaml
file and make the following changes:
- Change vectara.corpus_id value to the ID of the corpus into which you want to ingest the content of the website.
- Change vectara.account_id value to the ID of your account. You can click on your username in the top-right corner to copy it to your clipboard.
- Change rss_crawler.source to
pg
. - Change rss_crawler.rss_pages to
["http://www.aaronsw.com/2002/feeds/pgessays.rss"]
. - Change rss_crawler.days_past to 365.
Make sure the Docker app is running.
Then execute the run script from the root directory using your config file to tell it what to crawl and where to ingest the data, assigning the “default” profile from your secrets file:
Start the crawler job:
bash run.sh config/pg-rss.yaml default
For Microsoft Windows, make sure you run this command from within the WSL 2 environment (after running wsl
to enter the Linux environment).
The crawler executes inside of a Docker container to protect your system’s resources and make it easier to move your crawlers to the cloud. This is a good time to grab a coffee or snack – the container needs to install a lot of dependencies, which will take some time.
When the container is set up, the process will exit and you’ll be able to track your crawler’s progress:
docker logs -f vingest
Paul Graham is a prolific writer, so ingesting all of his work might take a few minutes. But even while your crawler ingesting data into your Vectara corpus, you can build applications that query it! Try out some sample queries with the Vectara Console ("search" tab) -- one of my favorites is "What is a maker schedule?"
The codebase includes the following components.
Fundamental utilities depended upon by the crawlers:
ingest.py
: The main entry point for a crawl job.indexer.py
: Defines theIndexer
class which implements helpful methods to index data into Vectara such asindex_url
,index_file()
andindex_document()
.crawler.py
: Defines theCrawler
class which implements a base class for crawling, where each specific crawler should implement thecrawl()
method specific to its type.pdf_convert.py
: Helper class to convert URLs into local PDF documents.utils.py
: Some utility functions used by the other code.
Includes implementations of the various specific crawlers.
Includes example YAML configuration files for various crawling jobs.
The main shell script to execute when you want to launch a crawl job (see below for more details).
Defines the Docker container image.
To crawl and index a source you run a crawl "job", which is controlled by several paramters that you can define in a YAML configuration file. You can see example configuration files in the config/ directory.
Each configuration YAML file includes a set of standard variables, for example:
vectara:
# the corpus ID for indexing
corpus_id: 4
# the Vectara customer ID
customer_id: 1234567
# flag: should vectara-ingest reindex if document already exists (optional)
reindex: false
# timeout (optional); sets the URL crawling timeout in seconds
timeout: 45
crawling:
# type of crawler; valid options are website, docusaurus, notion, jira, rss, mediawiki, discourse, github and others (this continues to evolve as new crawler types are added)
crawler_type: XXX
Following that, where needed, the same YAML configuration file will include crawler-specific section with crawler-specific parameters (see about crawlers):
XXX_crawler:
# specific parameters for the crawler XXX
We use a secrets.toml
file to hold secret keys and parameters. You need to create this file in the root directory before running a crawl job. This file can hold multiple "profiles", and specific specific secrets for each of these profiles. For example:
[profile1]
api_key="<VECTAR-API-KEY-1>
[profile2]
api_key="<VECTARA-API-KEY-2>"
[profile3]
api_key="<VECTARA-API-KEY-3>"
MOTION_API_KEY="<YOUR-NOTION-API-KEY>
This allows easy secrets management when you have multiple crawl jobs that may not share the same secrets. For example when you have a different Vectara API key for indexing differnet corpora.
Many of the crawlers have their own secrets, for example Notion, Discourse, Jira, or GitHub. These are also kept in the secrets file in the appropriate section and need to be all upper case (e.g. NOTION_API_KEY
or JIRA_PASSWORD
).
The Indexer
class provides useful functionality to index documents into Vectara.
This is probably the most useful method. It takes a URL as input and extracts the content from that URL (using the playwright
and Goose3
libraries), then sends that content to Vectara using the standard indexing API. If the URL points to a PDF document, special care is taken to ensure proper processing.
Please note that we use Goose3
to extract the main (most important) content of the article, ignoring links, ads and other not-important content. If your crawled content has different requirements you can change the code to use a different extraction mechanism (html to text).
Use this when you have a file that you want to index using Vectara's file_uplaod API, so that it takes care of format identification, segmentation of text and indexing.
Use these when you build the document
JSON structure directly and want to index this document in the Vectara corpus.
Specifically, the reindex
parameter determines whether an existing document should be reindexed or not. If reindexing is required, the code automatically takes care of that by calling delete_doc()
to first remove the document from the corpus and then sends it to the corpus index.
The project is designed to be used within a Docker container, so that a crawl job can be run anywhere - on a local machine or on any cloud machine. See the Dockerfile for more information on the Docker file structure and build.
To run vectara-ingest
locally, perform the following steps:
- Make sure you have Docker installed on your machine.
- Clone this repo locally with
git clone https://github.com/vectara/vectara-ingest.git
. - Enter the directory with
cd vectara-ingest
. - Choose the configuration file for your project and run
bash run.sh config/<config-file>.yaml <profile>
. This command creates the Docker container locally, configures it with the parameters specified in your configuration file (with secrets taken from the appropriate<profile>
insecrets.toml
), and starts up the Docker container.
vectara-ingest
can be easily deployed on any cloud platform such as AWS, Azure or GCP.
- Create your configuration file for the project under the
config/
directory. - Run
docker build . --tag=vectara-ingest:latest
to generate the Docker container. - Push the docker to the cloud specific docker container registry:
- Launch the container on a VM instance based on the Docker image now hosted in your cloud environment.
The vectara-ingest
container is available for easy deployment via docker-hub.
👤 Vectara
- Website: vectara.com
- Twitter: @vectara
- GitHub: @vectara
- LinkedIn: @vectara
- Discord: @vectara
Contributions, issues and feature requests are welcome and appreciated!
Feel free to check issues page. You can also take a look at the contributing guide.
Give a ⭐️ if this project helped you!
Copyright © 2023 Vectara.
This project is Apache 2.0 licensed.