vectara-ingest
includes a number of crawlers that make it easy to crawl data sources and index the results into Vectara.
vectara-ingest
uses Goose3 and JusText in Indexer.index_url to enhance text extraction from HTML content, ensuring relevant material is gathered while excluding ads and irrelevant content. vectara-ingest
supports crawling and indexing web content in 42 languages currently. To determine the language of a given webpage, we utilize the langdetect package, and adjust the use of Goose3 and JusText accordingly.
Let's go through each of these crawlers to explain how they work and how to customize them to your needs. This will also provide good background to creating (and contributing) new types of crawlers.
...
website_crawler:
urls: [https://vectara.com]
url_regex: []
delay: 1
pages_source: sitemap
extraction: playwright
max_depth: 3 # only needed if pages_source is set to 'crawl'
ray_workers: 0
...
The website crawler indexes the content of a given web site. It supports two modes for finding pages to crawl (defined by pages_source
):
sitemap
: in this mode the crawler retrieves the sitemap of the target websites (urls
: list of URLs) and indexes all the URLs listed in each sitemap. Note that some sitemaps are partial only and do not list all content of the website - in those cases,crawl
may be a better option.crawl
: in this mode the crawler starts from the homepage and crawls the website, following links no more thanmax_depth
The extraction
parameter defines how page content is extracted from URLs.
- The default value is
pdf
which means the target URL is rendered into a PDF document, which is then uploaded to Vectara. This is the preferred method as the rendering operation is helpful to extract any text that may be due to Javascript or other scriping execution. - The other option is
playwright
which results in using playwright to render the page content including JS and then extracting the HTML.
delay
specifies the number of seconds to wait between to consecutive requests to avoid rate limiting issues.
url_regex
if it exists defines a regular expression that is used to filter URLs. For example, if I want only the developer pages from vectara.com to be indexed, I can use ".vectara.com/developer."
ray_workers
if it exists defines the number of ray workers to use for parallel processing. ray_workers=0 means dont use Ray. ray_workers=-1 means use all cores available.
Note that ray with docker does not work on Mac M1/M2 machines.
...
database_crawler:
db_url: "postgresql://<username>:<password>@my_db_host:5432/yelp"
db_table: yelp_reviews
select_condition: "city='New Orleans'"
doc_id_columns: [postal_code]
text_columns: [business_name, review_text]
metadata_columns: [city, state, postal_code]
The database crawler can be used to read data from a relational database and index relevant columns into Vectara.
db_url
specifies the database URI including the type of database, host/port, username and password if needed.- For MySQL: "mysql://username:password@host:port/database"
- For PostgreSQL:"postgresql://username:password@host:port/database"
- For Microsoft SQL Server: "mssql+pyodbc://username:password@host:port/database"
- For Oracle: "oracle+cx_oracle://username:password@host:port/database"
db_table
the table name in the databaseselect_condition
optional condition to filter rows in the table bydoc_id_columns
defines one or more columns that will be used as a document ID, and will aggregate all rows associated with this value into a single Vectara document. This will also be used as the title. If this is not specified, the code will aggregate everyrows_per_chunk
(default 500) rows.text_columns
a list of column names that include textual information we want to usetitle_column
is an optional column name that will hold textual information to be used as titlemetadata_columns
a list of column names that we want to use as metadata
In the above example, the crawler would
- Include all rows in the database "yelp" that are from the city of New Orleans (
SELECT * FROM yelp WHERE city='New Orleans'
) - Group all rows that have the same values for
postal_code
into the same Vectara document - Each such Vectara document that is indexed, will include several section (one per row), each representing the textual fields
business_name
andreview_text
and including the meta-data fieldscity
,state
andpostal_code
.
...
csv_crawler:
csv_path: "/path/to/Game_of_Thrones_Script.csv"
doc_id_columns: [Season, Episode]
text_columns: [Name, Sentence]
metadata_columns: ["Season", "Episode", "Episode Title"]
separator: ','
The csv crawler is similar to the database crawler, but instead of pulling data from a database, it uses a local CSV file.
select_condition
optional condition to filter rows in the table bydoc_id_columns
defines one or more columns that will be used as a document ID, and will aggregate all rows associated with this value into a single Vectara document. This will also be used as the title. If this is not specified, the code will aggregate everyrows_per_chunk
(default 500) rows.text_columns
a list of column names that include textual information we want to usetitle_column
is an optional column name that will hold textual information to be used as titlemetadata_columns
a list of column names that we want to use as metadataseparator
a string that will be used as a separator in the CSV file (default ',')
In the above example, the crawler would
- Read all the data from the local CSV file under
/path/to/Game_of_Thrones_Script.csv
- Group all rows that have the same values for both
Season
andEpisode
into the same Vectara document - Each such Vectara document that is indexed, will include several section (one per row), each representing the textual fields
Name
andSentence
and including the meta-data fieldsSeason
,Episode
andEpisode Title
.
...
rss_crawler:
source: bbc
rss_pages: [
"http://feeds.bbci.co.uk/news/rss.xml", "http://feeds.bbci.co.uk/news/world/rss.xml", "http://feeds.bbci.co.uk/news/uk/rss.xml",
"http://feeds.bbci.co.uk/news/business/rss.xml", "http://feeds.bbci.co.uk/news/politics/rss.xml",
"http://feeds.bbci.co.uk/news/health/rss.xml", "http://feeds.bbci.co.uk/news/education/rss.xml",
"http://feeds.bbci.co.uk/news/science_and_environment/rss.xml", "http://feeds.bbci.co.uk/news/technology/rss.xml",
"http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml", "http://feeds.bbci.co.uk/news/world/middle_east/rss.xml",
"http://feeds.bbci.co.uk/news/world/us_and_canada/rss.xml", "http://feeds.bbci.co.uk/news/world/asia/rss.xml",
"http://feeds.bbci.co.uk/news/world/europe/rss.xml"
]
days_past: 90
delay: 1
extraction: playwright # pdf or playwright
The RSS crawler can be used to crawl URLs listed in RSS feeds such as on news sites. In the example above, the rss_crawler is configured to crawl various newsfeeds from the BBC.
source
specifies the name of the rss data feedrss_pages
defines one or more RSS feed locations.days_past
specifies the number of days backward to crawl; for example with a value of 90 as in this example, the crawler will only index news items that have been published no earlier than 90 days in the past.delay
defines the number of seconds between to wait between news articles, so as to make the crawl more friendly to the hosting site.extraction
defines how we process URLs (pdf or html) as as with the website crawler.
...
docs_crawler:
docs_repo: "https://github.com/vectara/vectara-docs"
docs_homepage: "https://docs.vectara.com/docs"
docs_system: "docusaurus"
ray_workers: 0
The Docs crawler processes and indexes content published on different documentation systems. It has two parameters
base_urls
defines one or more base URLS for the documentation content.pos_regex
defines one or more (optional) regex expressions defining URLs to match for inclusionneg_regex
defines one or more (optional) regex expressions defining URLs to match for exclusionextensions_to_ignore
specifies one or more file extensions that we want to ignore and not index into Vectara.doc_system
defines the type of documentation system we will crawl for content. Currently supporteddocusaurus
orreadthedocs
remove_code
if true (default) then removes all code segments (with tags<script> and <code>
) in what is indexed, otherwise keeps those sectionsray_workers
if it exists defines the number of ray workers to use for parallel processing. ray_workers=0 means dont use Ray. ray_workers=-1 means use all cores available. Note that ray with docker does not work on Mac M1/M2 machines.
...
discourse_crawler:
base_url: "https://discuss.vectara.com"
The discourse forums crawler requires a single parameter, base_url
, which specifies the home page for the public forum we want to crawl.
In the secrets.toml
file you should have DISCOURSE_API_KEY="<YOUR_DISCOURSE_KEY>" which will provide the needed authentication for the crawler to access this data.
The mediawiki crawler can crawl content in any wikimedia-powered website such as Wikipedia or others, and index it into Vectara.
...
wikimedia_crawler:
project: "en.wikipedia"
api_url: "https://en.wikipedia.org/w/api.php"
n_pages: 1000
The mediawiki crawler first looks at media statistics to determine the most viewed pages in the last 7 days, and then based on that picks the top n_pages
to crawl.
api_url
defines the base URL for the wikiproject
defines the mediawiki project name.
...
github_crawler:
owner: "vectara"
repos: ["getting-started", "protos", "slackbot", "magazine-search-demo", "web-crawler", "Search-UI", "hotel-reviews-demo"]
crawl_code: false
The GitHub crawler indexes content from GitHub repositories into Vectara.
repos
: list of repository names to crawlowner
: GitHub repository ownercrawl_code
: by default the crawler indexes only issues and comments; if this is set to True it will also index the source code (but that's usually not recommended).
It is highly recommended to add a GITHUB_TOKEN
to your secret.toml
file under the specific profile you're going to use. The GITHUB_TOKEN (see here for how to create one for yourself) ensures you don't run against rate limits.
...
jira_crawler:
jira_base_url: "https://vectara.atlassian.net/"
jira_username: [email protected]
jira_jql: "created > -365d"
The JIRA crawler indexes issues and comments into Vectara.
jira_base_url
: the Jira base_urljira_username
: the user name that the crawler should use (JIRA_PASSWORD
should be separately defined in thesecrets.toml
file)jira_jql
: a Jira JQL condition on the issues identified; in this example it is configured to only include items from the last year.
The Notion crawler has no specific parameters, except the NOTION_API_KEY
that needs to be specified in the secrets.toml
file.
The crawler will index any content on your notion instance.
The HubSpot crawler has no specific parameters, except the HUBSPOT_API_KEY
that needs to be specified in the secrets.toml
file. The crawler will index the emails on your Hubspot instance. The crawler also uses clean_email_text()
module which takes the email message as a parameter and cleans it to make it more presentable. This function in core/utils.py
is taking care of indentation character >
.
The crawler leverages Presidio Analyzer and Anonymizer to accomplish PII masking, achieving a notable degree of accuracy in anonymizing sensitive information with minimal error.
...
folder_crawler:
path: "/Users/ofer/Downloads/some-interesting-content/"
extensions: ['.pdf']
The folder crawler simple indexes all content that's in a specified local folder.
path
: the local folder locationextensions
: list of file extensions to be included. If one of those extensions is '*' then all files would be crawled, disregarding any other extensions in that list.source
: a string that is added to each file's metadata under the "source" field
Note that the local path you specify is mapped into a fixed location in the docker container /home/vectara/data
, but that is a detail of the implementation that you don't need to worry about in most cases, just specify the path to your local folder and this mapping happens automatically.
...
s3_crawler:
s3_path: s3://my-bucket/files
extensions: ['*']
The S3 crawler indexes all content that's in a specified S3 bucket path.
s3_path
: a valid S3 location where the files to index resideextensions
: list of file extensions to be included. If one of those extensions is '*' then all files would be crawled, disregarding any other extensions in that list.
Edgar
crawler: crawls SEC Edgar annual reports (10-K) and indexes those into Vectarafmp
crawler: crawls information about public companies using the FMP APIHacker News
crawler: crawls the best, most recent an most popular Hacker News storiesPMC
crawler: crawls medical articles from PubMed Central and indexes them into Vectara.Arxiv
crawler: crawls the top (most cited or latest) Arxiv articles about a topic and indexes them into Vectara.