a graph-based knowledge search engine powered by Wikipedia
The first link in the main body text identifies a hierarchical relationship between articles: banana links to fruit, piano to musical instruments, and so on.
The search engine constructs a directed graph connecting the 11 million English articles (6 million redirects) using the first link. For those curious about the network's topology, peek at the research inspiring this project.
Example: Piano
Parent | Comparable | Children |
---|---|---|
musical instruments | Music box, Violin family, Glass harmonica | Piano Music, Piano music, Grand Piano, Lily Maisky, William Merrigan Daly |
- Download entire XML dump available here: https://dumps.wikimedia.org/enwiki/
- Extract the first link in the main body text (get_first_link.py)
- distributed computation using Spark DataFrames (on an 8-node AWS cluster)
- Databricks XML package is used to delineate a page: https://github.com/databricks/spark-xml
- Store graph in neo4j
- index articles by title and add page views as a property of each article
- uses bulk import, which also includes page views as an attribute for each node
- query can filter resutls by page views to return the most relevant articles
- index articles by title and add page views as a property of each article
note matching titles between the available hourly page view data and displayed title is imperfect (see match_views.py)
In addition to the graph, the first 2000 characters of the main body text are indexed for fuzzy title searching.
- powered by Elasticsearch
- indexing is distributed using Scala
- note PySpark Databricks XML package and EsSpark (connector from Spark to Elasticsearch) are not compatible
- build jar using Maven and run Scala (see pom.xml and index_wiki.scala)
- Elasticsearch query weighs the title 2x more heavily realtive to introductory body text
Example: "paper" --> "Pulp (paper)"
lower-case "paper" is matched to the correct Wikipedia article title
To install dependencies:
pip install -r requirements.txt
For distributed computations, the program also requires Spark, Java > 7, and Scala.
Program expects configurations in configs.py which sets environment variables for database and spark nodes:
import os
os.environ["master_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["elasticsearch_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["neo4j_pass"] = "xxxx"
os.environ["neo4j_ip"] = "xx.xxx.xx"
Flow based on search term:
- match search term to the closest title (using Elasticsearch query above)
- fetch a subset of the network (parent, comparable, and child articles)
- filter the subgraph by page views to return only the most relevant subset
The front-end serves this result in a directed D3 graph.