Knowledge Search

a graph-based knowledge search engine powered by Wikipedia

Connecting Every Article in a Graph
- Graph Implementation
Fuzzy Title Matching
Setup
Application

Connecting Articles in a Graph

The first link in the main body text identifies a hierarchical relationship between articles: banana links to fruit, piano to musical instruments, and so on.

The search engine constructs a directed graph connecting the 11 million English articles (6 million redirects) using the first link. For those curious about the network's topology, peek at the research inspiring this project.

Example: Piano

Parent	Comparable	Children
musical instruments	Music box, Violin family, Glass harmonica	Piano Music, Piano music, Grand Piano, Lily Maisky, William Merrigan Daly

Graph Implementation

Download entire XML dump available here: https://dumps.wikimedia.org/enwiki/
Extract the first link in the main body text (get_first_link.py)
- distributed computation using Spark DataFrames (on an 8-node AWS cluster)
- Databricks XML package is used to delineate a page: https://github.com/databricks/spark-xml
Store graph in neo4j
- index articles by title and add page views as a property of each article
  - uses bulk import, which also includes page views as an attribute for each node
  - query can filter resutls by page views to return the most relevant articles

note matching titles between the available hourly page view data and displayed title is imperfect (see match_views.py)

Fuzzy Title Matching

In addition to the graph, the first 2000 characters of the main body text are indexed for fuzzy title searching.

powered by Elasticsearch
indexing is distributed using Scala
- note PySpark Databricks XML package and EsSpark (connector from Spark to Elasticsearch) are not compatible
- build jar using Maven and run Scala (see pom.xml and index_wiki.scala)
Elasticsearch query weighs the title 2x more heavily realtive to introductory body text

Example: "paper" --> "Pulp (paper)"

lower-case "paper" is matched to the correct Wikipedia article title

Setup

To install dependencies:

pip install -r requirements.txt

For distributed computations, the program also requires Spark, Java > 7, and Scala.

Program expects configurations in configs.py which sets environment variables for database and spark nodes:

import os

os.environ["master_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["elasticsearch_node_dns"] = "ec2-xx-xx-xx.compute-1.amazonaws.com"
os.environ["neo4j_pass"] = "xxxx"
os.environ["neo4j_ip"] = "xx.xxx.xx"

Application

Flow based on search term:

match search term to the closest title (using Elasticsearch query above)
fetch a subset of the network (parent, comparable, and child articles)
filter the subgraph by page views to return only the most relevant subset

The front-end serves this result in a directed D3 graph.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
Knowledge_search		Knowledge_search
data		data
site		site
tests		tests
tools		tools
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Search

Connecting Articles in a Graph

Graph Implementation

Fuzzy Title Matching

Setup

Application

About

Releases

Packages

Languages

marksibrahim/knowledge_search

Folders and files

Latest commit

History

Repository files navigation

Knowledge Search

Connecting Articles in a Graph

Graph Implementation

Fuzzy Title Matching

Setup

Application

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages