Skip to content

Finding concepts for keywords

Gregor Leban edited this page Mar 20, 2019 · 4 revisions

When you are searching for articles or events, one of the conditions that you can specify are keywords. If you'd for example like to find articles about Barack Obama you can make the following query:

er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticles(keywords = "Barack Obama")
q.addRequestedResult(RequestArticlesInfo())
res = er.execQuery(q)

res will return a list of articles that mention the words "Barack" and "Obama". A much better, faster and more desirable approach to make such queries would, however, be to use a concept search. In Event Registry, articles are annotated with entities (people, organizations, and location) and important words that are mentioned in the articles. These annotations are called concepts. Since articles are annotated with concepts we are also able to annotate events with concepts - events are annotated with those concepts that appear frequently enough in the articles describing the event.

Now, why would you care about using concepts? The main reason is that the languages are ambiguous and using keywords can yield undesirable results. The same word can mean different things and different words can often mean the same thing. Here is where the concepts come in. Each concept in Event Registry is represented with a unique identifier (URI), which is in our case the URL to the concept's Wikipedia page. For "Barack Obama", for example, the concept URI is http://en.wikipedia.org/wiki/Barack_Obama. Using the concept URI, we can repeat the top query in this way:

er = EventRegistry(apiKey = YOUR_API_KEY)
q = QueryArticles(conceptUri = "http://en.wikipedia.org/wiki/Barack_Obama")
q.addRequestedResult(RequestArticlesInfo())
res = er.execQuery(q)

The difference in the obtained results would mainly be twofold:

  1. Results would also include articles from languages that use a different script, such as Russian, Arabic or Chinese. This can be done because we know for each concept how it is spelled in different languages and for each concept, we use the concept URI regardless of the language in which the concept is mentioned.

  2. The results would also include articles where Barack Obama is mentioned simply as "Obama". This would be even more common with organizations or things that are often mentioned using different words, phrases or abbreviations. This feature is available because we use Wikipedia as a knowledge base and we are aware of several ways in which concepts can be mentioned.

I hope by now, the reason for preferring concepts over simple keywords is evident by now. The only question that remains is how can you find the concept URI for a concept of your interest. The simple way in which you can find the concept URI based on the label of some entity or word is to use the getConceptUri() API call. Here would be an example:

er = EventRegistry(apiKey = YOUR_API_KEY)
uri = er.getConceptUri("sandra bullock")
q = QueryArticles(conceptUri = uri)
q.addRequestedResult(RequestArticlesInfo())
res = er.execQuery(q)

As you can see, the call uri = er.getConceptUri("sandra bullock") searches for the concept URI that best matches the label "sandra bullock". In this case, uri would get the value http://en.wikipedia.org/wiki/Sandra_Bullock. When multiple concepts match the given label, the one that appears most often in the news articles will be returned.