-
Notifications
You must be signed in to change notification settings - Fork 54
Text analytics
EventRegistry module also has an Analytics
class that can be used to perform various text analytics. The class will be extended with additional functionality, but for now it allows you to
- semantically annotate your documents with entities and non-entities mentioned in the document,
- categorize the document into a list of predefined categories based on DMOZ.org taxonomy,
- compute sentiment of the document
- determine the language of the document.
To visually test different methods please visit our demo pages.
In order to semantically annotate a given document use code such as:
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
ann = analytics.annotate("Microsoft released a new version of Windows OS.")
Categorization is currently only supported for English language. To categorize the document into a predefined set of categories and identify top related keywords use code such as:
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
cat = analytics.categorize("Microsoft released a new version of Windows OS.")
Here is a sample code to detect the sentiment expressed in the document:
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
cat = analytics.sentiment("Microsoft released a new version of Windows OS.")
Here is a sample code to detect the code of the document
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
langInfo = analytics.detectLanguage("Microsoft released a new version of Windows OS.")
By analyzing several tens of documents you can identify what are the common concepts and categories associated with the documents. Below is sample code to demonstrate the usage of the API:
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
ret = analytics.trainTopicCreateTopic("my topic")
uri = ret["uri"]
# add the documents relevant for your topic of interest
analytics.trainTopicAddDocument(uri, "Facebook has removed 18 accounts and 52 pages associated with the Myanmar military, including the page of its commander-in-chief, after a UN report accused the armed forces of genocide and war crimes.")
analytics.trainTopicAddDocument(uri, "Emmanuel Macron’s climate commitment to “make this planet great again” has come under attack after his environment minister dramatically quit, saying the French president was not doing enough on climate and other environmental goals.")
analytics.trainTopicAddDocument(uri, "Theresa May claimed that a no-deal Brexit “wouldn’t be the end of the world” as she sought to downplay a controversial warning made by Philip Hammond last week that it would cost £80bn in extra borrowing and inhibit long-term economic growth.")
# finish training of the topic
ret = analytics.trainTopicFinishTraining(uri)
assert "topic" in ret
# use the "concepts" and "categories" properties in the topic. They represent what your documents are mostly about
topic = ret["topic"]
You can analyze a larger number of tweets matching search criteria and build a topic with common concepts and categories associated with the tweets. You can determine the set of tweets to analyze by either identifying tweets based on the username (using @ as a prefix), using a hashtag (using # as a prefix) or using a regular keyword. You can choose to analyze the content of the tweets or to just analyze the links provided in the tweets.
import eventregistry as ER
er = ER.EventRegistry()
analytics = ER.Analytics(er)
# enqueue the task of building a topic based on the tweets from a user
ret = analytics.trainTopicOnTweets("@SeanEllis", useTweetText = True, maxConcepts = 50, maxCategories = 20, maxTweets = 400)
assert ret and "uri" in ret
uri = ret["uri"]
# the training of the topic can take several minutes. For this reason you have to use the uri provided in the response and
# get the topic after a while
time.sleep(5)
# retrieve the topic definition. If the topic is not built yet, it will not be returned
ret = analytics.trainTopicGetTrainedTopic(uri)
{
"dmoz": {
// top categories associated with the text
"categories": [
{
// category ID
"label": "dmoz/Computers/Companies/Microsoft_Corporation",
// relevance of the category to the document
"score": 0.456
},
....
],
// top keywords that summarize the document and their weights
"keywords": [
{
"keyword": "Computers",
"wgt": 0.160
}
...
]
}
}
{
"reliable": true,
"textBytes": 32,
// the language candidates for the document
"languages": [
{
"name": "ENGLISH",
// ISO2 code of the language
"code": "en",
// probability of the document being in this language
"percent": 96,
"score": 1321
},
...
]
}
{
"name": "@SeanEllis",
"topic": [
"concepts": [
{
"uri": "https://en.wikipedia.org/wiki/Amazon_(company)",
"type": "org",
"label": "Amazon (company)",
"wgt": 50
},
...
],
"categories": [
{
"uri": "dmoz:Business/Investing",
"label": "Business/Investing",
"wgt": 42
},
...
]
]
}
{
// the list of annotations
"annotations": [
{
// the URL that uniquely identifies the concept represented by the annotation
"url": "http://en.wikipedia.org/wiki/Microsoft",
// the label that can be used to represent the annotation (in the language of the document)
"title": "Microsoft",
// the input language
"lang": "en",
// secondary URL that uniquely identifies the concept as a concept on English wikipedia
"secUrl": "http://en.wikipedia.org/wiki/Microsoft",
// label that can represent the concept in English language
"secTitle": "Microsoft",
"secLang": "en",
// dbpedia URI of the concept
"dbPediaIri": "http://dbpedia.org/resource/Microsoft",
// dbpedia types for the concept
"dbPediaTypes": [
"Agent",
"Organisation",
"Company"
],
// general categorization of the concept (person, org or loc)
"type": "org",
// importance of the concept for the whole document
"wgt": 0.6666,
// mentions of the concept in the document
"support": [
{
// character positions in text
"chFrom": 0,
"chTo": 8,
// based on the word(s) mentioned in the text, how likely it is that this is the correct annotation
"pMentionGivenSurface": 0.253001126280801,
"pageRank": 0.03690052603740375,
// the word/phrase that is used to mention the concept in the text
"text": "Microsoft",
// word indices
"wFrom": 0,
"wTo": 0,
"wikiLang": "en"
}
],
"pageRank": 0.2520778231483313,
// wikidata id for the concept
"wikiDataItemId": "Q2283"
// wikidata class ids for the concept
"wikiDataClassIds": [
"Q891723",
"Q1058914",
"Q4830453",
"Q43229",
"Q874405",
"Q24229398",
"Q16334295",
"Q58778",
"Q35120",
"Q16334298",
"Q286583",
"Q17519152",
"Q517966",
"Q223557",
"Q16889133",
"Q18844919",
"Q488383",
"Q5127848"
],
// wikidata class ids and names
"wikiDataClasses": [
{
"enLabel": "public company",
"itemId": "Q891723"
},
{
"enLabel": "software house",
"itemId": "Q1058914"
},
...
]
},
...
],
// list of nouns identified in the document
"nouns": [
{
// starting and ending indices of the noun
"iFrom": 25,
"iTo": 31,
// normalized form of the text
"normForm": "version",
// list of Wordnet synset IDs for the word
"synsetIds": [
"101267901",
"105840650",
"105928513",
"106408779",
"106536389",
"107173585"
]
},
...
],
// list of adjectives found in the document
"adjectives": [
{
// position in the document
"iFrom": 21,
"iTo": 23,
// normalized form of the adjective
"normForm": "new",
// wordnet synset ids
"synsetIds": [
"300024996",
"300128733",
"300818008",
"300937186",
"301640850",
"301687167",
"301687965",
"302070491",
"302584699"
]
},
...
],
// list of verbs identified in the document
"verbs": [
{
// text positions
"iFrom": 10,
"iTo": 17,
// normalized form of the verb
"normForm": "release",
// wordnet sysnsets
"synsetIds": [
"200069295",
"200104868",
"200269682",
"200967625",
"201436518",
"201474550",
"201757994",
"202316304",
"202421374",
"202494047"
]
},
...
],
// list of adverbs
"adverbs": [
],
// there are other returned properties that don't have significant importance for the user
}
Core Information
Usage tracking
Terminology
EventRegistry
class
ReturnInfo
class
Data models for returned information
Finding concepts for keywords
Filtering content by news sources
Text analytics
Semantic annotation, categorization, sentiment
Searching
Searching for events
Searching for articles
Article/event info
Get event information
Get article information
Other
Supported languages
Different ways to search using keywords
Feed of new articles/events
Social media shares
Daily trends
Find the event for your own text
Article URL to URI mapping