Skip to content
Erispoe edited this page Sep 3, 2014 · 4 revisions

GCorpusAnalytics (GCA) is a python package to serialize queries on Google services such as Books, Patents, Scholar... When a query is done on these services, Google returns results corresponding to the different criteria, but also an estimation of the total number of results. This number is an indication of the count of items meeting the request in the corpus. This count changes across dimensions, especially time. Provided than we can evaluate the quality of the queried Google corpus, we can learn many things by studying how this number changes across dimensions. However, this process is cumbersome to do by hand, as testing several expression for numerous timespans can result in several hundreds queries. GCA does the job for you. Feed it with a well-formated request indicating which dimension you want to serialize, and in will return a table with counts of results for every combination.

For instance, here is a graph of the relative proportion of books mentionning the word virtual in all the books talking about internet:

It clearly shows that there has been a peak, around 2000, of associating the two words together, maybe linked to the dot-com bubble. To see more about this story and the method behind, as well as the rationale behind GCA, read Quantitative analysis on the Google Books corpus.

Here is how to get started with GCA:

Install

Format your request

Use in command line

Use in a python script

Examples

Clone this wiki locally