Labnotes about data exploration and visualization of wikipedia pages corpora.
All the initial content and the produced datasets are stored as csv
and gexf
are stored in the data
folder of each topic folders.
This analysis is about the set of pages imported from the wikipedia List of geometry topics page. All the muscle for data retrieval work has been provided using the wekeypedia/python-toolkit/analysis-data-retrieval.py
macro.
- pages: 303
- editors: 15857
- revisions: 101462
- multivariate analysis: main exploration and results from the the geometry pages dataset
- time series: pageviews and revisions time series of all pages
- page explorer: a notebook to explore data about individual pages. Very helpfull if you are looking for local relationships. This includes all analytics used by knwoledge path reconstruction
- construction of the hyperlink graph: reconstruction of the natural hyperlink network between the corpus pages. Also add an extension to the graph of page names apparitions
- reading map based on a reduced graph of hyperlink terms
- construction of the pages-editors bi-partite graph: construction of a bi-partite graph made of pages and editors. Also build a page-page graph based on the projection of the previous bi-partite where pages are linked if they share an editor
- building a reading map based on a reduced graph of co-edited pages: a first trial at building a stronger reading map based upon a strategy using co-edited pages network instead of hyperlinks network. This include a lot of mid-range lifting involving networkx
- index of bots: List of bots active in the corpus
- pages: 1239
- editors: 81604
- revisions: 697011
-
pages: 4
-
parsing and decoding wikipedia page diffs: a guide about using the wekeypedia toolkit to make sense of diffs
-
words of love and wisdom: numerical exploration of terms distributions over the corpus pages
-
find love on wikipedia with NLTK: exploitation of the work on diffs to extract signs of love on wikipedia