-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathintro.html
240 lines (211 loc) · 18 KB
/
intro.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Discovering structure in your data: an overview of clustering</title>
<link rel="stylesheet" href="_static/basic.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/bootswatch-3.3.4/lumen/bootstrap.min.css" type="text/css" />
<link rel="stylesheet" href="_static/bootstrap-sphinx.css" type="text/css" />
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT: './',
VERSION: '0.1.0',
COLLAPSE_INDEX: false,
FILE_SUFFIX: '.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="_static/jquery.js"></script>
<script type="text/javascript" src="_static/underscore.js"></script>
<script type="text/javascript" src="_static/doctools.js"></script>
<script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script type="text/javascript" src="_static/js/jquery-1.11.0.min.js"></script>
<script type="text/javascript" src="_static/js/jquery-fix.js"></script>
<script type="text/javascript" src="_static/bootstrap-3.3.4/js/bootstrap.min.js"></script>
<script type="text/javascript" src="_static/bootstrap-sphinx.js"></script>
<link rel="top" title="None" href="index.html" />
<link rel="up" title="Tutorials" href="docs.html" />
<link rel="next" title="Finding the number of clusters with the Dirichlet Process" href="ncluster.html" />
<link rel="prev" title="Tutorials" href="docs.html" />
<meta charset='utf-8'>
<meta http-equiv='X-UA-Compatible' content='IE=edge,chrome=1'>
<meta name='viewport' content='width=device-width, initial-scale=1.0, maximum-scale=1'>
<meta name="apple-mobile-web-app-capable" content="yes">
</head>
<body role="document">
<div id="navbar" class="navbar navbar-inverse navbar-default navbar-fixed-top">
<div class="container">
<div class="navbar-header">
<!-- .btn-navbar is used as the toggle for collapsed navbar content -->
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">
datamicroscopes</a>
<span class="navbar-text navbar-version pull-left"><b>0.1</b></span>
</div>
<div class="collapse navbar-collapse nav-collapse">
<ul class="nav navbar-nav">
<li><a href="https://github.com/datamicroscopes">GitHub</a></li>
<li><a href="https://qadium.com/">Qadium</a></li>
<li class="dropdown globaltoc-container">
<a role="button"
id="dLabelGlobalToc"
data-toggle="dropdown"
data-target="#"
href="index.html">Site <b class="caret"></b></a>
<ul class="dropdown-menu globaltoc"
role="menu"
aria-labelledby="dLabelGlobalToc"><ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="">Discovering structure in your data: an overview of clustering</a></li>
<li class="toctree-l1"><a class="reference internal" href="ncluster.html">Finding the number of clusters with the Dirichlet Process</a></li>
<li class="toctree-l1"><a class="reference internal" href="enron_blog.html">Network Modeling with the Infinite Relational Model</a></li>
<li class="toctree-l1"><a class="reference internal" href="topic.html">Bayesian Nonparametric Topic Modeling with the Daily Kos</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="datatypes.html">Datatypes and Bayesian Nonparametric Models</a></li>
<li class="toctree-l1"><a class="reference internal" href="bb.html">Binary Data with the Beta Bernouli Distribution</a></li>
<li class="toctree-l1"><a class="reference internal" href="dd.html">Categorical Data and the Dirichlet Discrete Distribution</a></li>
<li class="toctree-l1"><a class="reference internal" href="niw.html">Real Valued Data and the Normal Inverse-Wishart Distribution</a></li>
<li class="toctree-l1"><a class="reference internal" href="nic.html">Univariate Data with the Normal Inverse Chi-Square Distribution</a></li>
<li class="toctree-l1"><a class="reference internal" href="gamma_poisson.html">Count Data and Ordinal Data with the Gamma-Poisson Distribution</a></li>
</ul>
<ul>
<li class="toctree-l1"><a class="reference internal" href="gauss2d.html">Inferring Gaussians with the Dirichlet Process Mixture Model</a></li>
<li class="toctree-l1"><a class="reference internal" href="mnist_predictions.html">Digit recognition with the MNIST dataset</a></li>
<li class="toctree-l1"><a class="reference internal" href="enron_email.html">Clustering the Enron e-mail corpus using the Infinite Relational Model</a></li>
<li class="toctree-l1"><a class="reference internal" href="hdp.html">Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process</a></li>
</ul>
<ul class="current">
<li class="toctree-l1 current"><a class="reference internal" href="docs.html">Tutorials</a><ul class="current">
<li class="toctree-l2 current"><a class="current reference internal" href="">Discovering structure in your data: an overview of clustering</a></li>
<li class="toctree-l2"><a class="reference internal" href="ncluster.html">Finding the number of clusters with the Dirichlet Process</a></li>
<li class="toctree-l2"><a class="reference internal" href="enron_blog.html">Network Modeling with the Infinite Relational Model</a></li>
<li class="toctree-l2"><a class="reference internal" href="topic.html">Bayesian Nonparametric Topic Modeling with the Daily Kos</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="docs.html#datatypes-and-likelihood-models-in-datamicroscopes">Datatypes and likelihood models in datamicroscopes</a><ul>
<li class="toctree-l2"><a class="reference internal" href="datatypes.html">Datatypes and Bayesian Nonparametric Models</a></li>
<li class="toctree-l2"><a class="reference internal" href="bb.html">Binary Data with the Beta Bernouli Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="dd.html">Categorical Data and the Dirichlet Discrete Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="niw.html">Real Valued Data and the Normal Inverse-Wishart Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="nic.html">Univariate Data with the Normal Inverse Chi-Square Distribution</a></li>
<li class="toctree-l2"><a class="reference internal" href="gamma_poisson.html">Count Data and Ordinal Data with the Gamma-Poisson Distribution</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="docs.html#examples">Examples</a><ul>
<li class="toctree-l2"><a class="reference internal" href="gauss2d.html">Inferring Gaussians with the Dirichlet Process Mixture Model</a></li>
<li class="toctree-l2"><a class="reference internal" href="mnist_predictions.html">Digit recognition with the MNIST dataset</a></li>
<li class="toctree-l2"><a class="reference internal" href="enron_email.html">Clustering the Enron e-mail corpus using the Infinite Relational Model</a></li>
<li class="toctree-l2"><a class="reference internal" href="hdp.html">Learning Topics in The Daily Kos with the Hierarchical Dirichlet Process</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="api.html">API Reference</a><ul>
<li class="toctree-l2"><a class="reference internal" href="microscopes.common.dataview.html">dataviews</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.common.util.html">util</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.common.random.html">microscopes.common.random</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.common.query.html">query</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.common.validator.html">microscopes.common.validator</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.kernels.parallel.html">parallel</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.mixture.html">mixturemodel</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.irm.html">irm</a></li>
<li class="toctree-l2"><a class="reference internal" href="microscopes.kernels.html">kernels</a></li>
<li class="toctree-l2"><a class="reference internal" href="api.html#indices-and-tables">Indices and tables</a></li>
</ul>
</li>
</ul>
</ul>
</li>
<li class="dropdown">
<a role="button"
id="dLabelLocalToc"
data-toggle="dropdown"
data-target="#"
href="#">Contents <b class="caret"></b></a>
<ul class="dropdown-menu localtoc"
role="menu"
aria-labelledby="dLabelLocalToc"><ul>
<li><a class="reference internal" href="#">Discovering structure in your data: an overview of clustering</a></li>
</ul>
</ul>
</li>
<li class="hidden-sm">
<div id="sourcelink">
<a href="_sources/intro.txt"
rel="nofollow">Source</a>
</div></li>
</ul>
<form class="navbar-form navbar-right" action="search.html" method="get">
<div class="form-group">
<input type="text" name="q" class="form-control" placeholder="Search" />
</div>
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
</div>
<div class="container">
<div class="row">
<div class="col-md-12">
<div class="section" id="discovering-structure-in-your-data-an-overview-of-clustering">
<span id="intro"></span><h1>Discovering structure in your data: an overview of clustering<a class="headerlink" href="#discovering-structure-in-your-data-an-overview-of-clustering" title="Permalink to this headline">¶</a></h1>
<hr class="docutils" />
<p>Often a dataset can be explained by a small number of unknown groups, or clusters. These clusters can arise because there are a small number of causes of the underlying dataset. Formally, the underlying assumption in clustering is that there exists <span class="math">\(K\)</span> latent classes in the data. The goal of clustering is to classify the observations in the data into these <span class="math">\(K\)</span> latent classes.</p>
<p>In the Bayesian context, we assume these clusters are characterized by probability distributions conditioned on their cluster assignment. These distributions are the likelihood of the clusters. For examples of the kinds of distributions available for modeling, see our list of <a class="reference internal" href="docs.html#docs"><span>available likelihood models</span></a>.</p>
<p>To be specific, we assume the data is generated from a mixture of distributions. As a result, algorithms to learn these underlying probabilistic distributions are called mxiture models.</p>
<p>Let’s consider the example of 2-dimensional real valued data:</p>
<img alt="_images/sim2d.png" src="_images/sim2d.png" />
<p>Since the data is real valued, we’ll assume the data is distributed <a class="reference internal" href="niw.html#niw"><span>multivariate Gaussian</span></a> :</p>
<div class="math">
\[P(\mathbf{x} \mid cluster=k)\sim\mathcal{N}(\mu_{k},\Sigma_{k})\]</div>
<p>As a result, we will use a Gaussian Mixture Model to learn these underlying clusters. In a Gaussian Mixture Model, we learn the parameters of these Gaussians. We’ll select our <a class="reference internal" href="niw.html#niw"><span>normal-inverse-Wishart</span></a> likelihood model since the normal-inverse-Wishart is the conjugate prior of the multivariate Gaussian distribution. With these parameters, we can estimate the probability that new data is generated from each of these <span class="math">\(K\)</span> clusters.</p>
<p>Most clustering algorithms rely on the number of clusters to be known in advance because cluster assignments are considered categorical. As a result, <span class="math">\(K\)</span> is assumed in advance so that the dimensionality of cluster assignments is finite.</p>
<p>Using the Dirichlet Process, we can learn the number of clusters as we learn each cluster’s parameters.</p>
<p>We will visualize our data to examine the cluster assignment</p>
<div class="code python highlight-python"><div class="highlight"><pre><span class="k">def</span> <span class="nf">plot_assignment</span><span class="p">(</span><span class="n">assignment</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">Y</span><span class="p">):</span>
<span class="n">cl</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'x'</span><span class="p">,</span><span class="s">'y'</span><span class="p">])</span>
<span class="n">cl</span><span class="p">[</span><span class="s">'cluster'</span><span class="p">]</span> <span class="o">=</span> <span class="n">assignment</span>
<span class="n">n_clusters</span> <span class="o">=</span> <span class="n">cl</span><span class="p">[</span><span class="s">'cluster'</span><span class="p">]</span><span class="o">.</span><span class="n">nunique</span><span class="p">()</span>
<span class="n">sns</span><span class="o">.</span><span class="n">lmplot</span><span class="p">(</span><span class="s">'x'</span><span class="p">,</span> <span class="s">'y'</span><span class="p">,</span> <span class="n">hue</span><span class="o">=</span><span class="s">"cluster"</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="n">cl</span><span class="p">,</span> <span class="n">fit_reg</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">legend</span><span class="o">=</span><span class="p">(</span><span class="n">n_clusters</span><span class="o"><</span><span class="mi">10</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'Simulated Gaussians: </span><span class="si">%d</span><span class="s"> Learned Clusters'</span> <span class="o">%</span> <span class="n">n_clusters</span><span class="p">)</span>
</pre></div>
</div>
<p>Let’s peek at the starting state for one of our chains</p>
<div class="code python highlight-python"><div class="highlight"><pre><span class="n">plot_assignment</span><span class="p">(</span><span class="n">latents</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">assignments</span><span class="p">())</span>
</pre></div>
</div>
<img alt="_images/gauss2d_14_0.png" src="_images/gauss2d_14_0.png" />
<p>Let’s watch one of the chains evolve for a few steps</p>
<div class="code python highlight-python"><div class="highlight"><pre><span class="n">first_runner</span> <span class="o">=</span> <span class="n">runners</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="n">first_runner</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">r</span><span class="o">=</span><span class="n">prng</span><span class="p">,</span> <span class="n">niters</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">plot_assignment</span><span class="p">(</span><span class="n">first_runner</span><span class="o">.</span><span class="n">get_latent</span><span class="p">()</span><span class="o">.</span><span class="n">assignments</span><span class="p">())</span>
</pre></div>
</div>
<img alt="_images/gauss2d_16_0.png" src="_images/gauss2d_16_0.png" />
<img alt="_images/gauss2d_16_1.png" src="_images/gauss2d_16_1.png" />
<img alt="_images/gauss2d_16_2.png" src="_images/gauss2d_16_2.png" />
<img alt="_images/gauss2d_16_3.png" src="_images/gauss2d_16_3.png" />
<img alt="_images/gauss2d_16_4.png" src="_images/gauss2d_16_4.png" />
<p>Now let’s burn all our runners in for 100 iterations.</p>
<div class="code python highlight-python"><div class="highlight"><pre><span class="k">for</span> <span class="n">runner</span> <span class="ow">in</span> <span class="n">runners</span><span class="p">:</span>
<span class="n">runner</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">r</span><span class="o">=</span><span class="n">prng</span><span class="p">,</span> <span class="n">niters</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</pre></div>
</div>
<p>Let’s now peek again at the first state</p>
<img alt="_images/gauss2d_20_0.png" src="_images/gauss2d_20_0.png" />
<p>Because this model was run on simulated data, we can compare the results to our actual underlying assignments:</p>
<img alt="_images/gauss2d_8_1.png" src="_images/gauss2d_8_1.png" />
<p>To learn more about the code that generated this example, see <a class="reference internal" href="gauss2d.html#gauss2d"><span>Inferring Gaussians with the Dirichlet Process Mixture Model</span></a> .</p>
</div>
</div>
</div>
</div>
<!-- your html code here -->
<center> Datamicroscopes is developed by <a href="http://www.qadium.com">Qadium</a>, with funding from the <a href="http://www.darpa.mil">DARPA</a> <a href="http://www.darpa.mil/program/xdata">XDATA</a> program. Copyright Qadium 2015. </center>
</body>
</html>