Skip to content

Commit

Permalink
revising docs
Browse files Browse the repository at this point in the history
restructuring toc

restructured documentation

restructured documentation

restructured documentation

rewriting main doc page

updated readme and docs intro
  • Loading branch information
bmcfee committed Aug 25, 2017
1 parent ac89968 commit d1fba77
Show file tree
Hide file tree
Showing 7 changed files with 189 additions and 101 deletions.
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,30 @@ pescador
[![Documentation Status](https://readthedocs.org/projects/pescador/badge/?version=latest)](https://readthedocs.org/projects/pescador/?badge=latest)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.400700.svg)](https://doi.org/10.5281/zenodo.400700)

A sampling and buffering module for iterative learning.
Pescador is a library for streaming (numerical) data, primarily for use in machine learning applications.

Read the [documentation](http://pescador.readthedocs.org)
Pescador addresses the following use cases:

- **Hierarchical sampling**
- **Out-of-core learning**
- **Parallel streaming**

These use cases arise in the following common scenarios:

- Say you have three data sources `(A, B, C)` that you want to sample.
Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!

- Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
once.
Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.

- If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.


Want to learn more? [Read the docs!](http://pescador.readthedocs.org)


Installation
Expand Down
31 changes: 18 additions & 13 deletions docs/example1.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,20 @@
.. _example1:

Basic example
=============
Streaming data
==============

This document will walk through the basics of using pescador to stream samples from a generator.
This example will walk through the basics of using pescador to stream samples from a generator.

Our running example will be learning from an infinite stream of stochastically perturbed samples from the Iris dataset.


Sample generators
-----------------
Streamers are intended to transparently pass data without modifying them. However, Pescador assumes that Streamers produce output in
a particular format. Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`. For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one
key: `X`. For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.
Streamers are intended to transparently pass data without modifying them.
However, Pescador assumes that Streamers produce output in a particular format.
Specifically, a data is expected to be a python dictionary where each value contains a `np.ndarray`.
For an unsupervised learning (e.g., SKLearn/`MiniBatchKMeans`), the data might contain only one key: `X`.
For supervised learning (e.g., SGDClassifier), valid data would contain both `X` and `Y` keys, both of equal length.

Here's a simple example generator that draws random samples of data from the Iris dataset, and adds gaussian noise to the features.

Expand Down Expand Up @@ -43,7 +45,6 @@ Here's a simple example generator that draws random samples of data from the Iri
sample['Y'] is a scalar `np.ndarray` of shape `(,)`
'''
n, d = X.shape
while True:
Expand All @@ -53,16 +54,20 @@ Here's a simple example generator that draws random samples of data from the Iri
yield dict(X=X[i] + noise, Y=Y[i])
In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop. Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.
In the code above, `noisy_samples` is a generator that can be sampled indefinitely because `noisy_samples` contains an infinite loop.
Each iterate of `noisy_samples` will be a dictionary containing the sample's features and labels.


Streamers
---------
Generators in python have a couple of limitations for common stream learning pipelines. First, once instantiated, a generator cannot be "restarted". Second, an instantiated generator cannot be serialized
directly, so they are difficult to use in distributed computation environments.

Pescador provides the `Streamer` class to circumvent these issues. `Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`. Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams. Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.
Generators in python have a couple of limitations for common stream learning pipelines.
First, once instantiated, a generator cannot be "restarted".
Second, an instantiated generator cannot be serialized directly, so they are difficult to use in distributed computation environments.

Pescador provides the `Streamer` class to circumvent these issues.
`Streamer` simply provides an object container for an uninstantiated generator (and its parameters), and an access method `generate()`.
Calling `generate()` multiple times on a `Streamer` object is equivalent to restarting the generator, and can therefore be used to simply implement multiple pass streams.
Similarly, because `Streamer` can be serialized, it is simple to pass a streamer object to a separate process for parallel computation.

Here's a simple example, using the generator from the previous section.

Expand Down
8 changes: 5 additions & 3 deletions docs/example2.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
.. _example2:

This document will walk through some advanced usage of pescador.
This example demonstrates how to re-use and multiplex streamers.

We will assume a working understanding of the simple example in the previous section.

Stream re-use and multiplexing
==============================

The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams. `Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.
The `Mux` streamer provides a powerful interface for randomly interleaving samples from multiple input streams.
`Mux` can also dynamically activate and deactivate individual `Streamers`, which allows it to operate on a bounded subset of streams at any given time.

As a concrete example, we can simulate a mixture of noisy streams with differing variances.

Expand Down Expand Up @@ -66,7 +67,8 @@ As a concrete example, we can simulate a mixture of noisy streams with differing
print('Test accuracy: {:.3f}'.format(accuracy_score(Y[test], Ypred)))
In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`. When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.
In the above example, each `Streamer` in `streams` can make infinitely many samples. The `rate=64` argument to `Mux` says that each stream should produce some `n` samples, where `n` is sampled from a Poisson distribution of rate `rate`.
When a stream exceeds its bound, it is deactivated, and a new streamer is activated to fill its place.

Setting `rate=None` disables the random stream bounding, and `mux()` simply runs each active stream until exhaustion.

Expand Down
7 changes: 5 additions & 2 deletions docs/example3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@
Sampling from disk
==================

A common use case for `pescador` is to sample data from a large collection of existing archives. As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings. When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters. Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive. The problem then becomes sampling data from a collection of `NPZ` archives.
A common use case for `pescador` is to sample data from a large collection of existing archives.
As a concrete example, consider the problem of fitting a statistical model to a large corpus of musical recordings.
When the corpus is sufficiently large, it is impossible to fit the entire set in memory while estimating the model parameters.
Instead, one can pre-process each song to store pre-computed features (and, optionally, target labels) in a *numpy zip* `NPZ` archive.
The problem then becomes sampling data from a collection of `NPZ` archives.

Here, we will assume that the pre-processing has already been done so that each `NPZ` file contains a numpy array of features `X` and labels `Y`.
We will define infinite samplers that pull `n` examples per iterate.
Expand Down Expand Up @@ -86,7 +90,6 @@ Alternatively, *memory-mapping* can be used to only load data as needed, but req
yield dict(X=X[idx:idx + n],
Y=Y[idx:idx + n])
# Using this streamer is similar to the first example, but now you need a separate
# NPY file for each X and Y
npy_x_files = #LIST OF PRE-COMPUTED NPY FILES (X)
Expand Down
136 changes: 57 additions & 79 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,96 +3,72 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Introduction
------------

Pescador is a library for streaming (numerical) data for use in iterative machine learning applications.

The core concept is the :ref:`Streamer` object, which encapsulates a Python `generator` to allow for re-use and
inter-process communication.

The basic use case is as follows:

1. Define a generator function `g` which yields a dictionary of numpy arrays at each step
2. Construct a :ref:`Streamer` object `stream = Streamer(g, args...)`
3. Iterate over examples generated by `stream()`.

On top of this basic functionality, pescador provides the following tools:

- A :ref:`Streamer` allows you to turn a finite-lifecycle generator into an infinte stream with `cycle()`, by automatically restarting the generator if it completes.
- Multiplexing multiple data streams (see :ref:`Mux`)
- Transform or modify streams with Maps (see :ref:`processing-data-streams`)
- Parallel processing (see :ref:`ZMQStreamer`)
- Buffering of sampled data into fixed-size batches (see :ref:`pescador.maps.buffer_stream`)

For examples of each of these use-cases, refer to the :ref:`Examples` section.


Definitions
-----------

Pescador is designed with the following core principles:

1. An "iterator" is an object that produces a sequence of data, i.e. via `__next__` / `next()`. (`iterator definition <https://docs.python.org/3/glossary.html#term-iterator>`_, `Iterator Types <https://docs.python.org/3/library/stdtypes.html#typeiter>`_)
.. _pescador:

2. An "iterable" is an object that can produce iterators, i.e. via `__iter__` / `iter()`. (`iterable definition <https://docs.python.org/3/glossary.html#term-iterable>`_)
########
Pescador
########

3. A "stream" is the sequence of objects produced by an iterator.
Pescador is a library for streaming (numerical) data, primarily for use in machine learning applications.

4. A "generator" (or more precisely "generator function") is a callable object that returns a single generator iterator. (`generator definition <https://docs.python.org/3/glossary.html#term-generator>`_)
Pescador addresses the following use cases:

For example:
- `range` is an iterable function
- `range(8)` is an iterable, and its iterator produces the stream (consecutively) `0, 1, 2, 3, ...`
- **Hierarchical sampling**
- **Out-of-core learning**
- **Parallel streaming**

These use cases arise in the following common scenarios:

.. _streaming-data:
- Say you have three data sources `(A, B, C)` that you want to sample.
Pescador can dynamically interleave these sources to provide a randomized stream `D <- (A, B, C)`.
The distribution over `(A, B, C)` need not be uniform: you can specify any distribution you like!

Streaming Data
--------------
1. Pescador defines an object called a `Streamer` for the purposes of (re)creating iterators indefinitely and (optionally) interrupting them prematurely.
- Now, say you have 3000 data sources that you want to sample, and they're too large to all fit in RAM at
once.
Pescador makes it easy to interleave these sources while maintaining a small `working set`.
Not all sources are simultaneously active, but Pescador manages the working set so you don't have to.

2. `Streamer` inherits from `iterable` and can be iterated directly.
- If loading data incurs substantial latency (e.g., due to storage access or pre-processing), this can slow down processing.
Pescador makes it easy to do this seamlessly in a background process, so that your main thread can continue working.

3. A `Streamer` can be initialized with one of two types:
- Any iterable type, e.g. `range(7)`, `['foo', 'bar']`, `"abcdef"`, or another `Streamer()`
- A generator function and its arguments + keyword arguments.

4. A `Streamer` transparently yields the data stream flowing through it
To make this all possible, Pescador provides the following utilities:

- A `Streamer` should not modify objects in its stream.

- In the spirit of encapsulation, the modification of data streams is achieved through separate functionality (see :ref:`processing-data-streams`)


Multiplexing Data Streams
-------------------------
1. Pescador defines an object called a `Mux` for the purposes of multiplexing streams of data.

2. `Mux` inherits from `Streamer`, which makes it both iterable and recomposable, e.g. one can construct arbitrary trees of data streams.
- :ref:`Streamer` objects encapsulate data generators for re-use, infinite sampling, and inter-process
communication.
- :ref:`Mux` objects allow flexible sampling from multiple streams
- :ref:`ZMQStreamer` provides parallel processing with low communication overhead
- Transform or modify streams with Maps (see :ref:`processing-data-streams`)
- Buffering of sampled data into fixed-size batches (see :ref:`pescador.maps.buffer_stream`)

3. A `Mux` is initialized with a container of one or more iterables, and parameters to control the stochastic behavior of the object.
************
Installation
************

4. As a subclass of `Streamer`, a `Mux` also transparently yields the stream flowing through it, i.e. :ref:`streaming-data`.
Pescador can be installed from PyPI through `pip`:

.. code-block:: bash
.. _processing-data-streams:
pip install pescador
Processing Data Streams
-----------------------
Pescador adopts the concept of "transformers" for processing data streams.
or via `conda` using the `conda-forge` channel:

1. A transformer takes as input a single object in the stream.
.. code-block:: bash
2. A transformer yields an object.
conda install -c conda-forge pescador
3. Transformers are iterators, i.e. implement a `__next__` method, to preserve iteration.
4. An example of a built-in transformer is `enumerate` [`ref <https://docs.python.org/3.3/library/functions.html#enumerate>`_]
************
Introduction
************
.. toctree::
:maxdepth: 2

intro

Basic Usage
--------------
**************
Basic examples
**************
.. toctree::
:maxdepth: 2

Expand All @@ -101,39 +77,41 @@ Basic Usage
example3
bufferedstreaming

Examples
--------
*****************
Advanced examples
*****************
.. toctree::
:maxdepth: 2

auto_examples/index

*************
API Reference
-------------
*************
.. toctree::
:maxdepth: 2

api


Changes
-------
*************
Release notes
*************
.. toctree::
:maxdepth: 2

changes


**********
Contribute
----------
**********
- `Issue Tracker <http://github.com/pescadores/pescador/issues>`_
- `Source Code <http://github.com/pescadores/pescador>`_
- `Contributing guidelines <https://github.com/pescadores/pescador/blob/master/CONTRIBUTING.md>`_


******************
Indices and tables
==================
******************

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

Loading

0 comments on commit d1fba77

Please sign in to comment.