Define the interface of Dataset.py #4

MarcCote · 2015-05-21T23:25:32Z

We should discuss here the interface of the datasets the smartpy library will be manipulating. In contrast with https://github.com/SMART-Lab/mldata, this is not a generic dataset manager.

Some questions

What type of dataset we want to support?
How will we support the notion of inputs and targets (i.e. variable number of sets of data)?
Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?
Should we provide symbolic variable for inputs and targets? Yes, it is a convenient place to put them their as they would be easy accessible instead of behind hidden in some obscure function responsible to compile the Theano graph.
How do we support targets for unlabeled dataset? Usually, models using unlabeled dataset use the input as their target, should this be the default behaviour when there are no targets?

The text was updated successfully, but these errors were encountered:

mgermain · 2015-05-22T00:47:10Z

I think you are pretty much asking the question we were not exactly capable to answer in mldata. I think for starters we should only support the basic kind for classification with a tuple (input, target) where input is an array and target a onehot vector. After that, we could do a small modification and support the "unsupervised kind" (input, input) and slowly but surely add support for new dataset types.

I suggest that we only focus on answering "Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?", that we will need for the basic case, for now.

MarcCote · 2015-05-25T15:08:48Z

I agree we should not try to be too general nor too specific. I say target should simply be another array (not necessarily a one-hot vector!).

Regarding the question "Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?", the way I see it is: trainset, validset and testset are sets of data, so they should be different instances of Dataset. But I don't know how to named the MNIST, CalTech101, CIFAR, etc. "datasets"? Maybe DatasetCollection?

mgermain · 2015-05-25T16:49:22Z

Good point for the one-hot. As for the valid train test I see them more as
views over a given dataset.

On Mon, May 25, 2015 at 11:08 AM, Marc-Alexandre Côté <
[email protected]> wrote:

I agree we should not try to be too general nor too specific. I say target
should simply be another array (not necessarily a one-hot vector!).

Regarding the question "Are trainset, validset and testset only three
instances of Dataset or the split should be part of the class?", the way I
see it is: trainset, validset and testset are sets of data, so they should
be different instances of Dataset. But I don't know how to named the
MNIST, CalTech101, CIFAR, etc. "datasets"? Maybe DatasetCollection?

—
Reply to this email directly or view it on GitHub
#4 (comment).

MarcCote changed the title ~~Redefine the interface of dataset.py~~ Define the interface of Dataset.py May 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define the interface of Dataset.py #4

Define the interface of Dataset.py #4

MarcCote commented May 21, 2015

mgermain commented May 22, 2015

MarcCote commented May 25, 2015

mgermain commented May 25, 2015

Define the interface of Dataset.py #4

Define the interface of Dataset.py #4

Comments

MarcCote commented May 21, 2015

mgermain commented May 22, 2015

MarcCote commented May 25, 2015

mgermain commented May 25, 2015