Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define the interface of Dataset.py #4

Open
1 of 5 tasks
MarcCote opened this issue May 21, 2015 · 3 comments
Open
1 of 5 tasks

Define the interface of Dataset.py #4

MarcCote opened this issue May 21, 2015 · 3 comments

Comments

@MarcCote
Copy link
Member

We should discuss here the interface of the datasets the smartpy library will be manipulating. In contrast with https://github.com/SMART-Lab/mldata, this is not a generic dataset manager.

Some questions

  • What type of dataset we want to support?
  • How will we support the notion of inputs and targets (i.e. variable number of sets of data)?
  • Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?
  • Should we provide symbolic variable for inputs and targets? Yes, it is a convenient place to put them their as they would be easy accessible instead of behind hidden in some obscure function responsible to compile the Theano graph.
  • How do we support targets for unlabeled dataset? Usually, models using unlabeled dataset use the input as their target, should this be the default behaviour when there are no targets?
@mgermain
Copy link
Member

I think you are pretty much asking the question we were not exactly capable to answer in mldata. I think for starters we should only support the basic kind for classification with a tuple (input, target) where input is an array and target a onehot vector. After that, we could do a small modification and support the "unsupervised kind" (input, input) and slowly but surely add support for new dataset types.

I suggest that we only focus on answering "Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?", that we will need for the basic case, for now.

@MarcCote MarcCote changed the title Redefine the interface of dataset.py Define the interface of Dataset.py May 22, 2015
@MarcCote
Copy link
Member Author

I agree we should not try to be too general nor too specific. I say target should simply be another array (not necessarily a one-hot vector!).

Regarding the question "Are trainset, validset and testset only three instances of Dataset or the split should be part of the class?", the way I see it is: trainset, validset and testset are sets of data, so they should be different instances of Dataset. But I don't know how to named the MNIST, CalTech101, CIFAR, etc. "datasets"? Maybe DatasetCollection?

@mgermain
Copy link
Member

Good point for the one-hot. As for the valid train test I see them more as
views over a given dataset.

On Mon, May 25, 2015 at 11:08 AM, Marc-Alexandre Côté <
[email protected]> wrote:

I agree we should not try to be too general nor too specific. I say target
should simply be another array (not necessarily a one-hot vector!).

Regarding the question "Are trainset, validset and testset only three
instances of Dataset or the split should be part of the class?", the way I
see it is: trainset, validset and testset are sets of data, so they should
be different instances of Dataset. But I don't know how to named the
MNIST, CalTech101, CIFAR, etc. "datasets"? Maybe DatasetCollection?


Reply to this email directly or view it on GitHub
#4 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants