Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guava Table<Integer, String, String> to JSAT DataSet #43

Open
salamanders opened this issue Aug 12, 2016 · 6 comments
Open

Guava Table<Integer, String, String> to JSAT DataSet #43

salamanders opened this issue Aug 12, 2016 · 6 comments

Comments

@salamanders
Copy link

I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.

  1. Parsing the Strings into Longs, Doubles, Strings
  2. Finding out the "worst" type for each column and normalizing across the column
  3. Making lookup tables for each column that needs it (small number of ints, or Strings)
  4. Generate a dataset based on the output column name

Is there an easier way to do this?
Can it be part of the library?

class TableDataLoader

  • TableDataLoader(Table<Long, String, String>)
  • getDataSet(String)
  • tableToDataSet_Classification(ColumnInfo, List, SortedSet, int, int)
  • tableToDataSet_Regression(ColumnInfo, List, SortedSet, int, int)

class ColumnInfo

  • ColumnInfo(String, Map<Long, String>)
  • collectionToSortedUniqueStringList(Collection)
  • parseColumn(Map<Long, String>)
  • parseToLowestObject(String, Class<?>)
  • constructJSATCategoricalData()
  • constructLabelLookups()
  • getCategoricalData()
  • getName()
  • getType()
  • isLookup()
  • getRowValue(Number)
  • getKeyFromLookupId(int)
  • getAllRowKeys()
@EdwardRaff
Copy link
Owner

I'm a little confused here. What is the Guava Table object representing with strings and 3 generic types? The super lazy thing would be to convert your table to a CSV and then use the CSV reader.... though I feel a little dirty just typing that out.

The CSV parser code isn't necessarily the best to read for understanding how to do something. That code (and the LIBSVM parser) are written to have a low GC impact by using a small state machine. This was done because for work I have some 100GB-500GB CSV and LIBSVM files that will fail with a JVM GC overhead exception implemented any other way.

@salamanders
Copy link
Author

I picked Table<row:Integer, columnName:String, cellValue:String> because it represents pretty much any tabular data structure read from disk or from a form post -- as long as it is small enough!

I think Table<row:Integer, columnName:String, cellValue:**(Long or Double or String)**> is the way to go because if they entire column's values are Longs, or a Doubles, or all Strings, then it maps well to how you need to transform the column (or if the column is the target deciding if it is a Classification or Regression problem).

The code I wrote tries to parse the String values to longs or doubles, decides on the "worst" per column and makes sure they are all the same time, then builds up a lookup table for the columns that need it, and outputs it all into DataSet. But ya... I think if I made better use of the various data row/point constructors, it could be half as much code.

@EdwardRaff
Copy link
Owner

Hmm, do you really need Long as an option? For all but the largest values a double can store them losslessly. JSAT is going to save it as a double in the end anyway. That would simplify your code too.

How many use cases would an abstract class for this be helpful for? Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

@salamanders
Copy link
Author

I was using Long (or Integer would be fine) for class lookup columns. Some threshold where "if the column is a String, or an Int where there is < 20 unique values, then treat it as a catFeats column.

Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance?

Exactly.

@salamanders
Copy link
Author

Here is the code. I warned you - ugly. But maybe I could hack half of it out using better constructors?

https://gist.github.com/salamanders/cd42f99b8483e8d0d89f6edfa5b43a10

tableToDataSet_Classification and tableToDataSet_Regression have the interesting code, the rest is support material.

@salamanders
Copy link
Author

Fixed a bug in the number of cats and simplified the "int double or string" logic, now working pretty fast!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants