-
-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guava Table<Integer, String, String> to JSAT DataSet #43
Comments
I'm a little confused here. What is the Guava Table object representing with strings and 3 generic types? The super lazy thing would be to convert your table to a CSV and then use the CSV reader.... though I feel a little dirty just typing that out. The CSV parser code isn't necessarily the best to read for understanding how to do something. That code (and the LIBSVM parser) are written to have a low GC impact by using a small state machine. This was done because for work I have some 100GB-500GB CSV and LIBSVM files that will fail with a JVM GC overhead exception implemented any other way. |
I picked I think The code I wrote tries to parse the String values to longs or doubles, decides on the "worst" per column and makes sure they are all the same time, then builds up a lookup table for the columns that need it, and outputs it all into DataSet. But ya... I think if I made better use of the various data row/point constructors, it could be half as much code. |
Hmm, do you really need How many use cases would an abstract class for this be helpful for? Is the use case you are imagining where you get datasets at runtime and don't know what types of features are in the data in advance? |
I was using Long (or Integer would be fine) for class lookup columns. Some threshold where "if the column is a String, or an Int where there is < 20 unique values, then treat it as a catFeats column.
Exactly. |
Here is the code. I warned you - ugly. But maybe I could hack half of it out using better constructors? https://gist.github.com/salamanders/cd42f99b8483e8d0d89f6edfa5b43a10 tableToDataSet_Classification and tableToDataSet_Regression have the interesting code, the rest is support material. |
Fixed a bug in the number of cats and simplified the "int double or string" logic, now working pretty fast! |
I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.
Is there an easier way to do this?
Can it be part of the library?
class TableDataLoader
class ColumnInfo
The text was updated successfully, but these errors were encountered: