Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: framework agnostic loaders #682

Open
wants to merge 21 commits into
base: develop
Choose a base branch
from

Conversation

tharvik
Copy link
Collaborator

@tharvik tharvik commented Jun 5, 2024

on the road of supporting ONNX, we first need to extract TFJS from the code base, making discojs framework agnostic.
based on #735.

  • adding a Dataset type, wrapping standard AsyncGenerator with some helpers (map, zip, split, batch, ...)
    • for now, disco.fit, validator.{assess,predict}, requires more type informations (via TypedDataset) as theses are supporting any dataset for any task. it should be enforced at some point via Task splitting so that we can drop it.
    • now, any users of disco have to use it, keeping the remaining Datasplit, Data, tf.* internal to disco (and removing theses when preprocessing and task typing are reworked)
  • also exposes types: Image, Tabular & Text
  • redo loaders to be small functions rather than cumbersome-to-use classes
  • added some convertors as a potential new way to do preprocessing
    • flat functions, that encodes type changes instead of wrapping in tf.TensorContainer (pretty much equivalent to any)
  • webapp are now typing its dataset, having a clear separation between Image, Tabular & Text, with Data component emitting changes

@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 333bd90 to 0560bb0 Compare June 5, 2024 12:52
@tharvik tharvik marked this pull request as ready for review June 5, 2024 12:59
@tharvik tharvik requested a review from JulienVig June 5, 2024 12:59
Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge amount of work, well done!! Things feel much cleaner now!

I think there are a few issues with the webapp right now:

  • Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)
    Screenshot 2024-06-06 at 13 54 23
  • Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.
  • Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000
  • Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear. (Sidenote titanic training currently fails on develop with error: Error: A Dataset iterator for fitDataset() is expected to generate objects of the form {xs: xVal, ys: yVal}, where the two values may be tf.Tensor, an array of Tensors, or a map of string to Tensor. The provided Dataset instead generates Tensor so you seemed to have fixed this error)
    Screenshot 2024-06-06 at 14 01 45
  • Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.
  • There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp Screenshot 2024-06-06 at 14 10 57

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

  • Dataset is not a data structure composed of Data elements
  • Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)
  • DataSplit represents a train-test split of Data rather than a dataset
  • dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

cli/src/benchmark_gpt.ts Outdated Show resolved Hide resolved
discojs-node/src/loaders/index.ts Show resolved Hide resolved
discojs-node/src/loaders.spec.ts Show resolved Hide resolved
discojs-node/src/loaders/image.ts Outdated Show resolved Hide resolved
cli/src/data.ts Show resolved Hide resolved
discojs/src/task/training_information.ts Outdated Show resolved Hide resolved
webapp/src/components/data/TabularDatasetInput.vue Outdated Show resolved Hide resolved
docs/examples/wikitext.ts Outdated Show resolved Hide resolved
webapp/src/components/containers/ImageCard.vue Outdated Show resolved Hide resolved
webapp/src/components/data/ImageDatasetInput.vue Outdated Show resolved Hide resolved
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 0560bb0 to a58143c Compare June 11, 2024 08:34
@tharvik
Copy link
Collaborator Author

tharvik commented Jun 11, 2024

  • Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

  • Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.

good catch! I indeed forgot to propagate the "clear file" event

  • Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000

indeed, I mixed RGBA & RGB images, added a real Image type and related remove_alpha convertor.

  • Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear.

right, thanks for catching that, I missplaced a toRaw in there. NB: this is needed as VueJS proxies every value, but it doesn't work with JS private fields (# prefixed values).

  • Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.

ho oupsi, I'm generating an empty dataset when uploading a folder, should have added a throwing TODO, thanks for catching that.

  • There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

correct, there isn't any. we can add one in a next iteration but that out of scope of the PR. from what I remember, one student should have worked on that but it wasn't clear if it happened in the end.

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

I totally agree with you, it's not understandable. that's also why this PR is introducing a new one that should replace all of theses in its next iterations (when I reworked the convertors).

  • Dataset is not a data structure composed of Data elements

that would be the central (and only) way to have a serie of data (image/tabular/text/...) with transformations applied on it.

  • Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)

it will be dropped, replaced by Dataset<Image>, Dataset<Tabular> & Dataset<Text> and some convertors applied on it.

  • DataSplit represents a train-test split of Data rather than a dataset

it will be replaced by the return type of Dataset<T>.split() ie [Dataset<T>, Dataset<T>].

  • dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

it will be real values, not fields, by shaping Dataset<T> via convertors functions. for example, Dataset<Tabular> -> extract_columns -> convert_to_numbers -> Dataset<[features: number[], label: number]>

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

as I'm hoping to remove most of theses soon, it seems a bit premature to update the doc now. I'll update the doc accordingly in the next one, but feel free to setup a PR if you feel that's it's taking too long.

@JulienVig
Copy link
Collaborator

Sounds good! Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

@JulienVig
Copy link
Collaborator

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 88b3514 to d1437de Compare June 12, 2024 15:22
@tharvik
Copy link
Collaborator Author

tharvik commented Jun 12, 2024

Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

hum, I think that's doable, not for the webapp Tester (it's not really supported now anyway) but for the others parts, yes.

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

hum, users are weird, but yeah, make sense, it's very demo-y; adding it back.

@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from d1437de to fbbd48b Compare June 12, 2024 15:46
@tharvik tharvik changed the base branch from develop to 669-wikitext-web-julien June 15, 2024 15:02
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 976b163 to 649c7a0 Compare June 15, 2024 15:03
Base automatically changed from 669-wikitext-web-julien to develop June 17, 2024 07:07
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 649c7a0 to 15a2a27 Compare July 3, 2024 23:30
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 3c1b6ef to 667cf26 Compare July 15, 2024 13:32
@tharvik tharvik added this to the v4.0.0 milestone Jul 23, 2024
@tharvik tharvik mentioned this pull request Jul 31, 2024
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 667cf26 to 6c4ce8a Compare August 7, 2024 22:44
@tharvik tharvik changed the base branch from develop to NAN-scraped-cleanups-tharvik August 7, 2024 22:45
@tharvik tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from e1f1ff4 to aae81eb Compare August 8, 2024 12:15
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 6c4ce8a to 54478b6 Compare August 8, 2024 12:38
@tharvik tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from aae81eb to bb30a7e Compare August 9, 2024 11:59
Base automatically changed from NAN-scraped-cleanups-tharvik to develop August 9, 2024 12:09
@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 54478b6 to 1a96493 Compare August 12, 2024 12:37
@tharvik
Copy link
Collaborator Author

tharvik commented Aug 12, 2024

so, it's alive again! way smaller to review thanks to extracted scraps. description updated.

@tharvik tharvik requested a review from JulienVig August 12, 2024 12:42
@JulienVig
Copy link
Collaborator

There seems to be something wrong with the image data loaders: connecting images (by group for lus_covid and by csv for mnist) results in an error (Something went wrong on our side) when clicking train (both alone and collaboratively). Titanic and the llm task work fine.

@tharvik
Copy link
Collaborator Author

tharvik commented Aug 13, 2024

There seems to be something wrong with the image data loaders

right, I missplaced a toRaw call. should be fixed. and added a test for lus_covid's training to catch this earlier.

Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work! You've virtually rewritten most of the codebase haha

The evaluation seems to have an issue with data loading:

  • Go to the Evaluation tab, test a GPT language model, connect a text file, click next: Toaster error "Something went wrong" and nothing displayed. Now If I click Previous the text file I previously connected has been cleared
  • Same error with lus_covid (test a model, connect images, next -> nothing displayed).

cli/src/data.ts Outdated Show resolved Hide resolved
discojs-node/src/loaders.spec.ts Show resolved Hide resolved
discojs/src/dataset/dataset.ts Outdated Show resolved Hide resolved
discojs/src/dataset/image.ts Show resolved Hide resolved
cli/src/benchmark_gpt.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: this is a reply to a comment in the previous review that has been detached from its thread

I like the idea of having simple stateless functions for preprocessing. I would prefer if it was more organized, maybe by dataset type but that can be handled later on as we add more functions.
I think there may be some overlap between this file and data/helpers.ts, could you add some doc about the scope of each file to clarify where functions should go?
On that matter maybe convertors.ts is a confusing name as not all functions there convert things (and some functions in helpers.ts also convert things). What about something like processing.ts (for both pre- and post-processing)?

discojs/src/convertors.ts Outdated Show resolved Hide resolved
docs/examples/training.ts Outdated Show resolved Hide resolved
@@ -1,4 +1,4 @@
import tf from '@tensorflow/tfjs'
import tf from '@tensorflow/tfjs-node'
Copy link
Collaborator

@JulienVig JulienVig Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably due to something else than this import but running npm run custom_task hangs forever (or at least a long time)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum, it's already the case in develop, is it not wanted? from a quick look, it seems to be only starting a server and never closing it (node doesn't exit as long a their are some stuff/promises running)

@tharvik tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 4d3961b to d089ec6 Compare August 14, 2024 21:49
@tharvik
Copy link
Collaborator Author

tharvik commented Aug 15, 2024

Impressive work! You've virtually rewritten most of the codebase haha

haha, slowly getting there! when changing something as fundamental as tfjs, there is some good rewrite indeed.

The evaluation seems to have an issue with data loading:

that should be fixed now, and some cypress tests to ensure that also. we still have a memory leak of one tensor every round. I didn't wanted to sink too much time into it.

@tharvik tharvik requested a review from JulienVig August 15, 2024 10:58
Copy link
Collaborator

@JulienVig JulienVig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements!! I found a few things remaining:

  • Training on lus_covid and then testing the model displays 0% accuracy and 0 samples visited
  • Trying to test/predict a model on titanic data by connecting the titanic_test.csv or titanic_train.csv doesn't display enough and there is a toaster error
  • Idem when trying to test a language model
  • The Stop training button doesn't work when the training fails (for example if you train collaboratively with mnist with a single user, the user timeouts waiting for other contributions. After the timeout, hitting stop training doesn't work)
  • (At some point I also got the error "multiple dataset entered info=render function thrown undefined" but I can't reproduce it...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants