*: framework agnostic loaders #682

tharvik · 2024-06-05T12:51:59Z

on the road of supporting ONNX, we first need to extract TFJS from the code base, making discojs framework agnostic.
based on #735.

adding a Dataset type, wrapping standard AsyncGenerator with some helpers (map, zip, split, batch, ...)
- for now, disco.fit, validator.{assess,predict}, requires more type informations (via TypedDataset) as theses are supporting any dataset for any task. it should be enforced at some point via Task splitting so that we can drop it.
- now, any users of disco have to use it, keeping the remaining Datasplit, Data, tf.* internal to disco (and removing theses when preprocessing and task typing are reworked)
also exposes types: Image, Tabular & Text
redo loaders to be small functions rather than cumbersome-to-use classes
added some convertors as a potential new way to do preprocessing
- flat functions, that encodes type changes instead of wrapping in tf.TensorContainer (pretty much equivalent to any)
webapp are now typing its dataset, having a clear separation between Image, Tabular & Text, with Data component emitting changes

JulienVig

Huge amount of work, well done!! Things feel much cleaner now!

I think there are a few issues with the webapp right now:

Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)
Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.
Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000
Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear. (Sidenote titanic training currently fails on develop with error: Error: A Dataset iterator for fitDataset() is expected to generate objects of the form {xs: xVal, ys: yVal}, where the two values may be tf.Tensor, an array of Tensors, or a map of string to Tensor. The provided Dataset instead generates Tensor so you seemed to have fixed this error)
Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.
There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

Dataset is not a data structure composed of Data elements
Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)
DataSplit represents a train-test split of Data rather than a dataset
dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

cli/src/benchmark_gpt.ts

discojs-node/src/loaders/index.ts

discojs-node/src/loaders.spec.ts

discojs-node/src/loaders/image.ts

cli/src/data.ts

discojs/src/task/training_information.ts

webapp/src/components/data/TabularDatasetInput.vue

docs/examples/wikitext.ts

webapp/src/components/containers/ImageCard.vue

webapp/src/components/data/ImageDatasetInput.vue

tharvik · 2024-06-11T08:37:13Z

Clicking next on a task without connecting data doesn't display anthing anymore at the Training step (while it used to display the training board)

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

Connecting data, clicking next to the Training board now displays the training things correctly but if I go back I can't clear the connected files anymore.

good catch! I indeed forgot to propagate the "clear file" event

Connecting the LUS COVID dataset and training fails with error Error: Based on the provided shape, [224,224,3], the tensor should have 150528 values but has 200704, maybe linked to the new image loading function? A similar errors occurs on simple face: Error: Based on the provided shape, [200,200,3], the tensor should have 120000 values but has 160000

indeed, I mixed RGBA & RGB images, added a real Image type and related remove_alpha convertor.

Training on titanic works well, however the webapp fails to display the testing page correctly afterward and the Test button doesn't appear.

right, thanks for catching that, I missplaced a toRaw in there. NB: this is needed as VueJS proxies every value, but it doesn't work with JS private fields (# prefixed values).

Training on the Skin condition and CIFAR10 tasks fails with error TypeError: valueScalar is undefined, you can find the sample skin condition dataset here.

ho oupsi, I'm generating an empty dataset when uploading a folder, should have added a throwing TODO, thanks for catching that.

There is no TextDatasetInput.vue so there is no way to connect data to the wikitext task in the webapp

correct, there isn't any. we can add one in a next iteration but that out of scope of the PR. from what I remember, one student should have worked on that but it wasn't clear if it happened in the end.

I've been terribly confused with the naming of Dataset, Data, DataSplit and tf.data.Dataset. A list of specific things I don't like about the current abstractions:

I totally agree with you, it's not understandable. that's also why this PR is introducing a new one that should replace all of theses in its next iterations (when I reworked the convertors).

Dataset is not a data structure composed of Data elements

that would be the central (and only) way to have a serie of data (image/tabular/text/...) with transformations applied on it.

Data is actually a dataset (and it is Data that wraps tf.data.Dataset rather than Dataset)

it will be dropped, replaced by Dataset<Image>, Dataset<Tabular> & Dataset<Text> and some convertors applied on it.

DataSplit represents a train-test split of Data rather than a dataset

it will be replaced by the return type of Dataset<T>.split() ie [Dataset<T>, Dataset<T>].

dataset ends up being an attribute of data like in data.train.preprocess().batch().dataset

it will be real values, not fields, by shaping Dataset<T> via convertors functions. for example, Dataset<Tabular> -> extract_columns -> convert_to_numbers -> Dataset<[features: number[], label: number]>

Can we either add more class documentation to clarify the relationships or rework the oriented-object architecture in a way that makes things self-explanatory? This is a bit beyond the scope of this PR but it feels like the right time to address it? Feel free to ignore otherwise I'm also happy to create an additional PR myself.

as I'm hoping to remove most of theses soon, it seems a bit premature to update the doc now. I'll update the doc accordingly in the next one, but feel free to setup a PR if you feel that's it's taking too long.

JulienVig · 2024-06-11T08:45:44Z

Sounds good! Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

JulienVig · 2024-06-11T14:29:37Z

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

tharvik · 2024-06-12T15:22:23Z

Is there a quick fix to support text dataset? Merging this PR means that wikitext will not be available on discolab.ai until a new PR implements text datasets.

hum, I think that's doable, not for the webapp Tester (it's not really supported now anyway) but for the others parts, yes.

as no dataset was entered (it's undef), one can't train anything. I added a small message to tell users to enter some. one more functional way would be to rework the steps to disable the next step until data are entered.

I quite liked simply displaying an error message when training without data ("input a dataset first"). That allowed new users to walk around and see things when they don't want to train an actual model. What do you think of it?

hum, users are weird, but yeah, make sense, it's very demo-y; adding it back.

tharvik · 2024-08-12T12:42:49Z

so, it's alive again! way smaller to review thanks to extracted scraps. description updated.

JulienVig · 2024-08-13T09:35:23Z

There seems to be something wrong with the image data loaders: connecting images (by group for lus_covid and by csv for mnist) results in an error (Something went wrong on our side) when clicking train (both alone and collaboratively). Titanic and the llm task work fine.

tharvik · 2024-08-13T12:27:50Z

There seems to be something wrong with the image data loaders

right, I missplaced a toRaw call. should be fixed. and added a test for lus_covid's training to catch this earlier.

JulienVig

Impressive work! You've virtually rewritten most of the codebase haha

The evaluation seems to have an issue with data loading:

Go to the Evaluation tab, test a GPT language model, connect a text file, click next: Toaster error "Something went wrong" and nothing displayed. Now If I click Previous the text file I previously connected has been cleared
Same error with lus_covid (test a model, connect images, next -> nothing displayed).

cli/src/data.ts

discojs-node/src/loaders.spec.ts

discojs/src/dataset/dataset.ts

discojs/src/dataset/image.ts

cli/src/benchmark_gpt.ts

webapp/src/components/dataset_input/LabeledImageDatasetInput/__tests__/ByCSV.spec.ts

JulienVig · 2024-08-13T12:20:43Z

discojs/src/convertors.ts

Edit: this is a reply to a comment in the previous review that has been detached from its thread

I like the idea of having simple stateless functions for preprocessing. I would prefer if it was more organized, maybe by dataset type but that can be handled later on as we add more functions.
I think there may be some overlap between this file and data/helpers.ts, could you add some doc about the scope of each file to clarify where functions should go?
On that matter maybe convertors.ts is a confusing name as not all functions there convert things (and some functions in helpers.ts also convert things). What about something like processing.ts (for both pre- and post-processing)?

discojs/src/convertors.ts

docs/examples/training.ts

JulienVig · 2024-08-13T12:30:31Z

docs/examples/custom_task.ts

@@ -1,4 +1,4 @@
-import tf from '@tensorflow/tfjs'
+import tf from '@tensorflow/tfjs-node'


This is probably due to something else than this import but running npm run custom_task hangs forever (or at least a long time)

hum, it's already the case in develop, is it not wanted? from a quick look, it seems to be only starting a server and never closing it (node doesn't exit as long a their are some stuff/promises running)

tharvik · 2024-08-15T10:57:52Z

Impressive work! You've virtually rewritten most of the codebase haha

haha, slowly getting there! when changing something as fundamental as tfjs, there is some good rewrite indeed.

The evaluation seems to have an issue with data loading:

that should be fixed now, and some cypress tests to ensure that also. we still have a memory leak of one tensor every round. I didn't wanted to sink too much time into it.

JulienVig

Thanks for the improvements!! I found a few things remaining:

Training on lus_covid and then testing the model displays 0% accuracy and 0 samples visited
Trying to test/predict a model on titanic data by connecting the titanic_test.csv or titanic_train.csv doesn't display enough and there is a toaster error
Idem when trying to test a language model
The Stop training button doesn't work when the training fails (for example if you train collaboratively with mnist with a single user, the user timeouts waiting for other contributions. After the timeout, hitting stop training doesn't work)
(At some point I also got the error "multiple dataset entered info=render function thrown undefined" but I can't reproduce it...)

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 333bd90 to 0560bb0 Compare June 5, 2024 12:52

tharvik marked this pull request as ready for review June 5, 2024 12:59

tharvik requested a review from JulienVig June 5, 2024 12:59

JulienVig requested changes Jun 6, 2024

View reviewed changes

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 0560bb0 to a58143c Compare June 11, 2024 08:34

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 88b3514 to d1437de Compare June 12, 2024 15:22

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from d1437de to fbbd48b Compare June 12, 2024 15:46

tharvik changed the base branch from develop to 669-wikitext-web-julien June 15, 2024 15:02

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 976b163 to 649c7a0 Compare June 15, 2024 15:03

Base automatically changed from 669-wikitext-web-julien to develop June 17, 2024 07:07

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 649c7a0 to 15a2a27 Compare July 3, 2024 23:30

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch 2 times, most recently from 3c1b6ef to 667cf26 Compare July 15, 2024 13:32

tharvik added this to the v4.0.0 milestone Jul 23, 2024

tharvik mentioned this pull request Jul 31, 2024

scraps from #682 #735

Merged

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 667cf26 to 6c4ce8a Compare August 7, 2024 22:44

tharvik changed the base branch from develop to NAN-scraped-cleanups-tharvik August 7, 2024 22:45

tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from e1f1ff4 to aae81eb Compare August 8, 2024 12:15

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 6c4ce8a to 54478b6 Compare August 8, 2024 12:38

tharvik force-pushed the NAN-scraped-cleanups-tharvik branch from aae81eb to bb30a7e Compare August 9, 2024 11:59

Base automatically changed from NAN-scraped-cleanups-tharvik to develop August 9, 2024 12:09

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 54478b6 to 1a96493 Compare August 12, 2024 12:37

tharvik requested a review from JulienVig August 12, 2024 12:42

JulienVig requested changes Aug 13, 2024

View reviewed changes

tharvik added 17 commits August 14, 2024 23:48

discojs/convertors: initial

d18bd4a

discojs/dataset: revamp

5b0c81f

discojs-*: rework loaders

0c29f15

webapp/data: rework

c0e7ae0

*: use Dataset

d0450de

discojs*: rm *Loader & DatasetBuilder

6df47b7

discojs/Image: add full type

137a81e

*: import tfjs-node for speed

8e8ee75

webapp/training: stop via AbortController

7b259af

webapp/tests: train lus covid

32018a7

webapp/training: fix unraw dataset use

a925953

cli: use path.join where possible

af9fcca

webapp/tests: commonize server setup

66d0525

webapp/tests: add testing e2e

2f69fc5

webapp/testing: fix text test

9699392

*: avoid some tf memory leak

d93f88e

discojs/prepocessing/text: fix type check

d089ec6

tharvik force-pushed the 650-framework-agnostic-loaders-tharvik branch from 4d3961b to d089ec6 Compare August 14, 2024 21:49

tharvik added 4 commits August 15, 2024 12:39

discojs: convertors -> processing

e91caef

discojs: add some comments

7c8e74e

examples/training: fix comment

812de12

webapp: small fixes

f58325a

tharvik requested a review from JulienVig August 15, 2024 10:58

JulienVig requested changes Aug 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: framework agnostic loaders #682

*: framework agnostic loaders #682

tharvik commented Jun 5, 2024 •

edited

Loading

JulienVig left a comment

tharvik commented Jun 11, 2024

JulienVig commented Jun 11, 2024

JulienVig commented Jun 11, 2024

tharvik commented Jun 12, 2024

tharvik commented Aug 12, 2024

JulienVig commented Aug 13, 2024

tharvik commented Aug 13, 2024

JulienVig left a comment •

edited

Loading

JulienVig Aug 13, 2024

JulienVig Aug 13, 2024 •

edited

Loading

tharvik Aug 15, 2024

tharvik commented Aug 15, 2024

JulienVig left a comment

		@@ -1,4 +1,4 @@
		import tf from '@tensorflow/tfjs'
		import tf from '@tensorflow/tfjs-node'

*: framework agnostic loaders #682

Are you sure you want to change the base?

*: framework agnostic loaders #682

Conversation

tharvik commented Jun 5, 2024 • edited Loading

JulienVig left a comment

Choose a reason for hiding this comment

tharvik commented Jun 11, 2024

JulienVig commented Jun 11, 2024

JulienVig commented Jun 11, 2024

tharvik commented Jun 12, 2024

tharvik commented Aug 12, 2024

JulienVig commented Aug 13, 2024

tharvik commented Aug 13, 2024

JulienVig left a comment • edited Loading

Choose a reason for hiding this comment

JulienVig Aug 13, 2024

Choose a reason for hiding this comment

JulienVig Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

tharvik Aug 15, 2024

Choose a reason for hiding this comment

tharvik commented Aug 15, 2024

JulienVig left a comment

Choose a reason for hiding this comment

tharvik commented Jun 5, 2024 •

edited

Loading

JulienVig left a comment •

edited

Loading

JulienVig Aug 13, 2024 •

edited

Loading