You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer:
deepscatter
is a relatively small project with just 9 contributors, primarily @bmschmidt. It is also under active development / improvement:
API
This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.
The team has been super supportive and responsive and I am writing this first and foremost out of a place of good faith as an end user and to give the developer team feedback which may be of interest to them.
TL;DR
4/5 + GitHub ⭐️
The good: Built for a niche use case and great for what it is. It's maintainers are active and supportive. Despite being under active development, it works (not always the case with in-development software) and examples work as advertised.
The bad: I am inpatient! I want it to be finished and fully polished already because the potential is amazing.
Requested features
3D
🙏 PLEASE 🙏
Resize-able
This is very important to my use case. Who likes an overflowing or underfilled div?
I know that my prayers are in the process of being answered (PR#81) @bschmidt = 😇.
Flip y-axis
In a discussion with @bschmidt we talked about why a plot I created was "flipped" when rendered. To the unfamilair this stems from how the web renders pages with $y=0$ being the top of the page and increasing as you go down, whereas most plots we are familiar have lower $y$-values at the bottom and go up.
I firmly think the default behavior should be flipped. Why? Well they use and adopt many of d3's conventions. While this is also the default behavior of d3, the difference is exposure. The Scatterplot object returned by new Deepscatter(htmlID, width, height) does not have a convenient method exposing the y-axis's range e.g. plot.yAxis.range(). Instead all of this must be specified fully in the plotAPI.encoding (based on, but divergent from Vega-Lite).
The point being, that clearly deepscatter, just by the name having "scatter" in it means that it is at a higher level of abstraction than d3 and yet, requiring the user to specify the range in the plotAPI.encoding.y channel is still coupled to these lower levels. By default it should render the way most people expect that it should render as.
Calculate domain and range
Similar to flipping the y-axis since there is not a plot.yAxis.range() method, there are not plot.getDomain(feature) and plot.getRange(feature) methods. This matters. As stated deepscatter is at a higher level of abstraction. So without these methods we, the end user, must also schlep around meta-data that isn't even accessible under plot._iroot.schema, plot._iroot.schema.fields, etc. This is even more frustrating as an end user as to use deepscatter we explictily have to use quadfeather. Additionally, as stated above, how plot handles the arguments of plotAPI, especially plotAPI.encoding changes depending on how much information is provided. So specifying the correctrange but not the domain yields undesirable behavior and vis-versa (doubly so for colors).
Thankfully I am not completely alone in this opinion:
"The challenge here is that we can’t confidently calculate extent of metadata in deepscatter because there is not a guarantee that the browser has access to all the data in the dataset--if you have 100 million points, we won’t load all of them."
It might be possible and reasonable to do this in quadfeather though and add it as column metadata. @bschmidt
Some other meta data for consideration:
the total number of points in the dataset should also be included somewhere (to set a "max number of points" slider).
categories of categorical features (or at least a method that loops over the current categories)
Fix encodings
By far the most aggrevating part of deepscatter as an end user are the channels encoding. While having the ability to specify these things is nice, per all my previously mentioned notes combined the number of channels, it aggreagates to be a very clunky feeling interface.
encoding={color: {field: "conditions",domain: [0,3]// NOTE: <--- categorical i.e. 0, 1, 2, 3, but values // are "condition 1", "condition 2", "condition 3", etc },}
it doesn't seem easy to have a categorical colors, even though the exported column was categorical.
Likewise, if range: "viridis" is not specified, even with domain: [-3.97, 3.25], you will not see a color gradient.
Under the section recommendations, I suggest some ways encoding can be made more friendly for users.
Fix plotAPI
Sometimes plot.plotAPI(...opts) can handle rapid changes in values. Othertimes, it can not. Going back to featherplot (svelte) Pages, moving the number of points slider rapidly works.
BUG:
Moving the zoom box slider does not (well at least not when deployed. It sometimes works with hot reloading in development mode.)
[Error] Error: (regl) cannot cancel a frame twice
f — deepscatter.39d51a89.js:2:90
c — deepscatter.39d51a89.js:2:149
Be — deepscatter.39d51a89.js:8:52497
(anonymous function) — deepscatter.39d51a89.js:1688:31555
(anonymous function) (2.5a5494e2.js:32:8792)
Inconsistencies
Types vs Usage
plotAPI(...opts) takes an APICall object which has:
So setting prefs.background_options.color does nothing (yes, see disclaimer at the top of this. API is subject to change).
Complex Encoding Types
Disclaimer: I make no claims at being an expert programmer, especially in TypeScript. This is the most complex project that I am not the primary author of using TypeScript that I have investigated.
It appears that there is muddled use of undefined and null. Consider the Encoding type:
put these cases could be handled by checking argument input, or conditional operators
// option 1: check if null then coerce to defaultsif(prefs.zoom===null){// set to default values, set to undefined, or raise error}// ...// with defaults set if zoom is providedif(prefs?.zoom){}// option 2: just add it as part of if / elseif(prefs.zoom===null){// ...}elseif(prefs?.zoom){// ...}// option 3: break channels into 4 cases:// null, undefined, empty object, non-empty valueconstisObj=(o)=>o!==null&&typeofo==='object'constisEmptyObj=(o)=>(isObj(o)&&Object.keys(0).length===0)switch(prefs.zoom){casenull:
// handle explicilty missingbreak;caseundefined:
// handle not setbreak;default:
if(isEmptyObj(o))break;// not empty obj}consthandleArg=(a,doNull,doUndef,doArg,doEmpty)=>{switch(a){casenull:
returndoNull(a)caseundefined:
returndoUndef(a)default:
if(isEmptyObj(a))returndoEmpty(a)returndoArg(a)}}
Why basically complain about this? RootChannel is too complex a type and it gives me squiggles when working with it. Encoding is even worse because it encoding.x can be null, undefined, string, or basically an object with differnet fields. I don't like squiggles and type guards (at least the ones I wrote) don't handle it well.
Quadfeather
Remove required columns
Per this note:
Future codebase splits
The plotting components and the tiling components are logically quite separate; I may break the tiling strategy into a separate JS library called 'quadfeather'.
quadfeather was in fact abstracted away. While we - the development team and I going back and forth with some reproducable MWEs - found some bugs with CSV file formats (see MWE: missing points from CSV), as on their docs it shows:
"There was some old code hanging around, that was designed to aid a 3d version of quadfeather and so treated columns with the name z differently than other metadata columns (basically, it tried to start building an oct-tree instead of a quadtree, and had nowhere to put half the data.)"
I still have greviances with quadfeather. Namely, it does not work if you do not have a x and y column in the DataFrame. Very annoying. Per my previous comments, deepscatter is at a higher level of abstract than d3, my data doesn't need an x and y column. I should only need to tell deepscatter what x and y are. However, due to quadfeather, I am going to have to have x and y.
Inefficient sidecars
In my discussions with @bschmidt I informed them of my use case. Namely, I am working with Single-Cell data. I have an embedding (like PCA, t-SNE, PHATE, etc) with $\gt 40,000$ cells and I want to dynamically change the color of points based on the expression of a given gene. I subsampled the $\gt30,000$ genes to $\approx 6,000$ and I have even subsampled my data down to $\approx 20,000$ cells. So one matrix of $(20000,~~ 3)$ and another of $(20000,~~ 6000)$.
However: when scaled to my actual data. A 2.56GiB parquet file balloons to comical 474.31GiB. So that is not really a practical option. Yes, I did double check to make sure I was using doubles (float32).
This is especially bad as my original solution was $6000$ single column CSV files that I had statically hosted on GitHub and just fetched them one at a time. That should not be efficient than quadfeather.
Recommendations
It wouldn't be fair to just state my grievances when working the package without providing some ideas on how to perhaps tackle it, if not improve it.
In general my recommendations break down into 3 parts:
create a new class (or augment the Scatterplot class) to handle the interface and tradeoff of state between the user and the deepscatter library,
expose methods for setting and getting individual aspects of APICall, and
develop individual svelte components for each argument of APICall.
So lets break down why I think this is the way to go moving forward and how it may help improve the deepscatter library.
A case for a state class
I understand deepscatter is meant to be framework agnostic. So I am not going to explicilty state that this should be a svelte store (see the next section contributions). However, I believe having an interface layer between the Scatterplot and the end user may be of interest to the developers of deepscatter, espeically as things (like the API) are still in flux. By having an intermediate layer you can change anything you want about deepscatter's Scatterplot without affecting the user endpoints. Additionally it will also make it easier for others to contribute in a non-breaking way. For example, if they want to add features, by working on this state middleware, the complex and core logic of reading the quadfeather data remains safe and unchanged. In addition, it allows more control over when the middleware state changes actually trigger Scatterplot changes (e.g. by debouncing / checking whether or not the plot is ready like its queue, etc).
Expose getters and setters
Much like an the state class, these will allow the user's endpoint to be unaffected by internal changes. It also makes deepscatter more declaritive and d3-like. For example one could imagine:
it is sort of like extra tests and auto documentation all in one. Want to work on animations for changing the x channel of encoding? An axis select lets you do this and you can point users there for how to deal with encodings.
it makes it easier for users to build on top of. Rather than have to go through the code base to find out what has changed, etc they can start with that component.
it makes it easier for users to use your library. They can just take the components and run with it.
Contributions
Note: to the developers of deepscatter, feel free to take any of the code, ideas, MWEs, etc from the links below if you feel like they help. These are how I found the bug regarding quadfeather and CSVs, the discrepancies of APICall and prefs, the lack of scalability for add_sidecars.py.
Given the discussions I have had with @bschmidt, my experience with deepscatter (frustration and excitement), and this post I pasting some links which has
some notebooks for reproducing MWEs and demonstrating how I am using deepscatter.
// imports// ...// for data from +page.tsexportletdata
$: meta=data?.metaonMount(async()=>{awaitmetaplotStore.url=`${base}${meta.tiles_dir}`plotStore.totalPoints=meta.n_points// for max point sliderplotStore.embedding=meta.embedding// columns for showing on x / yplotStore.sidecars=meta.sidecars// columns for coloring byplotStore.columns=meta.columns_metadata// domain / range / categories of all columns// NOTE: setters will auto create the RootChannels based on metadata :)plotStore.xField=meta.embedding[0]// 'x'plotStore.yField=meta.embedding[1]// 'y'plotStore.cField=meta.sidecars[0]// 'conditions'})</script>
Now all the channels have a common ancestor due to their requirement of the field field.
state
and in PlotStore.ts, we find a bunch of useful stuff
// ...publicasyncLoadDeepscatter(){constScatterplot=awaitimport('deepscatter');this.Deepscatter=Scatterplot.default;this.debug({status: 'Deepscatter imported'},Scatterplot);}// ...// the autofit stuff from ealier// ...private _schema: any|null=null;getschema(){returnthis._schema;}setschema(s: any|null){this._schema=s;}getfields(){return(this.ready ? this.plot?._root?.schema?.fields : []);}getextents(){return(this.ready ? this.plot?._root?._extents : {})asExtents;}getxExtent(){returnthis.extents?.x||null;}getyExtent(){returnthis.extents?.y||null;}getzExtent(){returnthis.extents?.z||null;}// ...// ...getxField(){returnthis.getEncodingField('x');}// NOTE: setting store.xField automatically create the RootChannel!setxField(x: string|undefined){this.setEncoding('x',this.getColumnAsEncoding(x));}// ...// can easily check x, y, and color encodingsgetx(){if(!this.xField)returnundefined;if(!this.columns)returnundefined;returnthis.getColumnAsEncoding(this.xFieldasEncodingKey);}gety(){if(!this.yField)returnundefined;if(!this.columns)returnundefined;returnthis.getColumnAsEncoding(this.yFieldasEncodingKey);}getcolor(){if(!this.cField)returnundefined;if(!this.columns)returnundefined;returnthis.getColumnAsEncoding(this.cFieldasEncodingKey);}// ...
This is how in the usage we saw plotStore.xField = 'x' works as the setter for xField will automatically create the RootEncoding from the metadata. Of course we could always manually set the x encoding if we needed more customization, but generally the domain of x is constant, its range is depednent on the transformation, etc.
Looking at the demo, we see how much the components help with making deepscatter more accessible so others can appreciate all of your hard work.
Conclusion
deepscatter is a great library. I can't wait for the NOMIC.ai team to continue to improve upon it. Special thanks to @bschmidt.
Acknowledgements
I'd like to take a moment to express my heartfelt gratitude to @bschmidt for all of their assistance while working with their library, their support, feedback and even provided suggestions to improve my project based on a M.W.E. Truly a kind person and had no reason to be so patient and understanding. 😇
Deepscatter Review
This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.
The team has been super supportive and responsive and I am writing this first and foremost out of a place of good faith as an end user and to give the developer team feedback which may be of interest to them.
TL;DR
4/5 + GitHub ⭐️
The good: Built for a niche use case and great for what it is. It's maintainers are active and supportive. Despite being under active development, it works (not always the case with in-development software) and examples work as advertised.
The bad: I am inpatient! I want it to be finished and fully polished already because the potential is amazing.
Requested features
3D
Resize-able
This is very important to my use case. Who likes an overflowing or underfilled div?
I know that my prayers are in the process of being answered (PR#81) @bschmidt = 😇.
Flip y-axis
In a discussion with @bschmidt we talked about why a plot I created was "flipped" when rendered. To the unfamilair this stems from how the web renders pages with$y=0$ being the top of the page and increasing as you go down, whereas most plots we are familiar have lower $y$ -values at the bottom and go up.
I firmly think the default behavior should be flipped. Why? Well they use and adopt many of d3's conventions. While this is also the default behavior of d3, the difference is exposure. The
Scatterplot
object returned bynew Deepscatter(htmlID, width, height)
does not have a convenient method exposing the y-axis's range e.g.plot.yAxis.range()
. Instead all of this must be specified fully in theplotAPI.encoding
(based on, but divergent from Vega-Lite).The point being, that clearly
deepscatter
, just by the name having "scatter" in it means that it is at a higher level of abstraction than d3 and yet, requiring the user to specify the range in theplotAPI.encoding.y
channel is still coupled to these lower levels. By default it should render the way most people expect that it should render as.Calculate domain and range
Similar to flipping the y-axis since there is not a
plot.yAxis.range()
method, there are notplot.getDomain(feature)
andplot.getRange(feature)
methods. This matters. As stateddeepscatter
is at a higher level of abstraction. So without these methods we, the end user, must also schlep around meta-data that isn't even accessible underplot._iroot.schema
,plot._iroot.schema.fields
, etc. This is even more frustrating as an end user as to usedeepscatter
we explictily have to usequadfeather
. Additionally, as stated above, howplot
handles the arguments ofplotAPI
, especiallyplotAPI.encoding
changes depending on how much information is provided. So specifying the correct range but not the domain yields undesirable behavior and vis-versa (doubly so for colors).Thankfully I am not completely alone in this opinion:
Some other meta data for consideration:
Fix encodings
By far the most aggrevating part of
deepscatter
as an end user are the channelsencoding
. While having the ability to specify these things is nice, per all my previously mentioned notes combined the number of channels, it aggreagates to be a very clunky feeling interface.Additionally, per this demo (
featherplot
(svelte) Pages) one can see thatit doesn't seem easy to have a categorical colors, even though the exported column was categorical.
Likewise, if
range: "viridis"
is not specified, even withdomain: [-3.97, 3.25]
, you will not see a color gradient.Under the section recommendations, I suggest some ways
encoding
can be made more friendly for users.Fix plotAPI
Sometimes
plot.plotAPI(...opts)
can handle rapid changes in values. Othertimes, it can not. Going back tofeatherplot
(svelte) Pages, moving the number of points slider rapidly works.Inconsistencies
Types vs Usage
plotAPI(...opts)
takes anAPICall
object which has:but...
background_color
is directly read fromprefs
:So setting
prefs.background_options.color
does nothing (yes, see disclaimer at the top of this. API is subject to change).Complex Encoding Types
Disclaimer: I make no claims at being an expert programmer, especially in TypeScript. This is the most complex project that I am not the primary author of using TypeScript that I have investigated.
It appears that there is muddled use of
undefined
andnull
. Consider theEncoding
type:So what is the difference between
undefined
andnull
undefined
: declared variable, but unassigned valuenull
: can be used to explicity indicate abscence sinceundefined
is "missing".Yet these
null
values are used for whether or not the value exists at all, and internallyput these cases could be handled by checking argument input, or conditional operators
Why basically complain about this?
RootChannel
is too complex a type and it gives me squiggles when working with it.Encoding
is even worse because itencoding.x
can benull
,undefined
,string
, or basically an object with differnet fields. I don't like squiggles and type guards (at least the ones I wrote) don't handle it well.Quadfeather
Remove required columns
Per this note:
quadfeather
was in fact abstracted away. While we - the development team and I going back and forth with some reproducable MWEs - found some bugs withCSV
file formats (see MWE: missing points from CSV), as on their docs it shows:I still have greviances with
quadfeather
. Namely, it does not work if you do not have ax
andy
column in the DataFrame. Very annoying. Per my previous comments,deepscatter
is at a higher level of abstract thand3
, my data doesn't need anx
andy
column. I should only need to telldeepscatter
whatx
andy
are. However, due toquadfeather
, I am going to have to havex
andy
.Inefficient sidecars
In my discussions with @bschmidt I informed them of my use case. Namely, I am working with Single-Cell data. I have an embedding (like PCA, t-SNE, PHATE, etc) with$\gt 40,000$ cells and I want to dynamically change the color of points based on the expression of a given gene. I subsampled the $\gt30,000$ genes to $\approx 6,000$ and I have even subsampled my data down to $\approx 20,000$ cells. So one matrix of $(20000,~~ 3)$ and another of $(20000,~~ 6000)$ .
Using a yet published to
quadfeather
scriptadd_sidecars.py
(courtesy of @bschmidt) I was able to partition my gene features and add them to my dataset (see MWE:quadfeather
with sidecars, and live here).However: when scaled to my actual data. A
2.56
GiB parquet file balloons to comical474.31
GiB. So that is not really a practical option. Yes, I did double check to make sure I was using doubles (float32
).Initial data sizes:
After
add_sidecars.py
withfloat32
:After
add_sidecars.py
withfloat16
:This is especially bad as my original solution was$6000$ single column CSV files that I had statically hosted on GitHub and just fetched them one at a time. That should not be efficient than
quadfeather
.Recommendations
It wouldn't be fair to just state my grievances when working the package without providing some ideas on how to perhaps tackle it, if not improve it.
In general my recommendations break down into 3 parts:
Scatterplot
class) to handle the interface and tradeoff of state between the user and thedeepscatter
library,APICall
, andAPICall
.So lets break down why I think this is the way to go moving forward and how it may help improve the
deepscatter
library.A case for a state class
I understand
deepscatter
is meant to be framework agnostic. So I am not going to explicilty state that this should be a svelte store (see the next section contributions). However, I believe having an interface layer between theScatterplot
and the end user may be of interest to the developers ofdeepscatter
, espeically as things (like the API) are still in flux. By having an intermediate layer you can change anything you want aboutdeepscatter
'sScatterplot
without affecting the user endpoints. Additionally it will also make it easier for others to contribute in a non-breaking way. For example, if they want to add features, by working on this state middleware, the complex and core logic of reading thequadfeather
data remains safe and unchanged. In addition, it allows more control over when the middleware state changes actually triggerScatterplot
changes (e.g. by debouncing / checking whether or not the plot is ready like its queue, etc).Expose getters and setters
Much like an the state class, these will allow the user's endpoint to be unaffected by internal changes. It also makes
deepscatter
more declaritive andd3
-like. For example one could imagine:Development via individual svelte components
A couple of reasons.
it is sort of like extra tests and auto documentation all in one. Want to work on animations for changing the x channel of
encoding
? An axis select lets you do this and you can point users there for how to deal with encodings.it makes it easier for users to build on top of. Rather than have to go through the code base to find out what has changed, etc they can start with that component.
it makes it easier for users to use your library. They can just take the components and run with it.
Contributions
Note: to the developers of
deepscatter
, feel free to take any of the code, ideas, MWEs, etc from the links below if you feel like they help. These are how I found the bug regardingquadfeather
and CSVs, the discrepancies ofAPICall
andprefs
, the lack of scalability foradd_sidecars.py
.Given the discussions I have had with @bschmidt, my experience with
deepscatter
(frustration and excitement), and this post I pasting some links which hassome notebooks for reproducing MWEs and demonstrating how I am using
deepscatter
.featherplot
(python)featherplot
(svelte)featherplot
(python) Pagesfeatherplot
(svelte) PagesMWE:
quadfeather
CSV pipelineMWE:
quadfeather
inefficient with sidecarsDEMO:
featherplot
(python)Of note, I made two libraries, both called
featherplot
. One is in python, the other in svelte. The pythonPython
This notebook from the python library show cases some of the frustrations I have had with
quadfeather
withdeepscatter
.Svelte
Likewise the svelte version does a few things of note:
usage
On
+page.svelte
we can see how a store can simplify the rendering of adeepscatter
object:then for the HTML:
yay, so simple.
types
We can also simplify types.ts:
Now all the channels have a common ancestor due to their requirement of the
field
field.state
and in
PlotStore.ts
, we find a bunch of useful stuffThis is how in the usage we saw
plotStore.xField = 'x'
works as the setter forxField
will automatically create theRootEncoding
from the metadata. Of course we could always manually set the x encoding if we needed more customization, but generally the domain ofx
is constant, its range is depednent on the transformation, etc.Looking at the demo, we see how much the components help with making
deepscatter
more accessible so others can appreciate all of your hard work.Conclusion
deepscatter
is a great library. I can't wait for the NOMIC.ai team to continue to improve upon it. Special thanks to @bschmidt.Acknowledgements
I'd like to take a moment to express my heartfelt gratitude to @bschmidt for all of their assistance while working with their library, their support, feedback and even provided suggestions to improve my project based on a M.W.E. Truly a kind person and had no reason to be so patient and understanding. 😇
🙏🙏🙏🙏🙏🙏🙏🙏
Links
d3
@bschmidt
deepscatter
quadfeather
add_sidecars.py
Vega-Lite
featherplot
(python)featherplot
(svelte)featherplot
(python) Pagesfeatherplot
(svelte) PagesMWE: missing points from CSV
MWE:
quadfeather
CSV pipelineMWE:
quadfeather
inefficient with sidecarsDEMO:
featherplot
(python)featherplot
on NPMfeatherplot
on PyPIThe text was updated successfully, but these errors were encountered: