Deepscatter Review from an end-user and developer (usage, recommendations, feature requests, bugs, MWEs, troubleshooting, and more!) #85

dsm-72 · 2023-06-19T16:28:13Z

Deepscatter Review

Disclaimer: deepscatter is a relatively small project with just 9 contributors, primarily @bmschmidt. It is also under active development / improvement:

API

This is still subject to change and is not fully documented. The encoding portion of the API mimics Vega-Lite with some minor distinctions to avoid deeply-nested queries and to add animation and jitter parameters.

The team has been super supportive and responsive and I am writing this first and foremost out of a place of good faith as an end user and to give the developer team feedback which may be of interest to them.

TL;DR

4/5 + GitHub ⭐️

The good: Built for a niche use case and great for what it is. It's maintainers are active and supportive. Despite being under active development, it works (not always the case with in-development software) and examples work as advertised.

The bad: I am inpatient! I want it to be finished and fully polished already because the potential is amazing.

Requested features

3D

🙏 PLEASE 🙏

Resize-able

This is very important to my use case. Who likes an overflowing or underfilled div?

I know that my prayers are in the process of being answered (PR#81) @bschmidt = 😇.

Flip y-axis

In a discussion with @bschmidt we talked about why a plot I created was "flipped" when rendered. To the unfamilair this stems from how the web renders pages with $y=0$ being the top of the page and increasing as you go down, whereas most plots we are familiar have lower $y$-values at the bottom and go up.

I firmly think the default behavior should be flipped. Why? Well they use and adopt many of d3's conventions. While this is also the default behavior of d3, the difference is exposure. The Scatterplot object returned by new Deepscatter(htmlID, width, height) does not have a convenient method exposing the y-axis's range e.g. plot.yAxis.range(). Instead all of this must be specified fully in the plotAPI.encoding (based on, but divergent from Vega-Lite).

The point being, that clearly deepscatter, just by the name having "scatter" in it means that it is at a higher level of abstraction than d3 and yet, requiring the user to specify the range in the plotAPI.encoding.y channel is still coupled to these lower levels. By default it should render the way most people expect that it should render as.

Calculate domain and range

Similar to flipping the y-axis since there is not a plot.yAxis.range() method, there are not plot.getDomain(feature) and plot.getRange(feature) methods. This matters. As stated deepscatter is at a higher level of abstraction. So without these methods we, the end user, must also schlep around meta-data that isn't even accessible under plot._iroot.schema, plot._iroot.schema.fields, etc. This is even more frustrating as an end user as to use deepscatter we explictily have to use quadfeather. Additionally, as stated above, how plot handles the arguments of plotAPI, especially plotAPI.encoding changes depending on how much information is provided. So specifying the correct range but not the domain yields undesirable behavior and vis-versa (doubly so for colors).

Thankfully I am not completely alone in this opinion:

"The challenge here is that we can’t confidently calculate extent of metadata in deepscatter because there is not a guarantee that the browser has access to all the data in the dataset--if you have 100 million points, we won’t load all of them."

It might be possible and reasonable to do this in quadfeather though and add it as column metadata.
@bschmidt

Some other meta data for consideration:

the total number of points in the dataset should also be included somewhere (to set a "max number of points" slider).
categories of categorical features (or at least a method that loops over the current categories)

Fix encodings

By far the most aggrevating part of deepscatter as an end user are the channels encoding. While having the ability to specify these things is nice, per all my previously mentioned notes combined the number of channels, it aggreagates to be a very clunky feeling interface.

Additionally, per this demo (featherplot (svelte) Pages) one can see that

encoding = {
    color: {
        field: "conditions",
        domain: [0, 3] // NOTE: <--- categorical i.e. 0, 1, 2, 3, but values 
                       // are "condition 1", "condition 2", "condition 3", etc        
    },
}

it doesn't seem easy to have a categorical colors, even though the exported column was categorical.

Likewise, if range: "viridis" is not specified, even with domain: [-3.97, 3.25], you will not see a color gradient.

Under the section recommendations, I suggest some ways encoding can be made more friendly for users.

Fix plotAPI

Sometimes plot.plotAPI(...opts) can handle rapid changes in values. Othertimes, it can not. Going back to featherplot (svelte) Pages, moving the number of points slider rapidly works.

BUG: Moving the zoom box slider does not (well at least not when deployed. It sometimes works with hot reloading in development mode.)

[Error] Error: (regl) cannot cancel a frame twice
f — deepscatter.39d51a89.js:2:90
c — deepscatter.39d51a89.js:2:149
Be — deepscatter.39d51a89.js:8:52497
(anonymous function) — deepscatter.39d51a89.js:1688:31555
	(anonymous function) (2.5a5494e2.js:32:8792)

Inconsistencies

Types vs Usage

plotAPI(...opts) takes an APICall object which has:

type APICall = {
    ...,
    background_options?: BackgroundOptions
}

type BackgroundOptions = {
    ...,
    color?: string
}

but... background_color is directly read from prefs:

ctx.fillStyle = prefs.background_color || 'rgba(133, 133, 111, .8)';

So setting prefs.background_options.color does nothing (yes, see disclaimer at the top of this. API is subject to change).

Complex Encoding Types

Disclaimer: I make no claims at being an expert programmer, especially in TypeScript. This is the most complex project that I am not the primary author of using TypeScript that I have investigated.

It appears that there is muddled use of undefined and null. Consider the Encoding type:

export type Encoding = {
    x?: RootChannel;
    y?: RootChannel;
    color?: null | ColorChannel;
    size?: null | RootChannel;
    shape?: null | RootChannel;
    filter?: null | FunctionalChannel;
    filter2?: null | FunctionalChannel;
    jitter_radius?: null | JitterChannel;
    jitter_speed?: null | RootChannel;
    x0?: null | RootChannel;
    y0?: null | RootChannel;
    position?: string;
    position0?: string;
    foreground?: null | FunctionalChannel;
  };

So what is the difference between undefined and null

undefined: declared variable, but unassigned value
null: can be used to explicity indicate abscence since undefined is "missing".

Yet these null values are used for whether or not the value exists at all, and internally

if (prefs.zoom !== undefined) {
    // ...
    if (prefs.zoom === null) {
        // ...
    }
    // ...
}
// ...
if (prefs.label !== undefined) {
    // ...
    if prefs.labels === null {
        // ...
    }
    // ...
}

put these cases could be handled by checking argument input, or conditional operators

// option 1: check if null then coerce to defaults
if (prefs.zoom === null) {
    // set to default values, set to undefined, or raise error
}
// ...
// with defaults set if zoom is provided
if (prefs?.zoom) {

}

// option 2: just add it as part of if / else
if (prefs.zoom === null) {
    // ...
} else if (prefs?.zoom) {
    // ...
}


// option 3: break channels into 4 cases:
// null, undefined, empty object, non-empty value
const isObj = (o) => o !== null && typeof o === 'object'
const isEmptyObj = (o) => (isObj(o) && Object.keys(0).length === 0)
switch (prefs.zoom) {
    case null:
        // handle explicilty missing
        break;
    case undefined:
        // handle not set
        break;
    default:
        if (isEmptyObj(o)) break;
        // not empty obj
}

const handleArg = (a, doNull, doUndef, doArg, doEmpty) => {
    switch (a) {
        case null:
            return doNull(a)            
        case undefined:
            return doUndef(a)            
        default:
            if (isEmptyObj(a)) return doEmpty(a)
            return doArg(a)
    }
}

Why basically complain about this? RootChannel is too complex a type and it gives me squiggles when working with it. Encoding is even worse because it encoding.x can be null, undefined, string, or basically an object with differnet fields. I don't like squiggles and type guards (at least the ones I wrote) don't handle it well.

Quadfeather

Remove required columns

Per this note:

Future codebase splits

The plotting components and the tiling components are logically quite separate; I may break the tiling strategy into a separate JS library called 'quadfeather'.

quadfeather was in fact abstracted away. While we - the development team and I going back and forth with some reproducable MWEs - found some bugs with CSV file formats (see MWE: missing points from CSV), as on their docs it shows:

quadfeather --files tmp.csv --tile_size 50_000 --destination tiles

"There was some old code hanging around, that was designed to aid a 3d version of quadfeather and so treated columns with the name z differently than other metadata columns (basically, it tried to start building an oct-tree instead of a quadtree, and had nowhere to put half the data.)"

@bschmidt

I still have greviances with quadfeather. Namely, it does not work if you do not have a x and y column in the DataFrame. Very annoying. Per my previous comments, deepscatter is at a higher level of abstract than d3, my data doesn't need an x and y column. I should only need to tell deepscatter what x and y are. However, due to quadfeather, I am going to have to have x and y.

Inefficient sidecars

In my discussions with @bschmidt I informed them of my use case. Namely, I am working with Single-Cell data. I have an embedding (like PCA, t-SNE, PHATE, etc) with $\gt 40,000$ cells and I want to dynamically change the color of points based on the expression of a given gene. I subsampled the $\gt30,000$ genes to $\approx 6,000$ and I have even subsampled my data down to $\approx 20,000$ cells. So one matrix of $(20000,~~ 3)$ and another of $(20000,~~ 6000)$.

Using a yet published to quadfeather script add_sidecars.py (courtesy of @bschmidt) I was able to partition my gene features and add them to my dataset (see MWE: quadfeather with sidecars, and live here).

However: when scaled to my actual data. A 2.56 GiB parquet file balloons to comical 474.31 GiB. So that is not really a practical option. Yes, I did double check to make sure I was using doubles (float32).

Initial data sizes:

After add_sidecars.py with float32:

After add_sidecars.py with float16:

This is especially bad as my original solution was $6000$ single column CSV files that I had statically hosted on GitHub and just fetched them one at a time. That should not be efficient than quadfeather.

Recommendations

It wouldn't be fair to just state my grievances when working the package without providing some ideas on how to perhaps tackle it, if not improve it.

In general my recommendations break down into 3 parts:

create a new class (or augment the Scatterplot class) to handle the interface and tradeoff of state between the user and the deepscatter library,
expose methods for setting and getting individual aspects of APICall, and
develop individual svelte components for each argument of APICall.

So lets break down why I think this is the way to go moving forward and how it may help improve the deepscatter library.

A case for a state class

I understand deepscatter is meant to be framework agnostic. So I am not going to explicilty state that this should be a svelte store (see the next section contributions). However, I believe having an interface layer between the Scatterplot and the end user may be of interest to the developers of deepscatter, espeically as things (like the API) are still in flux. By having an intermediate layer you can change anything you want about deepscatter's Scatterplot without affecting the user endpoints. Additionally it will also make it easier for others to contribute in a non-breaking way. For example, if they want to add features, by working on this state middleware, the complex and core logic of reading the quadfeather data remains safe and unchanged. In addition, it allows more control over when the middleware state changes actually trigger Scatterplot changes (e.g. by debouncing / checking whether or not the plot is ready like its queue, etc).

Expose getters and setters

Much like an the state class, these will allow the user's endpoint to be unaffected by internal changes. It also makes deepscatter more declaritive and d3-like. For example one could imagine:

const plot = new Deepscatter(...args, width, height)
                .x('PCA 1').xscale('log').xdomain('auto')
                .y('PCA 2').xscale('linear').ydomain('auto')
                .yrange([1, 0])
                .c('index').color('viridis')
                .autosize(true)
                .bgColor('transparent')

Development via individual svelte components

A couple of reasons.

it is sort of like extra tests and auto documentation all in one. Want to work on animations for changing the x channel of encoding? An axis select lets you do this and you can point users there for how to deal with encodings.
it makes it easier for users to build on top of. Rather than have to go through the code base to find out what has changed, etc they can start with that component.
it makes it easier for users to use your library. They can just take the components and run with it.

Contributions

Note: to the developers of deepscatter, feel free to take any of the code, ideas, MWEs, etc from the links below if you feel like they help. These are how I found the bug regarding quadfeather and CSVs, the discrepancies of APICall and prefs, the lack of scalability for add_sidecars.py.

Given the discussions I have had with @bschmidt, my experience with deepscatter (frustration and excitement), and this post I pasting some links which has
some notebooks for reproducing MWEs and demonstrating how I am using deepscatter.

Of note, I made two libraries, both called featherplot. One is in python, the other in svelte. The python

Python

This notebook from the python library show cases some of the frustrations I have had with quadfeather with deepscatter.

# load data

# ...

# Class for renaming embedding columns automatically for quadfeather
qfr = QuadFeatherRenamer(df_all)

# ...

# Grab metadata from dataframe and be ready to partition into main dataframe and sidecars
d2m = DataFrameToMetadata(
    df_q, 
    include_index=True,
    embedding='x y z'.split(),
    alt_names={v:k for k,v in renamed.items()}
)

succ, fail = d2m.convert()
meta = d2m.to_meta()

# ...

!quadfeather --files {p_file} \
             --tile_size {tile_size} \
             --destination {qf_dir}

# ... 

feather.write_feather(d2m.df.drop(columns=d2m.embedding), f_file)
!featherplot add-sidecars --tileset {qf_dir}\
                         --sidecar {f_file} --key {d2m.df.index.name};

Svelte

Likewise the svelte version does a few things of note:

simplifies types
implements store
MWE playground for debugging

usage

On +page.svelte we can see how a store can simplify the rendering of a deepscatter object:

// imports
// ...
// for data from +page.ts
export let data
$: meta = data?.meta

onMount(async() => {
    await meta        
    plotStore.url = `${base}${meta.tiles_dir}`

    plotStore.totalPoints = meta.n_points // for max point slider

    plotStore.embedding = meta.embedding // columns for showing on x / y
    plotStore.sidecars = meta.sidecars // columns for coloring by
    plotStore.columns = meta.columns_metadata // domain / range / categories of all columns

    // NOTE: setters will auto create the RootChannels based on metadata :)
    plotStore.xField = meta.embedding[0] // 'x'
    plotStore.yField = meta.embedding[1] // 'y'
    plotStore.cField = meta.sidecars[0] // 'conditions'
})
</script>

then for the HTML:

<div class="grid grid-cols-4 h-screen justify-content-center">
    <div class="">
    </div>
    <div class="col-span-2">
        <FeatherPlot class="border-2 border-sky-400"/>
        <div class="bg-slate-100 p-4 max-h-92 overflow-scroll">
            <MaxPointsSlider on:change={handleChange} />
            <AxesSelect on:change={handleChange} />
            <ColorSelect on:change={handleChange} />
            <ZoomCall on:change={handleChange} />             
        </div>        
    </div>
    <div class="">
        
    </div>
</div>

yay, so simple.

types

We can also simplify types.ts:

// ...
export type BaseConstantChannel<T> = {
    constant: T;
};

export type ConstantBool = BaseConstantChannel<boolean>;
export type ConstantNumber = BaseConstantChannel<number>;
export type ConstantColorChannel = BaseConstantChannel<string>;
export type ConstantChannel = ConstantBool | ConstantNumber | ConstantColorChannel;

export interface ChannelBase {
    field: string;
    [name: string]: any;
}

export interface ConditionalChannel extends ChannelBase {
    a: number;
    b?: number;
    op: Conditional;
}

export interface LambdaChannel extends ChannelBase {
    lambda: string;
    domain: Domain;
    range: Range;
}

export interface BasicChannel extends ChannelBase {
    transform?: Transform;
    domain?: MinMax;
    range?: MinMax;
    [name: string]: any;
}
export interface CategoricalChannel extends ChannelBase {}

Now all the channels have a common ancestor due to their requirement of the field field.

state

and in PlotStore.ts, we find a bunch of useful stuff

// ...

public async LoadDeepscatter() {
    const Scatterplot = await import('deepscatter'); 
    this.Deepscatter = Scatterplot.default; 
    this.debug({ status: 'Deepscatter imported'}, Scatterplot);
}

// ...

// the autofit stuff from ealier

// ...

private _schema: any | null = null;
get schema() {
    return this._schema;
}
set schema(s: any | null) {
    this._schema = s;
}
get fields() {
    return (this.ready ? this.plot?._root?.schema?.fields : []);
}
get extents() {
    return (this.ready ? this.plot?._root?._extents : {}) as Extents;
}
get xExtent() { return this.extents?.x || null; }
get yExtent() { return this.extents?.y || null; }
get zExtent() { return this.extents?.z || null; }

// ...


// ...
get xField() {
    return this.getEncodingField('x');
}

// NOTE: setting store.xField automatically create the RootChannel!
set xField(x: string | undefined) {
    this.setEncoding('x', this.getColumnAsEncoding(x));
}


// ...
// can easily check x, y, and color encodings
get x() {
    if (!this.xField) return undefined;
    if (!this.columns) return undefined;
    return this.getColumnAsEncoding(this.xField as EncodingKey);
}
get y() {
    if (!this.yField) return undefined;
    if (!this.columns) return undefined;
    return this.getColumnAsEncoding(this.yField as EncodingKey);
}
get color() {
    if (!this.cField) return undefined;
    if (!this.columns) return undefined;
    return this.getColumnAsEncoding(this.cField as EncodingKey);
}

// ...

This is how in the usage we saw plotStore.xField = 'x' works as the setter for xField will automatically create the RootEncoding from the metadata. Of course we could always manually set the x encoding if we needed more customization, but generally the domain of x is constant, its range is depednent on the transformation, etc.

Looking at the demo, we see how much the components help with making deepscatter more accessible so others can appreciate all of your hard work.

Conclusion

deepscatter is a great library. I can't wait for the NOMIC.ai team to continue to improve upon it. Special thanks to @bschmidt.

Acknowledgements

I'd like to take a moment to express my heartfelt gratitude to @bschmidt for all of their assistance while working with their library, their support, feedback and even provided suggestions to improve my project based on a M.W.E. Truly a kind person and had no reason to be so patient and understanding. 😇

🙏🙏🙏🙏🙏🙏🙏🙏

Links

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepscatter Review from an end-user and developer (usage, recommendations, feature requests, bugs, MWEs, troubleshooting, and more!) #85

Deepscatter Review from an end-user and developer (usage, recommendations, feature requests, bugs, MWEs, troubleshooting, and more!) #85

dsm-72 commented Jun 19, 2023

Deepscatter Review from an end-user and developer (usage, recommendations, feature requests, bugs, MWEs, troubleshooting, and more!) #85

Deepscatter Review from an end-user and developer (usage, recommendations, feature requests, bugs, MWEs, troubleshooting, and more!) #85

Comments

dsm-72 commented Jun 19, 2023

Deepscatter Review

TL;DR

Requested features

3D

Resize-able

Flip y-axis

Calculate domain and range

Fix encodings

Fix plotAPI

Inconsistencies

Types vs Usage

Complex Encoding Types

Quadfeather

Remove required columns

Inefficient sidecars

Recommendations

A case for a state class

Expose getters and setters

Development via individual svelte components

Contributions

Python

Svelte

usage

types

state

Conclusion

Acknowledgements

Links