Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Update t-SNE widget #6345

Merged
merged 20 commits into from
Jan 12, 2024
Merged

[ENH] Update t-SNE widget #6345

merged 20 commits into from
Jan 12, 2024

Conversation

pavlin-policar
Copy link
Collaborator

@pavlin-policar pavlin-policar commented Feb 23, 2023

Description of changes

This PR revamps the t-SNE widget.

Changes include:

  • Update parameter defaults to be in line with modern best practices. Rely on openTSNE for defaults in the future
  • Add option to disable PCA preprocessing, as this has no benefit on data sets with a small number of columns, warn user if PCA would be useful, but is disabled
  • Add support for spectral initialization (in addition to the existing PCA initalization)
  • Add support for different distance metrics, currently L1, L2, and cosine distances

The main change is added support for a Distances signal, as requested by @BlazZupan, which required a large number of changes. I've tried to make the behaviour of the t-SNE widget as similar as possible to the MDS widget, which also supports both Data and Distances signals. Using a distance matrix defaults to spectral initialization, and ignores any preprocessing steps.

The only discernible difference between the two widgets is the following:

When the MDS widget gets a Distance matrix, it comptues the embedding. If it then gets an incompatible Data table, it will hide the embedding, which makes sense. If, however, the data table has other issues which get caught by MDS's error checking, e.g. the data table has less than two rows, it will show an error for that, and still keep the embedding visible. The t-SNE widget does not do this. The t-SNE widget shows the "Incompatible data" error and hides the embedding for any error.

I have also found a potential improvement for both MDS and t-SNE, which will not be included in this PR: If the widget gets a valid distance matrix, it computes the embedding. If, however, it then gets an incompatible data table, it clears everything. Once the offending data table is removed from the input, the embedding has to be recomputed again. If the widgets cached the embedding for a given distance matrix, even when the input data signal is invalid, the recomputation could be avoided.

Includes
  • Code changes
  • Tests
  • Documentation

@codecov
Copy link

codecov bot commented Feb 23, 2023

Codecov Report

Merging #6345 (3b0a64d) into master (ad5dc9a) will decrease coverage by 0.36%.
Report is 97 commits behind head on master.
The diff coverage is 96.86%.

❗ Current head 3b0a64d differs from pull request most recent head 34c79d3. Consider uploading reports for the commit 34c79d3 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6345      +/-   ##
==========================================
- Coverage   87.78%   87.43%   -0.36%     
==========================================
  Files         321      317       -4     
  Lines       69445    68455     -990     
==========================================
- Hits        60962    59852    -1110     
- Misses       8483     8603     +120     

@pavlin-policar
Copy link
Collaborator Author

One more thing I'd like feedback on. Both t-SNE and MDS use ContextSettings, and use the "effective data" to open the context. The "effective data" is either taken from the distance matrix .row_items, or from the Data signal. This is good because, e.g., if we calculate Iris distances, send that as the Distance signal, set a color attribute, remove the Distance signal, and add the original data table to the Data signal, the data will still be colored according to the feature chosen before.

However, with t-SNE, a problem arises:

  1. Data (iris) → t-SNE
    Notice that the "Normalize" checkbox is checked
  2. Remove widgets.
  3. Data (iris) → Distances → t-SNE
    The entire preprocessing box is disabled, the "Normalize" checkbox is unticked to indicate it isn't being applied
  4. Remove widgets.
  5. Data (iris) → t-SNE
    The "Normalize" checkbox is disabled!

This happens because in both scenarios, the context being opened and saved is on the Iris data table. When a distance matrix is provided, the preprocessing controls are disabled and unchecked to indicate they aren't being used. However, when we re-open Iris using a data table (not distances), these settings would still be useful. Ideally, the context would remember whether or not we were using a DistMatrix or a Table instance.

This issue never arises in MDS because MDS has no such parameters.

@pavlin-policar pavlin-policar marked this pull request as ready for review February 23, 2023 17:55
@janezd janezd self-assigned this Feb 24, 2023
@janezd
Copy link
Contributor

janezd commented Feb 24, 2023

Don't set self.normalize to False. Instead, initialize the checkbox like this:

        self.normalize_cbx = gui.checkBox(
            self.preprocessing_box, self, "normalize", "Normalize data",
            callback=self._invalidate_normalized_data, stateWhenDisabled=False,
        )

That is, add stateWhenDisabled=False. With this, disabled checkbox will be shown as unchecked, but the attribute will stay True (if checked, or False if not). Of course, the widget must then take into account that the attribute is True even when the option is unavailable.

You can do the same for others, e.g. PCA preprocessing.

@pavlin-policar
Copy link
Collaborator Author

Thanks! I didn't know about this option. I believe this now works. It works properly in Orange, but I can't get my test test_controls_ignored_by_distance_matrix_retain_values_on_table_signal to work. It looks like the context settings aren't being restored properly within the test. Any ideas why this might be happening?

@pavlin-policar
Copy link
Collaborator Author

If I remember correctly, this PR still has a bug where the widget crashes if, for instance, the data contains NaN, and Normalize data is not checked. I am speaking from memory from a few months ago, so it might not be this exact thing, but something related to this. I'll need to fix that before this PR can be merged.

@BlazZupan
Copy link
Contributor

Bombs when "Apply PCA preprocessing" is switched off and the data contains unknowns. Try with HDI data set from Datasets widget. I suggest to use the same preprocessor for imputation as used with PCA initialization.

@BlazZupan
Copy link
Contributor

When distances are provided as input, the slider with PCA components should be disabled as well (not to confuse the user that PCA applies to this data).

@BlazZupan
Copy link
Contributor

Could initialization of positions of the data instances be the same if distances are given in the input and if they are computed by t-SNE? When I compare the result on Iris data set, for example, and use the same distance function in Distances, the visualizations are slightly different. It would be great if they would be the same.

image

@pavlin-policar
Copy link
Collaborator Author

pavlin-policar commented Sep 22, 2023

Bombs when "Apply PCA preprocessing" is switched off and the data contains unknowns. Try with HDI data set from Datasets widget. I suggest to use the same preprocessor for imputation as used with PCA initialization.

This should now be fixed. We use the default t-SNE preprocessor at the start of the pipeline now.

When distances are provided as input, the slider with PCA components should be disabled as well (not to confuse the user that PCA applies to this data).

I can't reproduce this; it's always disabled for me. Could you maybe provide your Qt version?

Could initialization of positions of the data instances be the same if distances are given in the input and if they are computed by t-SNE? When I compare the result on Iris data set, for example, and use the same distance function in Distances, the visualizations are slightly different. It would be great if they would be the same.

In general, no. By default, t-SNE uses PCA initialization, which can only be computed when the original data matrix is available. If we pass in a distance matrix, we can't compute the PCA projection, so we instead initialize the t-SNE embedding using a spectral layout, which can be obtained from the similarity graph obtained from the distance matrix. So, in general, the answer is no.

However, the results should be the same if we used the spectral initialization for both the Table input and Distance matrix input. However, I see that this is, in fact, not the case, and some rotation seems to be happening. I will need to look into this further.

@pavlin-policar
Copy link
Collaborator Author

I've finally tracked down the bug. It turns out that when testing comboboxes, we can't just write

combo_box.setCurrentIndex(2)

in the tests. This doesn't update the widget settings. It seems as though we have to instead use the simulate.combobox_activate_index function, and change combobox values like so:

simulate.combobox_activate_index(combo_box, 2)

No idea why this works, but I'm just glad to have tracked down this bug.

This PR is now probably ready for review.

@pavlin-policar
Copy link
Collaborator Author

Never mind. Apparently this times out on the CI servers. Does anyone have any idea how to fix this? @janezd @ales-erjavec @VesnaT

@ales-erjavec
Copy link
Contributor

Apparently this times out on the CI servers. Does anyone have any idea how to fix this?

Don't call the WidgetTest.show() method in the tests.

@pavlin-policar
Copy link
Collaborator Author

Oh my, I forgot to remove that... Sorry, that was very silly of me! Hopefully it should work now then.

@pavlin-policar
Copy link
Collaborator Author

This is now (really) be ready for review.

@janezd janezd removed their assignment Dec 8, 2023
@BlazZupan BlazZupan merged commit c0a60f2 into biolab:master Jan 12, 2024
22 of 23 checks passed
@pavlin-policar pavlin-policar deleted the update-tsne branch January 12, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants