blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

utterances-bot · 2024-01-08T14:48:48Z

Sparse Autoencoders for a More Interpretable RLHF | Naomi Bashkansky

Extending Anthropic's recent monosemanticity results toward a new, more interpretable way to fine-tune.

https://naomibashkansky.com/blog/2023/sparse-autoencoders-for-interpretable-rlhf/

Butanium · 2024-01-08T14:48:49Z

Hi, nice blog, thanks for sharing it!
Just wanted to warn you that your hf and wandb keys are still in the train colab you linked. Could you make the wandb report public ? It would be helpful to check the compute you needed.
Also, there is an image missing for this description

The second most frequent feature (feature index ...) in the Pythia 6.9B sparse autoencoder activates on the token "·the".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

utterances-bot commented Jan 8, 2024

Butanium commented Jan 8, 2024

blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

Comments

utterances-bot commented Jan 8, 2024

Sparse Autoencoders for a More Interpretable RLHF | Naomi Bashkansky

Butanium commented Jan 8, 2024