notes.txt

1. synth_weights.npy里面，有两行，分别对应condition 0, 1，有7列，分别对应7个类。
2. 为什么synth_condition_0_seg_bouttypes.npy的长度比数据总长度短？因为有些motif的实现中出现插入。
3. 为什么synth_condition_0_seg_words.npy中缺少最高概率的4种motif？（只出现了9个motif，且多为单个bout）虽然"a motif may never appear in the most likely partitioning even though it has non-zero pm"，但是最高概率的几个motif不应该不出现吧？
4. README里"If instead you have the data as a sequence of cluster labels, i.e., `hard' clustered data, then convert it into a sequence of probability vectors, and define a gmm model with means as the centers of the clusters and the circular standard deviation of 1.0"这个方案不好，1.0的std太大了。两个方案：a. 直接以labels为means，std设为特别小的正数。（缺点是没有考虑RNN输出的probability有多个类的概率接近的情况，这种情况下还粗暴地用hard cluster label不合适）b. 以RNN输出的probability为feature（13维），按正常流程拟合GMM。（缺点是这个feature的分布一定不是高斯的，因为每个坐标都大于0，一定是不对称的。当然，文中的情况也不太可能是高斯分布。）（注意，这个feature约束在一个超平面上，各坐标的和为1，所以估计出来的covariance一定是退化的，有一个特征值为0）（拟合GMM时会出现病态协方差矩阵的问题，去掉几个feature也不行。对每个bout type单独拟合高斯分布，确定的covariance、mean和weight直接给后面的步骤）（如何保证半正定？去掉1个feature也不行。正则化？Ensure positive semidefiniteness by checking eigenvalues？看chatGPT给出的方案。）
5. 可以在dataname_lengths_condition0.npy中写入每一段的长度（参考toy），把同一条鱼的多段数据算一个condition，放在一起用run_bass.py分析。