MLrefine3D

ml_refine3d

Purpose

This utility allows to separate structurally heterogeneous data sets into homogeneous classes by a multi-reference 3D angular refinement, using a maximum-likelihood (ML) target function. The mathematical details behind this approach are explained in detail in


Scheres et al. (2007) Nat. Methods, 4, pp. 27-29.

Please cite this paper if this program is of use to you!

An intuitive understanding of the algorithm is that we perform an iterative angular refinement procedure using several reference volumes, where the discrete assignment of orientation and class (conventionally based on maximum cross-correlation) is replaced by probability-weighted integrals over all possible assignments. These probability weights are calculated using an explicit statistical model of the experimental noise, which is assumed to be white and Gaussian. The probability-weighted integrals are performed using routines from the maximum-likelihood multi-reference refinement program MLalign2D. The intermediate reconstructions of the reference volumes are performed using a modified version of the ART with blobs algorithm, which is able to solve the corresponding weighted least-squares problem.

Our recommended way of separating structurally heterogeneous data sets is to perform K independent runs of one iteration against K random subsets of the data, using a low-pass filtered version of the average map of the entire dataset as single reference. Then, the resulting maps are used as initial seeds for a K-reference refinement against the entire dataset. See the above paper for more details.

Note, that the likelihood calculations are based on squared differences. This means that the absolute grey-scale of the reference volumes is very relevant (in contrast to refinement methods using cross-correlation coefficients). So make sure that projections of your initial reference volume yield images with the same grey-scales as your experimental images. A reconstruction using ART guarantees that the correct grey-scale is maintained. (See also F.A.Q. nr. 2!)

These calculations are extremely CPU and memory consuming. For example, using images of 64x64 pixels, 4 reference volumes and an angular sampling of 10 degrees requires ~1Gb of memory, and each iteration will take several days of CPU (depending on your computer) for each 10,000 experimental images. More reference volumes or a smaller angular sampling will take more memory and more CPU. Therefore, you will typically run this program in a parallel fashion on a powerful computer cluster. To this purpose, a parallel version of this program has been implemented in Mpi_MLrefine3D.

Usage


$ ml_refine3d ...

Parameters

`` Input selfile with experimental images
`` Selfile with initial reference volumes or a single reference volume
-o [rootnameML3D] = Rootname for all output files
-iter [int100] = Total number of iterations (of expectation maximization) to perform
-ang [float10] = Angular sampling rate for the reference library projection directions (degrees)
-psi_step [float"ang"] = Angular sampling rate for the in-plane rotations (degrees; default is the value of-ang)
-l [float0.2] = Lambda relaxation parameter for the wlsART reconstruction algorithm
-k [float0.5] = Kappa relaxation parameter for the wlsART reconstruction algorithm
-n [int10] = Number of inner iterations for the wlsART reconstruction algorithm
-sym [symfile""] = Enforce symmetry on the reference volume(s), and limit orientational search to asymmetric part
-sym_mask [maskfile""] = Local symmetry: only within mask the symmetry will be applied. (Consequently, the orientational search will not be limited!)
-tilt0 [float0.] =
-tiltF [float90.] = Restrict angular search to tilt angles between tilt0 and tiltF
-restart [logfile""] = Resume a run that has halted prematurely, after the last (completely finished) iteration
`` Use internal normalization (MLn3D algorithm) for data with normalization errors.
`` See also theMLalign2D and ART programs for more options

Examples and notes

Separate a data set into 3 classes

Note that this functionaliyt has been implemented in the much easier-to-use protocols! (seeGettingStartedWithProtocols)

Correct the absolute greyscale of the initial reference volume (also see the F.A.Q.s below):
mkdirCorrectGreyscale
xmipp_header_extract -i experimental_images.sel -o experimental_images.doc
xmipp_angular_project_library -i bad_greyscale.vol -experimental_images experimental_images.doc -oCorrectGreyscale/ref -sampling_rate 15 -sym c1h -compute_neighbors -angular_distance -1
xmipp_angular_projection_matching -i experimental_images.doc -oCorrectGreyscale/corrected_reference -refCorrectGreyscale/ref
xmipp_mpi_angular_class_average -iCorrectGreyscale/corrected_reference.doc -libCorrectGreyscale/ref_angles.doc -oCorrectGreyscale/corrected_reference
xmipp_mpi_reconstruct_wbp -iCorrectGreyscale/corrected_reference_classes.sel -o corrected_reference.vol -threshold 0.02 -sym c1 -use_each_image -weight
Note that at this point the angular sampling (-sampling_rate 15 degrees) can be quite coarse, as we will typically low-pass filter the resulting map anyway.
Low-pass filter the greyscale-corrected initial reference (also see the F.A.Q.s below):
xmipp_fourier_filter -i corrected_reference.vol -o filtered_reference.vol -low_pass 80 -sampling 5.6
Here we low-pass filter at 80 Angstroms (-low_pass 80), and the pixel size (-sampling) is 5.6 Angstroms.
create three random subsets of your data:
xmipp_selfile_split -i experimental_all.sel -n 3 -o experimental_subset
perform a single iteration of ml_refine3d for each of the three subsets, using the low-pass filtered map as a single reference:
mkdirGenerateSeed_1GenerateSeed_2GenerateSeed_3
xmipp_ml_refine3d -i experimental_subset_1.sel -vol filtered_reference.vol -iter 1 -oGenerateSeed_1/seeds_split_1
xmipp_ml_refine3d -i experimental_subset_2.sel -vol filtered_reference.vol -iter 1 -oGenerateSeed_2/seeds_split_2
xmipp_ml_refine3d -i experimental_subset_3.sel -vol filtered_reference.vol -iter 1 -oGenerateSeed_3/seeds_split_3
perform the actualML3D-classification:
mkdirRunML3D
xmipp_selfile_create "GenerateSeed_?/seeds_split_?_it000001.vol" > ml3d_seeds.sel
xmipp_ml_refine3d -i experimental_all.sel -vol ml3d_seeds.sel -iter 25 -oRunML3D/ml3d

Note that many of the output files of the ml_refine3d program are the same ones as described for theMLalign2D program (selfiles with classified particles, logfiles, images for the refined reference library, etc.) and for theArt program (hist-files, volumes in voxels and basis funstions).

F.A.Q.

1. How should I prepare my experimental images?

As with all Xmipp programs, your images should be in individual Spider file format. And you should have a Spider-like selection file with the names of all images
We recommend using CTF-phase corrected images (e.g. by phase flipping). You may do this at the micrograph level, or at the individual image level. For this, you may use any program you like, or the Xmipp program correctphase
You may want to down-scale your originally picked images, becauseML3D-classification takes a lot of memory and CPU. We normally work with images of 64x64 pixels, although for coarse angular samplings and few reference volumes you may use for example up to 90x90 (all depending on your computer of course). To down-scale your images you may use the following command (implementing a nice B-spline interpolation scheme, also see the scale progam manual):

$ xmipp_scale -i experimental_big.sel -oext small -xdim 64

Then, you will need to normalize your images. By default, the initial estimate for the standard deviation in the experimental noise of the likelihood optimization algorithm is set to one. Therefore, you should normalize your images in such a way that this is (roughly) true. We advise to use the following command (also see the normalize program manual):

$ xmipp_normalize -i experimental_all.sel -method Ramp -background_circle 32

Note that, not to mess up the statistics of your noise, we recommend you do NOT use any type of masking on your experimental images! For the same reason we also recommend that you do NOT filter your images (high nor low-pass)

2. The program gives very strange volumes after the first iteration...

Check whether you prepared your images as above: are they correctly normalized? If so, than most probably the pixel values of your initial starting volume are not on the correct scale. The probability-weight calculations are based on__squared differences__ between experimental images and projections from the reference volumes. This means that the absolute grey-scale values of the reference volumes are very important! Therefore, you have to make sure that the grey-scale of the initial low-pass filtered reference volume is such that projections of this volume are on the same scale as the experimental projections. (If you use any reconstruction program in Xmipp to create this reconstruction from your (normalized) images, this will always be the case. Different packages (e.g. Spider) may handle the absolute grey scale of your volumes differently! An easy way to correct your volume is given in the first two steps of the example given above.

3. Can I restart a run that has stopped before reaching its end?

Yes you can. Just type (or its MPI-version):


$ xmipp_ml_refine3d -restart yourrun_it00010.log

whereyourrun_it000010.log is the log-file of the last, completely finished iteration (incl. the reconstruction step).

4. How can I see which image belongs to which class?

The question is theoretically not correct. The maximum-likelihood estimate is a weighted average over all possible assignments, so in theory each image contributes to every class. However, the probability distribution functions typically converge to delta functions, so in practice such a "hard" classification is allowed. The program writes out the corresponding selection files:yourrun_it0000??_class00000?.sel

5. What can I analyze to see whether my calculations have gone OK?

The most important indicator will be the quality of the reference volumes (yourrun_it0000??_vol00000?.vol)
The logfiles (yourrun_it0000??.log) will look as follows:

;MLalign2D-logfile: Number of images= 20000 LL= -1.1774e+08 <Pmax/sumP>= 0.9872 ; -noise 1.014957 -offset 2.751146 -istart 12 ; -i experimental_all.sel -vol references.sel -o yourrun -iter 25 ; columns: model fraction (1); mirror fraction (2); 1000x signal change (3) ; yourrun_it000011_ref000001.xmp 1 3 0.00010 0.00000 -1000.00000 ; yourrun_it000011_ref000002.xmp 2 3 0.00025 0.00000 -1000.00000 Where:

- LL is the log-likelihood target value, which should increase at each iteration
- <Pmax/sumP> is an indicator of the "spikiness" of the probability distribution functions; values near-zero mean very broad distrbutions, delta-functions have a value of 1
- -noise is the estimated value for the noise in the images. This should be close to 1
- -offset is the estimated standard deviation in the origin offsets of your images (in pixels); this value reflects how well your particles were centered. Very large values may indicate problems
- The listing then contains for each of the projections in the reference library (yourrun_it000011_ref000???.xmp):
  - the fraction of assigned particles (absolute number is "total-number-of-particles" times this fraction)
  - the percentage of mirrored images
  - not-used
The document files (yourrun_it0000??.doc) contains for each of the experimental images the "optimal" values of the 1st, 2nd and 3rd Euler angles (rot,tilt &psi), the origin offsets (Xoff &Yoff), the number of the optimal projection in the reference library (Ref), a flag whether the optimal orientation of this image is mirrored (Flip), and thePmax/sumP value, as explained above. In this context, the "optimal" values refer to the maximum of the probability distribution (see also FAQ 4). They may serve several purposes:
- Make a postscript figure of the average (over all references) angular distribution:

$ xmipp_angular_distribution_show -ang yourrun_it000010.doc -ps distrbution.ps

- Use these values as a start for other angular refinement programs (that may allow using larger images and finer angular samplings, to reach higher resolutions). Note that if you want to do such a thing in Xmipp, you should first assign these angles to the headers of the images (as theMLrefine3D prgram does NOT do so!). You may do that as follows:

$ xmipp_headerinfo -assign -i yourrun_it000010.doc -mirror

6. How can I see whether my run has converged?

You may have a look at the following numbers:

The algorithm aims to optimize the logarithm of the likelihood. This value is reported in the logfiles (asLL, see also under FAQ 5). You may make a plot of these values to see whether theLL is still rising:

$ grep LL yourrun_it0000??.log | awk '{i++; print i,$9}'

You may analyze the docfile after every iteration to see how many particles still change either optimal angles, offsets or class. Upon convergence, relatively few particles should still be moving around. To analyze these numbers, you may run theMovingParticles csh-script (in the directory where you also ran theML3D-calculation).

7. How can I speed up my calculations?

Firstly, The required computing time depends strongly on angular sampling, so in general don't use samplings finer than 10 degrees. Furthermore, it depends quite strongly on the size of the images. Often you don't need large images to classify, we classify ribosomes at 64x64 pixels and it goes OK, so in general don't use much bigger images than that.

Secondly, in theory one has to perform exhaustive integrations over all translations and all rotations of all models. This may take a LONG time. We implemented a short-cut to this, as published in Scheres et al. Bioinformatics (2005). During every iteration you store, for each experimental image, the most probable translations for each projections direction and for each model. In the subsequent iteration, one first calculates the probability for that translation, and if it is very low then one may assume that all the other translations for that rotation and reference are also very low: and we skip that part of the search space. This has been shown to go approximately 10x faster, without hardly effecting convergence behavior. During the first iteration, by default the program performs an exhaustive search over all translations (as there are no optimal translations from the pervious iteration yet!). Therefore, by default the first iteration takes much longer than the rest. Alternatively, one may use the option-zero_offsets to set the optimal translations to (0,0) for each image and all projection directions and references. If your images are not too badly centered you may be likely to get away with it, while also having the first iteration very fast.

The amount of information (X and Y translations for 211 projection directions for K reference maps for each experimental image) is quite a bit of information. One may store this in memory (if available) which goes faster (less access to disc), or write it to disc. The advantage of writing to disc is that one may restart a run and still have exactly the same information (note that everything in memory will be lost). The disadvantage of writing to disc is massive disc access that may slow down significantly the calculation. As of Xmipp-2.0, the default is to store the translations in memory. (This explains the huge increase in CPU after the restart). If you want to write them to disc you may use the option-write_offsets.

8. The program did not classify my data, what can I do?

Clean up your data The basic assumption of the theory behindML3D-classification is that every image is the projection of one of K 3D-objects (where you choose K). If this is not the case, the program is likely to yield suboptimal classifications. A common example is when your data contains many bad/junk images. As junk particles are usually unique, they will probably NOT group together to form a "junk volume", and therefore their classification will be difficult.
Try a different initial reference volume In our experience the optimization algorithm does suffer from model bias, despite its maximum-likelihood formalism. This means that it may be hard to remove incorrect features from your initial reference volume, which should "somehow" represent the different classes present in the data, albeit at very low resolution. The following rule-of-thumb is not based on much experimental evidence and may seem useless in practice, but still: if the data were already separated into homogeneous subsets, and standard (SPIDER-like?) angular refinements could converge from the initial reference volume to each of the different conformations present, thenML3D-classification may also yield good results ... In practice, FILTER THE INITIAL VOLUME AS MUCH AS YOU CAN!!

--Main.AlfredoSolano - 25 Jan 2007

Provide feedback

Saved searches

Use saved searches to filter your results more quickly