-
Notifications
You must be signed in to change notification settings - Fork 11
Fast Higashi Usage
Fast-Higashi shares the same set of input as Higashi, and thus uses the same code to parse them.
Run the following commands to process the input data (only needs to be run once).
from higashi.Higashi_wrapper import Higashi
higashi_model = Higashi(config_path)
higashi_model.process_data()
This function will finish the following tasks:
- generate a dictionary that'll map genomic bin loci to the node id.
- extract data from the data.txt and turn that into the format of hyperedges (triplets)
- create contact maps based on sparse scHi-C for visualization, baseline model, and generate node attributes
The above function is also equivalent to
higashi_model.generate_chrom_start_end()
higashi_model.extract_table()
higashi_model.create_matrix()
Before each step is executed, a message would be printed indicating the progress, which helps the debugging process.
See next subsection (model.fast_process_data()
part)
from fasthigashi.FastHigashi_Wrapper import *
# Initialize the model
model = FastHigashi(config_path=config_path,
path2input_cache="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
path2result_dir="/work/magroup/ruochiz/fast_higashi_git/pfc_500k",
off_diag=100,
filter=False,
do_conv=False,
do_rwr=False,
do_col=False,
no_col=False)
# Processing from contacts to sparse mtx
model.fast_process_data()
# Pack from sparse mtx to tensors
model.prep_dataset(batch_norm=True)
**required arguments:**
config_path The path to the configuration JSON file that you created.
path2input_cache The path to the directory where the cached tensor file will be stored
path2result_dir The path to the directory where the cached tensor file will be stored
off_diag Maximum No of diagonals to consider. When set as 100, the 0-100th diagonal would
be considered
filter Whether only use cells that pass the quality control standard to learn the meta-interactions,
and then infers the embeddings for the result of the cells.
do_conv Whether use linear convolution or not.
do_rwr Whether use partial random walk with restart or not
do_col Whether use sqrt_vc normalization or not, the program
would automatically uses it when needed
no_col Whether force the program to not use sqrt_vc normalization, the program would automatically uses it when needed
batch_norm Whether uses batch corrected normalization or not
OK, how to select those important parameters:
-
filter
: recommended to beTrue
, unless you found that there are little to None cells pass QC control (You'll see a printed line that summarize good cells, bad cells, total cells) -
do_conv
:- Do your data have similar coverage or contact pairs per cell to
Ramani et al.
andKim et al.
? If so,True
- Are you running Fast-Higashi on resolution higher (bin size smaller) than 500kb? If so,
True
- Otherwise, False
- Do your data have similar coverage or contact pairs per cell to
-
do_rwr
: recommended to beTrue
, but if you have high quality data (similar or higher contact pairs per cell than sn-m3c-seq human PFC data), test outFalse
first, as it runs faster and sometimes gives better result at resolution lower or equal to 500kb. -
do_col
: recommended to beFalse
, this option exists for debugging purpose -
no_col
: recommended to beFalse
, this option exists for debugging purpose -
batch_norm
: if you havebatch_id
provided in the configuration JSON file, set it to beTrue
model.run_model(dim1=.6,
rank=256,
n_iter_parafac=1,
extra="")
**required arguments:**
dim1 The scaling factor to calculate the chromosome specific embeddings. For a chromosome
with n bins, studied at a resolution of xbp, its chromosome specific embedding size
will be int(dim1 * n * x / 1000000)
rank Size of the meta-embedding size (embeddings shared across all chromosomes)
n_iter_parafac Number of iterations for parafac within each Fast-Higashi iteration
extra Extra annotation strings when saving the results
embedding = model.fetch_cell_embedding(final_dim=256,
restore_order=False)
Fetch the embeddings, the final_dim
argument should be something equal or smaller than the rank
parameter when running the Fast-Higashi program.
restore_order
argument means whether to generate cell embeddings that is the same order as the input scHi-C data. By default, in Fast-Higashi cells would be sorted such that cells with good quality would be placed before those that don't pass quality check. Such order is usually not consistent with the input cell order. Set this as True, to generate cell embeddings that have the same order.
model.correct_batch_linear(embedding=embedding,
var_to_regress='exp')
Linear correction of the batch effects or read coverage bias. var_to_regress should be the key name in the label_info.pickle file you provided that corresponds to variables / conditions of each file that will be regressed out.
The embeddings are stored as a dictionary and can be retrieved by embedding = model.embedding_storage
or the embedding = model.fetch_cell_embedding
.
By default it has the following keys: 'embed_raw', 'embed_l2_norm', 'embed_correct_coverage_fh', 'embed_l2_norm_correct_coverage_fh'
. These correspond to embeddings, embeddings after l2 normalization, embedding after linear correction of coverage, embedding after linear correction of coverage + l2 normalization.
Now, if you run the optionally linear correction of additional variable 'abc', there will be two additional keys 'embed_correct_abc'
and 'embed_l2_norm_correct_abc'
.
So, for these many embeddings, which one to use?
It is suggested to try 'embed_l2_norm'
or 'embed_l2_norm_correct_coverage_fh'
.
Although technically 'embed_l2_norm_correct_coverage_fh'
offers embeddings that are unbiased towards sequencing depths of cells, for some datasets, the cell type information is highly correlated with the sequencing depths. A simple regression would lead to much worse embedding results. This is caused by the non-orthogonality of biological variance and technical variance, and usually happens when two scHi-C libraries contain different (or even non-overlapping) cell types but also have different sequencing depths. Thus, it is suggested to test out both embeddings and see which gives more reasonable results.
Higashi ~ ~ Wiki
- Input files
- Usage (API)
- [Fast-Higashi initialized Higashi (Under construction)]
- Runtime of Fast-Higashi