lcMLkin is a powerful tool to estimate kinship coefficient on
low-coverage genome data. Current release has been implemented on
Python2. Here, I slightly polished the script and implemented it for
Python3. Now, one can use it with both PL or GL according to your called
vcf. The most important difference between v1 (Lipatov et al. 2015) and v2 (Žegarac et al. 2021) of lcMLkin is the genotyping pipeline. In v1, the authors provided a genotyping script, however this is extremely slow. In v2, one can easily calculate their own genotype likelihoods. Below, a standard pipeline using bcftools
has been provided.
It is a simple python script. You need a vcf file (zipped or not) and plink files to be calculate AFs (or pre-calculated allele frequencies). In this version, lcMLkin does LD-pruning for each pair. That's why binary plink files are needed to run the script. Other flags are optional. See help output below.
See the original papers for a detailed explanation. To accurately calculate kinship coefficients on low-coverage data, background allele frequencies are important. For ancient DNA data or for a couple of individuals, generally it is not possible to calculate highly accurate allele frequency of the population. Providing background allele frequencies is a work-around to solve this problem. You can use either present-day high-coverage populations or pooled ancient populations (see Altınışık et al. 2022 for a possible approach). These populations should not deviate a lot from the population that you analyse (Žegarac et al. 2021). In plink files that was provided to lcmlkinv2.py
, you should have only individuals that you intend to use for allele frequency calculation.
python lcmlkinv2.py -h
usage: lcmlkinv2.py [-h] -v VCF -p PLINK [-f [FST]] [-t [NBTHREADS]] [-o OUT]
lcmlkin v2 adapted for python3
optional arguments:
-h, --help show this help message and exit
-v VCF, --vcf VCF vcf file must have PL or GL fields.
-p PLINK, --plink PLINK
plink file prefix to calculate allele frequencies.
Even if pre-calculated frequencies are the folder
(prefix.frq), one should have plink files for LD-
pruning.
-f [FST], --fst [FST]
a priori defined fst (see Zegarac 2021)
-t [NBTHREADS], --nbthreads [NBTHREADS]
number of threads
-o OUT, --out OUT output file name. if you do not provide a output name,
script will generate a name.
A sample output:
Ind1 Ind2 Z0ag Z1ag Z2ag PI_HATag nbSNP
Inda Indb 1.0 0.0 0.0 0.0 1195
Inda Indc 0.61 0.37 0.03 0.21 236
Inda Indd 1.0 0.0 0.0 0.0 110
Indb Indc 1.0 0.0 0.0 0.0 114
Indb Indd 0.77 0.22 0.01 0.116 471
- First, download 1000G sites from the website. This is needed for correct ref/alt info of positions.
wget -c https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz{,.tbi}
- Prepare inputs and arrange variables
snplist.txt
: tab-separated file with two columns,CHR
andPosition
. Seevcftools
documentation for more info about the format. You can use pre-selected SNPs, such as 1240K, or you can generate your own positions. To have some idea about the SNP selection, see Altınışık et al. 2022 (Material-Methods). The SNP list ascertained to Yoruba population used in Altınışık et al. 2022 is openly available here (the file named as Cayonu.1KGYoruba.snp).bam.list
: path to bam files one in each line.
vcftools=/PATH/TO/vcftools
bcftools=/PATH/TO/bcftools
reffa=/PATH/TO/reference.fa
lcmlkin=/PATH/TO/lcmlkinv2.py
out=outputname
- Subset bi-allelic SNPs which are referred in
snplist.txt
from whole 1000G dataset. Then, index the file.
$vcftools --gzvcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz --positions snplist.txt --remove-indels --min-alleles 2 --max-alleles 2 --recode-INFO-all --recode --stdout | bgzip -c > ${out}.vcf.gz &
$bcftools index -f ${out}.vcf.gz &
- Create a file including correct order of ref/alt alleles.
$bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' ${out}.vcf.gz | bgzip -c > ${out}.tsv.gz &
tabix -s1 -b2 -e2 ${out}.tsv.gz
- Calculate genotype likelihoods of selected SNPs in your own data. Since lcMLkin uses info from low-quality SNPs (see this), it is better to use low quality thresholds, but you can use your own parameters. Be careful not use these less confident genotype likelihoods in other analyses.
$bcftools mpileup -f ${reffa} -B -q20 -Q5 -I -a 'FORMAT/DP' -T {out}.vcf.gz -b bam.list -Ou | \
$bcftools call -Am -C alleles -f GP,GQ -T {out}.tsv.gz -Oz -o mydata.vcf.gz &
- Run lcMLkin v2.1
python lcmlkinv2.py -v mydata.vcf.gz -p PlinkPrefix
If you use this script, please cite original papers.
- The first version was published in Lipatov et al. 2015: https://doi.org/10.1101/023374
- The second version was published in Žegarac et al. 2021: https://doi.org/10.1038/s41598-021-89090-x