Thursday, November 11, 2021

HOW TO Part 2:: How to test models using qpAdm - qpAdm 101

qpAdm is a statistical tool for studying the ancestry of populations with histories that involve admixture between two or more source populations. I hope this tutorial helps people interested in doing own research as qpAdm validated models which are constructed well provide a high standard of evidence to the model.

qpAdm is used to validate models that are fed into it by the user. The details that need to passed to the qpAdm program are as follows.

  1. Target population
  2. List of 2 or more source populations 
  3. List of Right populations or Right Pops.
  4. The populations in 1 & 2 are together called Left Populations or Left Pops and the first population in this list is considered as target population by qpAdm.
  5. The first population among the right pops has to be a basal population (Outgroup) and usually an african population like Mbuti, ShumLaka or Mota etc is chosen for this purpose.

A standard example of a qpAdm model is 


 Target population (Target) = source population 1 (Source 1) + source population 2 (Source 2) 

The qpAdm output will contain a p-value (also called tail probability or tailprob), admixture coefficients x & y for Source1 and Source2 respectively such that x+y = 1 (or 100%) and standard errors for those coefficients. 

 A successful model will have 

  1. A high p-value, and all models above a given threshold are to be accepted as valid. The common threshold used in published pop genomics papers is 0.05. 
  2. Low standard errors in the admixture coefficients.
  3. Positive admixture co-efficients.

Steps to run qpfstats

  1. Install Admixtools on the linux system. The step wise how to for that is provided in this tutorial. also download the eigentrat files from the Harvard Lab database as described in the linked tutorial.
  2. Create Fstats output file for use as input into qpAdm. For that
    1. Create a new folder in your linux system and create 2 new text files. Name them
      1. parqpfstat.txt (parameter file)
      2. lista.txt (Write the label names of each population that you will use in the qpAdm for multiple models, one per line. Max 20-23 populations)
    2. run the qpfstats program in terminal while being inside this newly created folder.
qpfstats -p parqpfstat.txt >qpfstatlog.txt

Let the qpfstats program finish running. This step takes the most time so be patient.
This is how the output file looks. Refer the video  below to gain more clarity


Steps to run qpAdm

  1. Create 3 new text files in same folder. Click on the files below to view example. Name them
    1. parqpadm.txt (parameter file)
    2. left.txt (list of left populations, 1 population per line, name has to match label in eigenstrat .ind file)
    3. right.txt (list of right populations, 1 population per line, name has to match label in eigenstrat .ind file)
  2. Run the qpAdm program from terminal while being inside the folder which contains these files.
qpAdm -p parqpadm > outputfile.txt

Refer the below video for clarity




Output file will contain all the necessary details for model evaluation.

How to read the qpAdm output file

In the above video, I tried to model Irula as GanjDareh_N + Onge by specifying a few random right populations.

The output file is here. Please open it as I will be referring to it.

fixed pat  wt  dof     chisq       tail prob
       00  0    4    19.593   0.000600701   0.423  0.577 
       01  1    5   293.832             0   1.000  0.000 
       10  1    5   142.822   4.49515e-29   0.000  1.000 

The first line is our model GanjDareh + Onge and its a fail as p-value (tail prob) < 0.05.
The 2nd and 3rd lines are the submodels with 1 source only, Ganj Dareh only and Onge only respectively, both of which fail.

To know why the model fail, we refer to the generated Dstats, also known as gendstats in the output file. These are fstats which compare the simulated model we proposed to the actual target sample (Irula in our case), and big Z scores (above 3 or below -3) can tell us why our model failed.

gendstat:             Mbuti.DG Russia_MLBA_Sintashta -2.496
gendstat:             Mbuti.DG    Russia_HG_Karelia -3.280
gendstat:             Mbuti.DG     Russia_HG_Tyumen -2.328
gendstat:             Mbuti.DG   Turkey_N_published -0.628
gendstat:             Mbuti.DG          Tarim_EMBA1 -3.415
gendstat: Russia_MLBA_Sintashta    Russia_HG_Karelia -1.909
gendstat: Russia_MLBA_Sintashta     Russia_HG_Tyumen -0.876
gendstat: Russia_MLBA_Sintashta   Turkey_N_published 2.493
gendstat: Russia_MLBA_Sintashta          Tarim_EMBA1 -2.024
gendstat:    Russia_HG_Karelia     Russia_HG_Tyumen 0.709
gendstat:    Russia_HG_Karelia   Turkey_N_published 2.953
gendstat:    Russia_HG_Karelia          Tarim_EMBA1  0.100
gendstat:     Russia_HG_Tyumen   Turkey_N_published 1.973
gendstat:     Russia_HG_Tyumen          Tarim_EMBA1 -0.795
gendstat:   Turkey_N_published          Tarim_EMBA1 -3.118

For example, in this case the highest Z score gendstat tells us that there is too less Tarim_EMBA in the model as compared to Mbuti.

In this manner, you can tweak the model and add more or better sources so that the tail prob eventually becomes > 0.05.

The admixture weights are here (the 2 numbers on the right)

 00  0     4    19.593     0.000600701     0.423     0.577 


The standard errors to the admixture weights are here.

std. errors:     0.032     0.032 

PRO TIPS

1. Make a backup copy of the .ind file first. Then you can rename the labels in the 3rd column of each sample in the .ind file so it gives you flexibility of choosing a subset of samples from a label. DO NOT ALTER THE ORDER OF THE SAMPLES IN THE .ind FILE.

2. Put as many samples in the lista file for qpfstats as you think you will need for your project. Max limit in my opinion is 20-25. Once fstatsa.txt is generated, qpAdm run is lightning fast as it does not have to calculate fstats again.

3. always use allsnps: YES in qpfstats parameter file. No need to mention it again in qpAdm parameter file.

4. In the reich database annotation file, check the captured number of snps of the sample you are using. Please avoid any sample with <50k SNPs captured. ie low coverage samples.

5. Bad right population selection will make bad models pass. Choose a comprehensive list of right populations. Max 13-15.

6. Same population cant be there in both left and right.

See how i resolved the Irula model by adding Tarim_EMBA to left as source and therefore removing it from right.


Common Errors and Solutions:

1. fatalx: unable to allocate 324xxxxxxx units for item - This is a RAM issue, the geno file is larger than the RAM size. Increase the Virtualbox RAM for the guest to above 4gb.

2. zero popsize, Aborted (core dumped) - Check for spellings of your labels in the lista file, or left and right files. Correct spelling and ensure that the population exists in the .ind list or the qpfstat output.


Ask queries in comments section. Thanks.

Edit:
Suggesting Reading to know the limitations of qpAdm and best practices in qpAdm modeling.


Éadaoin Harney, Nick Patterson, David Reich, John Wakeley, Assessing the performance of qpAdm: a statistical tool for studying population admixture, Genetics, Volume 217, Issue 4, April 2021, iyaa045, https://doi.org/10.1093/genetics/iyaa045

Also refer

37 comments:

vAsiSTha said...

The irula model I showed is just for example purposes.the right pops can be expanded to make a better model.

Gaska said...

congratulations for the blog, it is always good to be able to consult different opinions.

vAsiSTha said...

Thanks Gaska

Gaska said...

you're welcome, I have seen how Wang 2019 successfully modeled (using qpAdm) eneolithic steppe as a mixture of Yamnaya (Caucasus, Samara, Kalmykia, Ukraine) and GAC (approx 17%). Since GAC has a high WHG component, that model is also valid using Iberia chalcolithic (approx 15%). Actually none of the Yamnaya culture autosomal components (EHG, WHG, CHG/Iran_neo, EEF) are foreign to Western Europe and I guess the big difference will simply be the percentages of each component. For years we have been seeing how Harvardians and dozens of amateur geneticists try to Yamnayize Europe to make their steppe theory more credible and although Yamnaya-related migrations reached central Europe (r1b-Z2103, I2a-L699, R1a-M417, Q, even R1b-V1636) the situation in western Europe and the Italian and Balkan peninsulas is much more complicated than they want us to believe. The vast majority of Iberian BBs (R1b-P312) theoretically have such small percentages of Yamnaya component that this similarity could be due to the fact that they share that WHG-rich EEF component. What do you think about this as an expert handling qpAdm?

On the other hand, the difference between CHG and Iran_Neo remains a mystery to many people although I suppose you will be able to model Yamnaya (or Khvalynsk, Sredni Stog, eneolithic steppe, afanasievo) convincingly using either CHg or Iran_Neo interchangeably and I suppose those models will be equally valid.

But, can you successfully model the Basques (and the rest of Spaniards) using Iran_neo or CHG? Can you detect that component in ancient Iberian samples and in contemporary Spaniards without using Yamnaya?

I don't know in detail the situation in South Asia, because Iberians and Indians are at the opposite geographic extremes of the supposed Yamnaya culture expansion, but I can understand the debate you have in eurogenes regarding the CHG-Iran-neo component in Yamnaya. I am not a fan of Harvard because many of their conclusions seem to me biased, risky. D. Anthony and colleagues strike me as charlatans with little scientific rigor and I think their interpretation of the genetic results is a mockery. Only a stupid supremacist attitude can lead these guys in eurogenes, anthrogenica etc to continue defending the purity of Europeans and continue denying the existence of components coming from South Asia, Levant or even Africa in the Yamnaya culture (I am thinking of Lazaridis, Dzudzuana, natufians, mbuti............. etc).

They have always pretended that we Basques are ultranationalist westerners to hide the fact that the only ultras are them.They do not understand that we will never accept that a bunch of Americans who have no fucking idea of the prehistory of Europe tell us what we are or what we are not (unless they present us with irrefutable scientific proof of their theories). I say this in relation to the non-Indo-European western languages (Basque, Iberian, Tartessian, Aquitanian, Etruscan...) and to the origin of the R1b-L51>P312 lineage. The quarrel between Europeans is not the business of the Asians but maybe your opinion is interesting in this respect

vAsiSTha said...

"I have seen how Wang 2019 successfully modeled (using qpAdm) eneolithic steppe as a mixture of Yamnaya (Caucasus, Samara, Kalmykia, Ukraine) and GAC (approx 17%)"

Maybe you meant Yamnaya as 83% steppeEn and 17% GAC. That would sound about right to me. Because yamnaya is much later than steppe eneolithic.

"What do you think about this as an expert handling qpAdm?"
I have yet to run western european or iberian peninsula, so cant comment much.

"On the other hand, the difference between CHG and Iran_Neo remains a mystery to many people although I suppose you will be able to model Yamnaya (or Khvalynsk, Sredni Stog, eneolithic steppe, afanasievo) convincingly using either CHg or Iran_Neo interchangeably and I suppose those models will be equally valid."

So far, yes. But im working on a qpgraph model which will be able to tell the difference between the 2. As more paleolithic samples get published we will be able to tell CHG vs IranN or IndiaN very clearly.

"But, can you successfully model the Basques (and the rest of Spaniards) using Iran_neo or CHG? "
As a matter of principle i dont model moderns using ancients, its flawed.'

Regarding the politics of IE, some of it is conscious bias and some unconscious, but its there and has been there for indian subcontinent since british colonial times of IE studies.

3rdacc said...

@Gaska

It is nice to see peopel resistant to mainstream academia. May I ask what exactly is your theory regarding IE dispersals?

Gaska said...

Yeah steppe eneolithic (83% or 85%) + GAC (17%) or Iberia (15%), the percentages of WHG in these cultures are very similar so they can be used interchangeably and both models are acceptable.

I hope that one day you will have time to run Iberia or France in an unbiased way, when you do it we will comment it.

Gaska said...

@3rdacc

The only valid solution to determine what language a prehistoric culture spoke is to demonstrate that there is genetic continuity between that culture and later or current cultures. That genetic continuity in principle has to be linked to male unipersonal markers because we know that prehistoric societies were fundamentally patrilineal and habitually practiced exogamy. I cannot speak about the situation in Central and South Asia because I do not know well the cultures of Iran and India so I could only say nonsense about it.

But if I know well the situation in Western Europe and the linkage of R1b-P312 with the spread of IE seems to me (at the moment) nonsense. As papers are published, the steppe theory seems more difficult to defend. In Iberia, southern France and Italy the main historical Iron Age cultures and peoples were overwhelmingly R1b-P312 and yet all of them spoke Non-Indo-European languages (I have already mentioned Basques, Iberians, Etruscans, Aquitans and Tartessians). In Greece and Anatolia they have not even found R1b-M269 (Z2103 or L51) linked to Mycenaean, Minoan or Hittite cultures. The situation has long been comical and the Kurganists keep inventing absurd arguments to defend the indefensible (scarcity of ancient DNA, errors in the cultures that have been analyzed, matrilocalism, adopted children)- From my point of view the major expansion of Indo-European languages in Europe is a matter of the Iron Age (Celts and later Romans) not the Chalcolithic or the Bronze Age, and the obvious association of R1b-P312 with the BBC means that this culture never spoke IE but a kind of Ibero-Basque (or maybe the European Neolithic farmers spoke other languages such as Sardinian or Etruscan). In any case, I have never understood the supremacist ambition to claim that the horsemen of the steppes conquered mainland europe on the back of their horses thanks to their physical, social, economic and technological superiority. For me these are just fairy tales that have absolutely nothing to do with reality. Genetic, archaeological, anthropological and linguistic evidence points in another direction.

On the one hand, we should thank Harvard and European geneticists for the effort they are making to reconstruct humanity's past. They are intelligent scientists and they use interesting tools to analyze genomes but this does not mean that we have to agree with their conclusions because many of them are totally absurd. I have already said that Anthony seems to me to be a charlatan and that his arguments regarding the origin of IE are biased, weak, and inadequate.

Reich and the Harvardians are of the opinion that IE originated in Iran? Ok, what's the problem? I have no idea what the solution is but I think that regarding the spread of languages in antiquity there are factors that we cannot yet control or understand (autosomal composition, role of women, trade, migrations, exogamy, religious cults, elites, population shortage). Perhaps Yamnaya spoke an IE language and perhaps the unipersonal markers linked to that culture spread iE in central Europe and the Balkans, but at the moment they cannot say that this was done by R1b-L51>P312. It may be that Late Bronze Age steppe cultures carried IE into Central and South Asia and perhaps R1a is linked to this issue but much more genetic studies in Iran, Afghanistan, and India are needed to reach a satisfactory explanation. For the moment we can only speculate and elaborate theories to be discussed in a scientific and unbiased way.



vAsiSTha said...

Let me just say that the whole idea of linking y haplogroups to languages is ludicrous in a multi y haplo society.

This is again bias developed by kurganists due to sintashta aNd steppe Mlba being 100% z93. No other non isolated society (modern or ancient) has this sort of concentrated y haplo frequencies.

dosas said...

I am stuck at not being able to find the packedancestrymapgeno file anywhere (10Oct21.packedancestrymapgeno 2.8gb file in your video).

vAsiSTha said...

You won't have that file. That's my custom file. Use the downloaded file from Harvard Allen database. Extensions will be .Geno, .SNP and .IND

vAsiSTha said...

i believe downloaded file names are v50.0_1240k_public.geno, v50.0_1240k_public.snp, v50.0_1240k_public.ind.

Use these names in the parameter files, or rename them and use the new name.

dosas said...
This comment has been removed by the author.
dosas said...

I get this error when I run the qpfstats -p parqpfstat.txt command:

fatalx:
(stats) zero popsize
Aborted (core dumped)

Any ideas where I might have gone wrong?

vAsiSTha said...

Yes. The program cannot find your label in the .ind file.

Crosscheck that all the labels you use in Lista file are present in the 3rd column of the .ind file.

ind file 1st column = sample id, 2nd =M/F, 3rd is group label. Feed this group label into Lista file, with exact spelling.

dosas said...

Thank you, I managed to get it work, your help was invaluable.

I, now, am trying to figure out how to build the right pops.

Also, do you think you can do one more tutorial of how we can add our own raw DNA files to the dataset?

Thanks, again.

vAsiSTha said...

If you don't mind, could you email your data to me? vikramraj12344321@gmail.com.

I haven't needed to do it yet but I will figure out how to make it work using your data. i haven't got my dna tested yet.

vAsiSTha said...

"I, now, am trying to figure out how to build the right pops."

For right pops, there is just one rule

None of the right pops should have provided ancestry to any of the sources AFTER the source splits from the true source of the target.

Eg. If true sources are x and y for target T but we don't have the actual sample for x but rather a close cousin X, you should avoid using any right pop which provides ancestry to X AFTER the split between x and X.

You of course won't know what x is apriori, that is why you are testing models, but just use judgement

For Eurasia bronze age I use right pops which cover all bases, eg Mbuti, romania_oase/zlatykun, Morocco iberomaurisian, ehg, whg, Anatolia, ppn, ganj Dareh, chg, Wshg/botai/tarim, onge, mongoliaNorth.

vAsiSTha said...

As per Harney et al 2021, any of the sources giving ancestry to any of the reference populations in right is allowed.

H said...
This comment has been removed by the author.
vAsiSTha said...

go to this folder

cd /home/Downloads/bruh/qpadmexample/

and run the command. make sure even parameter file is in this folder. ensure qdata1.snp exists

H said...

Sorry for the silly question:
Do I make those files manually (with the names) or are they supposed to be there? if its the latter then they weren't there before I made two empty files and gave them these names.

Again, bear with me please, lol

vAsiSTha said...

no need to apologize.

you need to download this file and extract the zip.

you will get these files after extraction.

v50.0_1240K_public.geno
v50.0_1240K_public.snp
v50.0_1240K_public.ind

use these files as names in the parameter text file and specify the location too. eg.

genotypename: /home/ash/Downloads/harvarddata/v50.0_1240K_public.geno

vAsiSTha said...

In my blogpost and screenvideos, im using a merged dataset hence the name is different (10Oct21.snp etc).

You need to use the name of the downloaded files. worry about custom files and merging etc later.

H said...

I see.

So, now I have a question

In the parameter folder, should I write it like this? :

/home/ash/Downloads/yes/v50.0_1240K_public.geno
/home/ash/Downloads/yes/v50.0_1240K_public.snp
/home/ash/Downloads/yes/v50.0_1240K_public.ind
v50.0_1240K_public.geno
v50.0_1240K_public.snp
v50.0_1240K_public.ind

(yes being the name of the file I keep the extracted files in)

also, should I add anything else to the qpadmexample file (like the files extracted etc)?

vAsiSTha said...

I have given link to my files in the post. Just adapt it to your own directory name and file name

H said...

Thanks a ton. I appreciate your help and patience

H said...

hmm...I tried using your files (of course, after adapting them to my own directory name and file name) but the error still doesn't go..it says that it cannot open the snp file (same issue I've been going through since the comment I deleted for some reason)

vAsiSTha said...

mail me a screen record video

vAsiSTha said...

@H

everything working now?

H said...

Yessir

If I want to run a specific sample, do I do this? (example: Iran_HajjiFiruz_C_Iran_I4349)

vAsiSTha said...

Change the 3rd column label name for the specific sample and use that new label

dosas said...

Can people post some right pop lists for ancients and moderns, for West Eurasians, so I can get some ideas, I think mine is blotted (below).

Mbuti.DG
Russia_Ust_Ishim.DG
China_Tianyuan
Goyet_Neanderthal.SG
Russia_Sunghir3.SG
Russia_Kostenki14.SG
Spain_ElMiron
Russia_MA1_HG.SG
Georgia_Satsurblia.SG
Papuan.DG
Switzerland_Bichon.SG
Israel_Natufian_published
Morocco_Iberomaurusian
Russia_AfontovaGora3
ONG.SG
Russia_DevilsCave_N.SG


Thanks.

vAsiSTha said...

My fave for bronze and iron ages eurasians, especially west and SC asians is

Mbuti.DG
EHG - karelia
WHG - iron gates
WSHG/Tarim
Barcin_N
Kotias_CHG
Onge
Sarazm
GanjDareh
Irula
Mongolia_N_North

Konowryy said...

I always get the " (openit) can't open file /home/konowryy/v50.0_1240k_public.ind.snp of type r
error info: No such file or directory" error when typing the code in terminal and idk what to do, Your videos are not clear because of all these renamed files. I'm not even sure I will get an answer but i post this comment anyway.

dosas said...

Can one rename samples in the .ind file?

Unknown said...

Why isnt TarimEMBA in the .ind file?
How do I fix that?