qpAdm is used to validate models that are fed into it by the user. The details that need to passed to the qpAdm program are as follows.
- Target population
- List of 2 or more source populations
- List of Right populations or Right Pops.
- The populations in 1 & 2 are together called Left Populations or Left Pops and the first population in this list is considered as target population by qpAdm.
- The first population among the right pops has to be a basal population (Outgroup) and usually an african population like Mbuti, ShumLaka or Mota etc is chosen for this purpose.
A standard example of a qpAdm model is
Target population (Target) = source population 1 (Source 1) + source population 2 (Source 2)
The qpAdm output will contain a p-value (also called tail probability or tailprob), admixture coefficients x & y for Source1 and Source2 respectively such that x+y = 1 (or 100%) and standard errors for those coefficients.
A successful model will have
- A high p-value, and all models above a given threshold are to be accepted as valid. The common threshold used in published pop genomics papers is 0.05.
- Low standard errors in the admixture coefficients.
- Positive admixture co-efficients.
Steps to run qpfstats
- Install Admixtools on the linux system. The step wise how to for that is provided in this tutorial. also download the eigentrat files from the Harvard Lab database as described in the linked tutorial.
- Create Fstats output file for use as input into qpAdm. For that
- Create a new folder in your linux system and create 2 new text files. Name them
- parqpfstat.txt (parameter file)
- lista.txt (Write the label names of each population that you will use in the qpAdm for multiple models, one per line. Max 20-23 populations)
- run the qpfstats program in terminal while being inside this newly created folder.
qpfstats -p parqpfstat.txt >qpfstatlog.txt
Let the qpfstats program finish running. This step takes the most time so be patient.
This is how the output file looks. Refer the video below to gain more clarity
Steps to run qpAdm
- Create 3 new text files in same folder. Click on the files below to view example. Name them
- parqpadm.txt (parameter file)
- left.txt (list of left populations, 1 population per line, name has to match label in eigenstrat .ind file)
- right.txt (list of right populations, 1 population per line, name has to match label in eigenstrat .ind file)
- Run the qpAdm program from terminal while being inside the folder which contains these files.
qpAdm -p parqpadm > outputfile.txt
Refer the below video for clarity
Output file will contain all the necessary details for model evaluation.
How to read the qpAdm output file
In the above video, I tried to model Irula as GanjDareh_N + Onge by specifying a few random right populations.
The output file is here. Please open it as I will be referring to it.
fixed pat wt dof chisq tail prob
00 0 4 19.593 0.000600701 0.423 0.577
01 1 5 293.832 0 1.000 0.000
10 1 5 142.822 4.49515e-29 0.000 1.000
The first line is our model GanjDareh + Onge and its a fail as p-value (tail prob) < 0.05.
The 2nd and 3rd lines are the submodels with 1 source only, Ganj Dareh only and Onge only respectively, both of which fail.
To know why the model fail, we refer to the generated Dstats, also known as gendstats in the output file. These are fstats which compare the simulated model we proposed to the actual target sample (Irula in our case), and big Z scores (above 3 or below -3) can tell us why our model failed.
gendstat: Mbuti.DG Russia_MLBA_Sintashta -2.496
gendstat: Mbuti.DG Russia_HG_Karelia -3.280
gendstat: Mbuti.DG Russia_HG_Tyumen -2.328
gendstat: Mbuti.DG Turkey_N_published -0.628
gendstat: Mbuti.DG Tarim_EMBA1 -3.415
gendstat: Russia_MLBA_Sintashta Russia_HG_Karelia -1.909
gendstat: Russia_MLBA_Sintashta Russia_HG_Tyumen -0.876
gendstat: Russia_MLBA_Sintashta Turkey_N_published 2.493
gendstat: Russia_MLBA_Sintashta Tarim_EMBA1 -2.024
gendstat: Russia_HG_Karelia Russia_HG_Tyumen 0.709
gendstat: Russia_HG_Karelia Turkey_N_published 2.953
gendstat: Russia_HG_Karelia Tarim_EMBA1 0.100
gendstat: Russia_HG_Tyumen Turkey_N_published 1.973
gendstat: Russia_HG_Tyumen Tarim_EMBA1 -0.795
gendstat: Turkey_N_published Tarim_EMBA1 -3.118
For example, in this case the highest Z score gendstat tells us that there is too less Tarim_EMBA in the model as compared to Mbuti.
In this manner, you can tweak the model and add more or better sources so that the tail prob eventually becomes > 0.05.
The admixture weights are here (the 2 numbers on the right)
00 0 4 19.593 0.000600701 0.423 0.577
The standard errors to the admixture weights are here.
std. errors: 0.032 0.032
PRO TIPS
1. Make a backup copy of the .ind file first. Then you can rename the labels in the 3rd column of each sample in the .ind file so it gives you flexibility of choosing a subset of samples from a label. DO NOT ALTER THE ORDER OF THE SAMPLES IN THE .ind FILE.
2. Put as many samples in the lista file for qpfstats as you think you will need for your project. Max limit in my opinion is 20-25. Once fstatsa.txt is generated, qpAdm run is lightning fast as it does not have to calculate fstats again.
3. always use allsnps: YES in qpfstats parameter file. No need to mention it again in qpAdm parameter file.
4. In the reich database annotation file, check the captured number of snps of the sample you are using. Please avoid any sample with <50k SNPs captured. ie low coverage samples.
5. Bad right population selection will make bad models pass. Choose a comprehensive list of right populations. Max 13-15.
6. Same population cant be there in both left and right.
See how i resolved the Irula model by adding Tarim_EMBA to left as source and therefore removing it from right.
Common Errors and Solutions:
1. fatalx: unable to allocate 324xxxxxxx units for item - This is a RAM issue, the geno file is larger than the RAM size. Increase the Virtualbox RAM for the guest to above 4gb.
2. zero popsize, Aborted (core dumped) - Check for spellings of your labels in the lista file, or left and right files. Correct spelling and ensure that the population exists in the .ind list or the qpfstat output.
Ask queries in comments section. Thanks.
Edit:
Suggesting Reading to know the limitations of qpAdm and best practices in qpAdm modeling.
Éadaoin Harney, Nick Patterson, David Reich, John Wakeley, Assessing the performance of qpAdm: a statistical tool for studying population admixture, Genetics, Volume 217, Issue 4, April 2021, iyaa045, https://doi.org/10.1093/genetics/iyaa045
Also refer
The irula model I showed is just for example purposes.the right pops can be expanded to make a better model.
ReplyDeletecongratulations for the blog, it is always good to be able to consult different opinions.
ReplyDeleteThanks Gaska
ReplyDeleteyou're welcome, I have seen how Wang 2019 successfully modeled (using qpAdm) eneolithic steppe as a mixture of Yamnaya (Caucasus, Samara, Kalmykia, Ukraine) and GAC (approx 17%). Since GAC has a high WHG component, that model is also valid using Iberia chalcolithic (approx 15%). Actually none of the Yamnaya culture autosomal components (EHG, WHG, CHG/Iran_neo, EEF) are foreign to Western Europe and I guess the big difference will simply be the percentages of each component. For years we have been seeing how Harvardians and dozens of amateur geneticists try to Yamnayize Europe to make their steppe theory more credible and although Yamnaya-related migrations reached central Europe (r1b-Z2103, I2a-L699, R1a-M417, Q, even R1b-V1636) the situation in western Europe and the Italian and Balkan peninsulas is much more complicated than they want us to believe. The vast majority of Iberian BBs (R1b-P312) theoretically have such small percentages of Yamnaya component that this similarity could be due to the fact that they share that WHG-rich EEF component. What do you think about this as an expert handling qpAdm?
ReplyDeleteOn the other hand, the difference between CHG and Iran_Neo remains a mystery to many people although I suppose you will be able to model Yamnaya (or Khvalynsk, Sredni Stog, eneolithic steppe, afanasievo) convincingly using either CHg or Iran_Neo interchangeably and I suppose those models will be equally valid.
But, can you successfully model the Basques (and the rest of Spaniards) using Iran_neo or CHG? Can you detect that component in ancient Iberian samples and in contemporary Spaniards without using Yamnaya?
I don't know in detail the situation in South Asia, because Iberians and Indians are at the opposite geographic extremes of the supposed Yamnaya culture expansion, but I can understand the debate you have in eurogenes regarding the CHG-Iran-neo component in Yamnaya. I am not a fan of Harvard because many of their conclusions seem to me biased, risky. D. Anthony and colleagues strike me as charlatans with little scientific rigor and I think their interpretation of the genetic results is a mockery. Only a stupid supremacist attitude can lead these guys in eurogenes, anthrogenica etc to continue defending the purity of Europeans and continue denying the existence of components coming from South Asia, Levant or even Africa in the Yamnaya culture (I am thinking of Lazaridis, Dzudzuana, natufians, mbuti............. etc).
They have always pretended that we Basques are ultranationalist westerners to hide the fact that the only ultras are them.They do not understand that we will never accept that a bunch of Americans who have no fucking idea of the prehistory of Europe tell us what we are or what we are not (unless they present us with irrefutable scientific proof of their theories). I say this in relation to the non-Indo-European western languages (Basque, Iberian, Tartessian, Aquitanian, Etruscan...) and to the origin of the R1b-L51>P312 lineage. The quarrel between Europeans is not the business of the Asians but maybe your opinion is interesting in this respect
"I have seen how Wang 2019 successfully modeled (using qpAdm) eneolithic steppe as a mixture of Yamnaya (Caucasus, Samara, Kalmykia, Ukraine) and GAC (approx 17%)"
ReplyDeleteMaybe you meant Yamnaya as 83% steppeEn and 17% GAC. That would sound about right to me. Because yamnaya is much later than steppe eneolithic.
"What do you think about this as an expert handling qpAdm?"
I have yet to run western european or iberian peninsula, so cant comment much.
"On the other hand, the difference between CHG and Iran_Neo remains a mystery to many people although I suppose you will be able to model Yamnaya (or Khvalynsk, Sredni Stog, eneolithic steppe, afanasievo) convincingly using either CHg or Iran_Neo interchangeably and I suppose those models will be equally valid."
So far, yes. But im working on a qpgraph model which will be able to tell the difference between the 2. As more paleolithic samples get published we will be able to tell CHG vs IranN or IndiaN very clearly.
"But, can you successfully model the Basques (and the rest of Spaniards) using Iran_neo or CHG? "
As a matter of principle i dont model moderns using ancients, its flawed.'
Regarding the politics of IE, some of it is conscious bias and some unconscious, but its there and has been there for indian subcontinent since british colonial times of IE studies.
@Gaska
ReplyDeleteIt is nice to see peopel resistant to mainstream academia. May I ask what exactly is your theory regarding IE dispersals?
Yeah steppe eneolithic (83% or 85%) + GAC (17%) or Iberia (15%), the percentages of WHG in these cultures are very similar so they can be used interchangeably and both models are acceptable.
ReplyDeleteI hope that one day you will have time to run Iberia or France in an unbiased way, when you do it we will comment it.
@3rdacc
ReplyDeleteThe only valid solution to determine what language a prehistoric culture spoke is to demonstrate that there is genetic continuity between that culture and later or current cultures. That genetic continuity in principle has to be linked to male unipersonal markers because we know that prehistoric societies were fundamentally patrilineal and habitually practiced exogamy. I cannot speak about the situation in Central and South Asia because I do not know well the cultures of Iran and India so I could only say nonsense about it.
But if I know well the situation in Western Europe and the linkage of R1b-P312 with the spread of IE seems to me (at the moment) nonsense. As papers are published, the steppe theory seems more difficult to defend. In Iberia, southern France and Italy the main historical Iron Age cultures and peoples were overwhelmingly R1b-P312 and yet all of them spoke Non-Indo-European languages (I have already mentioned Basques, Iberians, Etruscans, Aquitans and Tartessians). In Greece and Anatolia they have not even found R1b-M269 (Z2103 or L51) linked to Mycenaean, Minoan or Hittite cultures. The situation has long been comical and the Kurganists keep inventing absurd arguments to defend the indefensible (scarcity of ancient DNA, errors in the cultures that have been analyzed, matrilocalism, adopted children)- From my point of view the major expansion of Indo-European languages in Europe is a matter of the Iron Age (Celts and later Romans) not the Chalcolithic or the Bronze Age, and the obvious association of R1b-P312 with the BBC means that this culture never spoke IE but a kind of Ibero-Basque (or maybe the European Neolithic farmers spoke other languages such as Sardinian or Etruscan). In any case, I have never understood the supremacist ambition to claim that the horsemen of the steppes conquered mainland europe on the back of their horses thanks to their physical, social, economic and technological superiority. For me these are just fairy tales that have absolutely nothing to do with reality. Genetic, archaeological, anthropological and linguistic evidence points in another direction.
On the one hand, we should thank Harvard and European geneticists for the effort they are making to reconstruct humanity's past. They are intelligent scientists and they use interesting tools to analyze genomes but this does not mean that we have to agree with their conclusions because many of them are totally absurd. I have already said that Anthony seems to me to be a charlatan and that his arguments regarding the origin of IE are biased, weak, and inadequate.
Reich and the Harvardians are of the opinion that IE originated in Iran? Ok, what's the problem? I have no idea what the solution is but I think that regarding the spread of languages in antiquity there are factors that we cannot yet control or understand (autosomal composition, role of women, trade, migrations, exogamy, religious cults, elites, population shortage). Perhaps Yamnaya spoke an IE language and perhaps the unipersonal markers linked to that culture spread iE in central Europe and the Balkans, but at the moment they cannot say that this was done by R1b-L51>P312. It may be that Late Bronze Age steppe cultures carried IE into Central and South Asia and perhaps R1a is linked to this issue but much more genetic studies in Iran, Afghanistan, and India are needed to reach a satisfactory explanation. For the moment we can only speculate and elaborate theories to be discussed in a scientific and unbiased way.
Let me just say that the whole idea of linking y haplogroups to languages is ludicrous in a multi y haplo society.
ReplyDeleteThis is again bias developed by kurganists due to sintashta aNd steppe Mlba being 100% z93. No other non isolated society (modern or ancient) has this sort of concentrated y haplo frequencies.
I am stuck at not being able to find the packedancestrymapgeno file anywhere (10Oct21.packedancestrymapgeno 2.8gb file in your video).
ReplyDeleteYou won't have that file. That's my custom file. Use the downloaded file from Harvard Allen database. Extensions will be .Geno, .SNP and .IND
Deletei believe downloaded file names are v50.0_1240k_public.geno, v50.0_1240k_public.snp, v50.0_1240k_public.ind.
ReplyDeleteUse these names in the parameter files, or rename them and use the new name.
This comment has been removed by the author.
ReplyDeleteI get this error when I run the qpfstats -p parqpfstat.txt command:
ReplyDeletefatalx:
(stats) zero popsize
Aborted (core dumped)
Any ideas where I might have gone wrong?
Yes. The program cannot find your label in the .ind file.
ReplyDeleteCrosscheck that all the labels you use in Lista file are present in the 3rd column of the .ind file.
ind file 1st column = sample id, 2nd =M/F, 3rd is group label. Feed this group label into Lista file, with exact spelling.
Thank you, I managed to get it work, your help was invaluable.
ReplyDeleteI, now, am trying to figure out how to build the right pops.
Also, do you think you can do one more tutorial of how we can add our own raw DNA files to the dataset?
Thanks, again.
If you don't mind, could you email your data to me? vikramraj12344321@gmail.com.
ReplyDeleteI haven't needed to do it yet but I will figure out how to make it work using your data. i haven't got my dna tested yet.
"I, now, am trying to figure out how to build the right pops."
ReplyDeleteFor right pops, there is just one rule
None of the right pops should have provided ancestry to any of the sources AFTER the source splits from the true source of the target.
Eg. If true sources are x and y for target T but we don't have the actual sample for x but rather a close cousin X, you should avoid using any right pop which provides ancestry to X AFTER the split between x and X.
You of course won't know what x is apriori, that is why you are testing models, but just use judgement
For Eurasia bronze age I use right pops which cover all bases, eg Mbuti, romania_oase/zlatykun, Morocco iberomaurisian, ehg, whg, Anatolia, ppn, ganj Dareh, chg, Wshg/botai/tarim, onge, mongoliaNorth.
As per Harney et al 2021, any of the sources giving ancestry to any of the reference populations in right is allowed.
ReplyDeleteThis comment has been removed by the author.
ReplyDeletego to this folder
ReplyDeletecd /home/Downloads/bruh/qpadmexample/
and run the command. make sure even parameter file is in this folder. ensure qdata1.snp exists
Sorry for the silly question:
ReplyDeleteDo I make those files manually (with the names) or are they supposed to be there? if its the latter then they weren't there before I made two empty files and gave them these names.
Again, bear with me please, lol
no need to apologize.
ReplyDeleteyou need to download this file and extract the zip.
you will get these files after extraction.
v50.0_1240K_public.geno
v50.0_1240K_public.snp
v50.0_1240K_public.ind
use these files as names in the parameter text file and specify the location too. eg.
genotypename: /home/ash/Downloads/harvarddata/v50.0_1240K_public.geno
In my blogpost and screenvideos, im using a merged dataset hence the name is different (10Oct21.snp etc).
ReplyDeleteYou need to use the name of the downloaded files. worry about custom files and merging etc later.
I see.
ReplyDeleteSo, now I have a question
In the parameter folder, should I write it like this? :
/home/ash/Downloads/yes/v50.0_1240K_public.geno
/home/ash/Downloads/yes/v50.0_1240K_public.snp
/home/ash/Downloads/yes/v50.0_1240K_public.ind
v50.0_1240K_public.geno
v50.0_1240K_public.snp
v50.0_1240K_public.ind
(yes being the name of the file I keep the extracted files in)
also, should I add anything else to the qpadmexample file (like the files extracted etc)?
I have given link to my files in the post. Just adapt it to your own directory name and file name
ReplyDeleteThanks a ton. I appreciate your help and patience
ReplyDeletehmm...I tried using your files (of course, after adapting them to my own directory name and file name) but the error still doesn't go..it says that it cannot open the snp file (same issue I've been going through since the comment I deleted for some reason)
ReplyDeletemail me a screen record video
ReplyDelete@H
ReplyDeleteeverything working now?
Yessir
ReplyDeleteIf I want to run a specific sample, do I do this? (example: Iran_HajjiFiruz_C_Iran_I4349)
Change the 3rd column label name for the specific sample and use that new label
ReplyDeleteCan people post some right pop lists for ancients and moderns, for West Eurasians, so I can get some ideas, I think mine is blotted (below).
ReplyDeleteMbuti.DG
Russia_Ust_Ishim.DG
China_Tianyuan
Goyet_Neanderthal.SG
Russia_Sunghir3.SG
Russia_Kostenki14.SG
Spain_ElMiron
Russia_MA1_HG.SG
Georgia_Satsurblia.SG
Papuan.DG
Switzerland_Bichon.SG
Israel_Natufian_published
Morocco_Iberomaurusian
Russia_AfontovaGora3
ONG.SG
Russia_DevilsCave_N.SG
Thanks.
My fave for bronze and iron ages eurasians, especially west and SC asians is
ReplyDeleteMbuti.DG
EHG - karelia
WHG - iron gates
WSHG/Tarim
Barcin_N
Kotias_CHG
Onge
Sarazm
GanjDareh
Irula
Mongolia_N_North
I always get the " (openit) can't open file /home/konowryy/v50.0_1240k_public.ind.snp of type r
ReplyDeleteerror info: No such file or directory" error when typing the code in terminal and idk what to do, Your videos are not clear because of all these renamed files. I'm not even sure I will get an answer but i post this comment anyway.
Can one rename samples in the .ind file?
ReplyDeleteWhy isnt TarimEMBA in the .ind file?
ReplyDeleteHow do I fix that?