The Archaeogenetics blog: R1a Explained

Friday, October 14, 2022

R1a Explained

Y Haplogroups - A Primer

Y Chromosome haplogroup is a defined set of mutations in the (non-recombinant) portion of the DNA from male specific Y-chromosome. Y-Chr is passed down from father to son along with all the mutations that have been accumulated till that point. All the male descendants of a man will share the mutations in the Y-Chr of that man, plus all the additional mutations that have been collected in the generations between that man and the descendant on his specific male ancestral lineage chain. This feature helps us in identifying the most recent common paternal ancestor of a group of men and is a great tool for population geneticists. Do note that the same set of additional mutations cannot be created in two different men at the same time, the probability of that happening randomly is 0.

mtDNA haplogroups are mutations found in the mitochondrial DNA. Unlike autosomal chromosomes and Y chromosomes, mtDNA is found outside the cell nucleus. mtDNA is passed down from mother to children only, and therefore is informative only of the maternal lineage of a person.

In this article, we will discuss Y haplogroups which can only inform us of the paternal ancestry of males.

Y Hg simple tree: Source

The nomenclature for assigning haplogroup varies, and the most commonly used notations today are ISOGG and YFull. ISOGG notations go like this:

R>R1>R1a>R1a1>R1a1a>R1a1a1>.......

R>R1>R1b>R1b1>R1b1a>R1b1b1>.......

R>R2>R2a>....

Here, each subclade has an additional set of mutations than the previous ancestor. So, R1a and R1b have accumulated a different set of mutations from a common ancestor with Y-hg clade R1 which sets their lineages apart from then onwards. The common paternal ancestor shared by all individuals in the R1a, R1b and R2-carrying individuals is an ancient man who carried the R haplogroup.

A similar notation is applied to other all other haplogroups.

Here, it is worthwhile to get into the terminology that will be used in the subsequent portions of this article, explained at a very basic level.

SNP - single nucleotide polymorphism. Pronounced 'snips'. Each SNP represents a difference in a single DNA building block, called a nucleotide. Each SNP is found in a particular position of the genome (defined by physical position) and is given a name and a reference allele (A, C, T, or G). Harvard ancient DNA database currently provides samples in 1240K SNP resolution (across all chromosomes, in reality not all SNPs can be retrieved from ancient samples). Of these 1.24 million, 32670 SNPs are on the Y Chromosome.
Mutation - for a sample, a SNP for which the sampled allele is different than the reference allele.
eg. These are some of the mutations which define the Haplogroup R as per ISOGG.
In the first line, SNP Names M207, Page37, and UTY2 are given to the unique rsID rs2032658 found at position 15581983 (Human genome reference build #37). A mutation is detected if the allele 'G' is found at this SNP instead of the reference allele 'A'. This mutation, among many others, defines the Haplogroup R.
Positive SNP: If a mutation is found at a SNP. eg. 'M207+' implies that 'G' was found at the above-mentioned SNP.
Negative SNP: If a mutation is not found at a SNP. eg. 'M207-' implies that the haplogroup is likely not 'R'.
Haplogroup - a major branch, such as Haplogroup E or Haplogroup I
Clade - from the Greek word klados, clade means branch, and subclade means a further branch. eg. I2 is a branch of Haplogroup I
Terminal Hg - the branch with the youngest estimated age (age closest to the present) for a sample in the currently available database. For eg., if a sample is found positive for R1a1a, and no further mutations in the available database are found positive, the terminal Hg for the sample is marked as R1a1a. This is important as it helps us ascertain male movements in ancient times more precisely.
Upstream - all SNPs which are a link in the chain of ancestors till a given mutation.
Downstream - all SNPs which are descendants of the ancestor with a given mutation.
Asterisk * - An asterisk after a subclade means that there are further mutations after the terminal Hg but no other available sample shares those additional mutations, therefore they remain unclassified.
Plus + - Means a positive SNP, indicating that terminal Hg is either at this SNP or downstream of it. eg Z93+ includes terminal Hgs Z93, Z94, Z2124, Y3, Y2, L657 and everything else under Z93.
Minus - : This means a negative SNP call, indicating that SNP and all downstream SNPs are negative as well.

Yfull names clades and subclades in a more precise manner and has a larger Y chromosome SNP database. ISOGG nomenclature (R1a1a1, R1b1a2 etc) can keep on changing as more intermediate mutations are found between two levels, causing confusion.

Whereas, Yfull names clades with the positive SNP. eg R1a1a1b2 is named R-Z93 because the defining SNP Z93 is found positive. This has the advantage of being static, as the naming is not dependent on the names of the ancestors as in the case of the ISOGG convention. Furthermore, if the terminal Hg of a sample is R-Z93, it is assumed that the sample is positive for all the previous mutations (upstream) in the chain. eg. the sample is R+, R1+, R1a+, R-M459+, R-M735+, R-M198+, R-M417+, and R-Z645+ while being negative for all subclades downstream of Z93.

We shall mostly use Yfull notation for the R1a analysis below, and sometimes ISOGG notation will be additionally used within brackets.

R1a and R1b Found in ancient DNA

In the aDna database, R* is found in 22000BCE Siberia. So we can assume that R was prominent in the ANE people (Ancient North East). The ultimate origin of Hg R is from the subclades of Hg P. Hg P is widely present among Malaysians and Filipinos. So, the ultimate origin of R is SE Asia (Hallast et al 2021). The route taken to reach Siberia might have been via India (P-M1254 in Andamanese). There are also two P-P337+ samples found in the 30000BCE site of Yana in Siberia. So whatever the case is, the Y haplogroups reached Siberia from Southeast Asia very early.

Clade R2

Formation date ~26000BCE. Subclade R2 appears in the oldest samples from Iran (Ganj Dareh 8000BCE). R2 is also found in SC Eneolithic samples (oldest from that region), but absent from other Central Asian and Siberian sites. So, we can hypothesize that some R men migrated from Siberia to SC Asia/Iran/India where it mutated into R2 in one of their descendants and spread across the region. A more likely scenario is that on the route from SE Asia to Siberia, some R men spread in India/Iran and their descendants mutated into R2.

Clade R1b

R1b splits into R-L754 (R1b1, seen in 12000BCE Villabruna) and whose descendants are visible in the neolithic steppe as well as modern Europeans. The other split is into R-PH155 (R1b2) only seen in the east (Southern Central Asia and Tarim). Basal R1b* (pre PH155 - it has some mutations of PH155, but not all) is seen in a BMAC sample from Uzbekistan. Furthermore, all the 2000BCE Tarim Basin samples are R1b-PH155 or pre-PH155.

With this, we can hypothesize that R1b was born in Siberia/Central Asia from R1 and spread to Western Europe where it mutated into R-L754 and mutated locally into R-PH155 in central Asia.

Clade R1a

The oldest sample is a 10700 BCE man from NW Russia, with R-M459 (R1a1). R1a1 is also seen in the EHG hunter-gatherers (Karelia) and the Eneolithic steppe (Khvalynsk). Local continuity is also seen till the bronze age, where in Fatyanovo we see R-Z645 (R1a1a1b) and the descendant R-Z93 (R1a1a1b2).

The brother clade of R-Z93 is R-Z283 (R1a1a1b1), which is first found in Estonia Corded ware 2600BCE and now is primarily present among central and east Europeans.

The absence of R1a in every sample from Iran and South Central Asia till 2000 BCE makes it very likely that R1a is indeed not local to this region. The earliest sampled R1a-Z94 in this region is associated with autosomal steppe ancestry.

R-Z94 Tree

In the next sections, we will focus on R-Z94 and its subclades. This branch is important to understand because it is often claimed to be the carrier of Indo-Iranian languages from the Steppe into India and Iran. So, it is important to understand the main subclades of Z94 for a nuanced discussion about it's spread.

Z94 tree as per Yfull

Note: Formation dates are taken from Yfull. Here's their methodology for calculating age:

The second formula uses an assumed mutation rate of 144.41 years (0.8178*10-9, which is the average of the mutation rates of the ancient Anzick-1 sample and of a group of known genealogies, and an assumed age of 60 years for living providers of YFull samples.

Their Y-chromosome has a total length of 8,473,821 base pairs. So, 0.8178*10-9 * 8,473,821=0.00692989081 mutations per year. 1/0.00692989081 ~ 144.41 years per mutation.

But Ding et al 2021 suggest that the mutation rate for R1a subclades is lower than for other haplogroups, therefore suggesting older divergence dates. In this article, I will just go by the Yfull dates.

Distribution of R1a and subclades in India by Region - Dr Chaubey Data

The below data is from one slide in Dr Gyaneshwar Chaubey's presentation (unpublished work)

Of the 1196 samples analyzed from all over India, 23% belong to R1a or downstream. 16% belong to R-Y3+ (including L657) whereas 7% belong to the non-Y3 branch of R1a, and can be assumed to be of steppe origin, especially the ones which are Z2124+.

NW India has the highest % of R1a at 41%, of which the non-Y3 branches are 20% and the Y3+ samples are at 21%. Definite steppe impact is seen here with 13% Z2124+ samples. Ganga plains, central, east & south India have 23, 33, 20, and 17% R1a respectively. The bulk of that is Y3+, and only 4-10% is Y3-.

None of the samples from the central, east and Ganga plains show Z2124+. This difference in frequency in Z2124+ and Y3+ samples is important and I will come to it a bit later, just keep the numbers in mind.

Eurasian R1a distribution, from 11036 samples uploaded on Yfull as of Oct 2022

Y haplogroups of 11036 male samples available on the Yfull tree (modern and some ancient) from (select countries of) Europe and Asia are presented below. This is a great dataset, you can browse it at leisure (use a PC/laptop for a better experience with the Tableau dashboard). I have only focused on Z93 and downstream subclades for this dashboard. Note: Saudi Arabia label includes samples from Saudi, Kuwait, and Oman.

Of the 349 Indian samples on Yfull, 99 are R1a+, ie ~28%. 26 belong to Y3- (22 Z2124+, rest Z93+) ~ 7.4%.

While 69 belong to Y3+, L657+ ~ 19.7% of all males. Note that since YFull samples are (mostly) paying customers, the data will be skewed towards the relatively wealthy and tech-savvy demography.

So we can say that ~70% (60-80% range) of all Indian R1a is R1a-Y3+.

These numbers are in line with the Chaubey data presented in the previous section.

Let's have a look at some Specific Branches:

1. Z-93. This clade has many downstream subclades, 3 of which are minor and have a clear origin in the steppe region. These minor ones are

R-BY226207 (most modern samples found in Tajikistan, Xinjiang, Italy & Kyrgyzstan. One ancient sample from Srubnaya_alakul bronze age steppe site as per Yfull).
R-YP5585 (1 ancient sample from 1800BCE srubnaya_alakul, most modern samples from England, 1 from Kuwait and 1 from madhya Pradesh, India (community unknown)).
R-FGC82884 (One notable ancient sample which falls downstream of this- DA382 Turkmenistan_IA 850BCE. Another ancient sample downstream is from bronze age Srubnaya context. Other modern samples are from Hungary, Poland, China, Kyrgyzstan, Kazakhstan, Russia, Arab countries).

2. Z94. Formed ~2600BCE from Z93. Z94 terminal has been found at steppe bronze age sites. It has also been found at Swat valley Pakistan iron age sites, and among some Saka samples, which makes sense given the steppe connection to SC Asia and Swat.

R-Z94 has 1 minor subclade and 2 main subclades. The minor subclade is R-Y40. The main subclades are Z2124 and Y3.

3. Z94>R-Y40. formation date ~2600BCE- No ancient sample under this subclade has yet been found. In terms of modern distribution, Yfull has downstream samples from a Tamil speaker, Singapore (likely a Tamil speaker), Bangladesh, Pakistan, Jordan, Saudi, Syria, Iraq, Italy (likey middle east connection), Xinjiang, Kuwait, Qatar, Kerala, Konkan, Gujarati, Tajikistan and Turkey. On FT DNA as well, samples with Y40+ are from Turkey, Saudi Arabia, Italy, Bulgaria (name Ismail), South India and Kuwait. In the Yfull data, this subclade of Z94 (and its downstream mutations) is only found in India, Pakistan, Bangladesh, China, the Middle East and Tajikistan even though its formation date is ~2600BCE. This mutation wasn't found in any of the 2200 Russian samples on Yfull. The birth of this subclade can be considered to be in the Indian subcontinent.

4. Z94>Z2124. Formation date ~ 2600BCE. The oldest Z2124 sample is from Poltavka, Samara Oblast, Russia ~2600BCE in sample id I0432, perfectly in line with the proposed formation date. All the Sintashta 2000BCE samples belong to Z94>Z2124+, making it the quintessential steppe_mlba marker.

It is relatively minor in the Pakistani & Indian Yfull samples (~7 - 9%) but is present in 80% of Afghan samples. I think that this increase is a result of the brutal Ghaznavid invasion of Gandhara from Central Asia post 900CE. Ghaznavids were Persianized Mamluk Turks who followed Sunni Islam. This is corroborated by the Swat Valley Medieval Ghaznavid period samples who have higher steppe ancestry than the Swat iron Age or Buddhist historical periods. 3/6 Swat Medieval samples carry R1a, 2 are Z93+ and one is Z2124+.

5. Z94>Y3. Formation date ~ 2600BCE. no definite ancient sample found [1 found at Ukraine 2000BCE is questionable because of a single C--->T mutation] but no modern Ukrainian data that is available publicly is R-Y3+ (378 UKR samples on Yfull). Someone pointed out the presence of R-Y2 in one Ukrainian Cossack on https://www.theytree.com/tree/R-Y3, which could have come from Tatar/Bashir Kipchaks. However, that tree has Mongolian Sagly_EIA sample I7030 marked as R-Y3, which is clearly wrong. That sample is R-Z2121+ (as confirmed by the Harvard annotation file as well). So the accuracy of other samples on this website is also questionable.

Most of the Indian subcontinent R1a falls under R-Y3. In the Chaubey data, there are 16 Indian samples which are R-Y3+ but Y27-. These 16 should have terminal Hgs Y3, Y2 or FT12493.

3. Z94>Y3>Y2. Formation date ~2600BCE. Only one sample with terminal Hg Y2 is available on Yfull. He is a man from Kuwait. Further context is not known. In the Chaubey data, there are 16 Indian samples which are R-Y3+ but Y27-. These 16 should have terminal Hgs Y3, Y2 or FT12493.

4. Z94>Y3>FT12493. Formation date ~2600BCE. Only 2 samples from the Middle East have this rare terminal Hg. It is absent from all steppe countries. Therefore, I will consider its formation to be in the Indian subcontinent. In the Chaubey data, there are 16 Indian samples which are R-Y3+ but Y27-. These 16 should have terminal Hgs Y3, Y2 or FT12493.

5. Z94>Y3>Y2>Y27>F1417. Formation date ~2300BCE. There are 10 men downstream of this mutation from the Middle East. On the FTDNA database, 1 Turkish Romani male is positive for this mutation. 1 ancient sample ALN005 (Alai Nura, Kyrgyzstan) from 400CE and 4 Kipchak Turkic-speaking Tatars and Bashkirs from Russia fall downstream of this mutation.

This deserves an explanation. The Kipchak language was spoken by Buddhists from central Asia. Alai Nura also lies on the silk road route from Pakistan to Xinjiang. The earliest inscriptions of Kipchak are derived from Buddhist inscriptions written in Indo-Aryan languages. The Buddhism connection explains how Y27+ from the Indian subcontinent reached the Tatars and Bashkirs.

6. Z94>Y3>Y2>Y27>L657. Formation date ~2300BCE. L657 is the most common R1a subclade in the Indian subcontinent, however, it is entirely absent from the ancient DNA database between 2300BCE-0CE. The earliest L657 sample in the aDNA record is from 100BCE Xinjiang, more than 2000 years after its formation. (However, it's based on a dubious G-->A call. Otherwise it's simply Y27+ rather than L657, C->T and G->A calls are often ignored while calling ancient Y chromosomes). This was likely carried to Xinjiang from the Indian subcontinent, as Xinjiang's genetic contact with India started as early as 500BCE as per Kumar et al 2022. We know that Buddhism also spread to Xinjiang via the silk route post 200BCE.

From Yfull data, L657+ is present in ~19% of Indians, 13% Pakistanis, 19% Bangladeshis, 15% Sri Lankans, and just 1 sample each from Afghanistan, Iran, Kyrgyzstan, Tajikistan multiple multiple samples from the Arab world. However, it is completely absent from 2200+ Russians, and 1200+ Ukrainian/Poles/Baltic samples combined. Therefore, the formation of this subclade has an extremely high probability of being local to the Indian subcontinent.

DISCUSSION

There are a lot of Z94+ samples in the Middle East, but as a percentage, the number is quite small. This region is highly oversampled on Yfull as compared to India. For eg, on Yfull, Saudi Arabia is sampled at 98 times more than the per capita world average, Kuwait 360 times, whereas India is sampled 0.07 times per capita world average. Although there aren't many studies on the presence of India-specific R1a lineages in the Arab world, I believe this can be linked to the migration of Romani/ Kawliyah/ Nawar/ Dom people from India to Syria, Iraq, Jordan, Saudi Arabia, Kuwait regions in the last 1000 years. Some samples may even be from recent work migrants from India, Pakistan and Bangladesh to the middle east, although enough context of these samples is not made available by Yfull.

The Romani R-Y3 connection to the middle east is supported by the following statements made by Underhill et al 2015. Note that R-M780 is another name for R-Y3.

Notably, R1a-M780 occurs at high frequency in South Asia: India, Pakistan, Afghanistan, and the Himalayas. The group also occurs at >3% in some Iranian populations and is present at >30% in Roma from Croatia and Hungary, consistent with previous studies reporting the presence of R1a-Z93 in Roma.

There could also be bronze and Iron age connections from the Indian subcontinent to the middle east, however, we don't have ancient DNA from these regions to confirm or deny this.

Modern distribution of R-Y3+ (M780+). From Underhill et al 2015. Used under CC license

The density (and diversity) of R-Y3+ is highest in the Ganga plains and East India, so that remains the likely region of expansion of this clade.

Did an army of macho Aryan men invade India and give the Vedic language and culture?

It is important to say at the outset that Y haplogroups do not biologically encode for the language spoken, that is a sociological outcome. In the modern world, there are R1a people who speak European languages, Indo-Aryan languages or Iranian languages but many R1a also speak Dravidian, Semitic, or Turkic languages. The most robust reason for the spread of language families like Indo-European is, as Bellwood 2020 puts it, "that they travelled originally in the mouths of migrating populations."

This means that autosomal ancestry change (agnostic to sex bias) is a stronger predictor of language change than Y-haplogroups. Unlike the Y chromosome, autosomal chromosomes (1-22) undergo recombination and both son and daughter receive ~50% admixture from each parent. So the level of external autosomal admixture coefficient can tell us about the ratio of migrants-to-locals, and analysis of the X & Y chromosomes of the samples can tell us about the sex ratio of those migrants.

The larger the autosomal ancestry impact from migrants, the more likely it is that there would be a large impact on the local languages of the region, and given a large enough migration (still a subjective %), there will be an eventual language replacement. This is exemplified by the presence of a high percentage of 'Yamnaya steppe' autosomal ancestry in Corded-ware bronze age cultures of Europe (>70% as per Scorrano et al 2021), which is the reason for the Indo-European languages in those parts of the world as per most researchers. At the same time, Basques are one of the few non-Indo-European speaking groups of modern Europe and they show lesser (~40%) Steppe ancestry but with 80+% Basque males being Y Hg R1b (Olalde et al 2019 and Luis et al 2021). R1b is considered to be a part of this steppe expansion west into Europe. So clearly in the case of Basques, high R1b frequency was not successful in replacing the language of all the Basque ancestors, but the autosomal ancestry gives us a better picture - lower (40%) in Basques vs 70%+ in IE-speaking Germany.

Without knowing the exact societal processes of a given part of the ancient world, the one statement which can withstand scrutiny is 'The larger the migration, the more are the chances of eventual replacement of the local language.' There are also recorded instances of languages being replaced due to migrant women, eg. the replacement and (almost) extinction of Badeshi by Torwali in Swat, Pakistan within a few generations due to Badeshi men marrying Torwali wives.

From the aDNA database, Yfull and Chaubey data, we have seen that apart from the Z2124 subclade of R1a-Z94 whose origin is directly in the bronze age steppe cultures, no other subclades of Z94 can be said to have been born in the steppe cultures, although they still descend from an ancestor who lived on the steppe.

This distinction is important, at least in the context of the Aryan theories of IE language introduction into the Indian subcontinent (and Iran). The reason it is important is that a new mutation can only be born in one man - so if the bulk of the R1a in modern Indians is on the R-Y3 line, and Y3 was born in the Indian subcontinent around 2600BCE, it invalidates the claim of 'mass migration of steppe men' into the Indian subcontinent. Rather what it points to is an organic growth of R-Y3+ and especially R-L657+ lineages which are the descendants of a single Indian man who had the R-Y3 mutation around 2600BCE.

There are papers which corroborate this expansion post 2600BCE. Poznik et al 2016 states

In South Asia, we detect eight lineage expansions dating to ~4.0–7.3 kya and involving haplogroups H1-M52, L-M11, and R1a-Z93. The most striking are expansions within R1a-Z93, ~4.0–4.5 kya. This time predates by a few centuries the collapse of the Indus Valley Civilization, associated by some with the historical migration of Indo-European speakers from the western steppes into the Indian sub-continent

Note that 4.5 kya means 4500 years ago, ie 2500 BCE.

Starting with the Swat valley bronze age, the Indian subcontinent receives Steppe_MLBA (Sintashta R1a-Z2124 related) autosomal admixture. Analysis of the Swat valley Iron age samples reveals that the region received 15-20% ancestry from steppe sources around 1650 BCE. Narasimhan et al 2019 also note that steppe migration was mediated by women. This is because they only found 2 out of 44 male samples with R1a, and neither of these 2 was R-Y3+. Both were R-Z94 terminal Hg.

However, a strange (but not unexpected) claim is made by Narasimhan et al 2019:

Using previously reported calls on 1000 Genomes Project Y chromosomes, we observe that 62 out of the 221 South Asian males have an R1a Y chromosome corresponding to a ninety-five percent binomial confidence interval of 22-34% for Steppe MLBA ancestry on the entirely male line, which is significantly higher than the ninety-five percent confidence interval of 9-14% on the autosomes in the same set of individuals. These results shows the process of admixture of Central_Steppe_MLBA into the ancestors of the ANI was male biased, and reveal that the directionality of sex bias was opposite to the pattern observed for the contribution of Central_Steppe_MLBA to SPGT.

To put it simply, the authors are claiming that although the swat valley data says the opposite, modern data shows them that the steppe autosomal ancestry in modern South Asians must have come primarily via men from the steppe carrying R1a-Z93. The 221 South Asians are from four communities: Sri Lankan Tamil in the UK, Gujaratis in Houston, Telugu in the UK, and Punjabi in Lahore. Notice the erroneous assumption made here: All of Indian R1a is treated as Z93 without any of the nuances that they offer to other regions of the world. Eg. The recent 'Southern Arc' paper analyzes various subclades within R1b with nuance for Steppe, Armenia & NW Iran. The presence of M780 (aka Y3) and L657 in Indians was known since 2015, so that is not a valid reason to ignore the subclades of Z93.

If we substitute what we know from this article, only 14 of the 64 (Narasimhan reports 62, Poznik et al 2016 have 64) South Asian R1a from the 1000 Genome Project are Z2124+. Considering that 2000-1000BCE steppe only had Z2124+ samples, we see 14/221 (3% - 9.5% at 95% CI) of Steppe_MLBA on the Y chromosome instead of the claim of 22-34% by Narasimhan et al. The conclusion of male-mediated steppe migration completely falls apart. And now, the data from modern South Asians matches with the data seen at the Swat valley Iron age sites. In typical Eurocentric fashion, the Harvard researchers are just trying to force-fit the events of male-mediated steppe dispersals in Europe onto the genetic prehistory of the Indian subcontinent. The archaeological and genetic histories of the two could not be any more different.

Instead, we must propose the idea of a few men who entered India early (~2600BCE) and settled into Indian society. That a small number of people could change the language of a large country is not a good explanation for the Indo-Aryan language family. So this early migration should not be linked to language change.

The mechanism of Autosomal & Paternal ancestry transmission

Swat is the widely accepted route for the entry of steppe ancestry into the Indian subcontinent around 1650 BCE. No concrete date for the entry of steppe autosomal ancestry into modern Indian populations is available so far.

If Y3 and Y40 mutations from R-Z94 were formed in the Indian subcontinent around 2600 BCE, then what happened to the steppe autosomal component? Well, it could have become diluted.

Suppose that 1 R-Z94 man from the steppe migrated to Haryana and married a local woman there. They had 2 sons. The sons would have ~50% steppe ancestry and Z94 Y Hg. Suppose that both sons underwent separate mutation on their Y Hg and formed Z94>Y40 and Z94>Y3 respectively. Their male descendants would inherit these 2 subclades, but with each generation of marrying local women, eventually, the steppe ancestry would fall to ~0% in 100-120 years (50% >25% >12.5% >6.25% >3.125% >1.55% >~0)

In theory, just the migration of one Z94 man into India is sufficient to explain the current prevalence of his descendant R-Y3+ and Y40+ lineages in the Indian subcontinent. But in reality, it could have been more than one although not large enough to leave a detectable autosomal genetic impact. This is because the ~2000BCE Rakhigarhi sample has no autosomal steppe ancestry (Shinde et al 2019), and there are no other indications of the presence of early (~2600BCE) steppe autosomal ancestry as per admixture dating of various groups computed by Narasimhan et al 2019.

In this way, a Y Hg originally associated with the steppe autosomal ancestry could lose its association in a new population. There are various examples of this.

1. The only 2 R1a males in the 850CE Roopkund_A cluster have zero to minimal steppe ancestry, whereas all the other male samples with much higher steppe ancestry are non R1a.

2. Chenchu tribe of south India has ~25% R1a frequency but no steppe ancestry (Kivisild et al 2003)

3. There's a CHG/Iran related Y hg J sample in Karelia, Russia which has only Eastern Hunter Gatherer (EHG) autosomal ancestry.

and many more.

As another example, in the 'Southern Arc' supplement, the following point is noted about R1b:

However, the complete lack of association of R-haplogroup descendants and EHG ancestry in either Armenia or Iran is consistent with either a massive dilution of EHG ancestry in these populations resulting in the dissociation of Y chromosome lineages from autosomal ancestry over time, or with a scenario in which R-M269 was not associated with substantial EHG ancestry to begin with.

In the above statement, you could replace 'EHG' with 'steppe', 'Armenia or Iran' with 'India', and 'R-M269' with 'R1a' and the statement will match with what likely happened in India.

It must also be said that just like Y haplogroups can dissociate with autosomal ancestry, they can again reassociate with said autosomal ancestry by pure chance. Lots of changes can happen in 4000 years when we study the genetics of modern people, therefore making conclusive statements based on the correlation = causation fallacy should be avoided. Time transect of ancient DNA from the same region is the best method to tell us exactly what changes occurred at each step.

CONCLUSION

The likely birth of various Haplogroups and clades:

1. R1: West Siberia/Central Asia

2. R1b: West Siberia/Central Asia

3. R1a: West Siberia/Central Asia/Eastern Europe

4. R-Z645, Z-93, Z-94 and Z-2124: Eastern Europe

5. R-Y40 & Y3: Indian Subcontinent

6. R2: India/Iran/SC Asia

The distribution of Z2124 and Y3 samples in India makes it unlikely that there was a male-mediated steppe migration into India post 1700BCE. The presence of some Y3+ lineages in a few Turkic speakers (Russia/China) can be explained via movement from the Indian subcontinent in the Buddhist spread period. The movement of Y3+ lineages to the middle east is likely explained by the migration of the Romani people from the West and NW part of the Indian subcontinent post 900CE, however, more research is needed here.

REFERENCES

Narasimhan VM, Patterson N, Moorjani P, et al. The formation of human populations in South and Central Asia. Science. 2019;365(6457):eaat7487. doi:10.1126/science.aat7487

Underhill, P., Poznik, G., Rootsi, S. et al. The phylogenetic and geographic structure of Y-chromosome haplogroup R1a. Eur J Hum Genet 23, 124–131 (2015). https://doi.org/10.1038/ejhg.2014.50

Poznik GD, Xue Y, Mendez FL, et al. Punctuated bursts in human male demography inferred from 1,244 worldwide Y-chromosome sequences. Nat Genet. 2016;48(6):593-599. doi:10.1038/ng.3559

Qiliang Ding, Ya Hu, Amnon Koren, Andrew G Clark, Mutation Rate Variability across Human Y-Chromosome Haplogroups, Molecular Biology and Evolution, Volume 38, Issue 3, March 2021, Pages 1000–1005, https://doi.org/10.1093/molbev/msaa268

Chaubey, Gyaneshwar. "Dr Gyaneshwar Chaubey talk on Genetics behind the Aryan Migration Theory." YouTube, uploaded by Aditya Bakshi, 20 Sept. 2019, www.youtube.com/watch?v=PIeiHsGUeEU.

Kumar, Vikas et al. “Bronze and Iron Age population movements underlie Xinjiang population history.” Science (New York, N.Y.) vol. 376,6588 (2022): 62-69. doi:10.1126/science.abk1534

YFull Tree - https://www.yfull.com/tree/

FTDNA R1a Project - https://www.familytreedna.com/public/R1a/default.aspx?section=ycolorized

Hallast, P., Agdzhoyan, A., Balanovsky, O. et al. A Southeast Asian origin for present-day non-African human Y chromosomes. Hum Genet 140, 299–307 (2021). https://doi.org/10.1007/s00439-020-02204-9

Bellwood, P. (2020). Some observations on the trans-Eurasian language family, from the perspective of the Farming/Language Dispersal Hypothesis. Evolutionary Human Sciences, 2, E37. doi:10.1017/ehs.2020.34

Gabriele Scorrano, Fulya Eylem Yediay, Thomaz Pinotti, Motahareh Feizabadifarahani & Kristian Kristiansen (2021) The genetic and cultural impact of the Steppe migration into Europe, Annals of Human Biology, 48:3, 223-233, DOI: 10.1080/03014460.2021.1942984

Olalde I, Mallick S, Patterson N, et al. The genomic history of the Iberian Peninsula over the past 8000 years. Science. 2019;363(6432):1230-1234. doi:10.1126/science.aav4040

Luis, J.R., Palencia-Madrid, L., Mendoza, V.C. et al. The Y chromosome of autochthonous Basque populations and the Bronze Age replacement. Sci Rep 11, 5607 (2021). https://doi.org/10.1038/s41598-021-84915-1

Kivisild T, Rootsi S, Metspalu M, et al. The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. Am J Hum Genet. 2003;72(2):313-332. doi:10.1086/346068

(2018, August 26). Badeshi: Only three people speak this 'extinct' language. BBC News. Retrieved October 16, 2022, from https://www.bbc.com/news/world-asia-43194056

Shinde V, Narasimhan VM, Rohland N, et al. An Ancient Harappan Genome Lacks Ancestry from Steppe Pastoralists or Iranian Farmers. Cell. 2019;179(3):729-735.e10. doi:10.1016/j.cell.2019.08.048

Lazaridis I, Alpaslan-Roodenberg S, Acar A, et al. The genetic history of the Southern Arc: A bridge between West Asia and Europe. Science. 2022;377(6609):eabm4247. doi:10.1126/science.abm4247

22 comments:

Abhi said...: I was wondering why the Narasimhan paper doesn’t add up. The dominant r1a clade in India is L657, which is found earliest at 500 bce in China, not even in the Swat valley. So how can the swat valley steppe be source of Indian steppe ancestry; October 14, 2022 at 11:21 PM
vAsiSTha said...: Not 500bce, L657 is found in 100bce Xinjiang. saomple id C3316 China_Xinjiang_Guanjingtai_IA; October 15, 2022 at 12:11 AM
Mayuresh Madhav Kelkar said...: As usual, a cutting edge analysis. I am reading the following sentence

"Of the 1196 samples analyzed from all over India, 23% belong to R1a or downstream. 16% belong to R-Y3+ (including L657) whereas 7% belong to the non-Y3 branch of R1a, and can be assumed to be of steppe origin, especially the ones which are Z2124+. "

and trying to relate it to your Yfull box chart and unable to visualize how the R1-M780 relate to Y2/Y3/Y4. Another chart with only the India relavant Y at the top and then all the R's further down would help. Thank you.; October 15, 2022 at 8:26 PM
vAsiSTha said...: R-M780 is another name for R-Y3.

As per YFull, it is defined by mutations at 2 SNPs- Y3/F2597/M727 and Y26/M780.

If both these mutations are found, Yfull calls it R-Y3, whereas some research papers call it R-M780.; October 16, 2022 at 1:01 AM
Abhi said...: I looked at the YFupl for Y3, a lot of the people taken were Russian or middle eastern. How do you know of Y3 originated in India?; October 16, 2022 at 1:04 AM
vAsiSTha said...: "I looked at the YFull for Y3, a lot of the people taken were Russian or middle eastern. How do you know of Y3 originated in India?"

Middle east has many samples, but it is the most oversampled region on Yfull. Check this map https://phylogeographer.com/yfull-world-sampling-rate-map/
only 30-40 ME samples out of 5000 are R-Y3+. This is less than 1%, and makes it unlikely that it was formed there. The Romani connection from India is a much better reason for R-Y3 presence in Arab world.

There are precisely 4 Russian samples, all with Z94>Y3>Y2>Y27>F1417, 2 of them are Tatars, 2 are Bashkirs. All 4 are Kipchak Turkic speakers and not Indo European. The earliest Kipchak inscriptions are about Buddhism. So a late spread due to Buddhist movements is the reason for this. These central asian Buddhists later became Turkic speakers. Alai Nura, Kyrgyzstan 400CE sample also falls downstream of this mutation. Samples with south asian admixture have been found in other sites near Alai Nura at the same time. This is the period when Buddhism was at its peak.; October 16, 2022 at 8:18 PM
Abhi said...: Good, I saw a video in a Youtube channel called World of Antiquity explaining the origin of Sanskrit using genetics. there was a link by someone to his channel explaining how genetics proves an archeic indo european presence in India. That was struck down due to the argument that ancient dna, somehow better, shows it is steppe peoplemoved after 2000 bce, not before. I wish they could see your work.; October 16, 2022 at 9:33 PM
Anveṣaṇam said...: I always thought forcing the "male mediated steppe ancestry" on modern day South Asians was a deceiving layered lie. And for the Romanian explanation for presence of Y3 in Middle Easterners, do Romani themselves have this mutation (R-Y3+ and R-Y2-)?; October 17, 2022 at 10:57 PM
vAsiSTha said...: On Ftdna theres this Turkish Roma guy with Y3>Y2>Y27 L657-

E13038 Sarmaşık, Turkish Roma (Horahane), Babaeski Turkey R-M634

On Yfull, there's a Y3+ Y2- (Y3>FT12493) Turkish sample, more context is not given. He could possibly be Romani but cant say more.

There's an Omani as well as a Brazilian (Minas Gerais state) sample as well (Y3>FT12493>FT13945), with no extra known info.

The Brazilian sample could be Brazilian Romani, who have had presence in Minas Gerais after 1600ce, but also in other brazilian regions. https://en.wikipedia.org/wiki/Romani_people_in_Brazil

Older studies on Roma people don't differentiate between Y3, Y2 or L657, so theyre useless in this regard.; October 18, 2022 at 1:01 AM
mzp1 said...: I am not sure R1A is the original IE.

J1 and R1A appear together in a few places. Also IJ is kinda close. Why is IJ and R1a/b appearing so close to each other.

It looks like IJ left the IE homeland a long time ago, and then that homeland was left with R1A/B mostly, which is the last expansion we are seeing.

All of these HGs probably expanded out from the Gujarat region. R does seem intrusive into the IJ and G HGs which are phylogenetically closer and then R would have separated earlier than mixed in again or something.

I actually have J1-P58 but my ethnic group looking at 23andme is like 50% R1A and 50% J1; October 18, 2022 at 6:33 PM
Daniel de França MTd2 said...: Shouldn't the dates of mutation be in general much older than they are attested? For example, for Z93 or Z94 to be found at a certain in certain individual, it is much more likely that it appeared much more in the past only to become common at a much later date.

It's not like that a certain mutation will spread so soon, unless it yields some evolutionary advantage. So, Z93 could be a couple of thousand years old for a "random walk" spread. I don't have the mathematics means to calculate that, but I've never seen this point addressed.; October 20, 2022 at 5:40 AM
vAsiSTha said...: The formation dates given are ranges at 95% confidence interval. eg. R-Z2124 date as per Yfull is 4900 - 4200 years ago at 95% confidence.
But the Y hg has also been found in a sample dated to 4600years ago. So we can say that Z2124 was formed between 4900-4600 years ago.

Ding et al believe that the ages of R1a subclades are underestimated and that they should be older. If that is the case, then even Yfull formation dates should be pushed back, maybe by 10%.; October 20, 2022 at 5:48 AM
Daniel de França MTd2 said...: @Vasistha but that is not what I asked. It seems to me that these dates refer to the error in estimating the oldest sample. What I want to know is that if, in this dating, they estimate the time it takes for a mutation to happen until it is common enough in the population so that it is not a "fluke".

For all I know, it Z93 could appear, say, 4500BCE, in a very isolated population, but it and only centuries later diffuse. Or, maybe, it doesn't just diffuse by mere chance.; October 21, 2022 at 3:45 AM
vAsiSTha said...: There is an error margin in the estimation, and if the clade formed much earlier than estimate then we should be able to find it somewhere in the ancient records.

Otherwise, the calculations are our best bet.; October 21, 2022 at 9:09 AM
Abhi said...: VAsishta You should send your work to the so called authorities and see if they will change their views. Also could you put this article on a YouTube video?; October 21, 2022 at 7:13 PM
Piyush said...: "is present in 80% of Afghan samples. I think that this increase is a result of the brutal Ghaznavid invasion of Gandhara from Central Asia post 900CE."

Hey Ashish, If I understand this correctly, are you claiming that increased frequency of R1a-Z2124 in Afghanistan is the result of Ghaznavid invasion ? If yes, then by extension, are you also implying that pashtuns, who are rich in R1-Z2124 and are the dominant ethnic group in afghanistan , received major patrilineal gene flow during this ghaznavid invasion ?
Folks were saying that the swat valley medieval ghaznavid era sample was very similar to modern day pashtuns.; October 26, 2022 at 12:44 PM
vAsiSTha said...: "Hey Ashish, If I understand this correctly, are you claiming that increased frequency of R1a-Z2124 in Afghanistan is the result of Ghaznavid invasion ? If yes, then by extension, are you also implying that pashtuns, who are rich in R1-Z2124 and are the dominant ethnic group in afghanistan , received major patrilineal gene flow during this ghaznavid invasion ?
Folks were saying that the swat valley medieval ghaznavid era sample was very similar to modern day pashtuns."

There's a caveat, the 47 Afghan samples on Yfull are quite less in number. Maybe many of those individuals are related as well. So the 80% Z2124 on Yfull may actually not be accurate for Afg as a whole.

Yes, swat valley medieval samples do look similar to north_afg_pashtun on G25, but qpAdm is needed to validate that. What is certain is that there is increased steppe autosomal ancestry and R1a frequency after the Buddhist period, as seen in the medieval period samples in Pakistan.

R1a Frequency in Swat

IA - 2/44 R1a
Historical - 2/9 R1a
Medieval - 3/7 R1a

But maybe I'm not so off the mark. Underhill et al 2014 say
"R1a-Z2125 (Figure 3c) occurs at highest frequencies in Kyrgyzstan and in Afghan Pashtuns (>40%)."

Z2125 is downstream of Z2124, and is found at highest freq in Afghanistan as per Underhill, as is shown by Yfull data as well.; October 26, 2022 at 2:47 PM
vAsiSTha said...: Update: Have added Yfull data for various European countries to the dashboard.

For Kazakhstan, Uzbekistan, Tajikistan, Kyrgyzstan and Turkmenistan the Yfull samples are less in number. So I have also added 166 total samples from these countries from the following 2022 paper.

"Ancient Components and Recent Expansion in the Eurasian Heartland: Insights into the Revised Phylogeny of Y-Chromosomes from Central Asia" - Zhabagin et al. 2022 https://doi.org/10.3390/genes13101776; November 6, 2022 at 6:00 PM
Rig Vedic Clans said...: That guy is wrong.; February 16, 2023 at 1:23 AM
Unknown said...: Hi Vasishtha

So based on your modeling, you found that 1 or very few R1a1 people came from outside which resulted in 20% R1a1 presence in India.

What do you think of this paper? They are suggesting R1a1 possibly has founding behavior, I believe they are suggesting R1a1 roots are in India. Please comment.
https://www.nature.com/articles/jhg20082?fbclid=IwAR1ANswvT3Mv0qz2zNgJAMR3yF8cLwqBKE694jhtpZ2FMRC39lEZZBZdV-8; February 24, 2023 at 10:30 AM
Xax said...: The entire ludicrous idea that R1a originates from Siberia or Northern Eurasia is completely without basis or merit, the claim that R1a originates from Siberia or Northern Eurasia COMES FROM ONE SINGLE INDIVIDUAL SAMPLE AND NOT MANY SAMPLES!! This is a case of cherry picking and pure obfuscation, and absence of evidence does not equal evidence of absence. IN REALITY IT SHOULD BE India-Southern Asia-Iran-------------------->>>>Central Asia------------->>>>Siberia and Northern Asia---------------------------->Then Europe. Haplogroups aren't the be all and end all for genetics and actually for individuals is a poor measurer of genetic estimation, however the genetic ancestral clade to Haplogroup R is Haplogroup P and P is an Indian South Asian origin genetic haplogroup and South East Asian so it's irrefutable that R had to have based on that reality alone originated from a South Asian-Indian and Iranian location origin. Secondly the OP is completely ignorant of the reality and fact not aside from Haplogroup "P" the ancestral genetic siblings of Haplogroup "R" such as Haplogroups Q, M, S, N are found oldest in either Indian South Asian origin population groups or had origin within South Asian populations in the distant past or found in geographic adjacent regions of India-Southern Asia region; for example Australasians-Aboriginals are the only people that today carry Y-Haplogroups "M" and "S" and not anyone else (definitely not Northern Eurasians), Haplogroups "O" and "N" which are the other genetic paternal lineages closest to "R" originated and migrated-split out of North-East India-South East Asia and it's accepted as fact that "O" and "N" originated from mainland South East Asia-Eastern Himilayas which is near historic South Asian "Onge" admixed populations or came about from Onge admixed population groups (accepting the premise that "Onge" represents ancestral SA HG populations which they certainly didn't) so it's pretty clear based on simple evidence that the R1 and "R" had to have Indian South Asian-Iranian origin source.

Thirdly the entire premise of "R" originating from a Northern Eurasian location is based on cherry picking evidence and convenience sampling and absence of evidence does not equal evidence of absence. Because those locations had the privilege and finances and right conditions to uncover bodies and samples (single digit ones) researchers can cherry pick and claim X or Y genetic lineage originated from this or that location then ignore all contrary evidence which is what the OP is doing. Northern Eurasia and Central Asia today and in the past were barren cold wastelands incapable of hosting or housing large scale human populations in the past based on climatic conditions alone and still can't today specially to generate genetic diversity to originate genetic haplogroups. So, the idea that a major genetic lineage like "R" could originate from such desolate low population count regions such as Northern Eurasia especially right after the ice age is preposterous and unbelievable. So more non-sensical in-diot coolie non-sense and idiocy from the original OP anything other than a South to North migration is unbelievable.; April 9, 2023 at 12:27 AM
Dr. Ratnesh Shrivastava said...: Look at the Time to most recent common ancestor for R-M207 (Arunkumar et al), R-M198 (Mirabal et al), R-M17 (Swarkar Sharma et al, R-M780 (Underhill) and R-Z2125 (Underhill). The ratio between times for India versus Central Asia is 1.5 i.e Indian subclades are 1.5 times more ancient - i.e. they are pre-neolithic. R or R1 came to India during the LGM according to me. Wells et al and Mahal also said that it came in the pre-neolithic period. The Aryans just did not come in 1500 BCE, they did not even come during the neolithic they came to India as a result of the very bad climatic conditions during the LGM to India because it was the ony place with great climate even during LGM.; August 25, 2025 at 7:23 AM

The Archaeogenetics blog