Showing posts with label ADMIXTURE. Show all posts
Showing posts with label ADMIXTURE. Show all posts

Saturday, July 2, 2016

4Mix Ancients for PuntK12 Calculator

Overview

4Mix is a nifty supplementary tool executed alongside GEDMatch calculator or ADMIXTURE outputs to establish the genetic distance and ancestral proportions of a given number of population combinations. Originally conceived by "DESEUK1" (Eurogenes Ancestry Project participant),  it has been implemented numerous times across the wider genetic genealogy community.

In light of Lazaridis et al. 2016's recent "The genetic structure of the world's first farmers" [Link], crucial aDNA from the Near-East has been published and utilised by citizen scientists.

This brief entry provides users with an immediate means of assessing their ancestral proportions with the new releases through the PuntDNAL K12 calculator.

The R script, an example target file, the population source data and ReadMe's (DESEUK1's original and my own contribution outlining the "sink" version's procedure) can be found in the link below:



Purpose of the Package

This modification was simply designed to give the wider genetic genealogy community an easy and informal means of manipulating this recent data to explore ethnogenesis or personal ancestries at their own discretion. This is not a formal assessment of the above.

Limitations

Those intending to use this 4Mix package must be aware of the following:

1) The Iran_N, Iran_ChL and Levant_N samples here are GEDMatch contributions by genome bloggers "Kurd" and "Srkz". These currently number one, two and one respectively.

2) The utilisation of these samples as references is a short-term convenience and should not be considered equivalent to ADMIXTURE runs containing these samples among them. The methodology described above opens the potential for Davidski's "Calculator Effect" to manifest.

3) Due to the continued absence of Ancestral South Indian (ASI) aDNA, the Paniya were considered a "last resort" surrogate to address the ancestral proportions South/South-Central Asian samples would generate. Furthermore, additional modern reference populations (i.e. Yoruba, Nganasan) were used to furnish other worldwide aDNA deficiencies. These populations were chosen based on their peak modal status in the K's determined by the PuntDNAL K12 calculator.

Contributions

A very special thank you to the users "jesus" and "khanabadoshi" from Anthrogenica for their guidance and assistance in modifying the package for your usage. Another extended thank you to the user "surbakhunWessste" (also from Anthrogenica) for outlining the "sink" procedure here

Thursday, August 6, 2015

Steppe Ancestry Estimations in West, Central & South Asia (Ancestral Proportions Method) [Original Work]

Disclaimer
This is largely a re-post, albeit with additional explanations, from a recent ADMIXTURE autosomal run (Eurasia K20) performed at Anthrogenica by the user Kurd. Full technical information and the original files may be found in his original thread. Full acknowledgement is provided to him for the great work. Unless stated otherwise, assume the contents refer to the Eurasia K20 run. This entry may be updated at any time to include further investigations based on future runs. Finally, this entry assumes the mainstream Pontic-Caspian theory for the genesis of the Indo-Europeans to be fully accurate.

Preamble
This entry/repost contains a "quick and dirty" method for a preliminary attempt at deriving their Sintashta admixture levels in West, Central and South Asians based on the Eurasia K20 scores. Given the different admixture histories elsewhere in Eurasia, this probably won't be very informative for users with ancestral backgrounds outside the lands between Kurdistan and the Indo-Gangetic plains. This is especially the case with modern Europeans, who share the same core components with Sintashta, while also deriving their own Indo-European ancestries from different archaeological cultures and time periods.

Establishing the Context
According to this Eurasia K20 run, Sintashta are approximately 62% Yamnaya, 22% EEF, 10% European and 3% SHG_WHG. Sintashta, at present, appear to be the best proxy for the Indo-Iranians that arrived in West and South Asia. The above four components define the majority (94%) of Sintashta's autosomal profile here.

As discussed elsewhere in Anthrogenica (kudos to user Sein for pointing this out), Sintashta should be considered better surrogates for the Andronovo-related waves which reached West, Central and South Asia than the actual Andronovo samples derived from Allentoft et al. 2015. This is due to the Andronovo samples being derived from the extreme northeast of the archaeological horizon (above the Altai, near Afanasievo). Their position opens up the possibility for extraneous admixture from other steppe groups (including early speakers of Tocharian through Afanasievo?).

The user Kurd has previously demonstrated that recent steppe-related admixture may be segregated from other components. While undertaking this exercise, it also looks like Kurd has done an excellent job addressing the "teal" component that defined up to half of Samara Yamnaya and a big chunk of Sintashta. Kurd's K20 is, in my view, the most effective attempt thus far at separating the complicated autosomal overlapping in West and Central Eurasia.

Introducing the Ancestral Proportions Method (APM)
At present, the genetic landscape in West, Central and South Asia presents as a triple conundrum:

  1. There is, to date (and with the exception of the poor quality Barcin Neolithic Turkish sample), absolutely no interpretable ancient DNA (aDNA) from any of these regions, or indeed, at any point in this broad area's history. Perhaps the greatest obstacle at present.
  2. Autosomal and uniparental marker data from across the region are either inconsistent in sample strategy, or outdated, preventing a knowledge-based approach towards interpreting results.
  3. Archaeological evidence is inconsistent across the region; some cultures are well-studied, whereas others have fallen to mirthful speculation or cannot be readily assigned to any particular prehistoric group.

The APM is, in principle, unconcerned with these issues. Instead, it relies on objective data from a single ancient population to discern the numerical degree of overlap with modern populations.

Whether or not Iranians, Punjabis or Nepalis derive the bulk of their ancestry from unrelated group X or Y is beside the point. The sole purpose of the APM is, therefore, to establish whether or not ancient population Z has left any genetic imprint on modern populations A-K, and if so, to what extent. As such, the methodology described here is completely different as it is assymetrical; one-way gene flow across space and time from one ancestral (extinct) population to numerous extant populations. APM or derivative approaches should be considered as supplementary rather than directly competing with symmetrical modelling techniques such as f3 statistics.

The APM was specifically designed to answer the question; to what degree did Sintashta-related populations contribute to the modern groups of West, Central and South Asia? This simple inquiry has a tendency to attract considerable debate and wildly differing estimates in online discussion boards. Today, using the APM and recently generated data from the Eurasia K20 run, I hope to provide one set of estimations completely independent of extraneous modelling factors.

This approach is entirely reliant on high component specificity (e.g. minimal overlap or bleed-over from one component to another). This particular parameter is not within my control in this instance. As such, the outputs from APM here should be considered cautionary preliminary estimates at best, given the potential for ADMIXTURE-related shortcomings in the absence of relevant aDNA. I anticipate this approach will be much more effective at gleaning admixture extents once aDNA from West and Central Asia dating <2200 B.C. are retrieved.

The APM Approach
To contrast against the ADMIXTURE Sintashta scores, two different approaches are utilised together:

1) Direct Overlap (DO): summarised, for each component, the maximum overlap between a given population average and Sintashta's scores are calculated. This is done individually across all four components (Yamnaya, EEF, European, SHG_WHG) with the outputs added. See image below for schematic (conceptual breakdown of how the DO principle works between hypothetical samples 1 and 2, with Components a-d representing the distinct components).

Schematic diagram showing the principle behind Direct Overlap calculation


2) Component Proportions (CP): A single dominating component (frequency > 50%) is considered modal for the ancestral population of choice, with the other values considered as a fraction of this in modern populations. Given the Yamnaya component makes up almost two-thirds of Sintashta, the ratio between a population's and Sintashta's Yamnaya score are calculated and re-applied to the rest.

There are, however, problems with either approach:

1) DO is overinflated the more West Eurasian a population is. For example, several of the Iranian or Kurdish users at Anthrogenica had component scores greater than what is found in Sintashta (e.g. European being 12% in one sample, when it's 10% in Sintashta). This biases the results for Iranians and Kurds greatly, even when absolute value adjustments are set in place, which the formula shows is (it is highly improbable an Iranian with 10% European derived all of it from Sintashta).

2) CP is more accurate given the Yamnaya component appears highly steppe-oriented in Eurasia K20 and can therefore serve as a direct admixture marker. However, some of the South Asians are scoring very low, or almost none of, the other key components found in Sintashta (e.g. EEF). Due to this, CP doesn't fully account for the "missing variation" in South Asians, biasing the results slightly in their favour.

One convenient workaround is undertaking an average of both scores. However. given CP is intuitively more accurate due to the reasonable specificity of the Yamnaya component, a weighted average biased in favour of CP by a ratio of 3:1 was undertaken. The ratio choice in this variant of the APM is arbitrary here. Other variants (2:1, 4:1) would not result in radically different outcomes.

Results
Full results from up to 24 populations are shown in the Data Sink (interactive chart below). Summarised, Pamiri Tajiks are the most Sintashta-derived at 31.9%. North Caucasian (Ossetian) and Central Asian ethnic groups (Pashtuns, Uzbek, other Tajiks, Turkmen) follow at 22-19%. Various other ethnic groups across West, Central and South Asia follow. The lowest scoring population sampled here are the Makrani at 9.2%.

Internal Validation
The output (Data Sink) readily demonstrates strong correlation between DO and CP scores per population (e.g. Tajik Pamiris at 34.20% & 30.8%, Nepalis at 15% & 14.5% and Makrani at 10.2% & 8.7% respectively account for the top, middle and bottom pairs). The only marked deviation between the DO and CP scores were noted in West Asian populations (Armenians, Kurds, Iranians), as mentioned previously. Thus, empirical confirmation of correlation (e.g. Spearman's rank order) is unwarranted here.

Another means of confirming the validity of APM is to confirm Andronovo is a descendant of Sintashta. As the Andronovo archaeological horizon originates from Sintashta directly, one would expect very high (>90%) Sintashta-derived ancestry among them.

This appears to be the case. compared against Sintashta, Andronovo exhibits DO = 83.9%, CP = 97.3%, an average of 90.6% and a weighted average of 92.8%.

Summarised, these two results (dataset-wide correlation, ancestral-immediate successor high overlap) validate the outcomes of the APM.

Closing Thoughts
The results featured in this entry are in line with both broad uniparental marker data, previously published IBD results (unfortunately removed from sources) and are largely (though not fully) compatible with the degree of archaeological input from Andronovo-derived cultures in Asia. As stated previously, due to earlier shortcomings, they should not be considered definitive.

Given the CP here is not exclusively associated with Sintashta, I anticipate this technique will be more accurate if future "steppe"/"Yamnaya"/"Yamanya_related" components are shown to define more of the Sintashta samples. I look forward to extending this method in the near future.

Acknowledgements
Special thanks to the user Kurd from Anthrogenica for making this data available and obliging member inquiries with productive responses, as well as the user Sapporo for generating several of the population averages.

Wednesday, August 6, 2014

Anchored in Armenia: An Exercise in Genetic Relativity [Original Work]


Introduction

Location of the Armenian Highlands in West Asia
As is the case with many groups in the region, the Armenians are, anthropologically-speaking, a very unique modern ethnicity. Situated in the Armenian Highlands (an expansive area straddling between the Zagros & Caucasus range) with a settlement history dating since the Neolithic, the modern Armenian people have maintained a distinct culture both shaped and shielded by the mountainous territory they inhabit. [1] One unique aspect of the Armenian people is their language; Modern Armenian is an Indo-European language belonging to its' own branch. There has long been scholarly debate regarding its' linguistic exodus from the Proto-Indo-European homeland (commonly accepted by modern linguists as the Pontic-Caspian steppe) [2] through to its' historical seat in the South Caucasus. As is evident by the attested Urartian and Hurrian loanwords in later forms of the language, Armenian must have been spoken by its' current forebears since at least before 500 B.C. [3] Various genetics enthusiasts (including myself) on differing occasions have cited this as an indication of an aboriginal West Asian genetic layer accompanying the Urartian-Hurrian vocabulary substratum.

Presumably due to the on-going political instability in West Asia, there has been an unfortunate lack of ancient DNA (aDNA) recovery in the areas adjacent to the Armenian Highlands. Alongside the Armenians, West Asia proper is also home to Anatolian Turks, numerous Kurdish groups, the Assyrians, several Jewish minorities and various ethnic groups within Iran. Inter-relation of all these groups in differing extents has been demonstrated in both published studies [4] and the open-source projects. [5,6]

Mount Ararat - A symbolic item in Armenian culture
Although they have most likely experienced their own demic events in prehistoric times, the insular nature of the Armenians relative to their neighbours allows them to be used as a stand-in for the aDNA we currently lack in this part of the world. In this blog entry, the Armenians will therefore be considered as a surrogate for autochthonous West Asian ancestry. They will be treated as a primary donor population (PDP) for several other West Asian groups, in an attempt to flesh out the degree of mutual shared ancestry, as well as the directions of added affinities beyond the region. This is by no means an authoritative attempt to purport a particular image of the West Asian genetic landscape, but an attempt instead to provoke discussion and explore the underlying structure of the region through a manner that should hopefully yield fruitful results in the glaring absence of aDNA in the region.


Working Hypotheses

1. Given the demonstrated similarity in autosomal DNA profiles (here and here), modern Armenians will serve as a reasonable PDP for all tested populations.

2. Furthermore, the genetic difference (GD) will likely be dictated by geographical proximity to the Armenians, or a (lack of) history of admixture with them.

3. Finally, the other donor populations will be anticipated either by virtue of geography or language.


Method

The Dodecad K12b Oracle was used to undertake this small project (please visit link for technical information). When executed through R, the program was set to Mixed Mode and fixed to 500 results for every iteration per population. The command entered therefore remained the same each time:

DodecadOracle("WestAsianPopulation",mixedmode=T,k=500)

Samples consist of nine location-specific populations (Iranians, Kurds_Y, Azerbaijan_Jews, Iraq_Jews, Iran_Jews, Turks, Turks_Aydin*, Turks_Kayseri*, Turks_Istanbul*) and four Dodecad participant averages (Iranian_D, Kurd_D, Assyrian_D, Turkish_D). A total of thirteen populations were therefore included.

From the output, only those combinations expressing an Armenian population as a PDP were selected. In this context, the Armenians will be considered a PDP if their "ancestral" percentage exceeds 50%. A maximum of ten were collected per population. In the event the number of combinations exceeded this, the subsequent combination lists are terminated with an ellipsis.

* Although not included in the original Dodecad K12b Oracle dataset, Dienekes has conveniently shared the population averages for these samples here. These were manually inserted into the command.


Results

Iranian and Kurdish Oracle results
Unsurprisingly, the Iranians and Kurds all display similar results. Specifically, the adoption of either Makrani or Balochi as the secondary donors when Armenians are fixed as a PDP. The proportions are also comparable between all. The Iranians appear to fit the Armenian + Balochi/Makrani combination slightly better than the Kurds (GD=4.04-5.16 vs. 5.03-6.65 to 2 d.p. respectively). It is also worth observing that both Iranians and Kurds, irrespective of sampling strategy (location-specific or Dodecad average), do not have Mixed Mode results which exceed ten.

Assyrian and select Near-Eastern Jewish Oracle results
The Assyrians are one of the groups of interest, given the demonstrated autosomal similarity between them and Armenians (here). As anticipated, their Mixed Mode results well exceed ten and the best fits (GD=1.66-1.82 to 2 d.p.) are all, coincidentally, with the Near-Eastern Jewish groups studied here. Subsequent matches include additional populations (e.g. Saudi, Bedouin, Syrian) where the GD remains relatively small compared to the Iranian and Kurdish values (>3.15 to 2 d.p.).

The Near-Eastern Jewish groups largely mirror the Assyrian results, although some key differences should be outlined:

  • The Azerbaijani Jews have a GD similar to the Assyrians in range, setting them apart from the Iraqi and Iranian Jews. This seems to fit geography. However, if the association was strictly geographical, one would expect the Assyrians to lie in-between the Azerbaijani Jews from the Iraqi and Iranians. This may be genetic evidence of additional and direct ancestry between Armenians and Assyrians at some (or various) point(s) after the Near-Eastern Jewish groups had formalised their identities.
  • Saudis appear as a secondary donor population in all groups. Interestingly, they appear to have an inverse relationship with geographic proximity to the Armenian Highlands; Iraqi, Iranian and Azerbaijani Jews are 20.4%, 16.1% and 7.8% "Saudi" respectively. The Assyrians too fall on this cline despite the point raised above.

Anatolian Turkish Oracle results
Finally, the Anatolian Turks provide us with another set of interesting values and pairs:

  • Mixed Mode results from Western Turkey (Aydin, Istanbul) largely exhibit a combination of Armenian with various European ethnic groups or nationalities, which can be predominantly ascribed to geography. Please note the comparatively large GD among the Aydin average (>9.93 to 2 d.p.), which contrasts with Istanbul. I suspect the cosmopolitan nature of Istanbul has resulted in an artefactual lowering of the GD, given Anatolian Turks from
    across the country have moved their for employment purposes. [7]
  • In contrast, the samples listed as "Turks" in Dodecad K12b (from the Behar et al. dataset, located in Central-South Turkey) model well as a combination of Armenian with either the Chuvash, Nogay, Uzbek or Uyghur. European secondary donors do make an appearance once more. Please also note their GD is the smallest out of the Turkish averages investigated (4.20 to 2 d.p.).
  • The Kayseri average (Central Turkey) yielded no results matching the criteria outlined in "Method". However, the Assyrians instead made a frequent appearance as primary donors from GD=6.17 onwards. Given the genetic affinity between Assyrians and Armenians (refer above), and the consistency displayed by the Armenians as a PDP for other Turkish averages, this result can be considered anomalous. A close inspection of the Dodecad K12b proportions reveals the Kayseri Turks were on average approximately 1.5% more Southwest Asian than all other Turkish populations, explaining why Assyrians took preferential placing over Armenians as the PDP. The cause of this slight increase is unknown at present.
  • The Turkish_D average best resembled that of Istanbul, albeit with slightly more Armenian and less European proportions. This would suggest that, overall, the Dodecad Turkish participants map somewhere just east of Istanbul despite the presumably diverse backgrounds. 
  • Finally, all averages produced Mixed Mode results which exceeded ten in number.

IBD Segment Indications

To corroborate the findings of this investigation with additional genetic data, I refer to the Dodecad Project's fastIBD analysis of Italy/Balkans/Anatolia and fastIBD analysis of several Jewish and non-Jewish groups. As the analyses do not completely encompass those groups studied here, the results cannot be accepted wholesale. However, there does appear to be a broad agreement with some of the results in this investigation. For example, the Armenians and Assyrians have a demonstrated level of "warmth" to one another beyond background sharing.


Further Work

This investigation would have benefited from Azeri Turkish samples via the Republic of Azerbaijan. Additionally, a better breakdown of Kurdish, Iranian and Assyrian samples, akin to the site-specific sampling seen here in the Anatolian Turks, would have been ideal. Finally, as stated above, this investigation would have benefited from the inclusion of IBD segment analysis specific to the studied groups. Should time permit and the desired samples be made available in the future, this would be a natural line of inquiry to further what has been explored here.


Conclusion

Addressing the three hypotheses stated at the beginning in order:

1. Armenians certainly have behaved as a reasonable proxy for an autochthonous West Asian PDP in most of the populations tested (sole exception being the Kayseri Turks although this appears to be an anomalous response to slightly more Southwest Asian scores). The scores vary depending on the presence of the secondary donors, but Assyrians and Jewish populations from Azerbaijan, Iran and Iraq appear to have the largest proportion of this (occasionally surpassing 90%). All Iranians and Kurds, on the other hand, scored the least overall (approximately 65-75%). The Turkish range lies in-between these two.

2. Unfortunately, this isn't clear. The lack of regional results for Kurds and Iranians, together with a lack of samples specifically from Eastern Turkey, prevents any conclusion being reached on this point. The Near-Eastern Jewish populations studied here certainly do form a cline of Armenian "admixture" that is fully in line with geography. Furthermore, the large GD observed in Aydin Turks does support this idea, leading me to cautiously propose geography does indeed play a role. The second point also provides us with a partial answer, as the Assyrians demonstrate more of this than one would expect given their geographical placement based on GD, as well as fastIBD evidence from elsewhere.

3. With the exception of the Assyrians and Near-Eastern Jewish groups, the secondary donors overwhelmingly matched my expectations regarding their placement with whichever group that was studied (e.g. Iranians and Kurds towards South-Central Asia, Turks towards either Europe or Central Asia proper).

Over the coming years, with the availability of more data, we should hopefully move away from the population averages that have been used by various open-source projects. It has been empirically demonstrated here that regional results will differ significantly from nationwide averages (e.g. Aydin Turks vs. Turkish_D).

This also holds true on an individual basis; the best Oracle match for one Iranian via the described methodology was 56.4% Armenians_15_Y + 43.6% Tajiks_Y (GD=5.44 to 2 d.p.), differing significantly from both the Iranian and Kurdish averages.

I suspect the gentlemen running the numerous open-source projects are aware of this caveat and are, justifiably so in my opinion, making do with currently available data.

In closing, this investigation has also determined that, on the basis of the presumption of an Armenian-like autochthonous West Asian substrate, the studied populations as a whole have an apparent degree of inter-relatedness by virtue of this common South Caucasian autosomal heritage, albeit with the presence of highly significant affinities to elsewhere in Eurasia, be it population-wide, regional or even individual.


Speculations

The first topic is regarding the Iranians and Kurds; why were their average secondary donors always the Balochi's and Makrani, rather than more northern groups, such as the Tajiks? I suspect, when applied to population averages, the Oracle program effectively minimises intra-population variation to the point where only the broadest of affinities are indicated. In the case of Iranians, the secondary donor would therefore be one with genetic features that tend to emphasise the difference between Armenians and Iranians (e.g. additional South Asian and Gedrosian admixture). A similar conclusion can be reached with respect to the Turks.

Another interesting point is the demonstrated close relationship between the Assyrians and various Near-Eastern Jewish groups. This has been speculated upon in various discussion forums in the past. More precise tools will be required to elucidate whether these populations share legitimate ancestry with one another, or the affinity is happen-stance, instead reflecting the mixture of similar Near-Eastern groups with (again) similar Caucasus-derived groups at some point in history.

[Addendum I, 07/08/2014]: For a continuation on this with a fellow genome blogger, please read the Comments below.


Acknowledgements

Full credit for both the generation of raw population data and the Oracle program go to Dienekes Pontikos (Dodecad Ancestry Project).

Map of Armenian Highlands from Wikipedia.org. Photo of Mount Ararat courtesy of NoahsArkSearch.com.

Finally, I must refer all visitors interested in understanding the genetic constituency of the Armenian people to the FTDNA Armenian DNA Project. For a more interactive learning experience, two of the administrators (Mr.'s Simonian and Hrechdakian) recently delivered a lecture on this topic, garnishing it with a deeper description of anthropological and geographical aspects as described here.


References

1. Samuelian TJ. Armenian Origins: An Overview of Ancient and Modern Sources and Theories. [Last Accessed 3/08/2014]: http://www.arak29.am/PDF_PPT/origins_2004.pdf

2. Clackson J. Indo-European Linguistics: An Introduction. Cambridge Textbooks in Linguistics [Last Accessed 4/08/2014]: http://caio.ueberalles.net/Indo-European-Linguistics-Introduction/Indo-European%20Linguistics%20-%20James%20Clackson.pdf

3. Greppin JAC. The Urartian Substratum in Armenian. [Last Accessed 4/08/2014]: http://science.org.ge/2-2/Grepin.pdf

4. Grugni V, Battaglia V, Hooshiar Kashani B, Parolo S, Al-Zahery N et al. Ancient migratory events in the Middle East: new clues from the Y-chromosome variation of modern Iranians. PLoS One. 2012;7(7):e41252.

5. Dodecad Ancestry Project: ChromoPainter/fineSTRUCTURE Analysis of Balkans/West Asia [Last Accessed 4/08/2014]: http://dodecad.blogspot.com/2012/02/chromopainterfinestructure-analysis-of.html

6. Eurogenes Genetic Ancestry Project: Updated Eurogenes K13 and K15 population averages [Last Accessed 4/08/2014]: http://bga101.blogspot.com/2014/03/updated-eurogenes-k13-and-k15.html

7. Filiztekin A, Gokhan A. The Determinants of Internal Migration In Turkey. [Last Accessed 05/08/2014]: http://research.sabanciuniv.edu/11336/1/749.pdf

Sunday, August 19, 2012

Introducing the ACD Tool [Original Work]

It is with satisfaction I announce the release of my first ever population genetics spreadsheet for fellow researchers. The Ancestral Component Dissection (ACD) Tool is a piece freeware I have developed to give those with a similar knack for fiddling with ADMIXTURE, Y-SNP and mtDNA frequency data better means to flesh out inter-population differences.


ACDTool (v1.0)
How Does The ACD Tool Work?

The ACD Tool relies on the frequencies of "ancestral components", a general catch-all term for uniparental markers (Y-SNP's, mtDNA) and Autosomal DNA (auDNA). These form the mainstay of much of the work that has been done in population genetics for the past few decades. The advent of "genome blogger" projects has brought the immediacy of these techniques to those who have tested with personal genetics companies, such as Family Tree DNA (FTDNA) and 23andMe. The ACD Tool should therefore be considered a supplementary item by those interested in these results, as well as data procured from current literature.

The level of commonality that occurs between many populations and ethnic groups poses a problem for those interested in investigating what differences arise between them.

To solve this, the ACD Tool works by removing mutual shared component frequencies between sample averages within a region. The idea is to lessen the amount of regional similarity and intentionally exaggerate those differences that exist between neighbours.

This is achieved by removing congruent component values across all populations (using the lowest value as a benchmark), leaving only the differences behind.


What Experiments Are Ideal?

As the ACD Tool is intended for finer inter-population analysis, it is best applied in a regional context. It serves the purpose of better revealing genetic differences which may account for linguistic or micro-regional trends.

Example #1: Northeast Europeans (Dodecad)

Once the Polish, Russian and Finnish Dodecad cohort averages were run through the ACD Tool, I simply used Excel to create the charts. The "Before-After" feature is used to highlight that the tool has completely achieved its' desired goal in amplifying the genetic differences between them:


NE European auDNA (Dodecad) through the ACD Tool



Example #2: West Asians (Harappa)
Using the Harappa Ancestry Project this time, I ran the data of Armenians, Assyrians, Kurds and Iranians (mostly from the Harappa cohort) into the ACD Tool once more and presented the differences as above:

W Asian auDNA (Harappa) through the ACD Tool


Example #3: South-Central Asians (Eurogenes)
A final example pits Pathans, Jatts, the Burusho, Balochis and Brahuis against one another:

SC Asian auDNA (Eurogenes) through the ACD Tool



Are There Any Drawbacks?
The efficacy of the ACD Tool depends on the number of populations, cohort size and cohort specificity. As the examples above show, the level of inter-population component sharing may decrease greatly if groups that are from more genetically diverse regions are compared.

In addition, using the ACD Tool on populations that are too different (i.e. Han Chinese and Yoruba) will not work given the genetic overlap through either ADMIXTURE, Y-SNP's or mtDNA is negligible. Of course, this defeats the point of the tool in the first place.

Lastly, the tool requires Macros to be enabled for the instructions to work.


Disclaimer

The ACD Tool is an open-source free-to-use spreadsheet. Those wishing to modify the spreadsheet for their personal use are welcome to do so. However, any modifications made to the ACD Tool with the intent of subsequent redistribution are kindly asked to contact the creator (myself) before doing so out of common courtesy.

Please also note the ACD Tool is a first attempt at giving back to the genealogy world I have been a part of for several years. Though functional (as shown above), it is not without bugs. In light of this, I am not responsible for any loss of data that may occur from its' use.

Finally, I hope the genealogy world finds some use for this nifty piece of kit.



Acknowledgements

To the Dodecad Ancestry ProjectHarappa Ancestry Project and Eurogenes Genetic Ancestry Project (auDNA used in Examples).

Addentum I [20/08/2012]: ACDTool v1.1 replaces v1.0, Macros smoothened and instructions refined. Eurogenes South-Central Asian example also added.

Tuesday, June 26, 2012

Worldwide Distribution of Dodecad K10a Components [Review]

Numerous ADMIXTURE runs have been completed by the Dodecad Ancestry Project since its' inception approximately two years ago. The status of certain components remained tenuous despite subsequent runs, whilst others provided fairly stable values for the bulk of the project's participants.

With the completion of the latest K10a run, I have composed a series of geographically accurate frequency maps with the intention of effectively presenting the trends that can be seen through the raw data.


Method

Data; values from over 130 groups obtained through the Dodecad K10a Spreadsheet. Only groups with at least 5 participants considered. Composites of populations were taken where appropriate and denoted with _cmp. Labels shown otherwise identical to source. The O_Italian_D group was excluded because no information on their origins were found online. 

Mapping; Dodecad participant populations allocated to national capitals. Exact location of reference populations obtained where possible (see Citations) however some allowances were made regarding those accompanied by scant information. Refer to the Data Sink for the population list, coordinates and commentary made during mapping process. No numerical data, aside from those shown for certain populations, was shown to minimise clutter and to remain faithful to the intention of this entry.

Population depiction; I deemed it necessary to separately consider the genetic structure of Jewish, Indian and expatriate/New World populations and exclude them from the rest of Europe, Asia or Africa. Including Jewish minorities with their gentile compatriots would render the maps uninformative. The complexity of India's demographics, particularly because of the caste system, makes frequency maps an improper choice for revealing inter-group genetic differences. 


Results





















Acknowledgement

The raw values used in this investigation are attributed to Dienekes Pontikos, author of the Dodecad Ancestry Project.


Addenum I [04/07/2012]: Inclusion of All Components Colourised map, shown below:




Citations
http://www.uvm.edu/~rsingle/stat295/F05/papers/Cavalli-Sforza-NRG-2005_Ceph-HGDP-CDP.pdf
http://www.1000genomes.org/about 

Thursday, February 9, 2012

Autosomal variation from Anatolia to the Tarim periphery [Original Work]

The nature of ADMIXTURE as a tool for inferring ancestral components makes it difficult to discern the nature of a shared Autosomal component between several populations. For instance, a given component may originate in one population and be donated to others (e.g. purported African admixture in the Arabian Peninsula), stem from a mutual population (e.g. West Eurasian-specific components in low K=n runs between the Druze and the French Basque) or be the result of genetic drift (e.g. potentially, the peaking of East Asian-specific components in Korea and Japan).

Nevertheless, using results from the latest Dodecad Ancestry Project K12b run (link), I have investigated the component variation across a horizontal axis from Anatolia to the Tarim periphery in West China, with the intention of establishing the nature of the observed components across this area of interest. Raw values can be viewed on the newly-published Vaêdhya Data Sink. Populations are listed in a geographical cline.


One of the most immediate observations is the similarity between Kurdish and Iranian populations, with both expressing similar admixture percentages (deviation per component usually not >1%). This suggests that Kurds and Iranians have common origins, with the former largely maintaining those ancestral signals despite moving further westwards relative to their linguistic cousins in Iran.

Near-congruency between the Assyrians and Armenians is also striking, bar the variations on the North European, Caucasus and Southwest Asian components. It is again tempting to postulate the two descend for the most part from a similar root population with the aforementioned component differences accounting for the linguistic differences.

If one allocates the Kurds alongside the Iranians, several of the Autosomal components shown here have a distribution that appears to be determined by geography alone;


  • South Asian peaks in Tajiks, who are situated approximately due NNW of the Indian Subcontinent.
  • Caucasus reaches a maximum in Armenians and adjacent populations.
  • Atlantic Med steadily decreases as one moves further away from Europe.
  • Southeast Asian has an inverse relationship to the above, peaking in the Uyghurs sigificantly only.

Other components appear to have more complicated distributions;

  • Interestingly, East Asian and Siberian are not too dissimilar in the populations containing them. The elevation of both in populations which speak Turkic/Altaic languages relative to neighbours speaking other languages confirms genetic input from the Turkish steppe nomads who expanded from the eastern side of Central Asia, eventually reaching the Iranian plateau and Anatolia. However, it is possible some of the Siberian and East Asian values may simply be the result of prehistoric demic diffusion across Eurasia (demonstrated by potential gradient between Kurds/Iranians <-> Tajiks), although this may in itself be of medieval steppe ancestry.
  • Southwest Asian peaks in Assyrians, the only Semitic-speaking population shown in this analysis. This component falls rapidly beyond the Iranian plateau but is found at a background frequency east of Turkmenistan. Whether this is again an artefact of prehistoric demic movements or more recent migrations (e.g. Silk Road, various Persian empires) is debatable. As with the Siberian and East Asian components, there is an elevation which defies a geographical pattern and confirms historical accounts; the Tajiks, who descend in part from Persian speakers escaping Iran after the Sassanid collapse, show an elevation relative to the Uzbeks and Uyghurs. The greater frequency in Christian Armenians relative to the predominantly Muslim Kurdish territories and Iran disregards outright the notion it was introduced by the Islamic expansion out of the Arabian Peninsula.
  • The Gedrosia component has a bifurcated peak between Iranians and Tajiks, implying an ultimate peak in the region of Pakistan (corroborated by other Dodecad population results, such as the Balochis of Pakistan). However, the Gedrosian frequency drops from a stable 28% across West Iranic-speaking populations to 13-18% in Anatolian Turks, Armenians and Assyrians. It is again impossible to infer whether this is of prehistoric origins (i.e. mutual Neolithic phenomena between the Iranian plateau and South-Central Asia) or more recent (inflated Gedrosian values a function of Median, Persian and Parthian ancestry).
  • The North European component has what appears to be a dual geographic and linguistically-oriented distribution, which may be confounded further by recent interactions between Europe and some of the populations shown here (Anatolian Turks may potentially be the greatest example of this). It is interesting to note the Assyrian and Armenians show an inverse in the North European and Southwest Asian components despite otherwise appearing identical. The elevated frequency of this component in Central Asia will hopefully be covered in a future entry.

Despite the usefulness of ADMIXTURE in determining approximate ancestral origins of populations and individuals, it is impossible to ascertain the nature of component X between populations A and B; such Autosomal results should ideally be complementary to historical, linguistic, archaeological and even deep paternal and maternal evidence (Y-DNA, mtDNA).

Some of the observations made in this entry have been gleaned with earlier renditions of population data; through the use of deeper penetrating Autosomal techniques (such as IBD), the exact nature of the component variations should hopefully be resolved in the future.


Reference

The raw values used in this investigation are attributed to Dienekes Pontikos, author of the Dodecad Ancestry Project.