Interpreting Y-DNA Test Results
Contents
1. Introduction
2. STR test results
3. SNP test results and haplotrees
4. Latest Analysis Update
5. Further information
1. Introduction
As explained in "Background and Goals", surname studies depend on Y-DNA tests (because Y-DNA, like surnames, is only inherited through the male line). Several companies used to offer Y-DNA tests, but today the Houston-based company FamilyTreeDNA ("FTDNA") has easily the largest Y-DNA database in the world (a critical criterion). They also market a comprehensive range of DNA tests, and so are the most popular company marketing Y-DNA test kits.
FTDNA's test kits most relevant to this Study are:
(A) their 37-marker STR test, which gives (1) the repeat counts for each of 37 STR markers, (2) "Matches", (3) an estimated date of the common ancestor of these Matches (new), (4) a "Predicted Haplogroup" (aka SNP, pronounced "snip") and its estimated date, and (5) various other features, including ethnicity, which is not relevant to this Study;
(B) their BigY700 test, which though more expensive, gives (1) the repeat counts for over 700 STR markers and, more importantly, (2) identifies about 3,000 SNPs which have been inherited by the tester, including any "Private" SNPs which at the time of testing were unique to the tester, ogether with relevant (3) haplotrees, (4) estimated dates, and (5) various other features.
So both (A) and (B) above give both STR and SNP data. An analogy is that STRs identify the leaves of a tree, while SNPs identify the twigs and branches. STR and SNP test results complement one another. STRs are great for predicting from which branch of our surname each Irwin is descended. However, for various reasons it is now recognised that the much more reliable SNPs are needed to identify how the Irwins within each branch are related to one another.
FTDNA also market 111 STR marker tests, which we don't recommend, as well as their autosomal DNA Family Finder test. They no longer market 12, 25 or 67 STR marker tests, single SNP tests or SNP Pack tests.
2.1 STRs. FTDNA publish the results of an STR (Short Tandem Repeat) test in the tester's personal on-line webpage and (if the tester has agreed), publish them anonymously on the project's public page at https://www.familytreedna.com/public/irwin/default.aspx?section=yresults). These results, commonly known as the tester's haplotype, or genetic signature, comprise a number of markers, typically 37, each identified by a "DYS" (or similar) ID, and a count of the number of times each of these markers is repeated, known as the marker count. Thus for example the genetic signature of a man who had taken a 4-marker STR test might be DYS393 with a count of 13 repeats, DYS390 with 24 repeats, DYS394 with 14 repeats and DYS391with 11 repeats, summarised as "13, 24, 14, 11". 37 markers are generally needed, but are also sufficient, to identify which branch of the Irwin surname the tester is descended from. Experience has shown that the first 37 marker counts are the most useful, and 67, 111 or even 700+ STRs add little further value for money.
The Y-DNA STR test results of a single tester are of little value until they are compared with those of another tester. While determining the marker counts that make up a genetic signature is a strict scientific process, determining and expressing the probable relatedness of two genetic signatures sharing a common ancestor is much less precise. The probabilities are complex mathematical functions dependent on many variables, including the number of markers tested, the number and magnitude of the mismatching markers, and the different rates of mutation of individual markers. Further assumptions are needed of the likely number of generations elapsed since the most recent common paternal ancestor (MRCA).
Several tools may be used to interpret STR test results. Most depend on Genetic Distance.
2.2 Genetic distance ("GD"). Genetic Distance is a crude measure of relatedness. It may be defined as the number of differences, or steps, or mutations, between the genetic signatures of two testers. Thus if two 4-marker test results are compared and both have marker counts of 13, 24, 14,11, their GD is zero and the two men are an "exact match". But if the count of the second man is, say, 13, 23, 14, 11, or 13, 24, 15, 11, then their GD is 1; and if the count of the second man is 13, 23, 15, 11 (i.e. two 1-step mismatches), or 13, 24, 14, 9 (i.e. one 2-step mismatch) then their GD is 2. So the basics of this comparison method are simple and informative, but also note that:
There are various subtly-different models for calculating GDs ("stepwise", "infinite alleles" and hybrids).
Different rules apply for multi-copy markers such as DYS 385, 389, 464 and YCA: see https://dna-explained.com/2016/07/27/y-dna-match-changes-at-family-tree-dna-affect-genetic-distance/ ;
The average mutation rates of STR markers differ widely, so counting the differences is comparing "apples and pears". FTDNA's former TiP tool took account of this factor, but their new TiP tool no longer does so.
STR counts are unstable and thus liable to Convergence, or back mutations. Thus, for example, two men may today have identical STR signatures, but the signatures of their recent ancestors may have differed. Convergence is very difficult to recognise, but is less common with 67 and 111 marker genetic signatures than with 12 or 25 marker signatures.
Notwithstanding these limitations, FTDNA show the Genetic Distances for all close matches, and there are three ways in which these GDs are often used: "Matches" pages, TMRCAs, and determining from which branch of a surname a tester is descended.
2.3 Using FTDNA's Y-DNA Matches pages. This is a popular albeit crude tool to identify other DNA testers who have an exact or close "Match", which is arbitrarily defined as another tester having a GD of 1/12, 2/25, 4/37, 7/67 or 10/111, or less. These pages have been prepared by FTDNA to help their customers exploit their Y-DNA test results. Testers in a large and well-developed surname project such as this are better served by our Main Results table, which shows members from which branch of the surname they are descended and hence the testers to whom they are most closely related, but some explanation of the Matches pages is in order. The following points are relevant:
FTDNA identify a tester's "Matches" by their name and e-mail address but not be their kit number. For privacy reasons neither FTDNA nor administrators ever include e-mail address and kit number at the same time. "Matches" pages help matching testers to contact each other, but for reasons explained below this exercise is unlikely to be fruitful, and will typically result in a very poor response level as recipients soon tire of unsolicited and poorly-justified approaches.
Matches are listed at 12, 25, 36, 67 and 111 markers, but a man tested to, say, 37 markers cannot have matches at more than 37 markers.
The number of "Matches" listed for each individual tester have can vary widely. Matches at 12 markers are often so numerous and unreliable that they are best ignored. Conversely few or even no "Matches" may be listed at 67 or 111 markers resolutions simply because few or no testers have (yet) tested to that level. The number of Matches at mid-level, say at 37 markers, will vary from tester to tester depending on the number of testers with similar surnames who have tested. Testers with a large number of "Matches" with other testers similar with the project surname will probably be members of a large branch of the surname.
The number of Matches at mid-level, say at 37 markers, will also vary from tester to tester depending on how close their STR signature is to that of a popular haplogroup such as M222, U106, M269 etc. Such individuals may have hundreds of "Matches", but many of these will be "False positives", i.e. "Matches" with surnames dissimilar to the tester's surname. Most of such Matches will be caused by Convergence, and so be irrelevant. This will contrast with a NPE "Match" (see below) which is typically characterised by his "Matches" being limited to two surnames.
FTDNA's arbitrary "cut-off"s of 1/12, 2/25, 4/37, 7/67 and 10/111 mean that a few matches with men with the project surname at, for example, 5/37 and 6/37 are not listed, i.e. are "False Negatives". In theory if another tester is not a "Match" at a given level he will not be a "Match" at a higher level, although in practice a "Match" can occasionally appear at a higher level because of convergence. Several of our surname branches include a small proportion of "False negatives".
The number of "Matches" a man has is constantly increasing as more men take STR tests.
2.4 Predicting Time since Most Recent Common Ancestor (TMRCA). Various tables have been developed by FTDNA and others to use Genetic Distance to predict the probable date of birth of the most recent ancestor common to the particular two men whose STR signatures are being compared. Of these tools FTDNA's new TiP table "Most Recent Common Ancestor Time Predictor based on Y-STR Genetic Distance" (presently to be found as the "New" right-hand icon after each "Match") should be the most reliable as it is based in part on FTDNA's experience with their more reliable SNP data. The tables imply their prediction varies for each "Match", but in fact the dates in the table remain unchanged for every "Match". For surname studies we are not concerned with dates before c.AD1100, and FTDNA's table may be summarised thus:
All TMRCA's are actually a probability function and so the Mean TMRCA estimate should always be accompanied by Confidence Intervals ("CI"s). The table above shows 95% CIs, which indicate the date range within which 95% of TMRCAs will lie. But the date estimates, even taking into account the CIs, can be misleading. For example our Study has two living brothers who have a GD of 2/25, implying their father was probably born before AD1800 (!), and a dozen men with GDs of 0/67 whose genealogies typically go back to the 18th century, but without evidence of a common ancestor.
This table also shows why FTDNA's "Matches" criteria such as a GD "cut off" of 4/37 may be misleading and other factors such as the surname and places of origin need to be considered. Thus two men with a GD of 2/37 but different surnames probably don't have a common ancestor within the surname era (their Match is probably a "false positive"), but two men with a GD of 6/37 and the same surname very possibly do share a common ancestor within the surname era (their not being listed as a Match by FTDNA is probably a "false negative").
Note that these predominantly STR-based TMRCAs are less reliable than SNP-based TMRCAs and TimeTrees (see section 3 below), even when CIs are taken into account.
2.5 Determining from which branch (aka genetic family) of a surname each tester is descended. FTDNA's Colourised version of a surname project's Results table shows the Mode for each branch, i.e. the modal counts of each STR of all the members of each branch. If a tester shares the project surname and has a GD of less than, say, 6/37 from these modal values then he is a member of this branch. Counter-intuitively, these modal values rarely change as new members join each branch, and this tool has proved a reliable method of identifying different branches of a surname and determining from which branch each project member is descended (notwithstanding the inherent weakness of all Genetic Distances).
A genetic signature with no Irwin matches and thus not (yet) belonging to any branch of the surname is known as a Singleton. Some singletons are NPEs; others are simply awaiting a match to test when the pair can establishing a new branch.
2.6 NPEs. Although in theory surnames and Y-DNA signatures are both inherited through successive generations of a male ancestral line, in practice such lineages occasionally experience a change in the surname. In Surname DNA studies instances of such events are euphemistically termed as Non Paternal Events. This term has many synonyms and such events have a variety of causes. Examples of NPEs include:
A formal change of surname, typically a 20th century event, but sometimes earlier, e.g. to inherit land from a father-in-law.
An informal change of surname, typically in the 13th to 19th centuries, for example when a young boy's father died and he was given the surname of his mother (in Scotland females retained their maiden names until the 19th century) or, if she remarried, of his step-father; or if a boy was orphaned or a waif or foundling, and was given the name of his guardian.
A change of surname before these had become strictly hereditary, typically in the 12th to 19th centuries, for example a patronymic when a boy was given the forename of his father, or a man became known by his nickname or occupation, or by where he lived or came from, or when a clan member, tenant, apprentice, servant or slave took the surname of his chief, laird or master. This practice seems to have been particularly prevalent in the Scottish Borders. Sometimes such 'alias' surnames were used concurrently with paternal surnames, which later lapsed.
An illegitimacy or infidelity, covert or otherwise, at any period, and a young boy was given the surname of his mother or of her partner or husband.
NPEs in our Study can be manifest in two ways: those testers who today use the Irwin surname or similar but share the genetic signature of some other surname, and those who share the genetic signature of one of the branches of the Irwin surname but today use a different surname. For further discussion of the interpretation of test results see section 7 and Appendix D of the accompanying Supplementary Paper 8, slides 25-30 of the lecture at Supplementary Paper 9, and my contribution at http://www.isogg.org/wiki/NPE.
Awareness that one's paternal ancestry included a NPE can be a very disappointing surprise, particularly to genealogists who have long believed they are descended from a particular branch of their surname. But it is important to remember that a majority of NPE's were not associated with any untoward event, that most surnames are not derived from a single ancestor, that DNA evidence is never 100% proof of anything, and that some NPE branches of a surname may be older than branches that are not NPEs. NPEs traditionally occur in 1-2% per generation in paternal ancestral lines, but these rates are cumulative and NPE ancestry is thus much more common in the general population than is widely assumed. Testers with NPE ancestry can join another surname project and explore the heritage of two surnames, just as the heritage of a surname can be shared by all its branches.
For inspiring examples of how genealogical research can resolve NPE test results see the accompanying Supplementary Paper No.8 and, if you can get hold of a copy, Richard Hill's fascinating book Finding Family.
2.7 Caution. Prospective testers should be aware that some DNA test results have unexpected implications. Disappointments can occur for several reasons:
if there has been a NPE in the paternal ancestral line;
if the results contradict some cherished genealogical understanding, tradition or research;
if the comparisons are indeterminate, e.g. when a tester does not (yet) appear to be genetically related to anyone else in the study, or indeed to anyone who has taken a similar test;
if the test does not lead to identifying any "new" genealogical relatives (typically because few surname DNA studies have sampled more than 1% of those with the surname who are alive around the world today ).
Notwithstanding these contingencies over 90% of testers in our Study have been shown to be members of one of the various branches that have been identified, and as more Irwins take a 37 marker or BigY700 test, the more relatives are identified and the more we learn about the evolution of our surname.
3. SNP test results and haplotrees
DNA tests such as FTDNA's BigY700, which identifies the most of the SNPs (Single-Nucleotide Polymorphisms, pronounced "snip"s) that a man has inherited, are a relatively new but exciting development. It is first necessary to understand that (1) unlike STRs, SNPs are stable, i.e. are not prone to convergence, that (2) every man inherits every SNP of his father, plus sometimes one or two new mutations, i.e. additional SNPs that are unique to him, and so that (3) SNPs are hierarchical, and can be connected by a haplotree (aka phylogeny or phylogenetic tree). Haplotrees may be presented top-down (i.e. the oldest SNP at the top, most recent at the bottom), left-to-right, or circular. The male haplotree is a genetic family tree which in its entirety stems from a genetic Adam down to every living man. Many editions of this haplotree are evolving, e.g. those of ISOGG (www.isogg.org/tree) and Alex Williamson (www.ytree.net). FTDNA's haplotrees are the most extensive, containing nearly 1,000,000 SNPs. Confusingly the early, "high-level" SNPs, thousands of years old, are commonly known as haplogroups, and younger, more recent "downstream" SNPs are sometimes termed Variants.
FTDNA include in the results of their 37 marker test a "Predicted haplogroup" in red, such as R-M269. Such predictions, based on STR data, are generally reliable, and of interest to ethnicity studies, but are generally so ancient that they are of little relevance to Surname studies, except occasionally for helping to reject some possible NPEs (two men with different surnames and different predicted haplogroups are unlikely to be NPEs).
A single SNP, a SNP Pack test or a BigY test enable FTDNA to replace this red prediction with the youngest confirmed SNP, which FTDNA call the Terminal SNP, in green, e.g. R-L555. However the single SNP and SNP Pack tests do not discover new SNPs and we no longer recommend them.
FTDNA's new "Discover" and "Timetree" tools now show the estimated dates associated with these predicted haplogroups and terminal SNPs.
FTDNA's powerful NGS (Next Generation Sequence) BigY700 test "discovers" most of a man's SNPs (typically about 3,000!), right down to the SNPs presently unique to him (aka Private Variants or PVs), at least until a relative's BigY700 test shows that he shares some of these PVs. A man may have 40 PVs if no close relatives have taken a BigY test, or even no PVs if a very close relative has a BigY test. However there are several practical problems in interpreting BigY700 test results:
(1) like all DNA tests, BigY700 tests of at least two related men are needed to make most use of this tool. So being the first in a branch to take a BigY700 test will yield little immediate benefit, but hopefully will stimulate other members of the branch to follow this example. Conversely, our large Borders branch, now with more than100 BigY testers, has a haplotree with ever increasing detail and insight to the evolution of this branch.
(2) the BigY700 test itself does not identify the sequence in which a man's SNP mutations occurred, i.e. which of his SNPs are old and which are new. FTDNA automates much of this process, but the final "polishing" is done manually, typically a fortnight after the test results are first published. Further refinements may be published, again without warning, when subsequent BigY700 tests by other men help to refine the haplotree and reduce the number of PVs. In this sense the BigY700 test results are dynamic and need checking from time to time.
(3) not only can we not tell which of a man's PVs are the oldest and which are the most recent (they thus form a "block" of SNPs), but blocks of older SNPs still exist where not enough BigY700 or similar tests have yet been taken to subdivide the SNPs in these blocks to different lineages of descendants. Some of these blocks still contain scores of SNPs, so that the c.3,000 SNPs of most testers is typically reduced to about 50 single SNPs or blocks of SNPs. The more recent blocks in the ancestry of BigY testers are shown on their "Block Tree". Somewhat confusingly the SNPs within a block are known as "equivalent" SNPs. The sequence in which these "equivalent" SNPs are shown within a block is not significant. A block of SNPs may be named after one of these "equivalent" SNPs, often the "top" SNP, but future research may show this SNP is not the oldest. The TMRCA associated with a block of SNPs is that of the youngest SNP, even if we do not (yet) know which of the equivalents is the youngest. Similarly the estimated date that the block was formed is the date of its oldest SNP, even if we don't (yet) know which of the equivalent SNPs is the oldest.
(4) the nomenclature of SNPs is confusing. On discovery each SNP is known by its 7 or 8 digit position in the male chromosome, e.g. 779294GT. FTDNA retain this label for PVs and only give a name such as R-FT34569, or R-L555, or L555+, when a specfic PV is found to be shared by a new tester. FTDNA call the youngest named SNP the terminal SNP (though there are some exceptions to this practice). Confusingly another laboratory may prefer to name the give a SNP a different name, so for example R-L555 is also known as R-S393. The ISOGG naming system (e.g. R1b1a2a1a2c1a5a) is no longer used as it needed periodic updating and was becoming too cumbersome. The prefix R-, or R1b-, in these examples, sometimes omitted, refers to the halpogroup (i.e. very old or "high-level" SNP). See ybrowse.org for a full list of SNPs with their synonyms and locations on the human chromosome.
(5) The rate at which SNPs mutate varies widely. Thus for example the date of a man's terminal SNP depends on how many of his relatives have taken a BigY test, how many PVs he has that his relatives do not share, and how frequently these PVs have mutated. The average mutation rate for all BigY700 SNPs is 83 years per SNP (i.e. about once every 2 or 3 generations), and for our large Borders branch of Irwins is currently about the same (depending on how it is calculated). However different lineages within this branch have average mutation rates ranging from 1 SNP mutation per generation to 1 per 10 generations, and within some lineages the range is even wider. However FTDNA's new "Discover" haplotree and "Timeline" give estimated dates for all named SNPs. Note that SNP-based TMRCAs are much more reliable than STR-based TMRCAs.
(6) The BigY700 test results include much detailed information that I find to be of little relevance to our surname study. The pages that I find most useful are "Block Tree", Results - Private Variables, and under "Discover": Haplogroup Story, Time Tree and Scientific Details. These pages deserve close attention.
(7) Project administrators can embellish haplotrees of downstream SNPs by adding STR data (making what is known as a "mutation history tree") and other data (making what I call a "genetic family tree").
4. Latest Analysis Update.
This accompanying webpage, updated every six months, includes a bespoke Main Results table with STR and SNP data for each project member, a bespoke Clan Irwin Haplotree showing how all the 40+ branches of the Irwin surname had split off before the beginning of the surname era, and a bespoke genetic family tree including all Border Irwin L555 BigY testers and their tested close relatives and showing how they are related to one another. These focussed interpretations and accompanying discussions show how individual testers are contributing to the study of the evolution of the many branches of our surname. The picture is continuously evolving and each member has personal interests or concerns, so as the Study's Administrator I am happy to answer queries from both members and non-members.
5. Further guidance on understanding Y-DNA test results
See www.isogg.org/wiki.