Interpreting Y-DNA Test Results

Contents

1.  Introduction

2.  STR test results

3.  SNP test results and haplotrees

4.  Latest Analysis Update

5.  Further information

1.  Introduction

As explained in "Background and Goals", surname studies depend on Y-DNA tests (because Y-DNA, like surnames, is only inherited through the male line).  Several companies used to offer Y-DNA tests, but today the Houston-based company FamilyTreeDNA ("FTDNA") has easily the largest Y-DNA database in the world (a critical criterion).  They also market a comprehensive range of DNA tests, and so are the most popular company marketing Y-DNA test kits.   

FTDNA's test kits most relevant to this Study are:

(A) their 37-marker STR test, which gives (1) the repeat counts for each of 37 STR markers, (2) "Matches", (3) an estimated date of the common ancestor of these Matches (new), (4) a "Predicted Haplogroup" (aka SNP, pronounced "snip") and its estimated date, and (5) various other features, including ethnicity, which is not relevant to this Study;    

(B) their BigY700 test, which though more expensive, gives (1) the repeat counts for over 700 STR markers and, more importantly, (2) identifies about 3,000 SNPs which have been inherited by the tester, together with relevant (3) haplotrees, (4) estimated dates, and (5) various other features.  

So both (A) and (B) above give both STR and SNP data.  An analogy is that STRs identify the leaves of a tree, while SNPs identify the twigs and branches. STR and SNP test results complement one another.  STRs are great for predicting from which branch of our surname each Irwin is descended.  However, for various reasons it is now recognised that the much more reliable SNPs are needed to identify how the Irwins within each branch are related to one another.  

FTDNA also market 111 STR marker tests, single SNP tests, SNP Pack tests, none of which we now recommend, as well as their autosomal DNA Family Finder test. 

2.1 STRs.  FTDNA publish the results of an STR (Short Tandem Repeat) test in the tester's personal on-line webpage and (if the tester has agreed), publish them anonymously on the project's public page at https://www.familytreedna.com/public/irwin/default.aspx?section=yresults).  These results, commonly known as the tester's haplotype, or genetic signature, comprise a number of markers, typically 37, each identified by a "DYS" (or similar) ID, and a count of the number of times each of these markers is repeated, known as the marker count.   Thus for example the genetic signature of a man who had taken a 4-marker STR test might be DYS393 with a count of 13 repeats, DYS390 with 24 repeats, DYS394 with 14 repeats and DYS391with 11 repeats, summarised as "13, 24, 14, 11".   37 markers are generally needed, but are also sufficient, to identify which branch of the Irwin surname the tester is descended from.  Experience has shown that the first 37 marker counts are the most useful, and 67, 111 or even 700+ STRs add little further value for money. 

The Y-DNA STR test results of a single tester are of little value until they are compared with those of another tester.  While determining the marker counts that make up a genetic signature is a strict scientific process, determining and expressing the probable relatedness of two genetic signatures sharing a common ancestor is much less precise.  The probabilities are complex mathematical functions dependent on many variables, including the number of markers tested, the number and magnitude of the mismatching markers, and the different rates of mutation of individual markers.  Further  assumptions are needed of the likely number of generations elapsed since the most recent common paternal ancestor (MRCA).  

Several tools may be used to interpret STR test results.  Most depend on Genetic Distance. 

2.2    Genetic distance ("GD").   Genetic Distance is a crude measure of relatedness.  It may be defined as the number of differences, or steps, or mutations, between the genetic signatures of two testers. Thus if two 4-marker test results are compared and both have marker counts of 13, 24, 14,11, their GD is zero and the two men are an "exact match".  But if the count of the second man is, say, 13, 23, 14, 11, or 13, 24, 15, 11, then their GD is 1; and if the count of the second man is 13, 23, 15, 11 (i.e. two 1-step mismatches), or 13, 24, 14, 9 (i.e. one 2-step mismatch) then their GD is 2.  So the basics of this comparison method are simple and informative, but also note that: 

Notwithstanding these limitations, FTDNA show the Genetic Distances for all close matches, and there are three ways in which these GDs are often used: "Matches" pages, TMRCAs, and determining from which branch of a surname a tester is descended.

2.3  Using FTDNA's Y-DNA Matches pages.  This is a popular albeit crude tool to identify other DNA testers who have an exact or close "Match", which is arbitrarily defined as another tester having a GD of 1/12, 2/25, 4/37, 7/67 or 10/111, or less. These pages have been prepared by FTDNA to help their customers exploit their Y-DNA test results. Testers in a large and well-developed surname project such as this are better served by our Main Results table, which shows members from which branch of the surname they are descended  and hence the testers to whom they are most closely related, but some explanation of the Matches pages is in order.  The following points are relevant:

The number of "Matches" a man has is constantly increasing as more men take STR tests.    

2.4  Predicting  Time since Most Recent Common Ancestor (TMRCA).  Various tables have been developed by FTDNA and others to use Genetic Distance to predict the probable date of birth of the most recent ancestor common to the particular two men whose STR signatures are being compared.  Of these tools FTDNA's new TiP table "Most Recent Common Ancestor Time Predictor based on Y-STR Genetic Distance" (presently to be found as the "New" right-hand icon after each "Match") should be the most reliable as it is based in part on FTDNA's experience with their more reliable SNP data.  The tables imply their prediction varies for each "Match", but in fact the dates in the table remain unchanged for every "Match".   For surname studies we are not concerned with dates before c.AD1100, and FTDNA's table may be summarised thus: 

All TMRCA's are actually a probability function and so the Mean TMRCA estimate should always be accompanied by Confidence Intervals ("CI"s).  The table above shows 95% CIs, which indicate the date range within which 95% of TMRCAs will lie.  But the date estimates, even taking into account the CIs, can be misleading.  For example our Study has two living brothers who have a GD of 2/25, implying their father was probably born before AD1800 (!), and a dozen men with GDs of 0/67 whose genealogies typically go back to the 18th century, but without evidence of a common ancestor.  

This table also shows why FTDNA's "Matches" criteria such as a GD "cut off" of 4/37 may be misleading and other factors such as the surname and places of origin need to be considered.  Thus two men with a GD of 2/37 but different surnames probably don't have a common ancestor within the surname era (their Match is probably a "false positive"), but two men with a GD of 6/37 and the same surname very possibly do share a common ancestor within the surname era (their not being listed as a Match by FTDNA is probably a "false negative").   

Note that these predominantly STR-based TMRCAs are less reliable than SNP-based TMRCAs and TimeTrees (see section 3 below), even when CIs are taken into account.

2.5   Determining from which branch (aka genetic family) of a surname each tester is descended.  FTDNA's Colourised version of a surname project's Results table shows the Mode for each branch, i.e. the modal counts of each STR of all the members of each branch.   If a tester shares the project surname and has a GD of less than, say, 6/37 from these modal values then he is a member of this branch.   Counter-intuitively, these modal values rarely change as new members join each branch, and this tool has proved a reliable method of identifying different branches of a surname and determining from which branch each project member is descended (notwithstanding the inherent weakness of all Genetic Distances). 

A genetic signature with no Irwin matches and thus not (yet) belonging to any branch of the surname is known as a Singleton.  Some singletons are NPEs; others are simply awaiting a match to test when the pair can establishing a new branch.

2.6    NPEs. Although in theory surnames and Y-DNA signatures are both inherited through successive generations of a male ancestral line, in practice such lineages occasionally experience a change in the surname.  In Surname DNA studies instances of such events are euphemistically termed as Non Paternal Events.  This term has many synonyms and such events have a variety of causes.   Examples of NPEs include:

NPEs in our Study can be manifest in two ways: those testers who today use the Irwin surname or similar but share the genetic signature of some other surname, and those who share the genetic signature of one of the branches of the Irwin surname but today use a different surname.  For further discussion of the interpretation of test results see section 7 and Appendix D of the accompanying Supplementary Paper 8, slides 25-30 of the lecture at Supplementary Paper 9, and my contribution at http://www.isogg.org/wiki/NPE

Awareness that one's paternal ancestry included a NPE can be a very disappointing surprise, particularly to genealogists who have long believed they are descended from a particular branch of their surname.  But it is important to remember that a majority of NPE's were not associated with any untoward event, that most surnames are not derived from a single ancestor, that DNA evidence is never 100% proof of anything, and that some NPE branches of a surname may be older than branches that are not NPEs.  NPEs traditionally occur in 1-2% per generation in paternal ancestral lines, but these rates are cumulative and NPE ancestry is thus much more common in the general population than is widely assumed.  Testers with NPE ancestry can join another surname project and explore the heritage of two surnames, just as the heritage of a surname can be shared by all its branches.  

For inspiring examples of how genealogical research can resolve NPE test results see the accompanying Supplementary Paper No.8 and, if you can get hold of a copy, Richard Hill's fascinating book Finding Family.

2.7     Caution.  Prospective testers should be aware that some DNA test results have unexpected implications.  Disappointments can occur for several reasons:

Notwithstanding these contingencies over 90% of testers in our Study have been shown to be members of one of the various branches that have been identified, and as more Irwins take a 37 marker or BigY700 test, the more relatives are identified and the more we learn about the evolution of our surname.

3.  SNP test results and haplotrees

DNA tests such as FTDNA's BigY700, which identifies the most of the SNPs (Single-Nucleotide Polymorphisms, pronounced "snip"s) that a man has inherited, are a relatively new but exciting development.  It is first necessary to understand that (1) unlike STRs, SNPs are stable, i.e. are not prone to convergence,  that (2) every man inherits every SNP of his father, plus sometimes one or two new mutations, i.e. additional SNPs that are unique to him, and so that (3) SNPs are hierarchical, and can be connected by a haplotree (aka phylogeny or phylogenetic tree).  Haplotrees may be presented top-down (i.e. the oldest SNP at the top, most recent at the bottom), left-to-right, or circular.  The male haplotree is a genetic family tree which in its entirety stems from a genetic Adam down to every living man.   Many editions of this haplotree are evolving, e.g.  those of ISOGG (www.isogg.org/tree) and Alex Williamson (www.ytree.net).  FTDNA's haplotrees are the most extensive, containing nearly 1,000,000 SNPs.  Confusingly the early, "high-level" SNPs, thousands of years old, are commonly known as haplogroups, and younger, more recent "downstream" SNPs are sometimes termed Variants.

FTDNA include in the results of their 37 marker test a "Predicted haplogroup" in red, such as R-M269.  Such predictions, based on STR data, are generally reliable, and of interest to ethnicity studies, but are generally so ancient that they are of little relevance to Surname studies, except occasionally for helping to reject some possible NPEs (two men with different surnames and different predicted haplogroups are unlikely to be NPEs).  

A single SNP,  a SNP Pack test or a BigY test enable FTDNA to replace this red prediction with the youngest confirmed SNP, which FTDNA call the Terminal SNP,  in green, e.g. R-L555.   However the single SNP and SNP Pack tests do not discover new SNPs and we no longer recommend them.

FTDNA's new "Discover" and "Timetree" tools now show the estimated dates associated with these predicted haplogroups and terminal SNPs.  

FTDNA's powerful NGS (Next Generation Sequence) BigY700 test "discovers" most of a man's SNPs (typically about 3,000!), right down to the SNPs presently unique to him (aka Private Variants or PVs), at least until a relative's BigY700 test shows that he shares some of these PVs.  A man may have 40 PVs if no close relatives have taken a BigY test, or even no PVs if a very close relative has a BigY test.  However there are several practical problems in interpreting BigY700 test results:

(1) like all DNA tests, BigY700 tests of at least two related men are needed to make most use of this tool.  So being the first in a branch to take a BigY700 test will yield little immediate benefit, but hopefully will stimulate other members of the branch to follow this example.  Conversely, our large Borders branch, now with more than100 BigY testers, has a haplotree with ever increasing detail and insight to the evolution of this branch. 

(2) the BigY700 test itself does not identify the sequence in which a man's SNP mutations occurred, i.e. which of his SNPs are old and which are new.  FTDNA automates much of this process, but the final "polishing" is done manually, typically a fortnight after the test results are first published.  Further refinements may be published, again without warning, when subsequent BigY700 tests by other men help to refine the haplotree and reduce the number of PVs.  In this sense the BigY700 test results are dynamic and need checking from time to time.  

(3) not only can we not tell which of a man's PVs are the oldest and which are the most recent (they thus form a "block" of SNPs), but blocks of older SNPs still exist where not enough BigY700 or similar tests have yet been taken to subdivide the SNPs in these blocks to different lineages of descendants.  Some of these blocks still contain scores of SNPs, so that the c.3,000 SNPs of most testers is typically reduced to about 50 single SNPs or blocks of SNPs.   The more recent blocks in the ancestry of BigY testers are shown on their "Block Tree".  Somewhat confusingly the SNPs within a block are known as "equivalent" SNPs.  The sequence in which these "equivalent" SNPs are shown within a block is not significant.  A block of SNPs may be named after one of these "equivalent" SNPs, often the "top" SNP, but future research may show this SNP is not the oldest.  The TMRCA associated with a block of SNPs is that of the youngest SNP, even if we do not (yet) know which of the equivalents is the youngest.  Similarly the estimated date that the block was formed is the date of its oldest SNP, even if we don't (yet) know which of the equivalent SNPs is the oldest.     

(4)  the nomenclature of SNPs is confusing.  On discovery each SNP is known by its 7 or 8 digit position in the male chromosome, e.g. 779294GT.  FTDNA retain this label for PVs and only give a name such as R-FT34569, or R-L555, or  L555+, when a specfic PV is found to be shared by a new tester.  FTDNA call the youngest named SNP the terminal SNP (though there are some exceptions to this practice). Confusingly another laboratory may prefer to name the give a SNP a different name, so for example R-L555 is also known as R-S393.  The ISOGG naming system (e.g. R1b1a2a1a2c1a5a) is no longer used as it needed periodic updating and was becoming too cumbersome.  The prefix R-, or R1b-, in these examples, sometimes omitted, refers to the halpogroup (i.e. very old or "high-level" SNP).   See ybrowse.org for a full list of SNPs with their synonyms and locations on the human chromosome.  

(5) The rate at which SNPs mutate varies widely.  Thus for example the date of a man's terminal SNP depends on how many of his relatives have taken a BigY test, how many PVs he has that his relatives do not share, and how frequently these PVs have mutated.  The average mutation rate for all BigY700 SNPs is 83 years per SNP (i.e. about once every 2 or 3 generations), and for our large Borders branch of Irwins is currently about the same (depending on how it is calculated).  However different lineages within this branch have average mutation rates ranging from 1 SNP mutation per generation to 1 per 10 generations, and within some lineages the range is even wider.  However FTDNA's new "Discover" haplotree and "Timeline" give estimated dates for all named SNPs.  Note that SNP-based TMRCAs are much more reliable than STR-based TMRCAs.  

(6)  The BigY700 test results include much detailed information that I find to be of little relevance to our surname study.  The pages that I find most useful are "Block Tree", Results - Private Variables, and under "Discover":  Haplogroup Story, Time Tree and Scientific Details.  These pages deserve close attention.  

(7) Project administrators can embellish haplotrees of downstream SNPs by adding STR data (making what is known as a "mutation history tree") and other data (making what I call a "genetic family tree").  

4.  Latest Analysis Update.

This accompanying webpage, updated every six months, includes a bespoke Main Results table with STR and SNP data for each project member, a bespoke Clan Irwin Haplotree showing how all the 40+ branches of the Irwin surname had split off before the beginning of the surname era, and a bespoke genetic family tree including all Border Irwin L555 BigY testers and their tested close relatives and showing how they are related to one another.  These focussed interpretations and accompanying discussions show how individual testers are contributing to the study of the evolution of the many branches of our surname.  The picture is continuously evolving and each member has personal interests or concerns, so as the Study's Administrator I am happy to answer queries from both members and non-members.

5.  Further guidance on understanding Y-DNA test results

See www.isogg.org/wiki.