Interpreting yDNA Test Results

Contents

1. Introduction

2. STR test results

3. SNP test results and haplotrees

4. NGS test results, including BigY

5. Size of FTDNA database and text kit prefixes

6. Further information

1. Introduction

There are two types of basic yDNA tests: STRs and SNPs (pronounced "snip"s). An analogy is that STRs identify the leaves of a tree, while SNPs identify the twigs and branches.

STRs are liable to relatively frequent mutations and to convergence, whereas SNP mutations are much more stable and reliable.

ySTR tests and ySNP tests complement one another. However the first test is always a STR test, and individuals tackling genetic genealogy for the first time can ignore SNP tests to start with.

ySTR tests predict, and ySNP tests confirm a tester's haplogroup (analogy: the branches of a tree). Haplogroups are used in ethnicity/Deep Ancestry studies, which relate to the pre-surname era and are not directly relevant to surname genealogy. But two testers with the same surname but different haplogroups cannot be paternally related to one another during the surname era, that is the last 600-1,000 years.

STRs are great for predicting which branch of our surname an Irwin is descended from, but are not a reliable indicator of how closely one Irwin is related to another. For members of the large Border Irwin ("L555") branch of our surname, SNPs identify where sub-group of this branch they are descended from, and who their closest relatives are.

2. STR test results

FTDNA publish the results of an STR test in the tester's personal on-line webpage. The results, commonly known as the tester's haplotype, or genetic signature, comprise a number of markers, typically 37, mostly identified by a "DYS" number, and a count of the number of times each of these markers is repeated, known as the marker count.

The yDNA STR test results of a single tester are of little value until they are compared with those of another tester. From such comparisons a very approximate probability of the two testers sharing a common ancestor can be assessed. Clusters of testers sharing a common ancestor within the surname era, typically the last millennium, are known as surname branches (aka genetic families or groups; the terms "lineage" and "cluster" are also used, but the latter term may include testers whose common ancestor lived before the surname era). Despite their limitations discussed below, STR genetic signature are a reliable tool for determining to which branch of a surname a tester belongs.

While determining the marker counts that make up a genetic signature is a strict scientific process, determining and expressing the probabilities of the comparisons of genetic signatures sharing a common ancestor is much less precise. The probabilities are complex mathematical functions dependent on many variables, including the number of markers tested, the number and magnitude of the mismatching markers, and the different rates of mutation of individual markers (slow mutating markers are useful for grouping tester’ results, fast mutating markers for differentiating between results). And several assumptions are needed of the likely number of generations elapsed since the most recent common paternal ancestor (MRCA).

Several tools may be used to assess comparisons of two testers' genetic signatures. The tools I use include the following. Some FTDNA customers and other project administrators prefer to rely of FTDNA's "Matches" pages, which I discuss at the foot of this page.

2.1 Haplogroups, the DNA signatures associated with basic ethnic groups used in Deep Ancestry Studies. FTDNA predict the relevant haplogroup from each genetic signature (in red in results tables). Testers with different haplogroups are not genealogically related within the surname era. These haplogroup predictions are generally reliable, and can be confirmed by SNP ("snip") tests - see section 2 below.

2.2 Number of matching markers. This simple indicator can be used to give a very approximate indication of the probability of the number of generations since the two DNA signatures shared a common ancestor. The following table is from FTDNA's former faq512:

STR TMRCA probabilities

To place these numbers of generations in context, a popiular "rule of thumb" is 3 generations per century.

It is important to note that this table, and others like it, are only a very rough guide. The probabilities shown are averages, and are often misleading for individual comparisons. As will be seen, our Study includes two brothers who have only 23 of 25 matching markers, while we have about two dozen testers with 37 of 37 matching markers, of whom half also have 67 of 67 matching markers, none of whom have been able to use this Study to determine their genealogical relationship (unlike several other testers with non-identical, albeit close, genetic matches who have succeeded in identifying genealogical relations). These examples illustrate that the mutations of individual markers is a random process.

2.3 Genetic distance ("GD"). This popular measure is a slightly more sophisticated version of the number of matching markers. Genetic Distances are expressed in terms of the differences or "steps" between each marker, in terms of the number of markers compared, e.g. ‘0/12’ or ‘1/37’. There are various models for calculating genetic distance. FTDNA now calculate Genetic Distance as the sum of the differences of individual marker counts, e.g. a distance of 3 may include three 1-step mismatches, or one 2-step mismatch plus one 1-step mismatch. Note:

  • Different rules apply for multi-copy markers such as DYS 385, 389, 464 and YCA: see https://dna-explained.com/2016/07/27/y-dna-match-changes-at-family-tree-dna-affect-genetic-distance/ ;

  • FTDNA's Y-DNA "Matches" pages arbitrarily assume a "Match" when testers have STRs with a GD of 4/37 or less (or 1/2 or 2/25 or 7/67 or 10/111 or less). See section 2.8 below.

  • Small Genetic Distances alone (the basis of FTDNA's "Matches") should not be seen as proof of a close genealogical relationship; other evidence such as both similarity of surname AND some geographical or shared common ancestor should be determined before attempting to contact a "Match". This is especially relevant to members of large genetic families such as the Border Irwins, at least until SNP evidence also suggests such a relationship.

2.4 Time since Most Recent Common Ancestor (TMRCA). Tables and graphs may be used to convert Genetic Distance into the number of generations since two testers shared a common ancestor. Although TMRCAs are expressed in years, making them readily comprehensible, and are a powerful tool for deep ancestry studies, alas the associated margins of error are so great that it is a most unreliable tool when used to date ancestors within the surname era. And like the number of matching markers and Genetic Distance, TMRCA dates are also unreliable because they assume some single average mutation rate for all markers, while in practice the average mutation rates for individual markers vary enormously.

TMRCA's may also be calculated from SNPs (see section 3 below). Average SNP mutation rates are more reliable than average STR mutation rates - for BigY700 tests the average mutation rate is once per 83 years, or about one SNP mutation every three generations. But the SNP mutation rate for a particular individual may be far from this average.

Both STR- and SNP-based TMRCA predictions should always be accompanied by Confidence Intervals, but even these often give a misleading impression of both accuracy and precision.

2.5 FTDNA’s ‘TiP’ probability. This is a sophisticated STR-based tool that encompasses a large number of variables in a single probability figure. Unlike Matches, GDs and TMRCAs they take account of differing average mutation rates of individual markers, and respond to "resolution" (i.e. the number of makers analysed): as the resolutions is increased from 12 to 37 to 67, so the TiP probabilities of common ancestry of two testers tend to polarise towards 0% or 100%. But there are many exceptions to this generalization, and 12-marker TiP %s are particularly unreliable. The number of generations since a possible MRCA is also important. For most testers with the same or similar surnames this is typically a maximum of about 20 generations.

FTDNA's TiP tool represents their best understanding of the impact of differing average mutation rates of individual STR markers. It is still almost unique in utilizing a weighted average STR mutation rate to calculate TMRCA (Time to Most Recent Common Ancestor) data. However I believe that the published TiP % probabilities give a misleading impression of accuracy and, since 2016, that they are biased, i.e. that the true TiP % probabilities should be lower than shown.

Nevertheless this Study still uses TiPs as the best available tool for assessing the relative probabilities of an individual tester being related to the tester with genetic signature closest to the modal (i.e. most common) signature in a branch (aka genetic family). This Study assumes that if the 24-generation, no-paper-trail TiP % for the highest available resolution (known as the TiP Score) is over 60% (formerly over 80% - see Appendix C of the accompanying Supplementary Paper No.1 the genetic signatures being compared can be assumed to be those of members of the same surname branch. TiP % are thus used to group together the members of each branch. The genetic signature that is most common within each branch is known as the modal genetic signature. The modal genetic signature may be the signature of the common ancestor of the branch, but this is not necessarily so.

2.6 NPEs. Although in theory yDNA and surnames are both inherited through successive generations of the male ancestral line, in practice such lines occasionally experienced a change in the surname. In Surname DNA studies instances of such events are euphemistically termed as Non Paternal Events. This term jas many synonyms. Examples of NPEs include:

  • A formal change of surname, typically a 20th century event, but sometimes earlier, e.g. to inherit land from a father-in-law.

  • An informal change of surname, typically in the 13th to 19th centuries, for example when a young boy's father died and he was given the surname of his mother (in Scotland females retained their maiden names until the 19th century) or, if she remarried, of his step-father; or if a boy was orphaned or a waif, and was given the name of his guardian.

  • A change of surname before these had become strictly hereditary, typically in the 12th to 19th centuries, for example a patronymic when a boy was given the forename of his father, or a man became known by his nickname or occupation, or by where he lived or came from, or when a clan member, tenant, apprentice, servant or slave took the surname of his chief, laird or master. This practice seems to have been particularly prevalent in the Scottish Borders. Sometimes such 'alias' surnames were used concurrently with paternal surnames, which later lapsed.

  • An illegitimacy or infidelity, covert or otherwise, at any period, and the child was given the surname of his mother or her partner or husband.

NPEs in our Study can be manifest in two ways: those testers who today use the Irwin surname or similar but share the genetic signature of some other surname, and those who share the gentic signisture of one of the branches of the Irwin surname but today use a different surname. In the case of the latter I require a TiP Score of 95% for an individual testee to "qualify" for membership of our Study. For further discussion of the interpretation of test results see section 7 and Appendix D of the accompanying Supplementary Paper 8, slides 25-30 of the lecture at Supplementary Paper 9, and my contribution at http://www.isogg.org/wiki/NPE.

Awareness that one's paternal ancestry included a NPE can be disappointing, particularly to genealogists who have long believed they are descended from a particular branch of their surname. But it is important to remember that a majority of NPE's were not associated with any untoward event, that most surnames are not derived from a single ancestor, that DNA evidence is never 100% proof of anything, and that some NPE branches of a surname may be older than branches that are not NPEs. NPEs traditionally occur in 1-2% per generation in paternal ancestral lines, but these rates are cumulative and NPE ancestry is thus much more common than is widely assumed. Testers with NPE ancestry can celebrate a heritage of two surnames, just as the heritage of a surname can be shared by all its branches.

For inspiring examples of how genealogical research can resolve NPE test results see the accompanying Supplementary Paper No.8 and, if you can get hold of a copy, Richard Hill's fascinating book Finding Family.

2.7. Singletons, Mismatches, False Positives and Convergence/Back mutations. Testers with the surname Irwin or similar who do not (yet) closely match any other tester are considered to be "Singletons". This is, of course, hopefully only a temporary status! Many are NPEs.

Testers with a surname dissimilar to Irwin who have a close match with an Irwin at low STR resolutions who fail to make the TiP Score 95% cut-off at higher STR resolutions, or whose SNP test results are not compatible with the Irwin tester with whom their match, are termed False Positives.

False Positives are one example of "convergence", a term used in genetic genealogy to describe the process whereby two different STR genetic signatures have mutated over time and experienced "back mutations" to become identical or near identical, resulting in an accidental or coincidental match. Many of the "Matches" identified on the FTDNA YDNA "Matches" web pages which have different surnames can be explained by convergence. Convergence is said to be more likely at lower resolutions (1-12 or 1-25 markers) than high (1-67 or 1-111 markers), and to be more common in some haplogroups than others, but as convergence is "invisible" little is yet known of its true extent.

2.8 FTDNA's Y-DNA Matches pages. These pages have been prepared by FTDNA to help their customers understand the results of their yDNA STR tests. Testers in a well-developed surname project such as this are better served by our Main Results table, which shows the testers to whom they are most closely realted, but some explanation of the Matches pages is in order. The following points are relevant:

  • FTDNA identify a tester's "Matches" by their name and e-mail address but not be their kit number. For privacy reasons neither FTDNA nor administrators ever include e-mail address and kit number at the same time. "Matches" pages help matching testers to contact each other, but for reasons I explain below this exercise is unlikely to be fruitful, and will typically result in a very poor response level as recipients soon tire of unsolicited and poorly-justified approaches.

  • "Matches" can be identified at different levels of resolution (i.e. 12, 25, 37, 67 or 111 markers), up to the highest level of the two testers concerned. Thus if a tesater has only tested to 37 markers he cannot have matches at 67 or 111 markers.

  • "Matches" are ranked by Genetic Distance (see above). "Matches" at 12 marker level include tester with GDs or "steps" of 0 and 1; at 25 marker level they include GDs of 0, 1 and 2; at 37 marker level they include GDs of 0, 1, 2, 3 and 4; at 67 marker level they include GDs of 0, 1, 2, 3, 4, 5, 6 and 7; at 111 markers they include GDs of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10.

  • These "cut-off" GDs of 1, 2, 4, 7 and 10 are arbitrary. Testers with higher GDs may be related within the surname era, but the probability of this is lower and these testers are not listed on FTDNA's Matches pages - they are thus "false negatives". For example our Study has identified some men who have tested L555 positive but have GDs of 5/37 or more from the L555 modal STR values. We even have examples of two men who are L555 positive but have a GD of 13/37 from one another. The 37-marker modal GDs of the B(1) section of our Study's results on 1 November 2021 were thus:

GD 0/37 1/37 2/37 3/37 4/37 5/37 6/37 7/37 8/37 9/37 10/37

No. of testers 61 70 83 63 49 15 8 0 0 1 0 Total 350

% 17% 20% 24% 18% 14% 4% 2% 0 0 0.3% 0 100%

  • Similarly testers who are listed as "Matches" (i.e. have GDs of 1/12, 2/25, 4/37, 7/67 or 10/111 or less) are not necessarily related within the surname era. This is especially likely if the surnames are dissimilar, or if their haplogroups are different. These "false positives" occur because of convergence (see above).

  • Conversely "Matches" of testers with dissimilar surnames but the samepredicted haplogroup may be true matches disguised by a NPE (see above) in the ancestry of one of the testers.

  • Any individual tester will obviously have many more "Matches" at 12 markers than at 111 markers. Indeed "Matches" at 12 markers are usually best ignored. In theory if another tester is not a "Match" at a given level he will not be a "Match" at a higher level, although in practice a "Match" can occasionally appear at a higher level because of convergence.

  • The number of "Matches" that individual testers have can vary widely. Few or even no "Matches" may be listed at higher resolutions simply because few or no testers have (yet) tested to that level. The number of Matches at mid-level, say at 37 markers, will vary from tester to tester depending on the number of testers with similar surnames who have tested. Testers with a large number of "Matches" with other testers similar surnames will probably be members of a large branch of the surname.

  • The number of Matches at mid-level, say at 37 markers, will also vary from tester to tester depending on how close their STR signature is to that of a popular haplogroup such as M222, U106, M269 etc. Such individuals may have hundreds of "Matches", but many of these will be False positives, i.e. with surnames. dissimilar to himself and also dissimilar to each other. This will contrast with a NPE "Match" which is typically characterised by his "Matches" being limited to two surnames.

  • The number of "Matches" a man has is constantly increasing as more men take yDNA tests.

2.9 Caution. Prospective testers should be aware that some DNA test results have unexpected implications. Disappointments can occur for several reasons:

  • if there has been a NPE in the paternal ancestral line;

  • if the results contradict some cherished genealogical research, tradition or iunderstanding;

  • if the comparisons are indeterminate, e.g. if a tester does not (yet) appear to be genetically related to anyone else in the study;

  • if the test does not lead to identifying any "new" genealogical relatives (because few surname DNA studies have sampled more than 1% of those with the surname who are alive today ).

Notwithstanding these contingencies over 90% of testers in our Study have been shown to be members of one of the various branches that have been identified.

3. SNP test results and haplotrees

3.1 SNPs. SNPs (pronounced "snip"s) are another feature of yDNA samples which can be analysed. An analogy is that STRs identify the leaves of a tree, but SNPs identify the twigs, and sub-clades, and haplogroups identify the main branches of a tree. Although ySTR tests and ySNP tests complement one another, STRs are liable to relatively frequent mutations, whereas SNP mutations are much more stable, and so analyses based on SNP data are much more reliable. SNPs are cumulative, each man inheriting all the SNPs of his paternal ancestors, plus sometimes oneor even two unique to himself. Thus each man inherits hundreds of SNPs. However tha ages of the mutations that gave rise to the younger SNPs are not readily apparent, these analyses are not straight forward.

SNP test results are very different from STR test results. There are two types of SNP test: single SNP tests and (multiple) SNP Pack tests the saliva sample simply test positive or negative, e.g. L555+ or L555-, i.e. they are binary, and not probabilistic or "discovery" tests. Next Generation Sequencing (NGS) tests such as BigY are "discovery" tests that idnetify new SNPs and are much more sophisticated and comprehensive. For more details of SNP tests see Supplementary Paper 5.

The nomenclature used in describing SNP test results can be confusing. SNP tests are usually identified by an alphanumeric such as R-L21-L555, R-L555 or just L555. L555 or L555+ indicates the test was positive for this SNP, L555- means it was negative. Negative test results are usually not reported as they are so numerous. The prefix letter indicates the laboratory/individual which/who has named the SNP, e.g. SNPs “L” were named by FTDNA (who also use the prefixes BY (for BigY500 discoveries) and FT (for BigY700 discoveries). Other organizations use different prefixes (for readers curious about these prefixes see www.isogg.org/tree). Confusingly many SNPs have synonyms, e.g. L21 = M529, L555 = S393. SNPs are also known by their location on the human genome, e.g. L555 aka S393 is located at position 7779294. See ybrowse.org for a full list of SNPs with their synonyms and locations on the human genome. FTDNA no longer "name" Private SNPs (see below) but refer to them by their position on the human genome.

A further nomenclature challenge is the meaning of terms such "known SNPs", "private SNPs" and "terminal SNPs", not least because as more SNPs are discovered these labels are liable to change. Private SNPs were sometimes used to relate to those specific to a surname, and terminal SNPs to the most recent/youngest known SNP. FTDNA are now using the term "Terminal SNP" to denote the most recent SNP that is shared by two or more BigY testees, and "Private SNPs" as those SNPs not shared by any other BigY testee. Of course these demarcations are transient as more individuals take the BigY test, and may change as the FTDNA haplotree is further refined (see below). Because of lack of definition and the inherent instability of these terms I prefer not to use them. The relative terms "upstream" or "older" SNPs and "downstream" or younger" SNPs are also sometimes used - see below. SNPs are sometimes known as "Variants", a term which technically also includes STRs and Indels.

3.2 Haplotrees. All SNPs can be placed on the haplotree (aka phylogentic tree) of mankind, a genetic tree that goes back to the genetic Adam. Until recently haplotrees were mainly relevant to genetic anthropologists and others interested in ethnicity/Deep Ancestry studies, and their relevance to genetic genealogists in general and to this Study in particular had been very limited. However since about 2010 SNPs and haplotrees have become increasingly relevant to genetic genealogists as more and more SNPs are discovered and halpotrees expand downstream towards and now even into the surname era, i.e the last millennium and up to the present day. SNPs have been identified from mutations that occurred during the 20th century, and two brothers can get different SNPs from their BigY tests Downstream haplotrees such as the L555 Border Irwin haplotree (see below) are "at the cutting edge" and are throwing new light on many genealogical challenges.

The position of a SNP on a haplotree may be indicated in various ways, thus L21 may be termed R-L21 or R1b-L21, where R is the haplogroup and R1b is the sub-clade. Or some call R1b the haplogroup and R1b1a2a2a1a2c the sub-clade that defines L21. The latter hierarchical form is logical but both clumsy and liable to be changed, so more descriptive forms such as R>R1b>M343>M269>P312>L21 are now more popular. Thus L555 may be termed R1b>M343>M269>P312>L21>Z251>L555, or simply as R-L555, or some intermediate description if preferred.

SNPs often appear on haplotrees as a "block" of phylogentically equivalent SNPs. The order of mutations of the SNPs within each block is currently unknown, but may be resolved as more NGS test results become available.

SNPs are being discovered so frequently (in May 2019 there were 160k known SNPs) and their relationships to one another are occasionally revised, so alas there is no single, comprehensive, up-to-date haplotree, and even if there was it would be too cumbersome to replicate legibly. Several haplotrees are relevant to this Study:

  • FTDNA's haplotree (at their personal webpage/account under Y-DNA > Haplotree & SNPs) shows the haplotree relavant to the individual tester. It used to be very outdated, but since spring 2016 it has been much more comprehensive, expecially for SNPs that FTDNA have "discovered" fromcustomers' BigY tests. SNPs that have tested positive are shown in green, SNPs that have tested negative are shown in red. Untested SNPs are shown in orange (upstream, presumed positive), grey (upstream, presumed negative) or blue (downstream). Note that on FTDNA's public pages (e.g. https://www.familytreedna.com/public/irwin/default.aspx?section=yresults) (and on the main results table of this Study), haplogroups confirmed by SNP testing are shown in green, haplogroups predicted from STR data are shown in red. FTDNA's haplogroup predictions are very reliable.

  • ISOGG's haplotree (at www.isogg.org/tree) is more comprehensive but less up-to-date and excludes many downstream aka Private SNPs. Like the FTDNA haplotree this is presented as a table, with the oldest SNPs on the left, the younger "sons" and "grandsons" successivly indented towards the right.

  • Alex Williamson's excellent Big Tree (at www.ytree.net) is restricted to R-P312 and its downstream SNPs (including L555), but includes few SNPs identified by SNP Pack tests. This haplotree is presented more like a conventional family tree, with the oldest SNPs at the top, and successive "sons" and "grandsons" below.

  • The Clan Irwin haplotree (at LATEST ANALYSIS UPDATE) is edited (in BigTree format) to show only the haplotree branches relevant to the 40+ Genetic families identified in this Study. This includes the Border Irwin L555 SNP, but no details downstream thereof.

  • The Border Irwins L555 haplotree (a downstream amplification of the Clan Irwin haplotree) is now shown in two formats: (1) within the Master Results table in the LATEST RESULTS TABLE (FTDNA format), and (2) in the Border Irwins section of LATEST ANALYSIS UPDATE (BigTree format). The latter is now extended to include testees who can be included because of L555 Pack test results, single SNP test results, STR data, Family Finder connections or genealogical relationships

3.3 TMRCAs. The number of true generations separating each "son"/"grandson" on a haplotree varyies greatly, depending on when relevant mutations occurred. On average one BigY500 SNP mutation occurred every 131 years, or about once every 4 generations, and one BigY700 SNP occurred every 83 years, or about once every 3 generations. However for any individual the mutations of the SNPs they have inherited may be separated by 15 or more generations, or by just one generation (our Study has an example of two brothers, of whom one inherited his very own Private SNP!). TMRCA's calculated for individual branch lines on the basis of average SNP mutation rates are more reliable than those based on average STR mutation rates, but are still neither accurate nor precise, even when averaged over several lineages within a particular surname branch.

3.4 FTDNA's BigY "Matches" data and "Non-matching variants". I personally find these confusing. Their "Block Tree"s are more useful, but sometimes not up-to-date, and I refer L555 testers to their"Haplotree and SNPs" pages and to the L555 gentic family tree that is continuously results evelving for thsi Study (see Latest Analysis Update).

4. Next Generation Sequence (NGS) e.g. BigY test results

These tests are expensive (though their price has fallen considerably) but they give much more comprehensive results than STR and SNP Pack tests.. For details see above and Supplementary Paper 5.

5. Size of FTDNA data base and test kit prefixes

The number of Matches any DNA test kit can have is in part a function of the size of the relevant data base of the testing company concerned. For yDNA tests FTDNA's data base is by far the largest in the world, which is one of the reasons this Study recommends members take a FTDNA yDNA test. FTDNA do not publish the size of their databases, but a quantitative estimate of their direct-to-customer DNA database was attempted by Martin McDowell in February 2020 using his awareness of FTDNA's kit numbering system. He found the following

  • non-prefixed kits: 925,000

  • (International) kits: 84,000

  • MK (Multi-kit orders, USA): 67,000

  • MI (Multi-kits, international): 54,000

  • AM (Amazon orders): 32,000

  • N (transfers from National Genographic): 271,000

  • B (transfers from other testing companies): 612,000

  • 27 letter other prefixes: 71,000

Some of these test kits have not been used, but he estimated that at that date FTDNA's database exceeded 1,700,000. By 2021 it probably exceeds 2 million. Of these nealy 1 million include yDNA tests.

FTDNA's yDNA Matches pages draw on all their kits, immaterial of any prefix, although for technical reasons Matches of some transfers from other testing companies may be limited.

6. Further guidance on understanding test results

See www.isogg.org/wiki.