Paternal and Maternal Lines

We men inherit the Y chromosome from our fathers who got it from their fathers. So the Y chromosome can be used to trace your paternal lineage. Different sequences of alleles and mutations can be assigned to haplogroups where a haplogroup signifies common descent on the uniparental line.

According to my 23andme results, I belong to the paternal haplogroup R1a1a. This group is very common in Eastern Europe as well as South Asia. The distribution of R1a1a can be seen in the map below.

Similarly, we all inherit mitochondrial DNA from our mothers. The sequence of alleles and mutations on the mitochondrial DNA (mtDNA) is also organized into phylogenetic tree.

I can trace my maternal line to Egypt (my great-grandmother) and thus I expected a maternal haplogroup common in the eastern Mediterranean. It turns out I belong to haplogroup H, which everyone and their mother belong to in Europe as can be seen in this map.

According to Wikipedia,

Haplogroup H is the most common mtDNA haplogroup in Europe. About one half of Europeans are of mtDNA haplogroup H. The haplogroup is also common in North Africa and the Middle East. The majority of the European populations have an overall haplogroup H frequency of 40%–50%. Frequencies decrease in the southeast of the continent, reaching 20% in the Near East and Caucasus, 17% in Iran, and <10% in the Persian Gulf, Northern India and Central Asia.

Since 23andme didn’t tell me which subgroup of H I belonged to, I used mthap by James Lick:

Your rCRS differences found:

HVR2: 263G
CR: 750G 1438G 4769G 15326G
HVR1: (16519C)

Best mtDNA Haplogroup Matches:

1) H
2) H26
2) H(16192)
2) H35
2) H24
2) H10
2) H25
2) H(195)
2) H33
3) H19

Amber’s maternal haplogroup is M4a, which is mainly found in South Asia.

You can see the Y-DNA haplogroup tree and the mtDNA tree online.


I have found out I am actually from West Virginia. Ok, I am just joking.

I knew that my family had a history of marriages among relatives. After all I have only 10 great-great-grandparents instead of the usual 16. With my genome in hand, I set about to quantify the inbreeding.

First, I used David Pike’s Homozygosity tool. It analyzes your genome to find significant runs where the same haplotype is inherited from both parents. Large portions of the human genome are like that. The length of these homozygous regions, however, varies depending on the relation of your parents. If your parents are closely related (first cousins in my case), then you will have longer runs. If your parents are distantly related, then over the generations those genes have had a chance to recombine and so you will have shorter runs that are homozygous.

Overall, the percentage of my autosomal (i.e. on chromosomes 1-22) SNPs that are homozygous is 71.767 and I have 41 runs of homozygosity (ROH) of length at least 200. Here are some of my longest runs:

  • Chr 1 has a ROH of length 6009 (30.95 Mb)
  • Chr 8 has a ROH of length 5819 (33.00 Mb)
  • Chr 9 has a ROH of length 5877 (57.81 Mb)
  • Chr 9 has a ROH of length 5941 (24.38 Mb)

Let’s look at my homozygosity percentage by chromosome.

Chr 1: 71.734 %
Chr 2: 69.952 %
Chr 3: 65.741 %
Chr 4: 71.563 %
Chr 5: 69.270 %
Chr 6: 76.025 %
Chr 7: 69.445 %
Chr 8: 72.690 %
Chr 9: 93.323 %
Chr 10: 69.765 %
Chr 11: 71.866 %
Chr 12: 68.443 %
Chr 13: 74.184 %
Chr 14: 68.571 %
Chr 15: 73.087 %
Chr 16: 66.541 %
Chr 17: 77.555 %
Chr 18: 67.763 %
Chr 19: 66.267 %
Chr 20: 66.228 %
Chr 21: 79.902 %
Chr 22: 69.896 %

A majority of chromosomes seem to have reasonable percentages while chromosomes 4, 6, 8, 11, 13, 15, 17 and 21 are high. However, chromosome 9 is really weird: It is 93.323% homozygous.

David Pike writes that:

So far the largest ROHs in 23andMe V2 data that I am aware of consist of:

  • 9191 consecutive tested SNPs, corresponding to a DNA segment of length 49.99 Mb.
  • 6129 consecutive tested SNPs, corresponding to a DNA segment of length 39.05 Mb.
  • 5594 consecutive tested SNPs, corresponding to a DNA segment of length 28.95 Mb.
  • 4644 consecutive tested SNPs, corresponding to a DNA segment of length 27.71 Mb.

The highest percentage for overall autosomal homozygosity that I have so far seen from 23andMe V2 data is 71.763%.

As you can see, I am an extreme case.

A number of members at DNA Forums reported their homozygous percentage. Of all those listed, mine is the 2nd highest.

According to the paper Genomic Runs of Homozygosity Record Population History and Consanguinity:

South/Central Asians and West Asians have more than three times as many ROH in all categories over 4 Mb long than sub-Saharan Africans and other Eurasians. 19% of individuals from these populations have ROH over 16 Mb in length, consistent with the high prevalence of consanguineous marriage (marriage between individuals who are second cousins or closer) in these populations.

My total ROH length (segments > 0.5Mb) is about 282Mb which is about 1.2 standard deviations above the Central/South Asian sample mean in that paper. But I am more than 1.7 standard deviations above the mean for longer segments (>5Mb).

Let’s take a look at a graph from the paper’s supplemental material which plots total ROH length versus number of homozygous segments:

My inbreeding coefficient based on the length of long (>5Mb) runs of homozygosity in my genome (fROH5) is about 0.11 while the average in the Central and South Asian sample for the HGDP dataset is 0.015 (not directly comparable due to different number of SNPs used to calculate).

Finally, I used Plink to calculate my inbreeding coefficient F using all the South Asians from my reference datasets. That coefficient comes out to be 0.1184.

Harappa Project New Site

As several people had asked, I have set up a separate website for the Harappa Ancestry Project at

I am keeping a link to the new site on the top menu bar here titled Harappa DNA.

I might also crosspost some items from the project here.

I have also set up a Facebook page for the Harappa Ancestry Project. Please like it on Facebook so I can get a nice short name for the Facebook page URL.

I have received several samples and will be reporting some analysis results soon. However, I do need lots of participants, so please spread the word.

Cross-posted at Harappa Ancestry Project.

Harappa Ancestry Project

I have become interested (some would say obsessed) with genetics recently. I wrote about getting my DNA test done and there’s a lot more about my own results that I plan to bore you with.

One fun application of genetic testing is inferring ancestry: Which ancestral group are you descended from? Can we estimate the admixture of the different population groups you are descended from?

Most DNA testing companies provide information about ancestry and genetic genealogy has taken off. With several genome databases (HapMap, HGDP, etc) and software (like plink, admixture, Structure) publicly available, the days of the genome bloggers are here. And I am trying to be the latest one.

In starting this project, I have been inspired by the Dodecad Ancestry Project by Dienekes Pontikos and Eurogenes Ancestry Project by David Wesolowski. The catalyst for this project was my friend Razib who I bug whenever I need to talk genetics.

What is Harappa Ancestry Project?
It is a project to analyze (autosomal) genetic data of participants of South Asian origin for the purpose of providing detailed ancestry information. So the focus of the project is on South Asians: Indians, Pakistanis, Bangladeshis and Sri Lankans.

The project will collect 23andme raw genetic data from participants to better understand the ancestry relationships of different South Asian ethnicities.

I have named it after Harappa, an archaeological site of the Indus Valley Civilization in Punjab, Pakistan.

People of South Asian origin, or from neighboring countries, are eligible to participate. The list of countries of origin I am accepting are as follows:

  • Afghanistan
  • Bangladesh
  • Bhutan
  • Burma
  • India
  • Iran
  • Maldives
  • Nepal
  • Pakistan
  • Sri Lanka
  • Tibet

Right now, I am only accepting raw data samples from people who have tested with 23andme.

Please do not send samples from close relatives. I define close relatives as 2nd cousins or closer. If you have data from yourself and your parents, it might be better to send the samples from your parents (assuming they are not related to each other) and not send your own sample.

If you are unsure if you are eligible to participate, please send me an email ( to inquire about it before sending off your raw data.

What to send?
Please send your All DNA raw data text file (zipped is better) downloaded from 23andme to along with ancestral background information about you and all four of your grandparents. Background information would include where they were born, mother tongue, caste/community to which they belonged, etc. Please provide as much ancestry information as possible and try to be specific. Do especially include information about any ancestry from outside South Asia.

Data Privacy
The raw genetic data and ancestry information that you send me will not be shared with anyone.

Your data will be used only for ancestry analysis. No analysis of physical or health/medical traits will be performed.

The individual ancestry analysis published on this blog will be done using an ID of the form HRPnnnn known to only you and me.

What do you get?
All results of ancestry analysis (individual and group) will be posted on this blog under the Harappa Ancestry Project category. This will include admixture analysis as well as clustering into population groups etc.

I suggest you read about Dienekes’ analysis on South Asians for an idea about what to expect.

You can access all blog posts related to this project from the Harappa Ancestry Project link on the navigation menu on every page of my website. You can also subscribe to the project feed.


I have been neglecting the blog again.

On Sunday night and Monday, we got about 5 inches of snow here in Atlanta. That’s more than I have seen here in my 13 years.

Then the temperature stayed below freezing until today. So it turned to ice. The roads here have been treacherous all week and I have seen cars skidding and turning the wrong way on GA-400.

Also, school has been closed all week and it has been hard trying to keep the 1st grader occupied at home. She has been to school only 3 days out of 26. May be we could have gone to a month long vacation instead of a short one in New York.

Here’s how our backyard looks today.

Personal Genomics: DNA Test

Last year in April, 23andme were having a sale for DNA Day, selling their 550,000 SNP test with ancestry and health information for $99 instead of its regular price of $499 at that time. So I decided to take the plunge and sent my spit from the East coast to the West to be analyzed.

Then 23andme had another sale ($99 again but with the catch of a minimum of a year of $5/month subscription in addition), I got my wife and my sister to do it on 23andme’s new version 3 genotyping chip with more than a million SNPs.

I got my results in May 2010 and have been having fun with them since. So let’s take a look.

There are reports for your genetic risk of a bunch of diseases. Those are interesting and useful in some cases, but there is still a lot of work to be done in the area of genetic associations of diseases and for now except for a few important discoveries, family history is probably a better predictor of your disease risk than genetic testing. Oh yeah and there are a couple of scary-looking numbers in my reports.

The health reports also show carrier status and drug response.

In terms of other traits, it’s mostly information I already knew like:

  • I can taste bitter tastes
  • I have wet earwax
  • My eye color is brown
  • I have curlier than average hair

One thing that was a surprise was that I am likely to be lactose intolerant. It’s possible I am somewhat tolerant due to environmental reasons.

Since I wanted more analysis than the 23andme reports gave, I downloaded Promethease which is a freeware software which uses all the information at SNPedia to create a report about your SNPs and what features, traits and health factors are influenced by them. The report it generates is long and interesting, though not formatted very well.

PS. Yes, this is the sort of topic I alluded to in my return announcement.

While there is more navel-gazing coming (mostly about ancestry and genetics), there’s going to be posts of more general interest. Let me just go ahead and say that the friend Razib mentioned is me.


Happy New Year, everyone! I am back from fun in New York city. We barely missed the blizzard there. It was already snowing and our flight was delayed but fortunately not canceled. I think all later flights to Atlanta were canceled.

Hope 2011 is a good year for all of us and we have a much better and fun year than 2010. It’s also time to activate this blog again. So start visiting again for new content. There are several topics I have in mind but one important and fascinating topic is a result of discussion with Razib.