Diversity in Genetic Data: Where is Everyone?

Written by:

Genetic data from diverse populations is key for advancing disease research. A 2025 analysis revealed that 86% of large genetic studies do not have representation from more than one genetic ancestry population, severely limiting the utility of their results to any group other than the studied population. Scientists across the world are aiming to change this.

Photo by Pixabay

Incorporating genetic data into medical research has yielded incredible insight into disease mechanisms, risk factors, and potential treatments, with no slowdown in sight. As the cost of genetic sequencing falls, scientists have increased recruitment of research participants and widened their data collection net to make even more discoveries possible. This has allowed for analyses like genome wide association studies (GWAS) of numerous diseases and the establishment of biobanks with a huge number of participants who provide highly detailed lifestyle and health information alongside their genetic data. While these are great advances that have made a big difference in genetic research, we need to ask ourselves: who benefits from this data?

Ideally, findings should be as generalizable as possible. Recommendations such as not smoking, eating a healthy diet, and getting enough exercise are applicable to everyone, regardless of demographics like age, sex, race, or ethnicity. Unfortunately, even though the DNA sequence of any two humans is ~99.9% similar, that 0.1% difference can have a notable impact on someone’s disease risk or their response to a treatment. These genetic differences tend to cluster in groups who can trace their ancestry to a given geographical area, a concept referred to as genetic ancestry. 

A visual example of genetic ancestry vs genealogical ancestry. Genealogical ancestry refers to an individual’s family tree, the relatively recent collection of people that someone descends from. Genetic ancestry refers to the migration and mutation events that occurred over extended periods of time, leading to the genetic variants that an individual inherits. Credit to the National Human Genome Research Institute.

To be clear, genetic ancestry is completely separate from the concepts of race and ethnicity, which are social and cultural concepts without biological basis. Human migration events throughout history have had distinct effects on DNA inheritance patterns which can be used to cluster people with similar patterns into genetic ancestry groups. These groups are, again, 99.9% identical to each other; it isn’t as though one group has a completely different set of genes. Rather, people all have the same set of genes but differ at some individual letters: one group tends to have an A at a given site, while another tends to have a C, typically leading to slight alterations in the amount of a gene expressed, or expressing a version of a gene that acts slightly differently but still performs its function, just marginally more or less efficiently. 

Again, these groups are extremely similar to one another. But when it comes to understanding disease risk factors or reactions to medications, it is imperative to capture as much DNA variation as possible so the information is relevant to as many people as possible. But is that actually happening? 

Corpas et al. explain that there is still a significant “diversity gap” in research that relies on genetic data. The authors cite a study published in 2022 which found that 86.3% of GWAS–studies that aim to associate DNA sequence with disease risk–published before June 2021 included only individuals of European genetic ancestry, omitting those with East Asian, African, South Asian, and Hispanic/Latino genetic ancestry. Some studies have endeavored to include multiple genetic ancestries, but as of 2021 those make up only 4.8% of published GWAS. Further, Corpas et al. cite a study from 2019 which found that 72% of participants were recruited from just three countries: the United States, the United Kingdom, and Iceland. 

This is a problem.

When diversity isn’t built into a genetic study, results are under-informed at best and actively harmful at worst. Corpas et al. share examples of each of these scenarios. On the under-informed side, there are polygenic risk scores: estimates of a person’s risk for developing a specific disease based on their genetics. These are most often calculated using (typically European) GWAS data. When applied to anyone without European genetic ancestry, these scores can under- or over-estimate their risk, poorly informing health decisions. A study by Duncan et al. demonstrates this trend across multiple metabolic, cardiac, and psychiatric traits, each one showing that a European-calculated polygenic risk score is about half as informative when applied to those with African genetic ancestry. On the actively harmful side, there are variants more common in people with African genetic ancestry than European genetic ancestry which alters how some medications are metabolized. Efficacy and dosage of these medications being studied in those with European genetic ancestry meant that physicians were unprepared for poor breast cancer outcomes when treating African patients with tamoxifen, ultrarapid metabolism of codeine in Ethiopian patients leading to overdose, or warfarin having too high of an anticoagulation effect in Sub-Saharan African patients

These harrowing situations can be avoided by increasing the diversity in genetic studies. The authors list the current initiatives in this area: the United States has the All of Us project which has taken great care to build a highly diverse cohort; Latin America has both the Mexico City Prospective Study and the Peruvian Genome Project which focus on admixed and native indigenous populations; the Middle East has the Qatar Biobank Cohort Study which is making strides to increase that region’s representation; and the Human Pangenome Project has made a draft of the full human pangenome that attempts to capture as much genetic variation as possible by sequencing 47 individuals from diverse ancestries. 

While these efforts are promising, it will take time and effort to characterize other ancestries and there will be difficulties to overcome. Many of these groups have been neglected and wronged by scientists throughout history so there is very little trust between these groups and the scientific community, which has only led to less willingness to participate in genetic studies. Those reactions are completely understandable, and we as scientists need to put in work to prove that we do have communities’ best interest at heart. The only way we can do that is by getting involved: demystifying what we do, connecting with people as people and not as data points, and ASKING how we can best serve the community. We need to prove that we care.

If improving human health is the goal, everyone must be included. Genetics is for everyone, science is for everyone, health is for everyone.

Edited by Ethan Honeycutt and JP Flores


Discover more from GeneBites

Subscribe to get the latest posts sent to your email.

Leave a comment