How much junk is in our DNA?

The puffer fish genome is about eight times smaller than the human genome, and 330 times smaller than the genome of the lung-breathing protopter fish. What "ghosts" live in the "graveyards of genomes", and how much trash is in our DNA?

Renowned molecular biologist David Penny of the Allen Wilson Center for Molecular Ecology and Evolution at Massey University in New Zealand once said:

โœ โ€œI would be very proud to work with the team that developed the E. coli genome. However, I would never admit that I was involved in the design of the human genome. No university could have spoiled this project so much. "

The topic of the amount of garbage in our DNA is one of the hottest topics in the scientific community. Around this question, verbal battles flare up among scientists.

๐Ÿ“Œ A little bit of molecular genetics

Recall that the transmission of hereditary information is based on a double-stranded DNA molecule. It is a polymer of four types of monomers (nucleotides): adenine (A), thymine (T), cytosine (C) and guanine (G) - and is folded into chromosomes. A person has 23 pairs of chromosomes located in the nucleus (22 pairs of non-sex and one pair of sex). They form the basis of our genome (37 more genes contain circular DNA of mitochondria). If we took one human cell, sewed the entire diploid (paired) set of chromosomes together and pulled it into a thread, we would get a molecule two meters long, consisting of six billion base pairs (nucleotides). Three billion from dad and three from mom.

The most studied type of functional DNA sequences are genes encoding proteins. From such genes, an RNA molecule is read, which then plays the role of a template for the synthesis of proteins and determines their amino acid sequence. The coding part of the RNA molecule can be divided into triplets of nucleotides (codons), which either correspond to a certain amino acid, or determine the place where protein synthesis ends (stop codons). The rule for matching codons to amino acids is called the genetic code. For example, the GCC codon encodes the amino acid alanine.

๐Ÿ“Œ Shall we match genes?

It was once thought that such a complex organism as a person must have a lot of genes. When the Human Genome project was coming to an end, scientists even staged a sweepstakes: how many genes will be discovered?

Imagine their surprise when it turned out that the number of genes in humans and the small roundworm Caenorhabditis elegans is approximately the same. The worm has about 20, 000 genes, while we have 20-25 thousand.

For the "crown of creation", the fact is quite offensive, especially when you consider that there are many organisms both with a larger genome (the genome of a lung-breathing fish protopter, Protopterus aethiopicus, 40 times larger than a human), and with a larger number of genes (rice has 32 โˆ’50 thousand genes).

But in fact, in humans, less than 2% of the genome encodes some kind of proteins. What is the other 98% for? Maybe there is a secret of our complexity? It turned out that there are important non-coding regions of DNA. For example, these are the regions of promoters - nucleotide sequences on which the RNA polymerase enzyme sits and from where the synthesis of the RNA molecule begins. These are the binding sites for transcription factors - proteins that regulate the work of genes. These are telomeres, which protect the ends of chromosomes, and centromeres, which are necessary for the correct separation of chromosomes at different poles of cells during division.

Some regulatory RNA molecules are known (for example, miRNAs that prevent the synthesis of proteins of the corresponding genes on the messenger RNA - a copy of the source gene), as well as RNA molecules that are part of important enzymatic complexes - for example, ribosomes that collect proteins from individual amino acids, moving along messenger RNA. There are other examples of important non-coding regions of DNA.

Nevertheless, most of our genome resembles a desert: repeating sequences, the remains of "dead" viruses that long ago were embedded in the genomes of our ancestors; the so-called selfish mobile elements - DNA sequences that can jump from one part of the genome to another; various pseudogenes - nucleotide sequences that have lost the ability to encode proteins as a result of mutations, but still retain some of the characteristics of genes. This is not a complete list of "ghosts" living in the "genome graveyard".

๐Ÿ“Œ Minimum mouse

There is a point of view that most of the human genome is non-functional. In 2004, the journal Nature published an article describing mice, from the genome of which significant fragments of non-coding DNA of 0, 8 and even 1.5 million nucleotides were cut out. It has been shown that these mice do not differ from ordinary mice in body structure, development, lifespan or the ability to leave offspring. Of course, some differences could go unnoticed, but on the whole it was a serious argument in favor of the existence of "junk DNA", which can be eliminated without any special consequences.

Of course, it would be interesting to cut out not a couple of million nucleotides, but a billion, leaving only the predicted gene sequences and known functional elements. Will it be possible to bring out such a "minimal mouse", and will it be able to exist normally? Can a person get by with a genome only half a meter long? Perhaps someday we will find out about this. Meanwhile, another important argument in favor of the existence of junk DNA is the presence of fairly close organisms with very different genome sizes.

The puffer fish genome is about eight times smaller than the human genome (although there are about the same number of genes in it), and 330 times smaller than the genome of the already mentioned protopter fish. If every nucleotide in the genome was functional, then it is not clear why the onion has a genome five times larger than ours?

The evolutionary biologist Susumu Ono drew attention to the colossal differences in the size of the genomes of similar organisms. It is believed that it was Ohno who coined the term junk DNA. Back in 1972, long before the human genome was read, Ono expressed plausible ideas about both the number of genes in the human genome and the amount of "garbage" in it. In his article "So much junk DNA in our genome, " he notes that there must be about 30, 000 genes in the human genome. This number, which at that time was not at all obvious, turned out to be surprisingly close to the real one, which was learned decades later. In addition, Ohno gives an estimate of the functional share of the genome (6%), declaring more than 90% of the human genome to be garbage.

๐Ÿ“Œ Find or trash?

The project ENCODE - The Encyclopedia of DNA Elements, โ€œThe Encyclopedia of DNA Elementsโ€ (the first results were published in the journal Nature in 2012) challenged the notion of the existence of junk DNA. Having obtained numerous experimental data on which parts of the human genome interact with various proteins, participate in transcription - the synthesis of RNA copies of genes for subsequent translation (protein synthesis from amino acids on the messenger RNA matrix) - or other biochemical processes, the authors came to the conclusion that more than 80% of the human genome is functional in one way or another. Of course, this thesis caused a heated discussion in the scientific community.

One of the more ironic articles published by Dan Graur, a molecular evolutionary bioinformatics professor at the University of Houston, and his colleagues in 2013 in the journal Genome biology and evolution, is titled: โ€œOn the immortality of televisions: theโ€œ function โ€in the human genome for lack of evolution The Gospel of ENCODE. " Its authors note that individual members of the ENCODE consortium disagree about which part of the genome is functional. So, one of them soon clarified in the journal Genomicron that we are talking not about 80% of the functional sequences in the genome, but about 40%, and the other (in an article in Scientific American) completely reduced the indicator to 20%, but at the same time continued to insist that the term "junk DNA" should be removed from the lexicon.

According to the authors of the article "On the immortality of televisions", members of the ENCODE consortium interpret the term "function" too freely. For example, there are proteins called histones. They can bind the DNA molecule and help it compactly fit. Histones can undergo certain chemical modifications. According to ENCODE, the putative function of one of these histone modifications is "preference to be at the 5'-end of genes" (the 5'-end is the end of a gene from which DNA and RNA polymerase enzymes move during DNA copying or transcription). โ€œIn much the same way, we can say that the function of the White House is to occupy an area of โ€‹โ€‹land at 1600 Pennsylvania Avenue, Washington, DC, โ€ opponents say.

A problem also arises with assigning a function to DNA regions. Suppose that a protein important for the functioning of the cell is able to attach to a certain region of DNA, and therefore ENCODE ascribes a "function" to this region. For example, a certain transcription factor - a protein that initiates the synthesis of messenger (messenger) RNA - binds to the following nucleotide sequence: TATAAA.

Consider two identical TATAAA sequences in different parts of the genome. After the transcription factor binds to the first sequence, synthesis of an RNA molecule begins, which serves as a template for the synthesis of another important protein. Mutations (substitutions of any of the nucleotides) in this sequence will lead to the fact that the RNA will be poorly read, the protein will not be synthesized, and this, most likely, will negatively affect the survival of the organism. Therefore, the correct TATAAA sequence will be maintained at a given location in the genome by natural selection, and in this case it is appropriate to speak of its function.

Another TATAAA sequence arose in the genome for random reasons. Since it is identical to the first, a transcription factor also binds to it. But there is no gene nearby, so the binding does not lead to anything. If a mutation occurs in this area, nothing will change, the body will not suffer. In this case, it makes no sense to talk about the function of the second section of the TATAAA. However, it may turn out that the presence of a large number of TATAAA sequences in the genome far from genes is simply necessary in order to bind the transcription factor and decrease its effective concentration. In this case, selection will regulate the number of such sequences in the genome.

To prove that a piece of DNA is functional, it is not enough to show that some biological process (for example, DNA binding) is taking place in this area. Members of the ENCODE consortium write that the DNA regions involved in transcription have a function. โ€œBut why is it necessary to focus on the fact that 74, 7% of the genome is transcribed, while we can say that 100% of the genome takes part in a reproducible biochemical process - replication!โ€ - Graur jokes again.

A good criterion for the functionality of a DNA region is that mutations in it are quite harmful and that significant changes in this region are not observed from generation to generation. How to identify such areas? This is where bioinformatics comes to the rescue, a modern science at the junction of biology and mathematics about the analysis of sequences of genes and proteins.

We can take the genomes of humans and mice and find all the similar pieces of DNA in them. It turns out that in these two species, some parts of the nucleotide sequences are very similar. For example, the genes necessary for the synthesis of ribosomal proteins are rather conservative, that is, mutations in them are harmful enough for carriers of new mutations to die out without leaving offspring. Such genes are said to be under negative selection, which cleans up harmful mutations. Other regions of genomes will have significant differences between species, which indicates that mutations in these regions are most likely harmless, which means that their functional role is small or not determined by a specific nucleotide sequence.

In a number of studies, the proportion of human DNA regions under the pressure of negative selection was estimated. It turned out that only about 6, 5-10% of the genome belong to them, and the non-coding regions, in contrast to the coding ones, are much less susceptible to negative selection. It turns out that, in terms of evolutionary criteria, less than 10% of the human genome is functional. Notice how close to this estimate was Ono in 1972!

๐Ÿ“Œ Rubbish Hold

But are the remaining 90% of the human genome really garbage that is better to get rid of? Not certainly in that way. There are considerations that a large genome size can be useful in itself. In bacteria, genome replication is a serious limiting factor that requires significant energy consumption. Therefore, their genomes, as a rule, are small, and they get rid of all that is superfluous. In large organisms, as a rule, DNA replication of dividing cells does not make such a large contribution to the total amount of energy consumed by the body against the background of expenditures on the work of the brain, muscles, excretory organs, maintaining body temperature, etc.

At the same time, a large genome can be an important source of genetic diversity, increasing the chances of the emergence of new functional regions from non-functional ones due to mutations that are potentially useful in the process of evolution. Mobile elements can carry regulatory elements, creating genetic diversity in the regulation of genes. That is, organisms with large genomes can theoretically adapt more quickly to environmental conditions, paying for the relatively small additional costs of replicating a larger genome. We will not find such an effect in an individual organism, but it can play an important role at the population level.

Having a large genome can also reduce the likelihood that a virus will insert itself into a functional gene (which can lead to gene breakdown and, in some cases, cancer). In other words, it is possible that natural selection can act not only to maintain specific sequences in the genome, but to maintain certain genome sizes, nucleotide composition in some of its regions, etc.

However, although the idea that only 80% or even 20% of the human genome is functional is controversial, this does not mean that the entire ENCODE project is subject to criticism. Within its framework, a huge amount of data was obtained on how different proteins bind to DNA, information on gene regulation, etc. These data are of great interest to specialists. But it will hardly be possible in the near future to get rid of the "garbage" in the genome - both from the concept and from the unnecessary sequences themselves.

Author: Alexander Panchin, Researcher, Molecular Evolution Sector