Transposition Intermediates
Transposition Intermediates

A Field Guide to Eukaryotic Transposable Elements

Transposable elements (TEs) are mobile DNA sequences that can replicate and propagate within genomes. Through various invasion strategies, TEs have come to occupy a significant portion of nearly all eukaryotic genomes, becoming a major source of genetic variation and novelty. This review explores the defining features of each major group of eukaryotic TEs, their evolutionary origins, and their relationships. We discuss how the unique biology of different TEs influences their propagation and distribution within and across genomes. Furthermore, we examine how environmental and genetic factors at the host level modulate the activity, diversification, and ultimate fate of TEs, leading to the dramatic variation in TE content observed across eukaryotes. Cataloging TE diversity and dissecting the idiosyncratic behavior of individual elements is crucial to furthering our understanding of their impact on the biology of genomes and the evolution of species.

INTRODUCTION

Transposable elements (TEs), sometimes called “jumping genes,” are mobile DNA sequences capable of replicating themselves within genomes independently of the host cell DNA. They typically range in length from 100 to 10,000 base pairs, but can sometimes be far larger. Alongside viruses, TEs are among the most complex and intriguing selfish genetic elements, often encoding proteins with multiple biochemical activities and intricate noncoding regulatory sequences that promote their propagation. Understanding A Field Guide To Eukaryotic Transposable Elements is crucial for unraveling genome evolution.

The boundary between TEs and other invasive genetic elements like viruses can be ambiguous. We define a TE as a genetic element capable of chromosomal and replicative mobilization in the germ line, thereby increasing in frequency through vertical inheritance. This definition includes non-autonomous elements such as short interspersed nuclear elements (SINEs) and miniature inverted-repeat TEs (MITEs), as well as endogenous retroviruses (ERVs). While inheritance through the germ line is a defining feature, horizontal transfer of TEs between species is also an important factor in their long-term success.

Nearly all eukaryotic genomes examined thus far harbor TEs, with a few exceptions. TE content often correlates strongly with genome size, and in some species, TEs constitute as much as 85% of the genome, with protein-coding regions appearing as islands in a sea of TEs. The fraction of the genome occupied by TEs doesn’t correlate with organismal complexity; both complex multicellular organisms (e.g., conifers and salamanders) and single-celled organisms (e.g., Trichomonas vaginalis and Anncaliia algerae) may contain substantial TE fractions. Thus, TEs are an omnipresent feature of eukaryotic genomes, making a field guide to eukaryotic transposable elements essential.

Since Barbara McClintock’s pioneering work on “controlling elements,” the profound impact of TEs on eukaryotic evolution has become clear. TEs play a critical role in everything from the size and structure of genomes to the proteins they encode and the regulation thereof. To understand how TEs have impacted the diversification and biology of species, we must first understand the diversity and biology of TEs themselves. In this review, we provide an overview of the classification of eukaryotic TEs, examine their evolutionary origins and relationships, explore the variation of TE content across species, and discuss the factors underlying such variation.

CLASSIFICATION OF EUKARYOTIC TRANSPOSABLE ELEMENTS

The most fundamental division of eukaryotic TEs, introduced by David Finnegan in 1989, distinguishes two major classes based on their transposition intermediates: class I – retrotransposons, and class II – DNA transposons. Class I elements replicate via an RNA intermediate, which is then reverse-transcribed back into a DNA copy and integrated into the genome. Because the original template element remains intact, retrotransposons are commonly referred to as “copy-and-paste” elements. In contrast, the majority of (but not all) class II elements mobilize through a “cut-and-paste” mechanism, in which the transposon itself is excised and moved to a new genomic location. Both classes are further subdivided into subclasses, superfamilies, and families based on replication mechanisms and phylogenetic relationships. Understanding this classification is the first step in using a field guide to eukaryotic transposable elements.

TE families are usually defined using the “80-80-80” rule: insertions are members of the same family if they are longer than 80 base pairs and share at least 80% sequence identity over 80% of their length. These families can then be represented by their majority-rule consensus sequence, approximating the ancestral TE that seeded the family. However, this rule and corresponding consensus sequences do not always reflect the true phylogenetic structure of TE families, requiring more careful analyses.

TEs can also be classified according to whether or not they are able to move autonomously. Autonomous elements encode the enzymatic machinery necessary for their own transposition, while non-autonomous elements are noncoding but capable of mobilization in trans by hijacking the machinery produced by their autonomous counterparts. Some non-autonomous elements originate from deletion derivatives of autonomous elements (e.g., MITEs), while others emerge ‘de novo’ from non-TE sequences (e.g., SINEs).

Class I Retrotransposons

Retrotransposons are divided into three major subclasses according to their mechanism of replication and integration: (i) Long Terminal Repeat (LTR) elements (mobilized by an integrase); (ii) “target-primed” non-LTR elements; and (iii) Tyrosine Recombinase (YR)-mobilized elements.

Non-LTR elements usually contain two open reading frames, ORF1 and ORF2. ORF1’s function is poorly understood and dispensable in some groups, while ORF2 encodes both endonuclease (EN) and reverse transcriptase (RT) activities, essential for target-primed reverse transcription (TPRT). In L1 elements, TPRT initiates with a single-stranded nick by EN, followed by host DNA hybridization, reverse transcription, and cDNA strand integration. A hallmark of this process is 5’-truncation, generally preventing further propagation.

The structures, coding capacity, and replication mechanisms of LTR elements resemble those of retroviruses. Autonomous LTR elements contain gag and pol genes, generally expressed as a single polycistronic RNA. Both gag and pol encode polyproteins cleaved into multiple proteins by a pol-encoded protease (PR). Pol also encodes reverse transcriptase (RT), RNaseH, and integrase (IN) activities. The cDNA product is bound by the IN protein, mediating nuclear localization and integration through a process similar to cut-and-paste transposases. The process of retroviral replication and integration is essentially the same, with the only substantive difference linked to the acquisition of fusogenic env genes by retroviruses.

YR retrotransposons represent a third major subclass of class I elements, differing from LTR elements by encoding YR in place of IN. YR elements possess terminal repeat sequences, but their structure varies between superfamilies. A proposed mechanism for DIRS involves reverse transcription of the mRNA template, circularization of the cDNA copy, synthesis of the second cDNA strand, and chromosomal integration mediated by YR.

Finally, Penelope elements are characterized by pseudo-LTRs and a GIY-YIG endonuclease domain. Based on their likely reliance on TPRT for transposition, they may be classified as non-LTR elements, but phylogenetic analyses suggest they define a distinct monophyletic group, potentially a separate subclass of retroelements.

Figure 1. Summary of replication mechanisms and transposition intermediates.

Proposed transposition intermediates and key replication steps for five TE subclasses. YR-retrotransposons and Maverick/Polintons are not shown, but the former are expected to transpose via the same intermediate as Class II YR-transposons (i.e. Cryptons). The mechanism of Mavericks/Polintons has not yet been studied, but based on the presence of protein-primed type B DNA polymerase (pPolB), they are expected to transpose by direct synthesis of a DNA copy.

Class II DNA Transposons

There are four major groups of DNA transposons: (i) cut-and-paste elements mobilized by DDE transposases, (ii) elements mobilized by YR (Cryptons), (iii) rolling-circle elements (Helitrons), and (iv) “self-synthesizing” transposons (Mavericks or Polintons). DDE transposons and Cryptons are the simplest, typically consisting of a single ORF encoding a recombinase flanked by short terminal inverted repeats (TIRs), resembling prokaryotic insertion sequences. DDE transposons are the most diverse and widespread of all TEs.

The precise mechanism of DDE transposition varies between superfamilies, but generally involves nucleophilic attack near the TIRs, resulting in excision and re-location of the transposon DNA. While the process itself is non-replicative, these elements can increase in copy number through preferential transposition during host DNA synthesis or through homologous recombination repair of double-strand breaks. Non-autonomous elements often lose their coding capacity but retain transposase binding sites, forming extensive families of MITEs.

Helitrons are abundant in many eukaryotic lineages but were largely uncharacterized until the early 2000s. Autonomous Helitrons code for a Rep/Hel protein with a DNA helicase domain fused to a HUH nuclease domain, suggesting a fundamentally different mobilization mechanism than cut-and-paste elements. Functional studies suggest a “peel-and-paste” mechanism in which a covalently linked circular dsDNA intermediate is formed, though some Helitrons are able to directly excise rather than copy.

Mavericks (or Polintons) are exceptional for their size (15–20 kb) and complexity, consisting of up to twenty protein-coding genes flanked by long TIRs. These elements are widespread across eukaryotes but generally present in low copy number. Maverick/Polintons share similarities to disparate groups of double-stranded DNA (dsDNA) viruses, including a protein-primed family-B DNA polymerase (pPolB), suggesting replication via direct synthesis of a DNA copy. They also encode a DDE nuclease related to retroviral IN. Many Maverick and Polinton elements are predicted to encode capsid-like proteins, leading to the proposal that they may represent endogenous viruses or virophages.

EVOLUTIONARY ORIGINS OF EUKARYOTIC TRANSPOSABLE ELEMENTS

When and how did the major groups of TEs originate, and how do they relate to each other? A phylogenomic framework, integrating taxonomic distribution with phylogenetic analyses of shared core proteins, is the best approach. However, limitations exist due to rapid TE sequence evolution, frequent horizontal transfers, and lineage loss.

Despite these caveats, several conclusions can be drawn. First, all major subclasses are widely distributed across the eukaryotic tree. Second, phylogenetic topologies of core TE proteins suggest that each of these subclasses already existed early in eukaryotic evolution. Third, TE evolution is highly modular, with recurrent gain and loss of proteins from a shared pool of conserved domains.

Deep Evolutionary Roots of TE Proteins

Despite the diversity in the structure of different elements, the number of distinct protein families involved in replication and transposition is surprisingly small, comprising roughly five defining catalytic families (RT, DDE integrase, YR, Rep/Hel, and pPolB) along with accessory domains such as HUH endonuclease. DDE integrases, HUH endonuclease, and RT all share a deeply conserved structural fold termed the RNA recognition motif, indicating that the core enzymatic machinery of transposition predates the emergence of eukaryotes.

At least six of the main DDE-transposase superfamilies can be phylogenetically clustered with well-defined prokaryotic IS transposases, suggesting that each of these DNA transposon superfamilies arose prior to the split of prokaryotes and eukaryotes. In contrast, none of the remaining eukaryotic TE subclasses have unambiguous prokaryotic homologs. While retroelements do occur in prokaryotes and phylogenies point to a direct affiliation between prokaryotic and eukaryotic RTs, all extant eukaryotic retroelements are very distinct from their prokaryotic relatives.

In the case of rolling-circle replication elements, the HUH endonuclease involved in the transposition of Helitrons is also responsible for the mobilization of prokaryotic IS91 transposons, but it appears likely that prokaryotic and eukaryotic rolling-circle elements emerged independently from viruses or plasmids. Similarly, although transposons mobilized by YR are common in prokaryotes, their enzymes are not directly related to those encoded by eukaryotic YR retrotransposons or class II Cryptons. Thus, most eukaryotic TE subclasses appear to have emerged shortly after the split of prokaryotes and eukaryotes.

Chimeric Elements and Modular Evolution

While phylogenomic analyses reveal the deep relationships between the core transposition enzymes, they offer limited insight into the origin of individual families and superfamilies. TEs, viruses, and plasmids form a densely connected evolutionary web characterized by frequent exchange of protein-coding units. These exchanges involve both the core domains essential for transposition and accessory domains acquired from host genomes, blurring the distinctions between TE classes and other invasive elements. LTR retrotransposons, for example, appear to have evolved a unique transposition mechanism that borrows components from non-LTR elements and cut-and-paste DDE transposons.

SINEs also offer a compelling illustration of how highly successful TE families repeatedly emerge via chimeric assembly. Most SINEs are derived from Pol III-transcribed noncoding RNA trans-mobilized by the machinery of LINEs. Many have evolved complex mosaic structures which further enhance their transposition capacity. For instance, Alu elements arose early in primate evolution by a process involving the fusion of two monomeric 7SL-derived SINEs. Since their appearance, Alus have spawned many subfamilies and new composite elements. Similarly tortuous stories of SINE diversification via fusion and accretion of additional sequences have been described in plants and other animals.

Figure 2. Structure and taxonomy of eukaryotic TEs.

Left panel: unrooted cladograms showing putative relationships between the major TE superfamilies, based on phylogenies of core protein domains for five subclasses. Right panel: genetic structures of representative elements from each subclass.

VARIABLE SUCCESS OF TRANSPOSABLE ELEMENTS ACROSS SPECIES

The TE content of genomes varies greatly between species. Some genomes contain just a few TE families, while others are bloated with a bewildering diversity. Understanding the factors influencing TE accumulation and diversification across species is paramount to characterizing the impact of TEs on genome evolution. This is where a field guide to eukaryotic transposable elements becomes vital for researchers.

Across broad phylogenetic scales, it has been proposed that the overall TE load is dictated by effective population size, or Ne. The efficiency of selection in removing deleterious insertions decreases with Ne. However, this cannot account for differences in TE abundance observed between species with comparable Ne. Similarly, it offers little explanation as to why the diversity of TEs should be so variable between species, or why certain TE types seem to be particularly successful in certain taxonomic groups.

TE Abundance and the Relationship with Genome Size

Very few eukaryotic species appear to lack TEs altogether. The best-known exceptions are apicomplexan protists such as Plasmodium falciparum, which seem to have successfully purged TEs from their genomes. These species are single-celled, obligate intracellular parasites and are predominantly asexual except for brief periods in their lifecycle. However, several other parasitic unicellular eukaryotes do harbor diverse and active TE communities. At the other end of the size spectrum, many salamanders have undergone extreme genome expansions through the accumulation of LTR retroelements. Plant genomes too often grow very large through the rapid accumulation of LTR elements.

The rate of non-essential DNA removal is also a critical factor shaping TE content and genome size. Genomic gigantism in salamanders is associated with low deletion rates, whereas in rice and Arabidopsis transpositional gain of DNA appears to be buffered by high rates of deletion via ectopic recombination. This phenomena is also apparent in birds and mammals, and suggests an “accordion model” for genome size evolution.

Genomic TE Diversity

In addition to variation in abundance, there are also differences in TE diversity between species. Many eukaryotes harbor extraordinarily diverse TE repertoires. Zebrafish, for example, are both the most TE-abundant and -diverse vertebrate model organism currently in use, harboring nearly 2000 distinct families with representatives from every subclass and almost every superfamily.

Large genomes might be assumed to be associated with wide TE diversity, but this is not necessarily true. Spruce pine, for example, is a gymnosperm conifer with a 20-Gb genome dominated by a relatively small number of very high-copy number LTR elements. This indicates that whilst TE diversity is low in the spruce pine, elements that do establish in the genome are removed slowly. The opposite is true of most flowering plants, which tend to have smaller genomes but more diverse TE landscapes than gymnosperms.

Figure 3. Distribution of TEs across the eukaryote phylogeny.

Reference genome size varies dramatically across eukaryotes and is loosely correlated with transposable element content. Here, the honey bee TE content is likely an underestimate, as approximately 3% of the genome derives from unusual “large retrotransposon derivatives” (LARDs).

HOW THE BIOLOGY OF TES AFFECTS THEIR SUCCESS

The fate of a TE family is dictated by three dynamic forces: (i) the rate of transposition, (ii) the rate of fixation of new TE insertions, and (iii) the rate at which TE sequences are deleted or eroded. Each of these processes is influenced by factors intrinsic to the TE itself and those intrinsic to the host. Both TE and host factors are in turn shaped by the environment, and the interplay between TE, host, and environmental factors results in the variety of TE landscapes in eukaryotic genomes. We will concentrate on the factors intrinsic to TEs that influence their survival and success within genomes.

TE Insertion Preference

A critical determinant of the fate of a TE is where it initially inserts in the genome. Studies have documented three general patterns: (i) TEs with apparently little insertional bias; (ii) TEs favoring insertion in genomic regions that minimize their deleterious effects; (iii) TEs targeting sites that likely facilitate their subsequent propagation.

Mechanistically, insertion location is dictated by the nuclease that catalyzes chromosomal integration. Because all TE-encoded nucleases have some degree of substrate specificity for particular DNA or chromatin attributes, it follows that virtually all TEs show some level of insertion specificity. At the lowest level of specificity are TEs with nucleases that recognize highly degenerate or short sequence motifs, such as L1 elements.

Many TEs show much stronger insertion specificity, and a common theme involves targeting genomic sites where insertions are unlikely to disrupt cell function. A classic example includes several families of LINEs, which precisely target ribosomal RNA gene arrays. Targeting “safe havens” enables TEs to colonize compact genomes with little intergenic space. For example, all TEs in baker’s yeast are LTR elements that have evolved integration strategies to avoid genes.

A wide variety of TEs are known to target the 5’ upstream region of genes, alleviating the likelihood of disrupting coding sequences and placing the newly inserted TE in a chromatin environment promoting further expression. Another mitigating strategy is for TEs to target other TEs.

Features Affecting the Long-Term Retention of TEs

All new TE insertions are subject to natural selection acting at the level of host fitness. The three major factors driving the deleterious effects of TE insertions are: disruption of gene expression, toxic effects of TE transcripts or protein products, and increased frequency of ectopic recombination between copies of the same TE family.

Current data point to ectopic recombination as the predominant factor affecting TE fixation in various species. If correct, then longer TEs should be strongly selected against due to their increased likelihood of initiating recombination. This likely explains why LTR and LINE retroelements tend to accumulate in regions with low recombination rates, while shorter elements such as SINEs and MITEs accumulate in gene-rich regions, which are generally characterized by higher recombination rates.

A second factor driving differential patterns of retention between TE types is their potential effect on gene expression. Since autonomous elements carry their own promoters and regulatory elements, they have a greater likelihood of disrupting expression of nearby genes upon insertion.

Horizontal Transposon Transfer

Sex provides the primary mechanism for the spread of TEs within populations, but horizontal transfer of TEs (HTT) is another important factor in their long-term success, and one which occurs regularly on evolutionary timescales. All major groups of TEs undergo HTT, but it is particularly common for some families. Notably, many DDE-type DNA transposons appear to pass between species with relative ease, whereas HTT events involving non-LTR retroelements are rare in comparison.

One possible explanation is that some DNA transposons have evolved mechanisms that reduce their dependence on specific host factors. The nature of transposition intermediates may also explain why some TEs can propagate horizontally more efficiently than others.

Circumventing Host Defense Systems

Numerous host-encoded systems control TE activity, leading to inventive strategies by TEs to escape repression. One spectacular example is that of I-elements in D. melanogaster oogenesis. I-elements preferentially retrotranspose in the oocyte, but their RNA intermediates are exclusively produced in the nurse cells that surround the developing oocyte, limiting their exposure to piRNA silencing.

CONCLUSIONS

TEs exist in all domains of life, but their abundance and omnipresence in eukaryotes attest to their profound influence on genome architecture and organismal evolution. TEs account for the majority of cis-regulatory DNA in the human genome introduced during primate evolution and have given birth to numerous proteins coopted for mammalian physiology and development. Their movement, rearrangement, and regulatory activities can also cause a plethora of diseases.

Revolutionary advances in DNA sequencing have triggered a major shift in TE research to ‘genome-wide’ studies where virtually all TEs residing within any genome can be identified, compared, and interrogated for their regulatory activities. While it was quickly realized that most TEs in any given species are inactive relics of past invasions, such genome-wide studies revealed how TEs have fueled genome evolution.

TE research continues to be predominantly concerned with understanding their large-scale effects on genome architecture and function. But it is important not to lose sight of the fact that we can only interpret these effects when armed with an understanding of the mechanisms that promoted the propagation of the elements in the first place.

No two TE families look or behave exactly the same. Consequently, the effects of TEs on their host genomes are as varied as the TEs themselves. It is therefore of paramount importance to continue cataloging and organizing TE diversity in a wide range of species. Detailed studies of the molecular mechanisms and cellular activities of individual elements should also be encouraged, with priority given to TEs from widespread yet poorly characterized groups, such as Helitrons, Maverick/Polintons, or YR elements. While genomes are often dominated by defective and immobile elements, today’s technology offers the ability to revive these elements and reveal the idiosyncratic features that make each TE uniquely fascinating.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *