⌂ Contents Table of contents
Chapter 9

DNA-Based Information Technologies

Textbook pages 1150–1283 (Lehninger, 8e) · 25 MCQs below · Source: printed chapter text extracted from the PDF

CHAPTER 9 DNA-BASED INFORMATION TECHNOLOGIES This is a methods chapter, a necessary prelude to much of what comes later in this book. It is organized around just a few straightforward principles: An organism’s DNA — its genome — is the ultimate source of biological information. Genomic information is a resource of unparalleled importance for investigators studying any aspect of biology. Genomes vary in size but all are large enough to direct all aspects of an organism’s structure and function. To approach them oen requires tools to break them into small parts that are experimentally digestible. Genomic information is accessible. Advances in DNA sequencing (Chapter 8) are being matched by new approaches to understanding how chromosomal information is expressed and regulated on a genomic and cellular scale. Important clues to protein function are embedded in the sequences of the genes that encode them. Genomic information is malleable. We can not only elucidate cellular genomic information; we can also change it. That capacity provides a path to altering any aspect of cellular metabolism, structure, or function. The word “genome,” coined by German botanist Hans Winkler in 1920, was derived from the Greek words genesis and soma to describe a body of genes. A genome today is defined as the complete haploid genetic complement of an organism. In essence, a genome is one copy of the hereditary information required to specify the organism. For sexually reproducing organisms, the genome includes one set of autosomes and one of each type of sex chromosome. When cells have organelles that also contain DNA, the genetic content of the organelles is not considered part of the nuclear genome. Mitochondria, found in most eukaryotic cells, and chloroplasts, in the light-harvesting cells of photosynthetic organisms, each have their own distinct genome. For viruses, which can have genetic material composed of DNA or RNA, the genome is a complete copy of the nucleic acid required to specify the virus. As objects of study, DNA molecules present a special problem: their size. Chromosomes are far and away the largest biomolecules in any cell. How do researchers find the information they seek when it is just a small part of a chromosome that can include millions or even billions of contiguous base pairs? Decades of advances by thousands of scientists working in genetics, biochemistry, cell biology, and physical chemistry came together in the laboratories of Paul Berg, Herbert Boyer, and Stanley Cohen to yield the first techniques for locating, isolating, preparing, and studying small segments of DNA derived from much larger chromosomes. The science of genomics is dedicated to the study of DNA on a cellular scale. In turn, genomics contributes to systems biology, the study of biochemistry on the scale of whole cells and organisms. The methods described in this chapter were built on advances in our understanding of DNA and RNA metabolism that are not presented in this text until Part III. Fundamental concepts of DNA replication, RNA transcription, protein synthesis, and gene regulation are intrinsic to an appreciation for how these methods work. Yet all facets of modern biochemistry rely on these same methods to such an extent that a current treatment of any aspect of the discipline becomes very difficult without a proper introduction to them. By presenting these technologies early in the book, we acknowledge that they are inextricably interwoven with both the advances that gave rise to them and the newer discoveries they now make possible. The background we necessarily provide makes the discussion here not just an introduction to technology but also a preview of many of the fundamentals of DNA and RNA biochemistry encountered in later chapters. We begin by outlining the principles of DNA cloning, then illustrate the range of applications and the potential of many newer technologies that support and accelerate the advance of biochemistry. 9.1 Studying Genes and Their Products A researcher has isolated a new enzyme that she knows is the key to a human disease. She hopes to isolate large amounts of the protein to crystallize it for structural analysis and to study it. She wants to alter amino acid residues at its active site so that she can understand the reaction it catalyzes. She plans an elaborate research program to elucidate how this enzyme interacts with, and is regulated by, other proteins in the cell. All of this, and much more, becomes possible if she can isolate the gene encoding her enzyme. Unfortunately, that gene consists of just a few thousand base pairs within a human chromosome with a size measured in hundreds of millions of base pairs. How does she isolate the small segment that she needs and then study it? The answer lies in DNA cloning and methods developed to manipulate cloned genes. Genes Can Be Isolated by DNA Cloning A clone is an identical copy. This term originally applied to cells of a single type, isolated and allowed to reproduce to create a population of identical cells. When applied to DNA, a clone represents many identical copies of a particular gene segment. In brief, our researcher must separate the gene from the larger chromosome, attach it to a much smaller piece of carrier DNA, and allow microorganisms to make many copies of it. This is the process of DNA cloning. The result is selective amplification of a particular gene or DNA segment so that its genetic information may be studied and utilized. Classically, the cloning of DNA from any organism entails five general procedures: 1. Obtaining the DNA segment to be cloned. Enzymes called restriction endonucleases act as precise molecular scissors, recognizing specific sequences in DNA and cleaving genomic DNA into smaller fragments suitable for cloning. Alternatively, genomic DNA can be sheared randomly into fragments of a desired size. Since the sequence of targeted genomic regions is oen known (available in databases), DNA segments to be cloned are most oen amplified by the polymerase chain reaction (PCR) or are simply synthesized (both methods are described in Chapter 8). 2. Selecting a small molecule of DNA capable of autonomous replication. These small DNAs are called cloning vectors (a vector is a carrier or delivery agent). Most cloning vectors used in the laboratory are modified versions of naturally occurring small DNA molecules found in bacteria or eukaryotes. Viral DNAs may also play this role. 3. Joining two DNA fragments covalently. The enzyme DNA ligase links the cloning vector to the DNA fragment to be cloned. Composite DNA molecules of this type, comprising covalently linked segments from two or more sources, are called recombinant DNAs. 4. Moving recombinant DNA from the test tube to a host organism. The host organism provides the enzymatic machinery for DNA replication. 5. Selecting or identifying host cells that contain recombinant DNA. The cloning vector generally has features that allow the host cells to survive in an environment in which cells lacking the vector would die. Cells containing the vector are thus “selectable” in that environment. The methods used to accomplish these and related tasks are collectively referred to as recombinant DNA technology or, more informally, genetic engineering. Much of our initial discussion focuses on DNA cloning in the bacterium Escherichia coli, the first organism used for recombinant DNA work and still the most common host cell. E. coli has many advantages: its DNA metabolism (like many other of its biochemical processes) is well understood; many naturally occurring cloning vectors associated with E. coli, such as plasmids and bacteriophages (bacterial viruses; also called phages), are readily available; and techniques are available for moving DNA expeditiously from one bacterial cell to another. The principles discussed here are broadly applicable to DNA cloning in other organisms, a topic discussed more fully later in the section. Restriction Endonucleases and DNA Ligases Yield Recombinant DNA A set of enzymes (Table 9-1) made available through decades of research on nucleic acid metabolism is indispensable for generating and propagating a recombinant DNA molecule (Fig. 9- 1). First, restriction endonucleases (also called restriction enzymes) recognize and cleave DNA at specific sequences (recognition sequences or restriction sites) to generate a set of smaller fragments. Second, the DNA fragment to be cloned is joined to a suitable cloning vector by using DNA ligases to link the DNA molecules together. The recombinant vector is then introduced into a host cell, which amplifies the fragment in the course of many generations of cell division. TABLE 9-1 Some Enzymes Used in Recombinant DNA Technology Enzyme(s) Function Type II restriction endonucleases Cleave DNA molecules at specific base sequences DNA ligase Joins two DNA molecules or fragments DNA polymerase I (E. coli) Fills gaps in duplexes by stepwise addition of nucleotides to 3′ ends Reverse transcriptase Makes a DNA copy of an RNA molecule Polynucleotide kinase Adds a phosphate to the 5′-OH end of a polynucleotide to label it or to permit ligation Terminal transferase Adds homopolymer tails to the 3′-OH ends of a linear duplex Exonuclease III Removes nucleotide residues from the 3′ ends of a DNA strand Bacteriophage λ exonuclease Removes nucleotides from the 5′ ends of a duplex to expose single-stranded 3′ ends Alkaline phosphatase Removes terminal phosphates from the 5′ end or 3′ end (or both)

FIGURE 9-1 Schematic illustration of DNA cloning. A cloning vector and eukaryotic chromosomes are separately cleaved with the same restriction endonuclease. (A single chromosome is shown here for simplicity.) The fragments to be cloned are then ligated to the cloning vector. The resulting recombinant DNA (only one recombinant vector is shown here) is introduced into a host cell, where it can be propagated (cloned). Note that this drawing is not to scale: the size of the E. coli chromosome relative to that of a typical cloning vector (such as a plasmid) is much greater than depicted here. Restriction endonucleases are found in a wide range of bacterial species. As Werner Arber discovered in the early 1960s, the biological function of restriction endonucleases is to recognize and cleave foreign DNA (the DNA of an infecting virus, for example); such DNA is said to be restricted. In the host cell’s DNA, the sequence that would be recognized by one of its own restriction endonucleases is protected from digestion by methylation of the DNA, catalyzed by a specific DNA methylase. The restriction endonuclease and the corresponding methylase are sometimes referred to as a restriction-modification system. There are three types of restriction endonucleases, designated I, II, and III. Types I and III are generally large, multisubunit complexes containing both the endonuclease and methylase activities. Type II restriction endonucleases, first isolated by Hamilton Smith in 1970, are simpler, require no ATP, and catalyze the hydrolytic cleavage of particular phosphodiester bonds in the DNA within the recognition sequence itself. The extraordinary utility of this group of restriction endonucleases was demonstrated by Daniel Nathans, who first used them to develop novel methods for mapping and analyzing genes and genomes. Thousands of type II restriction endonucleases have been discovered in different bacterial species, and more than 100 different DNA sequences are recognized by one or more of these enzymes. The recognition sequences are usually 4 to 6 bp long and are palindromic (see Fig. 8-18). Table 9-2 lists sequences recognized by a few type II restriction endonucleases. TABLE 9-2 Recognition Sequences for Some Type II Restriction Endonucleases BamHI HindIII ClaI NotI EcoRI PstI EcoRV PvuII HaeIII Tth111I

Note: Arrows indicate the phosphodiester bonds cleaved by each restriction endonuclease. Asterisks indicate bases that are methylated by the corresponding methylase (where known). N denotes any base. Note that the name of each enzyme consists of a three-letter abbreviation of the bacterial species from which it is derived, sometimes followed by a strain designation and roman numerals to distinguish different restriction endonucleases isolated from the same bacterial species. Thus BamHI is the first (I) restriction endonuclease characterized from Bacillus amyloliquefaciens, strain H. Some restriction endonucleases make staggered cuts on the two DNA strands, leaving two to four nucleotides of one strand unpaired at each resulting end. These unpaired strands are referred to as sticky ends (Fig. 9-2a) because they can base-pair with each other or with complementary sticky ends of other DNA fragments. Other restriction endonucleases cleave both strands of DNA straight across, at opposing phosphodiester bonds, leaving no unpaired bases on the ends, oen called blunt ends (Fig. 9- 2b). FIGURE 9-2 Use of restriction endonucleases in cloning. (a) Restriction endonucleases recognize and cleave only specific sequences, leaving either sticky ends (with protruding single strands) or blunt ends. Fragments can be ligated to other DNAs, such as the cleaved cloning vector (a plasmid) shown here. This reaction is facilitated by the annealing of complementary sticky ends. Ligation is less efficient for DNA fragments with blunt ends than for those with complementary sticky ends, and DNA fragments with different (noncomplementary) sticky ends generally are not ligated. (b) DNA that has been amplified by the polymerase chain reaction (see Fig. 8-33) can be cloned. The primers can include noncomplementary ends that have a site for cleavage by a restriction endonuclease. Although these parts of the primers do not anneal to the target DNA, the PCR process incorporates them into the DNA that is amplified. Cleavage of the amplified fragments at these sites creates sticky ends, used in ligation of the amplified DNA to a cloning vector. (c) A synthetic DNA fragment with recognition sequences for several restriction endonucleases can be inserted into a plasmid that has been cleaved by a restriction endonuclease. The insert is called a linker; an insert with multiple restriction sites is generally called a multiple cloning site (MCS). The gene or DNA segment to be cloned is most oen generated by the polymerase chain reaction. Careful design of the primers used for PCR (see Fig. 8-33) can alter the amplified segment by the inclusion, at each end, of additional DNA not present in the chromosome that is being targeted. For example, including restriction endonuclease cleavage sites can facilitate the subsequent cloning of the amplified DNA (Fig. 9-2c). Aer the target DNA fragment is prepared and digested with the appropriate restriction enzyme, DNA ligase can be used to join it to a vector digested by the same restriction endonuclease; a fragment generated by EcoRI, for example, generally will not link to a fragment generated by BamHI. As described in more detail in Chapter 25 (see Fig. 25-15), DNA ligase catalyzes the formation of new phosphodiester bonds in a reaction that uses ATP or a similar cofactor. The base pairing of complementary sticky ends greatly facilitates the ligation reaction (Fig. 9-2a). Blunt ends can also be ligated, albeit less efficiently. Researchers can create new DNA sequences for a wide range of purposes by inserting synthetic DNA fragments, called linkers, to bridge the ends that are being ligated. An inserted DNA fragment with multiple recognition sequences for restriction endonucleases (oen useful later as points for inserting additional DNA by cleavage and ligation) is called the multiple cloning site (MCS) (Fig. 9-2d). The effectiveness of sticky ends in selectively joining two DNA fragments was apparent in the earliest recombinant DNA experiments. Before restriction endonucleases were widely available, some investigators found they could generate sticky ends by the combined action of the bacteriophage λ exonuclease and terminal transferase (Table 9-1). The fragments to be joined were given complementary homopolymeric tails. Peter Lobban and Dale Kaiser used this method in 1971 in the first experiments to join naturally occurring DNA fragments. Similar methods were used soon aer in Paul Berg’s laboratory to join DNA segments from simian virus 40 (SV40) to DNA derived from bacteriophage λ , thereby creating the first recombinant DNA molecule with DNA segments from different species. Cloning Vectors Allow Amplification of Inserted DNA Segments The factors that govern the delivery of recombinant DNA in clonable form to a host cell, and its subsequent amplification in the host, are well illustrated in three popular cloning vectors: plasmids and bacterial artificial chromosomes, used in experiments with E. coli, and a vector used to clone large DNA segments in yeast. Plasmids A plasmid is a circular DNA molecule that replicates separately from the host chromosome. Naturally occurring bacterial plasmids range in size from 5,000 to 400,000 bp. Many of the plasmids found in bacterial populations are little more than molecular parasites, similar to viruses but with a more limited capacity to transfer from one cell to another. To survive in the host cell, plasmids incorporate several specialized sequences that enable them to make use of the cell’s resources for their own replication and gene expression. Naturally occurring plasmids usually have a symbiotic role in the cell. They may provide genes that confer resistance to antibiotics or that perform new functions for the cell. For example, the Ti plasmid of Agrobacterium tumefaciens allows the host bacterium to colonize the cells of a plant and make use of the plant’s resources. The same properties that enable plasmids to grow and survive in a bacterial or eukaryotic host are useful to molecular biologists who want to engineer a vector for cloning a specific DNA segment. Constructed in 1977, one of the first recombinant vectors — E. coli plasmid pBR322 — illustrates some key features that define a useful cloning vector (Fig. 9-3): 1. The plasmid pBR322 has an origin of replication, or ori, a sequence where replication is initiated by cellular enzymes (see Chapter 25). This sequence is required to propagate the plasmid. An associated regulatory system is present that limits replication to maintain pBR322 at a level of 10 to 20 copies per cell. 2. The plasmid contains genes that confer resistance to the antibiotics ampicillin (AmpR) and tetracycline (T etR), allowing the selection of cells that contain the intact plasmid or a recombinant version of the plasmid (discussed below). 3. Several unique recognition sequences in pBR322 are targets for restriction endonucleases (PstI, EcoRI, BamHI, SalI, and PvuII), providing sites where the plasmid can be cut to insert foreign DNA. 4. The small size of the plasmid (4,361 bp) facilitates its entry into cells and the biochemical manipulation of the DNA. This small size was the result of trimming away many DNA segments from a larger, parent plasmid — sequences that the biochemist does not need. FIGURE 9-3 The constructed E. coli plasmid pBR322. Notice the location of some important restriction sites, for PstI, EcoRI, BamHI, SalI, and PvuII; genes for ampicillin and tetracycline resistance (AmpR and TetR); and the replication origin (ori). Constructed in 1977, this was one of the early plasmids designed expressly for cloning in E. coli. The replication origins inserted in common plasmid vectors were originally derived from naturally occurring plasmids. As in pBR322, each of these origins is regulated to maintain a particular plasmid copy number. Depending on the origin used, the plasmid copy number can vary from one to hundreds or thousands per cell, providing many options for investigators. Two different plasmids cannot function in the same cell if they use the same origin of replication, because the regulation of one will interfere with the replication of the other. Such plasmids are said to be incompatible. When a researcher wants to introduce two or more different plasmids into a bacterial cell, each plasmid must have a different replication origin. In the laboratory, small plasmids can be introduced into bacterial cells by a process called transformation. The cells (oen E. coli, but other bacterial species are also used) and plasmid DNA are incubated together at 0°C in a calcium chloride solution, then are subjected to heat shock by rapidly shiing the temperature to between 37°C and 43°C. For reasons not well understood, some of the cells treated in this way take up the plasmid DNA. Some species of bacteria, such as Acinetobacter baylyi, are naturally competent for DNA uptake and do not require the calcium chloride–heat shock treatment. In an alternative method, called electroporation, cells incubated with the plasmid DNA are subjected to a high-voltage pulse, which transiently renders the bacterial membrane permeable to large molecules. Regardless of the approach, relatively few cells take up the plasmid DNA, so a method is needed to identify those that do. The usual strategy is to utilize one of two types of genes in the plasmid, referred to as selectable and screenable markers. A selectable marker either permits the growth of a cell (positive selection) or kills the cell (negative selection) under a defined set of conditions. The plasmid pBR322 provides markers for both positive and negative selection (Fig. 9-4). A screenable marker is a gene encoding a protein that causes the cell to produce a colored or fluorescent molecule. Cells are not harmed when the gene is present, and the cells that carry the plasmid are easily identified by the colored or fluorescent colonies they produce.

FIGURE 9-4 Use of pBR322 to clone foreign DNA in E. coli and identify cells containing the DNA. Transformation of typical bacterial cells with purified DNA (never a very efficient process) becomes less successful as plasmid size increases, and it is difficult to clone DNA segments longer than about 15,000 bp when plasmids are used as the vector. To illustrate the use of a plasmid as a cloning vector, consider the bacterial gene encoding a recombinase called the RecA protein (see Chapter 25). In most bacteria, the gene encoding RecA is one of the thousands of genes on a chromosome millions of base pairs long. The recA gene is just over 1,000 bp long. A plasmid would be a good choice for cloning a gene of this size. As described later, the cloned gene can be altered in a variety of ways, and the gene variants can be expressed at high levels to enable purification of the encoded protein. Bacterial Artificial Chromosomes Researchers sometimes want to clone much longer DNA segments than can typically be incorporated into standard plasmid cloning vectors such as pBR322. To meet this need, plasmid vectors have been developed with special features that allow the cloning of very long segments (typically 100,000 to 300,000 bp) of DNA. Once such large segments of cloned DNA have been added, these vectors are large enough to be thought of as chromosomes and are known as bacterial artificial chromosomes, or BACs (Fig. 9-5).

FIGURE 9-5 Bacterial artificial chromosomes (BACs) as cloning vectors. The vector is a relatively simple plasmid, with a replication origin (ori) that directs replication. The par genes assist in the even distribution of plasmids to daughter cells at cell division. This increases the likelihood of each daughter cell carrying one copy of the plasmid, even when few copies are present. The low number of copies is useful in cloning large segments of DNA, because this limits the opportunities for unwanted recombination reactions that can unpredictably alter large cloned DNAs over time. The BAC includes selectable markers. A lacZ gene (required for the production of the enzyme β -galactosidase) is situated in the cloning region such that it is inactivated by cloned DNA inserts. Introduction of recombinant BACs into cells by electroporation is promoted by the use of cells with an altered (more porous) cell wall. Recombinant DNAs are screened for resistance to the antibiotic chloramphenicol (Cam R). Plates also contain X-gal, a substrate for β -galactosidase that yields a blue product. Colonies with active β -galactosidase, and hence no DNA insert in the BAC vector, turn blue; colonies without β -galactosidase activity, and thus with the desired DNA inserts, are white. A BAC vector (without any cloned DNA inserted) is a relatively simple plasmid, generally not much larger than other plasmid vectors. To accommodate very long segments of cloned DNA, BAC vectors have stable origins of replication that maintain the plasmid at one or two copies per cell. The low copy number is useful in cloning large segments of DNA, because it limits the opportunities for unwanted recombination reactions that can unpredictably alter large cloned DNAs over time. BACs also include par genes, derived from a type of plasmid called an F plasmid. The par genes encode proteins that direct the reliable distribution of the recombinant chromosomes to daughter cells at cell division, thereby increasing the likelihood of each daughter cell carrying one copy, even when few copies are present. The BAC vector includes both selectable and screenable markers. The BAC vector shown in Figure 9-5 contains a gene that confers resistance to the antibiotic chloramphenicol (Cam R). Vector- containing cells can be selected by growing them on agar plates containing this antibiotic — a positive selection, as the cells with the vector survive. A lacZ gene, required for production of the enzyme β -galactosidase, is a screenable marker that can reveal which cells contain plasmids — now chromosomes — that incorporate the cloned DNA segments. The β -galactosidase catalyzes conversion of the colorless molecule 5-bromo-4-chloro- 3-indolyl-β -D-galactopyranoside (more simply, X-gal) to a blue product. If the gene is intact and expressed, the colony containing it is blue. If gene expression is disrupted by the introduction of a cloned DNA segment, the colony is white. Yeast Artificial Chromosomes As with E. coli, yeast genetics is a well-developed discipline. Research on large genomes and the associated need for high- capacity cloning vectors led to the development of yeast artificial chromosomes, or YACs (Fig. 9-6). As with BACs, YAC vectors can be used to clone very long segments of DNA. In addition, the DNA cloned in a YAC can be altered to study the function of specialized sequences in chromosome metabolism, mechanisms of gene regulation and expression, and many other aspects of eukaryotic molecular biology.

FIGURE 9-6 Construction of a yeast artificial chromosome (YAC). A YAC vector includes an origin of replication (ori), a centromere (CEN), two telomeres (TEL), and selectable markers (X and Y). Digestion with BamHI and EcoRI generates two separate DNA arms, each with a telomeric end and one selectable marker. A large segment of DNA (e.g., up to 2× 106 bp from the human genome) is ligated to the two arms to create a yeast artificial chromosome. The YAC transforms yeast cells (prepared by removal of the cell wall to form spheroplasts), and the cells are selected for X and Y; the surviving cells propagate the DNA insert. The genome of Saccharomyces cerevisiae contains only 14× 106 bp (less than four times the size of the E. coli chromosome), and its entire sequence is known. Yeast is also very easy to maintain and grow on a large scale in the laboratory. Plasmid vectors have been constructed for insertions into yeast cells, employing the same principles that govern the use of E. coli vectors. Convenient methods for moving DNA into and out of yeast cells permit the study of many aspects of eukaryotic cell biochemistry. Some recombinant plasmids incorporate multiple replication origins and other elements that allow them to be used in more than one species (e.g., in yeast and in E. coli). Plasmids that can be propagated in cells of two or more species are called shuttle vectors. YAC vectors contain all the elements needed to maintain a eukaryotic chromosome in the yeast nucleus: a yeast origin of replication, two selectable markers, and specialized sequences (derived from the telomeres and centromere) needed for stability and proper segregation of the chromosomes at cell division (see Chapter 24). In preparation for its use in cloning, the vector is propagated as a circular bacterial plasmid and then isolated and purified. Cleavage with a restriction endonuclease (BamHI in Fig. 9-6) removes a length of DNA between two telomere sequences (TEL), leaving the telomeres at the ends of the linearized DNA. Cleavage at another internal site (by EcoRI in Fig. 9-6) divides the vector into two DNA segments, referred to as vector arms, each with a different selectable marker. Genomic DNA to be cloned is prepared by partial digestion with restriction endonucleases to obtain a suitable fragment size. Genomic fragments are then separated by pulsed field gel electrophoresis, a variation of gel electrophoresis (see Fig. 3-18) that segregates very large DNA segments. DNA fragments of appropriate size (up to about 2× 106 bp) are mixed with the prepared vector arms and ligated. The ligation mixture is then used to transform yeast cells (pretreated to partially degrade their cell walls) with these very large DNA molecules — which now have the structure and size to be considered yeast chromosomes. Culture on a medium that requires the presence of both selectable marker genes ensures the growth of only those yeast cells that contain an artificial chromosome with a large insert sandwiched between the two vector arms (Fig. 9-6). The stability of YAC clones increases with the length of the cloned DNA segment (up to a point). Those with inserts of more than 150,000 bp are nearly as stable as normal cellular chromosomes, whereas those with inserts of fewer than 100,000 bp are gradually lost during mitosis (so, generally, there are no yeast cell clones carrying only the two vector ends ligated together or vectors with only short inserts). YACs that lack a telomere at either end are rapidly degraded. Cloned Genes Can Be Expressed to Amplify Protein Production Frequently, the product of a cloned gene, rather than the gene itself, is of primary interest — particularly when the protein has commercial, therapeutic, or research value. Proteins are encoded by genes in DNA; alter the DNA in a gene, and one can alter the protein product of that gene. Biochemists use purified proteins for many purposes, including to elucidate protein function, study reaction mechanisms, generate antibodies to the proteins, reconstitute complex cellular activities in the test tube with purified components, and examine protein binding partners. With an increased understanding of the fundamentals of DNA, RNA, and protein metabolism and their regulation in a host organism such as E. coli or yeast, investigators can manipulate cells to express cloned genes in order to study their protein products. The general goal is to alter the sequences around a cloned gene to trick the host organism into producing the protein product of the gene, oen at very high levels. This overexpression of a protein can make its subsequent purification much easier. We’ll use the expression of a eukaryotic protein in a bacterium as an example. Eukaryotic genes have surrounding sequences needed for their transcription and regulation in the cells they are derived from, but these sequences do not function in bacteria. Thus, eukaryotic genes lack the DNA sequence elements required for their controlled expression in bacterial cells: promoters (sequences that instruct RNA polymerase where to bind to initiate mRNA synthesis), ribosome-binding sites (sequences that allow translation of the mRNA to protein), and additional regulatory sequences. Appropriate bacterial regulatory sequences for transcription and translation must be inserted in the vector DNA at the correct positions relative to the eukaryotic gene. Cloning vectors with the transcription and translation signals needed for the regulated expression of a cloned gene are called expression vectors. The rate of expression of the cloned gene is controlled by replacing the gene’s normal promoter and regulatory sequences with more efficient and convenient versions supplied by the vector. Generally, a well-characterized promoter and its regulatory elements are positioned near several unique restriction sites for cloning, so that genes inserted at the restriction sites will be expressed from the regulated promoter elements (Fig. 9-7). Some of these vectors incorporate other features, such as a bacterial ribosome-binding site to enhance translation of the mRNA derived from the gene (Chapter 27) or a transcription termination sequence (Chapter 26). In some cases, cloned genes are so efficiently expressed that their protein product represents 10% or more of the cellular protein. At these concentrations, some foreign proteins can kill the host cell (usually E. coli), so expression of the cloned gene must be limited to the few hours before the planned harvesting of the cells. FIGURE 9-7 DNA sequences in a typical E. coli expression vector. The gene to be expressed is inserted into one of the restriction sites in the MCS, near the promoter (P), with the end of the gene encoding the amino terminus of the protein positioned closest to the promoter. The promoter allows efficient transcription of the inserted gene, and the transcription- termination sequence sometimes improves the amount and stability of the mRNA produced. The operator (O) permits regulation by a repressor that binds to it. The ribosome-binding site provides sequence signals for the efficient translation of the mRNA derived from the gene. The selectable marker allows the selection of cells containing the recombinant DNA. Many Different Systems Are Used to Express Recombinant Proteins Every living organism has the capacity to express genes in its genomic DNA; thus, in principle, any organism can serve as a host to express proteins from a different (heterologous) species. Almost every sort of organism has, indeed, been used for this purpose, and each host type has a particular set of advantages and disadvantages. Bacteria Bacteria, especially E. coli, remain the most common hosts for protein expression. The regulatory sequences that govern gene expression in E. coli and many other bacteria are well understood and can be harnessed to express cloned proteins at high levels. Bacteria are easy to store and grow in the laboratory, on inexpensive growth media. Efficient methods also exist to get DNA into bacteria and extract DNA from them. Bacteria can be grown in huge amounts in commercial fermenters, providing a rich source of the cloned protein. Problems do exist, however. When expressed in bacteria, some heterologous proteins do not fold correctly, and many do not undergo the posttranslational modifications or proteolytic cleavage that may be necessary for their activity. Certain features of a gene sequence also can make a particular gene difficult to express in bacteria. For example, intrinsically disordered regions are more common in eukaryotic proteins. When expressed in bacteria, many eukaryotic proteins aggregate into insoluble cellular precipitates called inclusion bodies. For these and many other reasons, some eukaryotic proteins are inactive when purified from bacteria or cannot be expressed at all. To help address some of these problems, researchers are regularly developing new bacterial host strains that include enhancements such as the engineered presence of eukaryotic protein chaperones or enzymes that modify eukaryotic proteins. There are many specialized systems for expressing proteins in bacteria. The well-characterized promoter and regulatory sequences associated with the lactose operon (see Chapter 28) are oen fused to the gene of interest to direct transcription. The cloned gene will be transcribed when lactose is added to the growth medium. However, regulation in the lactose system is “leaky”: it is not turned off completely when lactose is absent — a potential problem if the product of the cloned gene is toxic to the host cells. Transcription from the Lac promoter is also not efficient enough for some applications. An alternative system uses the promoter and RNA polymerase of a bacterial virus called bacteriophage T7. If the cloned gene is fused to a T7 promoter, it is transcribed, not by the E. coli RNA polymerase, but by the T7 RNA polymerase. The gene encoding this polymerase is separately cloned into the same cell in a construct that affords tight regulation (allowing controlled production of the T7 RNA polymerase). The polymerase is also very efficient and directs high levels of expression of most genes fused to the T7 promoter. This system has been used to express the RecA protein in bacterial cells (Fig. 9-8).

FIGURE 9-8 Regulated expression of RecA protein in a bacterial cell. The gene encoding the RecA protein, fused to a bacteriophage T7 promoter, is cloned into an expression vector. Under normal growth conditions (uninduced), no RecA protein appears. When the T7 RNA polymerase is induced in the cell, the recA gene is expressed, and large amounts of RecA protein are produced. The positions of standard molecular weight markers that were run on the same gel are indicated. Yeast Saccharomyces cerevisiae is probably the best understood eukaryotic organism. The principles underlying the expression of a protein in yeast are the same as those for bacteria. Cloned genes must be linked to promoters that can direct high-level expression in yeast cells. For example, the yeast GAL1 and GAL10 genes (encoding enzymes involved in galactose metabolism) are under cellular regulation such that they are expressed when yeast cells are grown in media with galactose but shut down when the cells are grown in glucose. Thus, if a heterologous gene is expressed using these same regulatory sequences, the expression of that gene can be controlled simply by choosing an appropriate medium for cell growth. Some of the same problems that accompany protein expression in bacteria also occur with yeast. Heterologous proteins may not fold properly, yeast may lack the enzymes needed to modify the proteins to their active forms, or certain features of the gene sequence may hinder expression of a protein. However, because S. cerevisiae is a eukaryote, the expression of eukaryotic genes (especially yeast genes) is sometimes more efficient in this host than in bacteria. As yeast possess many of the same protein chaperones and modification systems of higher eukaryotes, protein products may also be folded and modified more accurately than are proteins expressed in bacteria. Insects and Insect Viruses Baculoviruses are insect viruses with double-stranded DNA genomes. When baculoviruses infect their insect larval hosts, they act as parasites, killing the larvae and turning them into factories for virus production. Late in the infection process, the viruses produce large amounts of two proteins (p10 and polyhedrin), neither of which is needed for production of viruses in cultured insect cells. The genes for both of these proteins can be replaced with the gene for a heterologous protein. When the resulting recombinant virus is used to infect insect cells or larvae, the heterologous protein is oen produced at very high levels — up to 25% of the total protein present at the end of the infection cycle. Autographa californica multicapsid nucleopolyhedrovirus (AcMNPV; A. californica is a moth species that it infects) is the baculovirus most oen used for protein expression. It has a large genome (134,000 bp), too large for direct cloning. Virus purification is also cumbersome. These problems have been solved by the creation of bacmids, large circular DNAs that include the entire baculovirus genome along with sequences that allow replication of the bacmid in E. coli (Fig. 9-9). The gene of interest is cloned into a smaller plasmid and combined with the larger plasmid by site-specific recombination in vivo (see Fig. 25- 37). The recombinant bacmid is then isolated and transfected into insect cells (the term transfection is used when the DNA used for transformation includes viral sequences and leads to viral replication), followed by recovery of the protein once the infection cycle is finished. A wide range of bacmid systems are available commercially. Baculovirus systems are not successful with all proteins. However, with these systems, insect cells sometimes successfully replicate the protein-modification patterns of higher eukaryotes and produce active, correctly modified eukaryotic proteins.

FIGURE 9-9 Cloning with baculoviruses. (a) Shown here is the construction of a typical vector used for protein expression in baculoviruses. The gene of interest is cloned into a small plasmid (top le ) between two sites (att) recognized by a site-specific recombinase, then is introduced into the baculovirus vector by site-specific recombination. This generates a circular DNA product that is used to infect the cells of an insect larva. The gene of interest is expressed during the infection cycle, downstream of a promoter that normally expresses a baculovirus coat protein at very high levels. (b) The photographs show larvae of the cabbage looper moth. The larva on the le is uninfected; the larva on the right was infected with a recombinant baculovirus vector expressing a protein that produces a red color. Mammalian Cells in Culture The most convenient way to introduce cloned genes into a mammalian cell is with viruses. This method takes advantage of the natural capacity of a virus to insert its DNA or RNA into a cell, and sometimes into the cellular chromosome. A variety of engineered mammalian viruses are available as vectors, including human adenoviruses and retroviruses. The gene of interest is cloned so that its expression is controlled by a virus promoter. The virus uses its natural infection mechanisms to introduce the recombinant genome into cells, where the cloned protein is expressed. One advantage of these systems is that proteins can be expressed either transiently (if the viral DNA is maintained separately from the host cell genome and eventually degraded) or permanently (if the viral DNA is integrated into the host cell genome). With the correct choice of host cell, the proper posttranslational modification of the protein to its active form can be ensured. However, the growth of mammalian cells in tissue culture is very expensive, and this technology is generally used to test the function of a protein in vivo rather than to produce a protein in large amounts. Alteration of Cloned Genes Produces Altered Proteins Cloning techniques can be used not only to overproduce proteins but also to produce proteins that are altered, subtly or dramatically, from their native forms. Specific amino acids may be replaced individually by site-directed mutagenesis. This approach has greatly enhanced research on proteins by allowing investigators to make specific changes in the primary structure and examine the effects of these changes on the protein’s folding, three-dimensional structure, and activity. The amino acid sequence of the protein is changed by altering the DNA sequence of the cloned gene. If appropriate restriction sites flank the sequence to be altered, researchers can simply remove a DNA segment and replace it with a synthetic one, identical to the original except for the desired change (Fig. 9-10a). FIGURE 9-10 Two approaches to site-directed mutagenesis. (a) A synthetic DNA segment replaces a fragment removed by a restriction endonuclease. (b) A pair of synthetic and complementary oligonucleotides with a specific sequence change at one position are hybridized to a circular plasmid with a cloned copy of the gene to be altered. The mutated oligonucleotides act as primers for the synthesis of full-length double-stranded (ds) DNA copies of the plasmid that contain the specified sequence change. The blue parental strand was methylated while replicating in its host cell, prior to plasmid isolation. These plasmid copies are then used to transform cells. (c) Results from an automated sequencer (see Fig. 8-35), showing sequences from the wild-type recA gene (top) and an altered recA gene (bottom), with the triplet (codon) at position 72 changed from AAA to CGC, specifying an Arg (R) residue instead of a Lys (K) residue. [(c) Information from Elizabeth A. Wood, University of Wisconsin–Madison, Department of Biochemistry.] When suitably located restriction sites are not present, oligonucleotide-directed mutagenesis can create a specific DNA sequence change (Fig. 9-10b). The cloned gene is denatured, separating the strands. Two short, complementary synthetic DNA strands, each with the desired base change, are annealed to opposite strands of the cloned gene within a suitable circular DNA vector. The mismatch of a single base pair in 30 to 40 bp does not prevent annealing. The two annealed oligonucleotides serve to prime DNA synthesis in both directions around the plasmid vector, creating two complementary strands that contain the mutation. Aer several cycles of selective amplification by the polymerase chain reaction (PCR; see Fig. 8-33), the mutation- containing DNA predominates in the population and can be used to transform bacteria. Most of the transformed bacteria will have plasmids carrying the mutation. For an example, we go back to the bacterial recA gene. The product of this gene, the RecA protein, has several activities (see Section 25.3) including the hydrolysis of ATP. The Lys residue at position 72 in RecA (a 352 residue polypeptide) is involved in ATP hydrolysis. Changing Lys72 to an Arg creates a variant of RecA protein that will bind, but not hydrolyze, ATP (Fig. 9-10c). The engineering and purification of this variant RecA protein has facilitated research into the roles of ATP hydrolysis in the functioning of this protein. Changes can be introduced into a gene that involve far more than one base pair. Large parts of a gene can be deleted by cutting out a segment with restriction endonucleases and ligating the remaining portions to form a smaller gene. For example, if a protein has two domains, the gene segment encoding one of the domains can be removed so that the gene now encodes a protein with only one of the original two domains. Parts of two different genes can be ligated to create new combinations; the product of such a fused gene is called a fusion protein. Researchers have ingenious methods to bring about virtually any genetic alteration in vitro. Aer reintroducing the altered DNA into the cell, they can investigate the consequences of the alteration. Terminal Tags Provide Handles for Affinity Purification Affinity chromatography is one of the most efficient methods for purifying proteins (see Fig. 3-17c). Unfortunately, many proteins do not bind a ligand that can be conveniently immobilized on a column matrix. However, the gene for almost any protein can be altered to express a fusion protein that can be purified by affinity chromatography. The gene encoding the target protein is fused to a gene encoding a peptide or protein that binds a simple, stable ligand with high affinity and specificity. The peptide or protein used for this purpose is referred to as a tag. Tag sequences can be added to genes such that the resulting proteins have tags at their amino terminus or carboxyl terminus. Table 9-3 lists some of the peptides or proteins commonly used as tags. TABLE 9-3 Commonly Used Protein Tags Tag protein/peptide Molecular mass (kDa) Immobilized ligand Protein A    59 Fc portion of IgG (His)6          0.8 Ni2+ Glutathione-S- transferase (GST)   26 Glutathione Maltose-binding protein    41 Maltose β -Galactosidase 116 p-Aminophenyl-β - - thiogalactoside (TPEG) Chitin-binding domain          5.7 Chitin The general procedure can be illustrated by focusing on a system that uses the glutathione-S-transferase (GST) tag (Fig. 9-11). GST is a small enzyme (Mr 26,000) that binds tightly and specifically to glutathione. When the GST gene sequence is fused to a target gene, the fusion protein acquires the capacity to bind glutathione. The fusion protein is expressed in a host organism such as a bacterium, and a crude extract is prepared. A column is filled with a porous matrix consisting of the ligand (glutathione) immobilized on microscopic beads of a stable polymer such as cross-linked agarose. As the crude extract percolates through this matrix, the fusion protein becomes immobilized by binding the glutathione. The other proteins in the extract are washed through the column and discarded. The interaction between GST and glutathione is tight but noncovalent, allowing the fusion protein to be gently eluted from the column with a solution containing either a higher concentration of salts or free glutathione to compete with the immobilized ligand for GST binding. The fusion protein is oen obtained with good yield and high purity. In some commercially available systems, the tag can be entirely or largely removed from the purified fusion protein by a protease that cleaves a sequence near the junction between the target protein and its tag.

FIGURE 9-11 Use of tagged proteins in protein purification. (a) Glutathione-S-transferase (GST) is a small enzyme that binds glutathione. (b) The GST tag is fused to the carboxyl terminus of the protein by genetic engineering. The tagged protein is expressed in the cell and is present in the crude extract when the cells are lysed. The extract is subjected to affinity chromatography through a matrix with immobilized glutathione. A shorter tag with widespread application consists of a simple sequence of six or more His residues. These histidine tags, or His tags, bind tightly and specifically to nickel ions. A chromatography matrix with immobilized Ni2+ can be used to quickly separate a His-tagged protein from other proteins in an extract. Some of the larger tags, such as maltose-binding protein, provide added stability and solubility, allowing the purification of cloned proteins that are otherwise inactive due to improper folding or insolubility. Affinity chromatography using terminal tags is powerful and convenient. The tags have been successfully used in thousands of published studies; in many cases, the protein would be impossible to purify and study without the tag. However, even very small tags can affect the properties of the proteins they are attached to, thereby influencing the study results. For example, the tag may adversely affect protein folding. Even if the tag is removed by a protease, one or a few extra amino acid residues can remain behind on the target protein, which may or may not affect the protein’s activity. The types of experiments to be carried out, and the results obtained from them, should always be evaluated with the aid of well-designed controls to assess any effect of a tag on protein function. The Polymerase Chain Reaction Offers Many Options for Cloning Experiments Many adaptations of PCR have increased its utility in cloning. For example, sequences in RNA can be amplified if the first PCR cycle uses reverse transcriptase, an enzyme that works like DNA polymerase (see Fig. 8-33) but uses RNA as a template (Fig. 9- 12a). Aer the DNA strand is made from the RNA template, the remaining cycles can be carried out with DNA polymerases, using standard PCR protocols. This reverse transcriptase PCR (RT- PCR) can be used, for example, to detect sequences derived from living cells (which are transcribing their DNA into RNA) as opposed to dead tissues. FIGURE 9-12 Some applications of PCR. (a) In reverse transcriptase PCR, or RT-PCR, RNA molecules are amplified by using reverse transcriptase in the first two cycles. (b) In quantitative PCR, or qPCR, careful monitoring of the progress of a PCR amplification allows one to determine when a DNA segment has been amplified to a specified threshold level. The amount of PCR product present is determined by measuring the level of a fluorescent probe attached to a reporter oligonucleotide complementary to the DNA segment that is being amplified. Probe fluorescence is not detectable initially, due to a fluorescence quencher attached to the same oligonucleotide. When the reporter oligonucleotide pairs with its complement in a copy of the amplified DNA segment, the fluorophore is separated from the quenching molecule and fluorescence results. As the PCR reaction proceeds, the amount of the targeted DNA segment increases exponentially, and the fluorescent signal also increases exponentially as the oligonucleotide probes anneal to the amplified segments. A er many PCR cycles, the signal reaches a plateau as one or more reaction components become exhausted. When a segment is present in greater amounts in one sample than another, its amplification reaches a defined threshold level earlier. The “No template” line follows the slow increase in background signal observed in a control that does not include added sample DNA. CT is the cycle number at which the threshold is first surpassed. PCR protocols can also be used to estimate the relative copy numbers of particular sequences in a sample, an approach called quantitative PCR (qPCR) or real-time PCR. If a DNA sequence is present in higher than usual amounts in a sample — for example, if certain genes are amplified in tumor cells — qPCR can reveal the increased representation of that sequence. In brief, the PCR is carried out in the presence of a probe that emits a fluorescent signal when the PCR product is present (Fig. 9-12b). If the sequence of interest is present at higher levels than other sequences in the sample, the PCR signal will reach a predetermined threshold faster. Reverse transcriptase PCR and qPCR can be combined to determine the relative concentrations of a particular mRNA molecule in a cell, and thereby monitor gene expression under different environmental conditions. DNA Libraries Are Specialized Catalogs of Genetic Information In some instances, it is useful to clone many genes or genomic segments rather than a particular one. A DNA library is a collection of DNA clones, usually gathered for purposes of gene discovery or the determination of gene or protein function. The library can take a variety of forms, depending on the source of the DNA and the ultimate purpose of the library. An example is a library that includes only the genes that are transcribed into RNA — expressed — in a given organism or even just in certain cells or tissues. Such a library lacks any genomic DNA that is not transcribed. The researcher first extracts mRNA from an organism, or from specific cells of an organism, and then prepares the complementary DNAs (cDNAs). Like RT-PCR, this multistep reaction (Figure 9-13a) relies on reverse transcriptase, which synthesizes DNA from a template RNA. The resulting double-stranded DNA fragments are inserted into a suitable vector and cloned, creating a population of clones called a cDNA library. If the library host is a bacterium like E. coli, each cell in the population will carry one particular cloned sequence. The library will encompass many millions of cells with millions of different cloned segments. The presence of a gene for a particular protein in such a library implies that this gene is expressed in the cells and under the conditions used to generate the library.

FIGURE 9-13 Building a cDNA library from mRNA. A cell’s total mRNA content includes transcripts from thousands of genes, and the cDNAs generated from this mRNA are correspondingly heterogeneous. Reverse transcriptase can synthesize DNA on an RNA or a DNA template. To prime the synthesis of a second DNA strand, oligonucleotides of known sequence are ligated to the 3′ end of the first strand, and the double-stranded cDNA so produced is cloned into a plasmid. Another type of library, called a combinatorial gene library or simply a gene library, focuses on sequence variants within one gene. For example, beginning with the cloned gene of enzyme X, a segment of the gene could be replaced with nearly identical fragments synthesized with a slight imprecision so that each clone had one or two random base pair changes relative to the original. For example, the gene segment of interest could be amplified by PCR using an altered DNA polymerase that was slightly inaccurate. The library of clones would then consist of many cells, many of which harbored a different variant of the gene for enzyme X. Investigators could use the library to select for variants of enzyme X with enhanced catalytic properties or could simply determine which changes were functional and which were not. The possibilities are limited only by the imagination of the researcher. SUMMARY 9.1 Studying Genes and Their Products DNA cloning and genetic engineering involve the cleavage of DNA and assembly of DNA segments in new combinations — recombinant DNA. Cloning entails generating a DNA fragment of interest, inserting the fragment into a suitable cloning vector, transferring the vector with the DNA insert into a host cell for replication, and identifying and selecting cells that contain the DNA fragment. Key enzymes in gene cloning include restriction endonucleases (especially the type II enzymes) and DNA ligase. Cloning vectors include plasmids and, for the longest DNA inserts, bacterial artificial chromosomes (BACs) and yeast artificial chromosomes (YACs). Cloned genes can be expressed in a host cell by incorporating them into expression vectors that have the sequence signals needed for transcription and translation. Proteins can be expressed in different types of cells using expression systems with various useful features and advantages. Genetic engineering techniques can alter cloned genes as required by the investigator. Proteins or peptides can be attached to a protein of interest by altering its cloned gene, creating a fusion protein. The additional peptide segments can be used to detect the protein or to purify it, using convenient affinity chromatography methods. The polymerase chain reaction (PCR) permits the amplification of chosen segments of DNA or RNA for cloning and can be adapted to determine gene copy number or to monitor gene expression quantitatively. DNA libraries consist of many clones, encompassing many genomic segments or many variants of a particular gene. 9.2 Exploring Protein Function on the Scale of Cells or Whole Organisms Protein function can be described on three levels. Phenotypic function describes the effects of a protein on the entire organism. For example, loss of the protein may lead to slower growth of the organism, an altered development pattern, or even death. Cellular function is a description of the network of interactions a protein engages in at the cellular level. Identifying interactions with other proteins in the cell can help define the kinds of metabolic processes in which the protein participates. Finally, molecular function refers to the precise biochemical activity of a protein, including details such as the reactions that an enzyme catalyzes or the ligands that a receptor binds. In response to the challenge of understanding these functions of the thousands of proteins in a typical cell, scientists have developed a variety of techniques in the broader discipline of genomics. We can apply these techniques to determine when a particular protein is expressed, what other proteins it might be related to, where it is located in the cell, what other cellular components it interacts with, and what happens to the cell when the protein is missing. A variety of interrelated methods broadly probe a cell’s RNA or protein content. The entire complement of transcribed RNAs present at a given moment in a cell is defined as the cellular transcriptome. As introduced in Chapters 1 and 3, the entire complement of proteins present at a given moment in a cell is defined as that cell’s proteome. Studies of transcriptomes and studies of proteomes are referred to as transcriptomics and proteomics, respectively. Changes in these cellular macromolecules that occur when a particular gene or its expression is altered can provide important additional clues about protein function, as we will see. The methods we cover here are summarized in Table 9-4. The list is by no means comprehensive, but it serves to illustrate important approaches. TABLE 9-4 Methods for Discovering New Proteins and Exploring Their Functions Clue Method to Apply What is a protein’s function? What other proteins of known function have similar sequences? Comparative genomics What known sequence motifs does the protein possess? Comparative genomics Under what conditions is the gene encoding the protein expressed? RNA-Seq How much of the protein is present in the cell under different conditions? Mass spectrometry Where is the protein located in the cell? Microscopy with fusion proteins and immunofluorescence What does the protein interact with? Immunoprecipitation; tandem affinity purification; yeast two-hybrid analysis What happens to the cell when the protein is missing or altered? CRISPR/Cas9 or other mutagenic methods What genes (some unknown) are involved in a process? Large-scale screening Sequence or Structural Relationships Can Suggest Protein Function One important reason to sequence many genomes is to provide databases that can be used to assign gene functions by genome comparisons, an enterprise referred to as comparative genomics. A genome sequence is simply a very long string of A, G, T, and C residues, all meaningless until interpreted. Genome annotation yields information about the location and function of genes and other critical sequences. Genome annotation converts the sequence into information that any researcher can use, and it is typically focused on genomic DNA encompassing genes that encode RNA and protein, the most common targets of scientific investigation. Every newly sequenced genome includes many genes — oen 40% or more of the total — about which little or nothing is known. Using online tools that apply computational power to comparative genomics, scientists can define gene locations and assign tentative gene functions (where possible) based on similarity to genes previously studied in other genomes. The classic BLAST (Basic Local Alignment Search Tool) algorithm allows a rapid search of all genome databases for sequences related to one that a researcher is exploring, and it is especially valuable for investigating the function of a particular gene. BLAST is one of many resources available at the NCBI (National Center for Biotechnology Information) site (www.ncbi.nlm.nih.gov), sponsored by the National Institutes of Health, and the Ensembl site (www.ensembl.org), cosponsored by the EMBL-EBI (European Molecular Biology Laboratory–European Bioinformatics Institute). Comparative genomics is made possible by evolutionary biology. Sometimes a newly discovered gene is related by sequence homologies to a previously studied gene in another or the same species, and its function can be entirely or partly defined by that relationship. Genes that occur in different species but have a clear sequence and functional relationship to each other are called orthologs. Genes similarly related to each other within a single species are called paralogs. We introduced these terms in Chapter 3 in the context of proteins. As with proteins, information about the function of a gene in one species can be used to at least tentatively assign function to the orthologous gene found in a second species. The correlation is easiest to make when comparing genomes from relatively closely related species, such as mouse and human, although many clearly orthologous genes have been identified in species as distant as bacteria and humans. Sometimes even the order of genes on a chromosome is conserved over large segments of the genomes of closely related species (Fig. 9-14). Conserved gene order, called synteny, provides additional evidence for an orthologous relationship between genes at identical locations within the related segments. FIGURE 9-14 Synteny in the human and mouse genomes. Large segments of the two genomes have closely related genes aligned in the same order on the chromosomes. In these short segments of human chromosome 9 and mouse chromosome 2, the genes show a very high degree of homology, as well as the same gene order. The different lettering schemes for the gene names simply reflect the different naming conventions for the two species. [Information from T. G. Wolfsberg et al., Nature 409:824, 2001, Fig. 1.] Alternatively, certain amino acid sequences associated with particular structural motifs (Chapter 4) may be identified within a protein. The presence of a structural motif may help to define molecular function by suggesting that a protein, say, catalyzes ATP hydrolysis, binds to DNA, or forms a complex with zinc ions. These relationships are determined with the aid of sophisticated computer programs, limited only by the current information on gene and protein structure and by our capacity to associate sequences with particular structural motifs. Sequences at an enzyme active site that have been highly conserved during evolution are typically associated with catalytic function, and their identification is oen a key step in defining an enzyme’s reaction mechanism. The reaction mechanism, in turn, provides information needed to develop new enzyme inhibitors that can be used as pharmaceutical agents. When and Where a Protein Is Present in a Cell Can Suggest Protein Function If a protein is involved in a reaction or process, it must be present at the location and at the moment that reaction or process occurs. This aspect of protein function can now be explored at multiple levels and with ever-increasing precision. RNA-Seq and Transcriptomics The RNAs that are transcribed from a genome under a given set of conditions can be determined using DNA sequencing methods described in Chapter 8. The approach is called RNA-Seq (Fig. 9- 15). RNA is first isolated from a tissue or a population of cells. The RNA is fragmented and converted to double-stranded DNA using reverse transcriptase (see Fig. 9-12a). This DNA is then subjected to deep DNA sequencing, which reveals both the RNAs that are present and the relative abundance of each (if more copies of one RNA are present, they will give rise to more DNA sequencing reads). This method is sensitive enough to apply to single cells, an approach called single cell RNA-Seq, or scRNA-Seq. It allows investigators to catalog the RNAs being transcribed in different parts of a tissue. FIGURE 9-15 RNA-Seq. To define a cellular transcriptome, the first step is to isolate cellular RNA. Because many RNAs, particularly mRNAs and rRNAs, are quite long, the RNA is then fragmented to an average size commensurate with the DNA sequencing platform to be used. The RNA is converted to DNA using reverse transcriptase. Hexameric DNA oligonucleotides of random sequence are used to prime the reverse transcriptase if all RNA is to be included in the transcriptome. RNA-DNA hybrids are more stable than DNA-DNA hybrids, so hexameric duplexes are sufficient for this task. If the transcriptome is focused on expression of protein-coding genes, beads coated with poly(dT) can be used to hybridize to the poly(A) tails of eukaryotic mRNAs, allowing their precipitation and enrichment relative to other RNAs. A er reverse transcription, the DNA fragments are ligated to duplex adapters that provide a universal priming location for the DNA sequencing as well as sequences that allow annealing to anchors on the sequencing flow cell (see Fig. 8-36). This is followed by deep DNA sequencing and data analysis. The transcriptional state of a human cell or tissue can be diagnostic of conditions ranging from diabetes to cancer. If a particular gene under study is expressed in a certain tissue or under particular metabolic conditions, the result provides a new functional clue. Detailed knowledge about what genes are expressed in a given tumor may eventually help guide treatment options. RNA-Seq can reveal patterns of gene regulation and expression. It has special importance in studying cancerous tumors, in which rapid evolution triggered by genome instability creates a range of cell types. The mRNAs that are present in tumor cells provide a clue to the proteins that may be present, although not all mRNAs are immediately translated into protein. RNA-Seq also reveals the presence of many types of noncoding RNAs (described in Chapter 26) that are now being defined. Cellular Proteomes and Mass Spectrometry A more direct way to establish protein presence or absence is to assess the cellular proteome. Mass spectrometry (Chapter 3) can accurately catalog and quantify the thousands of proteins present in a typical cell. This approach is a complement to RNA-Seq, as it provides a comprehensive list of the genes that are both transcribed and translated into protein. Mass spectrometry also provides information about how those proteins are modified, in turn allowing an assessment of their regulatory state. Fusion Proteins and Immunofluorescence Oen, an important clue to the function of a gene product comes from determining its location within the cell. For example, a protein found exclusively in the nucleus could be involved in processes that are unique to that organelle, such as transcription, replication, or chromatin condensation. Researchers oen engineer fusion proteins for the purpose of locating a protein in the cell or organism. Some of the most useful fusions are the attachment of marker proteins that signal the location by direct visualization or by immunofluorescence. A particularly useful marker is the green fluorescent protein (GFP) (Fig. 9-16), discovered by Osamu Shimomura. As subsequently shown by Martin Chalfie, a target gene (encoding the protein of interest) fused to the GFP gene generates a fusion protein that is highly fluorescent — it literally lights up when exposed to blue light — and can be visualized directly in a living cell. GFP is a protein derived from the jellyfish Aequorea victoria (Fig. 9-16a). The protein has a β -barrel structure with a fluorophore (the fluorescent component of the protein) in the center (Fig. 9-16b). The fluorophore is derived from a rearrangement and oxidation of three amino acid residues (Fig. 9- 16c). Because this reaction is autocatalytic and requires no proteins or cofactors other than molecular oxygen, GFP is readily cloned in an active form in almost any cell. Just a few molecules of this protein can be observed microscopically, allowing the study of its location and movements in a cell. FIGURE 9-16 Green fluorescent protein (GFP). (a) GFP is derived from the jellyfish Aequorea victoria. (b) The protein has a β -barrel structure; the fluorophore is in the center of the barrel. (c) The fluorophore in GFP is derived from a sequence of three amino acids: –Ser65–Tyr66–Gly67–. The fluorophore achieves its mature form through an internal rearrangement, coupled to a multistep oxidation reaction. An abbreviated mechanism is shown here. (d) Variants of GFP are now available in almost any color of the visible spectrum. (e) A GLR1-GFP fusion protein fluoresces bright green in C. elegans, a nematode worm (le ). GLR1 is a glutamate receptor of nervous tissue. (In this photograph, autofluorescing fat droplets are false-colored in magenta.) The membranes of E. coli cells (right) are stained with a red fluorescent dye. The cells are expressing a protein that binds to a resident plasmid, fused to GFP. The green spots indicate the locations of plasmids. [(a) Chris Parks/ImageQuest Marine. (b) Data from PDB ID 1GFL, F. Yang et al., Nature Biotechnol. 14:1246, 1996. (c) Information from Roger Tsien, University of California, San Francisco, Department of Pharmacology, and Paul Steinbach. (d) Courtesy of Roger Tsien and Paul Steinbach, University of California, San Diego, Department of Pharmacology. (e) (le ) Courtesy Penelope J. Brockie and Andres V. Maricq, Department of Biology, University of Utah; (right) Courtesy Joseph A. Pogliano, from J. Pogliano et al. (2001), Multicopy plasmids are clustered and localized in Escherichia coli, Proc. Natl. Acad. Sci. USA 98:4486–4491.] Careful protein engineering by Roger Tsien, coupled with the isolation of related fluorescent proteins from other marine coelenterates, has made variants of these proteins available in an array of colors (Fig. 9-16d) and other characteristics (brightness, stability). If fusion to GFP does not impair the function or properties of a protein one wishes to study, the fusion protein can be used to reveal the protein’s location in the cell under a range of conditions and to detect interactions with other labeled proteins. With this technology, for example, the protein GLR1 (a glutamate receptor of nervous tissue) has been visualized as a GLR1-GFP fusion protein in the nematode Caenorhabditis elegans (Fig. 9-16e). In some cases, the GFP fusion protein may be inactive or may not be expressed at sufficient levels to allow visualization. Immunofluorescence is an alternative approach for visualizing the endogenous (unaltered) protein. This approach requires fixation (and thus death) of the cell. The protein of interest is sometimes expressed as a fusion protein with an epitope tag, a short protein sequence that is bound tightly by a well- characterized, commercially available antibody. Fluorescent molecules (fluorochromes) are attached to this antibody. More commonly, the target protein is unaltered and is bound by an antibody that is specific for the protein. Next, a second antibody is added that binds specifically to the first one, and it is the second antibody that has the attached fluorochrome(s) (Fig. 9-17). A variation of this indirect approach to visualization is to attach biotin molecules to the first antibody, then add streptavidin (a bacterial protein closely related to avidin, a protein that binds biotin; see Table 5-1) complexed with fluorochromes. The interaction between biotin and streptavidin is one of the strongest and most specific known, and the potential to add multiple fluorochromes to each target protein gives this method great sensitivity. In all of these cases, the end product is a microscopic view of a cell in which a spot of light (a focus) reveals the location of the protein. FIGURE 9-17 Indirect immunofluorescence. (a) The protein of interest is bound to a primary antibody, and a secondary antibody is added; this second antibody, with one or more attached fluorescent groups, binds to the first. Multiple secondary antibodies can bind the primary antibody, amplifying the signal. If the protein of interest is in the interior of the cell, the cell is fixed and permeabilized, and the two antibodies are added in succession. (b) The end result is an image in which bright spots indicate the location of the protein or proteins of interest in the cell. The images here show a nucleus from a human fibroblast, successively stained with antibodies and fluorescent labels for DNA polymerase ɛ; for PCNA, an important polymerase accessory protein; and for bromo-deoxyuridine (BrdU), a nucleotide analog. The BrdU, added as a brief pulse, identifies regions undergoing active DNA replication. The patterns of staining show that DNA polymerase ɛ and PCNA co-localize to regions of active DNA synthesis (rightmost image); one such region is visible in the white box. [(b) Fuss, J. and Linn, S., 2002, “Human DNA Polymerase ε Colocalizes with Proliferating Cell Nuclear Antigen and DNA Replication Late, but Not Early, in S Phase,” J. Biol. Chem. 277:8658–8666. Courtesy Jill Fuss, University of California, Berkeley.] Knowing What a Protein Interacts with Can Suggest Its Function Another key to defining the function of a particular protein is to determine its biochemical playmates. In the case of protein-protein interactions, the association of a protein of unknown function with one whose function is known can compellingly imply a functional relationship. The techniques used in this effort are quite varied. Purification of Protein Complexes By fusing the gene encoding a protein under study with the gene for an epitope tag, investigators can precipitate the protein product of the fusion gene by complexing it with the antibody that binds the epitope. This process is called immunoprecipitation (Fig. 9-18). If the tagged protein is expressed in cells, other proteins that bind to it precipitate with it. Identifying the associated proteins reveals some of the intracellular protein-protein interactions of the tagged protein. There are many variations of this process. For example, a crude extract of cells that express a tagged protein is added to a column containing immobilized antibody (see Fig. 3-17c for a description of affinity chromatography). The tagged protein binds to the antibody, and proteins that interact with the tagged protein are sometimes also retained on the column. The connection between the protein and the tag is cleaved with a specific protease. The protein complexes are eluted from the column, and the proteins in them are identified by mass spectrometry. Researchers can use these methods to define complex networks of interactions within a cell. In principle, the chromatographic approach to analyzing protein-protein interactions can be used with any type of protein tag (His tag, GST, etc.) that can be immobilized on a suitable chromatographic medium. FIGURE 9-18 The use of epitope tags to study protein-protein interactions. The gene of interest is cloned next to a gene for an epitope tag, and the resulting fusion protein is precipitated by antibodies to the epitope. Any other proteins that interact with the tagged protein also precipitate, thereby helping to elucidate protein-protein interactions. The selectivity of this approach can be enhanced with tandem affinity purification (TAP) tags. Two consecutive tags are fused to a target protein, and the fusion protein is expressed in a cell (Fig. 9-19). The first tag is protein A, a protein found at the surface of the bacterium Staphylococcus aureus that binds tightly to mammalian immunoglobulin G (IgG). The second tag is oen a calmodulin-binding peptide. A crude extract containing the TAP- tagged fusion protein is passed through a column matrix with attached IgG antibodies that bind protein A. Most of the unbound cellular proteins are washed through the column, but proteins that normally interact with the target protein in the cell are retained. The first tag is then cleaved from the fusion protein with a highly specific protease, TEV protease, and the shortened fusion target protein and any proteins associated noncovalently with the target protein are eluted from the column. The eluate is then passed through a second column containing a matrix with attached calmodulin that binds the second tag. Loosely bound proteins are again washed from the column. Aer the second tag is cleaved, the target protein is eluted from the column with its associated proteins. The two consecutive purification steps eliminate most weakly bound contaminants. False positives are minimized, and protein interactions that persist through both steps are likely to be functionally significant.

FIGURE 9-19 Tandem affinity purification (TAP) tags. A TAP-tagged protein and associated proteins are isolated by two consecutive affinity purifications, as described in the text. Yeast Two-Hybrid Analysis A sophisticated genetic approach to defining protein-protein interactions is based on the properties of the Gal4 protein (Gal4p; see Fig. 28-32), which activates the transcription of GAL genes (encoding the enzymes of galactose metabolism) in yeast. Gal4p has two domains: one that binds a specific DNA sequence, and another that activates RNA polymerase to synthesize mRNA from an adjacent gene. The two domains of Gal4p are stable when separated, but activation of RNA polymerase requires interaction with the activation domain, which in turn requires positioning by the DNA-binding domain. Hence, the domains must be brought together to function correctly. In yeast two-hybrid analysis, the protein-coding regions of the genes to be analyzed are fused to the yeast gene for either the DNA-binding domain or the activation domain of Gal4p, and the resulting genes express a series of fusion proteins (Fig. 9-20). If a protein fused to the DNA-binding domain interacts with a protein fused to the activation domain, transcription is activated. The reporter gene transcribed by this activation is generally one that yields a protein required for growth or an enzyme that catalyzes a reaction with a colored product. Thus, when grown on the proper medium, cells that contain a pair of interacting proteins are easily distinguished from those that do not.

FIGURE 9-20 Yeast two-hybrid analysis. (a) The goal is to bring together the DNA-binding domain and the activation domain of the yeast Gal4 protein (Gal4p) through the interaction of two proteins, X and Y, to which one or other of the domains is fused. This interaction is accompanied by the expression of a reporter gene. (b) The two gene fusions are created in separate yeast strains, which are then mated. The mated mixture is plated on a medium on which the yeast cannot survive unless the reporter gene is expressed. Thus, all surviving colonies have interacting fusion proteins. Sequencing of the fusion proteins in the survivors reveals which proteins are interacting. A library can be set up with a particular yeast strain in which each cell in the library has a gene fused to the Gal4p DNA-binding domain gene, and many such genes are represented in the library. In a second yeast strain, a gene of interest is fused to the gene for the Gal4p activation domain. The yeast strains are mated, and individual diploid cells are grown into colonies. The only cells that grow on the selective medium, or that produce the appropriate color, are those in which the gene of interest is binding to a partner, allowing transcription of the reporter gene. This allows large-scale screening for cellular proteins that interact with the target protein. The interacting protein that is fused to the Gal4p DNA-binding domain present in a particular selected colony can be quickly identified by DNA sequencing of the fusion protein’s gene. Some false positive results occur, due to the formation of multiprotein complexes. The Effect of Deleting or Altering a Protein Can Suggest Its Function

One of the most informative paths to understanding the function of a gene is to change (mutate) the gene or delete it. An investigator can then examine how the genomic alteration affects cell growth or function. The methods available to modify genomes grow more sophisticated every year. The most common approach is to cut the gene of interest at a site that is functionally critical, generating a double-strand break. In eukaryotes, such breaks are most commonly repaired by cellular systems that promote nonhomologous end joining (NHEJ), a process described in Chapter 25. NHEJ seals the double-strand break, but the process is imprecise. Nucleotides are oen deleted or added during the repair, inactivating the gene. In bacteria, introduced double-strand breaks are usually repaired more accurately, by homologous recombination systems (Chapter 25), but inactivating mutations can appear. Many traditional approaches to targeting a gene in this way were supplanted by the advent of CRISPR/Cas systems in 2011. CRISPR/Cas Systems “CRISPR” stands for clustered, regularly interspaced short palindromic repeats; as the name suggests, these consist of a series of regularly spaced short repeats in the bacterial genome. A Cas (CRISPR-associated) protein is a nuclease. The CRISPR sequences and Cas protein are components of a kind of immune system that evolved to allow bacteria to survive infection by bacteriophages. CRISPR sequences are embedded in the bacterial genome, surrounding sequences derived from phage pathogens that previously infected the bacterium without killing it. The viral sequences are, in effect, spacer sequences separating the CRISPR sequences. When the same bacteriophage again attacks a bacterium that has the corresponding CRISPR/Cas system, the CRISPR sequence and Cas protein act together to destroy the viral DNA. First, the CRISPR sequences are transcribed to RNA, and individual viral spacer sequences are cleaved to form products called guide RNAs (gRNAs), which include some adjacent repeat RNA. A gRNA forms a complex with one or more Cas proteins and, in some cases, with another RNA called a trans-activating CRISPR RNA, or tracrRNA. The resulting complex binds specifically to the invading bacteriophage DNA, cleaving and destroying it through the nuclease activities associated with the Cas proteins. The current technology was made possible by discovery of a relatively simple CRISPR/Cas system in Streptococcus pyogenes. This system requires only a single Cas protein, Cas9, to cleave DNA. Work in many laboratories, particularly those of Jennifer Doudna and Emmanuelle Charpentier, has produced a streamlined CRISPR/Cas9 system composed of just one protein (Cas9) and one associated RNA, consisting of gRNA and tracrRNA fused into a single guide RNA (sgRNA).

The power of the system is embedded in this sgRNA, in which the guide sequence can be altered to specifically and efficiently target almost any genomic sequence (Fig. 9-21). Cas9 has two separate nuclease domains: one domain cleaves the DNA strand paired with the sgRNA, and the other cleaves the opposite DNA strand. Inactivating one domain creates an enzyme that cleaves just one strand, forming a single-strand break, or nick. The sgRNA is needed both to pair with the target sequence in the DNA and to activate the nuclease domains for cleavage. FIGURE 9-21 The CRISPR/Cas9 system for genomic engineering. (a) The genes encoding the Cas9 protein and sgRNA are introduced into a cell in which a targeted genomic change is planned. The sgRNA has a region complementary to the chosen genomic target sequence (purple); this region can be engineered to include any desired sequence. A complex consisting of the CRISPR sgRNA and the Cas9 protein forms within the cell and binds to the chosen target site in the DNA. The structure of the bound complex is shown in (b). In the pathway shown on the le in (a), two nuclease active sites in the Cas9 protein separately cleave each DNA strand in the target, producing a double- strand break. The double-strand break is usually repaired by nonhomologous end joining, which generally deletes or alters the nucleotides at the site where joining occurs. Alternatively, as shown in the pathway on the right, if one nuclease site is inactivated, Cas9 nuclease activity creates a single-strand break in the target sequence. In the presence of a recombination donor DNA fragment, identical to the target sequence but incorporating the desired sequence change (fragment shown in red), homologous DNA recombination will sometimes change the sequence at the site of the break to match that of the donor DNA. [Data from PDB ID 4UN3, C. Anders et al., Nature 513:569, 2014.] Plasmids expressing the required protein and RNA components of CRISPR/Cas9 can be introduced into microbial cells by electroporation (p. 306). For mammalian cells, the genes encoding the CRISPR/Cas9 components can be incorporated into engineered viruses that subsequently deliver them to the cell nuclei. For many organisms, the targeted gene is inactivated in a high percentage of the treated cells. If a genomic change (mutation) rather than a simple gene inactivation is required, it can be introduced by recombination when a DNA fragment encompassing the cleavage site and including the desired change enters the cell with the CRISPR/Cas9 plasmids. This recombination is oen inefficient, but success can be improved somewhat by introducing a nick rather than a double-strand break at the target site (Fig. 9-21). CRISPR/Cas9 can be combined with other approaches to extract additional information. For example, a particular gene can be inactivated with CRISPR/Cas9. Then, the effect of that gene inactivation on the transcription of other genes can be probed with RNA-Seq at the level of tissues, cell populations, or single cells. New applications for CRISPR/Cas9 are being developed rapidly, both for basic research and for medicine. Genetic screens based on CRISPR are described in the next section. CRISPR is being used to enhance food production, provide new approaches to combat bacterial infections, and eliminate nonnative pest species that can harbor diseases (Box 9-1). New CRISPR-based treatments for genetic diseases are being cautiously advanced to clinical trials for vision loss due to inherited retinal dystrophies, Duchenne muscular dystrophy, β -thalassaemia, and many other conditions. Uncertainties remain, particularly the potential for occasional cleavage at unintended chromosomal sites (off-target cleavage). The impact of CRISPR/Cas9 will continue to grow as problems are overcome, current applications mature, and new applications are imagined and created. BOX 9-1 Getting Rid of Pests with Gene Drives Invasive introduced plant and animal species can wreak havoc on any natural environment, and can also spread human disease. Mosquitoes that harbor Zika virus and other diseases in many parts of the world, introduced rats on almost every continent, cane toads and rabbits in Australia, the kudzu vine in the southern United States—represent just a few examples of invasive species that cause human misery and annual financial damage totaling billions of dollars. Traditional methods of control such as poisoning or trapping are o en unsuccessful and can have detrimental effects on native species that become unintended targets. The discovery of selfish DNA elements such as homing endonucleases and transposons that can spread through a population gave rise to the concept of gene drives as a new approach to the control of invasive species. The most recent and promising iteration of this idea involves synthetic gene drives based on CRISPR/Cas9. The overall idea is to set up a system that skews the male to female ratio in a target species far away from the favored 1:1, resulting in population collapse. A strategy called X shredder, already proven in the laboratory with mosquitoes, is highlighted in Figure 1. A cassette that includes genes expressing Cas9, as well as several sgRNAs targeted to different unique sites on the X chromosome, is inserted into an intergenic region on the Y chromosome. The cassette is engineered into males, and is controlled by a gene regulatory system that is expressed only during spermatogenesis. Thus, during spermatogenesis, the cassette is expressed so that the X chromosome is cleaved at multiple locations, basically destroying it. This ensures that the only viable sperm have Y chromosomes. All the offspring of any cross with a female are all males, and all of them possess Y chromosomes containing the X- shredder cassette. When those male offspring mate with other females later, the same result ensues. As these males mate and spread the cassette through the population, a dearth of females occurs and the population collapses. In principle, this same strategy could be applied to rats, cane toads, and many other invasive species. FIGURE 1 The X-shredder gene drive concept. To date, gene drives have been restricted to the laboratory. The potential for a gene drive escaping to species that are not intended targets is not yet clear. Resistance in the target species could occur by mutating the sgRNA target sites, although the use of multiple sites makes this less likely. The gene drive approach is a good illustration of the power and potential of CRISPR. However, once males with a gene drive are released, it would be essentially impossible to call a halt to the effects. Nature has a way of imposing consequences, both unintended and unexpected. The potential positive effects on health and agriculture continue to drive research to improve the technology and address potential problems. Many Proteins Are Still Undiscovered For most biological processes, ranging from intermediary metabolism to neurological function to DNA metabolism, the list of known participating enzymes and proteins is far from complete. Genetic screening for new gene functions has been underway for many decades. The goal is to efficiently interrogate large numbers of genes, sometimes the entire genome, for genes that affect a particular cellular reaction or process. A gene perturbation — a treatment that inactivates a gene or activates its expression — is introduced under conditions in which just one gene is affected in each cell, but most or all of the genes are affected in one or more cells within the population (Fig. 9-22). The population is then subjected to a stress or selection. Cells in which a gene required to respond to the selection is altered may drop out of the population or be enriched in the population, depending upon the goals and design of the screen. FIGURE 9-22 High-throughput genetic screening. A gene perturbation, either inactivation or activation, is introduced to a population of cells such that only one gene in each cell is affected. However, all or most genes are affected in one cell or another within the population. The population is then subjected to a selection that requires a response from some cellular genes. Cells that lack the required genes or that have those genes activated will drop out or be enriched within the population, respectively. In the example shown, two cells drop out. CRISPR-based technologies increasingly play a central role in large-scale screening protocols (Fig. 9-23). Libraries of sgRNAs have been generated to target virtually all genes in a mammalian genome, or specialized subsets of them. The targeting sequence in each sgRNA is 20 bp long. In addition to targeting a particular gene, each targeting sequence acts as a kind of unique bar code identifier that is readily recognized by computer programs aer sequencing. The sgRNAs are packaged in a DNA cassette set up to also express the Cas9 protein or a Cas9 variant. The cassettes are incorporated into carefully engineered lentiviral vectors derived from HIV (with genes required for HIV multiplication eliminated). The viral vectors deliver the cassette to the nucleus as a single-stranded RNA, convert it to double-stranded DNA with the viral-encoded reverse transcriptase, and integrate the DNA into a chromosome. The CRISPR/Cas9 components are expressed to perturb the target gene specified by the particular sgRNA delivered to that cell. The effect produced depends upon the Cas9 variant used. The unmodified Cas9 nuclease will create a double- strand break that inactivates the gene. A modified Cas9 that lacks the nuclease activity will simply bind to its target and block transcription. Cas9 fused to a protein transcription inhibitor or activator may more effectively block or activate transcription, respectively (Fig. 9-23b). FIGURE 9-23 Use of CRISPR/Cas9 in high-throughput screening. CRISPR/Cas9 provides the gene perturbation in many screening protocols. (a) In a typical screen, a library of sgRNAs is constructed so as to target all known genes in a genome of interest. These are cloned into viral vectors. The vectors infect cells at a multiplicity of infection (MOI) small enough so that most cells will gain only one vector. The vector RNA is converted to DNA and integrated into the genome. When expressed, it will affect one target gene, with most genes affected in one or more cells in the population. A er selection, some cells drop out or are enriched in the population, depending on the nature of the screen. (b) Several variations of Cas9 are shown to illustrate a few of the ways genes can be affected. Unaltered Cas9 will cleave the DNA at the target site. (c) If engineered to lack nuclease activity, and fused with a gene repressor or activator, the modified Cas9 will bind to the target site and either decrease or increase gene transcription, respectively. Whatever strategy is used, a different gene is affected in each cell. Once the population has been treated with a stress or selection, cells in which genes required to survive the treatment are inactivated or activated by the CRISPR/Cas9 variant will die or thrive. The decreased or increased presence of the relevant bar code sequences can be detected by deep DNA sequencing, using a universal priming sequence incorporated into the cassette near the sgRNA sequence. The strategies outlined here only hint at the variety of protocols in use, limited only by the imagination of the investigators. SUMMARY 9.2 Exploring Protein Function on the Scale of Cells or Whole Organisms Proteins can be studied at the level of phenotypic, cellular, or molecular function. Comparative genomics can elucidate protein function by identifying structural motifs within the encoded protein and comparing gene sequences from different organisms. A determination of when and where a protein appears in a cell can offer functional clues. RNA-Seq provides information on what genes are being expressed in a cell. Mass spectrometry can define cellular proteomes. By fusing a gene of interest with genes that encode green fluorescent protein or epitope tags, researchers can visualize the cellular location of the gene product, either directly or by immunofluorescence. The interactions of a protein with other proteins or RNA can be investigated with epitope tags and immunoprecipitation or affinity chromatography. Yeast two-hybrid analysis probes molecular interactions in vivo. The cellular effects of inactivating a gene can be conveniently explored using the CRISPR/Cas9 programmable nuclease. CRISPR/Cas9 can also be used to alter gene sequences in a targeted manner. Screens for new genes increasingly employ variants of the CRISPR/Cas9 system. 9.3 Genomics and the Human Story Since the report of the first complete human genomes in 2001, human genome sequencing has become routine. The genomes of tens of thousands of other species have now been sequenced and made publicly available, providing a look at genomic complexity throughout the three domains of living organisms: Bacteria, Archaea, and Eukarya. Whereas many early sequencing projects focused on species commonly used in research laboratories, the projects now include species of practical, medical, agricultural, and evolutionary interest. Genomes from every known bacterial family have been sequenced. Completed eukaryotic genome sequences number in the tens of thousands. Genomes of extinct species such as Homo neanderthalensis and of humans who died in past millennia have also been sequenced. Personal genomes are playing an ever-increasing role in medicine. Each genome sequence becomes an international resource for researchers. Collectively, the sequences provide a source for broad comparisons that help pinpoint both variable and highly conserved gene segments, and they allow the identification of genes that are unique to a species or group of species. Efforts to map genes, identify new proteins and disease-related genes, elucidate genetic patterns of medical interest, and trace our evolutionary history are among the many initiatives under way. The Human Genome Contains Many Types of Sequences The rapidly growing genome databases have the potential not only to fuel advances in all realms of biochemistry but also to change the way we think about ourselves. What does our own genome, and comparisons with those of other organisms, tell us? In some ways, we are not as complicated as we once imagined. Humans have only about 20,000 protein-coding genes — less than twice the number in a fruit fly (13,600 genes), not many more than in a nematode worm (19,700 genes), and fewer than in a rice plant (38,000 genes). In other ways, we are more complex than we previously realized. Many, if not most, eukaryotic genes contain one or more segments of DNA that do not code for the amino acid sequence of a polypeptide product. These nontranslated segments interrupt the otherwise colinear relationship between the gene’s nucleotide sequence and the amino acid sequence of the encoded polypeptide. Such nontranslated DNA segments are called introns, and the coding segments are called exons (Fig. 9-24). Few bacterial genes contain introns. The introns are spliced from a precursor RNA transcript to generate a transcript that can be translated contiguously into a protein product (see Chapter 26). An exon oen (but not always) encodes a single domain of a larger, multidomain protein. Humans share many protein domain types with plants, worms, and flies, but the domains in the human genome are mixed and matched in more complex ways, increasing the variety of proteins found in our proteome. Alternative modes of gene expression and RNA splicing permit alternative combinations of exons, leading to the production of more than one protein from a single gene. Alternative splicing (Chapter 26) is far more common in humans and other vertebrates than in worms or bacteria, allowing greater complexity in the number and kinds of proteins generated. FIGURE 9-24 Introns and exons. This gene transcript contains five exons and four introns, along with 5′ and 3′ untranslated regions (5′UTR and 3′UTR). Splicing removes the introns to create an mRNA product for translation into protein. In mammals and some other eukaryotes, the typical gene has a much higher proportion of intron DNA than exon DNA; in most cases, the function of introns is not clear. Less than 1.5% of human DNA is “protein-coding” or exon DNA, carrying information for protein products (Fig. 9-25a). However, when introns are included in the accounting, as much as 30% of the human genome consists of genes that encode proteins. Several efforts are under way to categorize protein-coding genes by type of function (Fig. 9-25b). FIGURE 9-25 A snapshot of the human genome. (a) This pie chart shows the proportions of various types of sequences in our genome. The classes of transposons that represent nearly half of the total genomic DNA are indicated in shades of gray. LTR retrotransposons are retrotransposons with long terminal repeats (see Fig. 26-33). Long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs) are special classes of particularly common DNA transposons. (b) The approximately 20,000 protein-coding genes in the human genome can be classified by the type of protein encoded. [Information from (a) T. R. Gregory, Nature Rev. Genet. 6:699, 2005; (b) www.pantherdb.org.] The relative paucity of protein-coding genes in the human genome leaves a lot of DNA unaccounted for. Much of the DNA that does not encode proteins (exons or introns) is in the form of repeated sequences of several kinds. Perhaps most surprising is that about half the human genome is made up of moderately repeated sequences that are derived from transposons, segments of DNA, ranging from a few hundred to several thousand base pairs long, that can move from one location to another in the genome. Originally discovered in corn by Barbara McClintock, who called them transposable elements, transposons are a kind of molecular parasite. They make their home in the genomes of essentially every organism. Many transposons contain genes encoding the proteins that catalyze the transposition process itself, as described in more detail in Chapters 25 and 26. There are several classes of transposons in the human genome. Many are strictly DNA segments, which have slowly increased in number over the millennia as a result of replication events coupled to the transposition process. Some, called retrotransposons, are closely related to retroviruses, transposing from one genomic location to another through RNA intermediates that are reconverted to DNA by reverse transcription. Some transposons in the human genome are active elements, moving at a low frequency, but most are inactive, evolutionary relics altered by mutations. Transposon movement can lead to the redistribution of other genomic sequences, and this has played a major role in human evolution. Once the protein-coding genes (including exons and introns) and transposons are accounted for, perhaps 25% of the total DNA remains. As a follow-up to the Human Genome Project, the ENCODE initiative was launched by the National Human Genome Research Institute in 2003 to identify functional elements in the human genome. The work of the worldwide consortium of research groups engaged in the ENCODE initiative has revealed that the vast majority (>80%, including protein-coding genes, most transposons, and more) of the DNA in the human genome is either transcribed into RNA in at least one type of cell or tissue or is involved in some functional aspect of chromatin structure. Much of the noncoding (nontranscribed) DNA in the remaining 20% contains regulatory elements that affect the expression of the 20,000 protein-coding genes and the many additional genes encoding functional RNAs. Many mutations (SNPs; described below) associated with human genetic diseases lie in this noncoding DNA, probably affecting regulation of one or more genes. As described in Chapters 26 and 27, new classes of functional RNAs are being discovered at a rapid pace. Many of these functional RNAs, now being identified by a variety of screening methods, are produced by RNA-coding genes whose existence was previously unsuspected. About 3% or so of the human genome consists of highly repetitive sequences referred to as simple-sequence repeats (SSRs). Generally less than 10 bp long, an SSR is sometimes repeated millions of times per cell, distributed in short segments of tandem repeats. The most prominent examples of SSR DNA are found in centromeres and telomeres (see Chapter 24). Human telomeres, for example, consist of up to 2,000 contiguous repeats of the sequence GGTTAG. Additional, shorter repeats of simple sequences also occur throughout the genome. These isolated segments of repeated sequences, oen containing up to a few dozen tandem repeats of a simple sequence, are called short tandem repeats (STRs). Such sequences are the targets of the technologies used in forensic DNA analysis (see Box 8-1). What does all this information tell us about the similarities and differences among individual humans? Within the human population there are millions of single-base variations, called single nucleotide polymorphisms, or SNPs (pronounced “snips”). Each person differs from the next by, on average, 1 in every 1,000 bp. Many of these variations are in the form of SNPs, but the human population also has a wide range of larger deletions, insertions, and small rearrangements. From these oen subtle genetic differences comes the human variety we are all aware of — such as differences in hair color, stature, foot size, eyesight, allergies to medication, and (to some unknown degree) behavior. The process of genetic recombination during meiosis tends to mix and match these small genetic variations so that different combinations of genes are inherited (see Chapter 25). However, groups of SNPs and other genetic differences that are close together on a chromosome are rarely affected by recombination and are usually inherited together; such a grouping of multiple SNPs is known as a haplotype. Haplotypes provide convenient markers for certain human populations and for individuals within populations. Defining a haplotype requires several steps. First, positions that contain SNPs in the human population are identified in genomic DNA samples from multiple individuals (Fig. 9-26a). Each SNP in a prospective haplotype may be separated from the next SNP by several thousand base pairs and still be regarded as “nearby” in the context of chromosomes that extend for millions of base pairs. Second, a set of SNPs typically inherited together is chosen as a defined haplotype (Fig. 9-26b); each haplotype consists of the particular bases found at the various SNP positions within the defined set. Finally, tag SNPs — a subset of SNPs that define an entire haplotype — are chosen to uniquely identify each haplotype (Fig. 9-26c). By sequencing just these tag positions in genomic samples from human populations, researchers can quickly identify which of the haplotypes are present in each individual. Especially stable haplotypes exist in the mitochondrial genome (which does not undergo meiotic recombination) and on the Y chromosome (only 3% of which is homologous to the X chromosome and thus subject to recombination). As we will see, haplotypes can be used as markers to trace human migrations. FIGURE 9-26 Haplotype identification. (a) The positions of SNPs in the human genome can be identified in genomic samples. The SNPs can be in any part of the genome, whether or not it is part of a known gene. (b) Groups of SNPs are compiled into a haplotype. The SNPs vary in the overall human population, as in the four fictitious individuals shown here, but the SNPs chosen to define a haplotype are o en the same in most individuals of a particular population. (c) A few SNPs are chosen as haplotype-defining (tag SNPs, outlined in red), and these are used to simplify the process of identifying an individual’s haplotype (by sequencing 3 instead of 20 loci). (c) For example, if the positions shown here were sequenced, an A–T–C haplotype might be characteristic of a population native to one location in northern Europe, whereas G–T–C might be the prevailing sequence in a population in Asia. Multiple haplotypes of this kind are used to trace prehistoric human migrations. [Information from International HapMap Consortium, Nature 426:789, 2003, Fig. 1.] Genome Sequencing Informs Us about Our Humanity The human genome is very closely related to other mammalian genomes over large segments of every chromosome. However, for a genome measured in billions of base pairs, differences of just a few percent can add up to millions of genetic distinctions. Searching among these, and making use of comparative genomics techniques, researchers can begin to explore the molecular basis of definably human characteristics. The genome sequences of our closest biological relatives, the chimpanzee (Pan troglodytes) and bonobo (Pan paniscus), offer some important clues, and we can use them to illustrate the comparative process. Human and chimpanzee shared a common ancestor about 7 million years ago. Genomic differences between the species, including SNPs and larger genomic rearrangements such as inversions, deletions, and fusions, can be used to construct a phylogenetic tree (Fig. 9-27a). Over the course of evolution, segments of chromosomes may become inverted as a result of a segmental duplication, transposition of one copy to another arm of the same chromosome, and recombination between them (Fig. 9-27b); such inversions have occurred in the human lineage on chromosomes 1, 12, 15, 16, and 18. Two chromosomes found in other primate lineages have been fused to form human chromosome 2 (Fig. 9-27c). The human lineage thus has 23 chromosome pairs rather than the 24 pairs typical of simians. Once this fusion appeared in the line leading to humans, it would have represented a major barrier to interbreeding with other primates that lacked it.

FIGURE 9-27 Genomic alterations in the human lineage. (a) This evolutionary tree is for the progesterone receptor, which helps regulate many events in reproduction. The gene encoding this protein has undergone more evolutionary alterations than most. Amino acid changes associated uniquely with human, chimpanzee, and bonobo are listed beside each branch (with the residue number). (b) One of the multistep processes that can lead to the inversion of a chromosome segment. A gene or a chromosome segment is duplicated, then moved to another chromosomal location by transposition. Recombination of the two segments may result in inversion of the DNA between them. (c) The genes on chimpanzee chromosomes 2p and 2q are homologous to those on human chromosome 2, implying that two chromosomes fused at some point in the line leading to humans. Homologous regions can be visualized as bands created in metaphase by certain dyes, as shown here. [(a) Information from C. Chen, Mol. Phylogenet. Evol. 47:637, 2008.] If we look only at base-pair changes, the published human and chimpanzee genomes differ by only 1.23% (compared with the 0.1% variance from one human to another). Some variations are at positions where there is a known polymorphism in either the human population or the chimpanzee population, and these are unlikely to reflect a species-defining evolutionary change. When we ignore these positions, the differences amount to about 1.06%, or about 1 in 100 bp. This small fraction translates into more than 30 million base-pair differences, some of which affect protein function and gene regulation. Humans are approximately as closely related to bonobos as to chimpanzees. The genomic rearrangements that help distinguish chimpanzee and human include 5 million short insertions or deletions involving a few base pairs each, as well as a substantial number of larger insertions, deletions, inversions, and duplications that can involve many thousands of base pairs. When transposon insertions — a major source of genomic variation — are added to the list, the differences between the human and chimpanzee genomes increase. The chimpanzee genome has two classes of retrotransposons that are not present in the human genome (see Chapter 26). Other types of rearrangements, especially segmental duplications, are also common in primate lineages. Duplications of chromosomal segments can lead to changes in the expression of genes contained in these segments. There are about 90 million bp of such differences between human and chimpanzee, representing another 3% of these genomes. Each species has segments of DNA, constituting 40 million to 45 million bp, that are entirely unique to that particular genome, with larger chromosomal insertions, duplications, and other rearrangements affecting more base pairs than do single-nucleotide changes. Thus, in all, chimpanzee and human differ over about 4% of their genomes. Sorting out which genomic distinctions are relevant to features that are uniquely human is a daunting task. If one assumes a similar rate of evolution in the chimpanzee and human lines aer they diverged from their common ancestor, half the changes represent chimpanzee lineage changes and half represent human lineage changes. By comparing both genome sequences with those of more distantly related species referred to as outgroups, we can determine which variant was present in the common ancestor. Consider a locus, X, where there is a difference between the human and chimpanzee genomes (Fig. 9-28a). The lineage of the orangutan, an outgroup, diverged from that of chimpanzee and human prior to the common ancestor of chimpanzee and human (Fig. 9-28b). If the sequence at locus X is identical in orangutan and chimpanzee, this sequence was probably present in the chimpanzee and human ancestor, and the sequence seen in humans is specific to the human lineage. Sequences that are identical in human and orangutan can be eliminated as candidates for human-specific genomic features. The importance of comparisons with closely related outgroups has given rise to new efforts to sequence the genomes of orangutan, macaque, and many other primate species. Comparison of the human and bonobo genomes is refining the analysis of genes and alleles of special significance to humans. FIGURE 9-28 Determination of sequence alterations unique to one ancestral line. (a) Sequences from the same hypothetical gene in human and chimpanzee are compared. The sequence of this gene in the two species’ last common ancestor is unknown. (b) The orangutan genome is used as an outgroup. Because the sequence of the orangutan gene is identical to that of the chimpanzee gene, the mutation causing the difference between human and chimpanzee almost certainly occurred in the line leading to modern humans, and the common ancestor of human and chimpanzee (and orangutan) had the variant now found in chimpanzees. The search for the genetic underpinnings of special human characteristics, such as our enhanced brain function, can benefit from two complementary approaches. The first approach searches for genomic regions where extreme changes have occurred, such as genes that have been duplicated many times or large genomic segments not present in other primates. The second approach looks at genes known to be involved in relevant human disease conditions. For brain function, for example, one would examine genes that, when mutated, contribute to cognitive or mental disorders. Notably, analyses of the human lineage have not detected an increased rate of genetic change in protein-coding genes involved in brain development or size. In primates, most genes that function uniquely in the brain are even more highly conserved than genes functioning in other tissues, perhaps due to some special constraints related to brain biochemistry. However, there are some differences in gene expression patterns between humans and other primates that may affect brain function. For example, the gene encoding the enzyme glutamate dehydrogenase, which plays an important role in neurotransmitter synthesis, has been subjected to gene duplication events, so that there are now multiple copies of it. Genomic regions related to gene regulation have disproportionately high numbers of changes in genes involved in neural development and nutrition. Our brains have become larger as a result, and additional functional effects may eventually be defined. A variety of RNA-coding genes, some with expression concentrated in the brain, also show evidence of accelerated evolution (Fig. 9-29). Many of these are probably involved in regulating the expression of other genes. As we continue to discover many new classes of RNA (see Chapter 26), we are likely to radically change our perspective on how evolution alters the workings of living systems. FIGURE 9-29 Accelerated evolution in some human genes. The HAR1F locus specifies a noncoding RNA that is highly conserved in vertebrates. The human HAR1F gene has an unusual number of substitutions (highlighted by color shading), providing evidence of accelerated evolution. HAR1F RNA functions in the brain during neurodevelopment. Compensatory substitutions are those that retain complementarity where strand segments are paired. [Information from T. Marques-Bonet, Annu. Rev. Genomics Hum. Genet. 10:355, 2009.] Genome Comparisons Help Locate Genes Involved in Disease One of the motivations for the Human Genome Project was its potential for accelerating the discovery of genes underlying genetic diseases. That promise has been fulfilled: more than 6,000 human mutation phenotypes, mostly associated with genetic diseases, have been mapped to particular genes or groups of genes. For the last two decades, the main approach to gene mapping has been linkage analysis, yet another approach derived from evolutionary biology. In brief, the gene involved in a disease condition is mapped relative to well-characterized genetic polymorphisms that occur throughout the human genome. We can illustrate this approach by describing the search for one gene involved in early-onset Alzheimer disease. About 10% of all cases of Alzheimer disease in the United States result from an inherited predisposition. Several different genes have been discovered that, when mutated, can lead to early onset of the disease. One such gene, PS1, encodes the protein presenilin-1, and its discovery made heavy use of linkage analysis. The search begins with large families having multiple individuals affected by a particular disease — in this case, Alzheimer disease. Two of the many family pedigrees used to search for this gene in the early 1990s are shown in Figure 9-30a. In studies of this type, DNA samples are collected from both affected and unaffected family members. Researchers first localize the region associated with a disease to a specific chromosome by comparing the genotypes of individuals with and without the disease, focusing especially on close family members. The specific points of comparison are sets of well-characterized SNP loci mapped to each chromosome, as identified by the Human Genome Project. By identifying the SNPs that are most oen inherited with the disease-causing gene, investigators can gradually localize the responsible gene to a single chromosome. In the case of the PS1 gene, co-inheritance was strongest with markers on chromosome 14 (Fig. 9-30b). FIGURE 9-30 Linkage analysis in the discovery of disease genes. (a) These pedigrees for two families affected by early-onset Alzheimer disease are based on the data available at the time of the study. To protect family privacy, gender is not indicated. (b) Chromosome 14, with bands created by certain dyes. Chromosome marker positions are shown at the right, with the genetic distance between them in centimorgans, a genetic distance measurement that reflects the frequency of recombination between markers. TCRD (T- cell receptor delta) and PI (AACT (α 1-antichymotrypsin)), two genes with alterations in the human population, were used along with SNPs as markers in chromosome mapping. (c) By comparing DNA from affected and unaffected family members, researchers eventually defined a region of interest near marker D14S43 that contains 19 expressed genes. The gene labeled S182 (red) encodes presenilin-1. (1M b= 106 base pairs.) [Information from (a, b) G. D. Schellenberg et al., Science 258:668, 1992; (c) R. Sherrington et al., Nature 375:754, 1995.] Chromosomes are very large DNA molecules, and localizing the gene to one chromosome is only a small part of the battle. Localizing the gene to one chromosome (in this case, chromosome 14) is only the beginning of narrowing the search for a gene. Chromosomes are very large molecules; each one houses thousands of SNPs and other changes. Simply sequencing the entire chromosome would be unlikely to reveal the SNP or other change associated with the disease. Instead, investigators rely on statistical methods that correlate the inheritance of additional, more closely spaced polymorphisms with the occurrence of the disease, focusing on a denser panel of polymorphisms known to occur on the chromosome of interest. The more closely a marker is located to a disease gene, the more likely it is to be inherited along with that gene. This process can pinpoint a region of the chromosome that contains the gene. However, the region may still encompass many genes. In our example, linkage analysis indicated that the disease-causing gene, PS1, was somewhere near the SNP locus D14S43 (Fig. 9-30c). The final steps in identifying the gene use the human genome databases. The local region containing the gene is examined, and the genes within it are identified. DNA from many individuals, some who have the disease and some who do not, is sequenced over this region. As the DNA in this region is sequenced from increasing numbers of individuals, gene variants that are consistently present in individuals with the disease and absent in unaffected individuals can be identified. An understanding of the function of the genes in the target region can aid the search, because particular metabolic pathways may be more likely than others to produce the disease state. In 1995, the chromosome 14 gene associated with Alzheimer disease was identified as S182. The product of this gene was given the name presenilin-1, and the gene was subsequently renamed PS1. Many human genetic diseases are caused by mutations in a single gene or in sequences involved in regulation of that gene. Several different mutations in a particular gene, all leading to the same or related genetic condition, may be present in the human population. For example, there are several variants of PS1, all giving rise to a much-increased risk of early-onset Alzheimer disease. Another, more extreme example is the several genes encoding different hemoglobins: more than 1,000 known mutational variants are present in the human population. Some of these variants are innocuous; some cause diseases ranging from sickle cell disease to thalassaemias. The inheritance of particular mutant genes may be concentrated in families or in isolated populations. More complex are cases in which a disease condition is caused by mutations in two different genes (neither of which, alone, causes the disease), or in which a particular condition is enhanced by an otherwise innocuous mutation in another gene. Identifying the genes and mutations responsible for these digenic diseases is exceedingly difficult, and sometimes such diseases can be documented only within small, isolated, and highly inbred populations. Genome databases provide alternative paths to the identification of disease genes. In many cases, we already have biochemical information about the disease. In the case of early-onset Alzheimer disease, an accumulation of the amyloid-β protein in limbic and association cortices of the brain is at least partly responsible for the symptoms. Defects in presenilin-1 (and in a related protein, presenilin-2, encoded by a gene on chromosome 1) lead to the elevated cortical levels of amyloid-β protein. Focused databases are being developed that catalog such functional information on the protein products of genes and on protein-interaction networks and SNP locations, along with other data. The result is a streamlined path to the identification of candidate genes for a particular disease. If a researcher knows a little about the kinds of enzymes or other proteins likely to contribute to disease symptoms, these databases can quickly generate a list of genes known to encode proteins with relevant functions, a list of additional uncharacterized genes with orthologous or paralogous relationships to these genes, a list of proteins known to interact with the target proteins or orthologs in other organisms, and a map of gene positions. Oen, with the aid of data from some selected family pedigrees, a short list of potentially relevant genes can be rapidly determined. These approaches are not limited to human diseases. The same methods can be used to identify the genes involved in diseases — or genes that produce desirable characteristics — in other animals and in plants. Of course, they can also be used to track down genes involved in any observable trait that a researcher might be interested in. Genome Sequences Inform Us about Our Past and Provide Opportunities for the Future Anatomically modern humans arose in Africa between 250,000 and 350,000 years ago. About 100,000 to 120,000 years ago, humans in Africa looked out across the Red Sea to Asia. Perhaps encouraged by some innovation in small boat construction, or driven by conflict or famine, or simply curious, they crossed the water barrier. That initial colonization began a journey that did not stop until humans reached Tierra del Fuego (at the southern tip of South America), many thousands of years later. As Homo sapiens populations moved into more northern parts of Europe and Asia about 45,000 years ago, established populations from previous hominid expansions into Eurasia, including Homo neanderthalensis and a group now called the Denisovans, were displaced. The Neanderthals and Denisovans disappeared, just as other hominid lines had disappeared before them. The story of how modern humans first appeared in Africa a few hundred thousand years ago, and their migrations as they eventually radiated out of Africa, is written in our DNA. Genomic sequences from multiple species have brought both primate and hominid evolution into sharper focus. Using haplotypes present in extant human populations, we can trace the migrations of our intrepid ancestors across the planet (Fig. 9-31a). The Neanderthals were not simply displaced. Some mingling occurred (Fig. 9-31b). Using sensitive PCR-based methods, we now possess multiple complete sequences of the Neanderthal genome (Box 9- 2). We know that up to 5% of the genome of most non-African humans is derived from Neanderthals. Anatomically modern human remains up to 45,000 years old have been sequenced, beginning an effort to pinpoint the period of interbreeding. Human populations native to Melanesia and Australia acquired up to 6% of their genomic DNA from the Denisovans. Neanderthal DNA gave humans a more complex immune system, making us more resistant to infection but also a little more susceptible to autoimmune diseases. The story of our past is gradually taking shape as more genomes, of humans alive today and those who lived in past millennia, are being assembled. FIGURE 9-31 The paths of human migrations. (a) When a small part of a human population migrates away from a larger group, it takes only part of the population’s overall genetic diversity with it. Thus, some haplotypes are present in the migrating group but many are not. At the same time, mutations can create novel haplotypes over time. This map was generated from an analysis of genetic markers (defined haplotypes with M or LLY numbers) on the Y chromosome. The genetic samples were taken from indigenous populations long established at geographic points along the routes shown. The abbreviation kya means “thousand years ago.” (b) Human migrations eventually displaced several closely related hominid groups, but not before some intermingling occurred. This tree illustrates gene-flow events documented from detailed genomic sequences of modern and ancient humans, as well as of Neanderthals and Denisovans. DNA from an unknown group of Neanderthals (A) is recorded in the genomes of all humans with some Eurasian heritage. A transfer of DNA from an unknown ancestor to the Denisovan line (B) contributed to ancestors of present-day individuals native to Australia and Pacific islands (Oceania). [Information from (a) G. Stix, Sci. Am. 299 (July):56, 2008; (b) S. Pääbo, Cell 157:216, 2014.] BOX 9-2 Getting to Know Humanity’s Next of Kin Modern humans and Neanderthals coexisted in Europe and Asia as recently as 30,000 years ago. The human and Neanderthal ancestral populations diverged about 370,000 years ago, somewhat before the appearance of anatomically modern humans. Neanderthals used tools, lived in small groups, and buried their dead. Of the known hominid relatives of modern humans, Neanderthals are the closest. For hundreds of millennia, they inhabited large parts of Europe and western Asia (Fig. 1). If the chimpanzee genome can tell us something about what it is to be human, the Neanderthal genome can tell us more. Buried in the bones and other remains taken from burial sites are fragments of Neanderthal genomic DNA. Technologies developed for use in forensic science (see Box 8-1) and studies of ancient DNA have been combined in the Neanderthal genome project.

FIGURE 1 Neanderthals occupied much of Europe and western Asia until about 30,000 years ago. Major Neanderthal archaeological sites are shown here. (Note that the group was named for the site at Neanderthal in Germany.) This endeavor is unlike the genome projects aimed at extant species. The Neanderthal DNA is present in small amounts, and it is contaminated with DNA from other animals and bacteria. How does one get at it, and how can one be certain that the sequences really came from Neanderthals? The answers have been revealed by innovative applications of biotechnology. In essence, the small quantities of DNA fragments found in Neanderthal bone or other remains are cloned into a library, and the cloned DNA segments are sequenced at random, contaminants and all. The sequencing results are compared with the existing human genome and chimpanzee genome databases. Segments derived from Neanderthal DNA have sequences closely related to human DNA and chimpanzee DNA and thus are readily distinguished by computerized analysis from segments derived from bacteria or insects. Once these segments are sequenced, they can be used as probes to identify sequence fragments in ancient samples that overlap with these known fragments. The potential problem of contamination with the closely related modern human DNA can be controlled for by examining mitochondrial DNA. Human populations have readily identifiable haplotypes (distinctive sets of genomic differences; see Fig. 9-26) in their mitochondrial DNA, and analysis of Neanderthal samples has shown that Neanderthals’ mitochondrial DNA has its own distinct haplotypes. The presence in the Neanderthal samples of some base-pair differences that are found in the chimpanzee database but not in the human database is more evidence that nonhuman hominid sequences are being found. Multiple high-quality Neanderthal genomic sequences have been completed. The data provide evidence that modern humans and the Neanderthals who were the source of this DNA shared a common ancestor about 700,000 years ago (Fig. 2). Analysis of mitochondrial DNA suggests that the two groups continued on the same track, with some gene flow between them, for about 300,000 more years. The lines split with the appearance of anatomically modern humans, although evidence now exists for some intermingling of the lines somewhat later as humans spread through Eurasia.

FIGURE 2 This timeline shows the divergence of human and Neanderthal genome sequences (black lines) and of ancestral human and Neanderthal populations (yellow screen). Genomic data provide evidence for some intermingling of the populations up to about 45,000 years ago. Key events in human evolution are noted. [Information from J. P. Noonan et al., Science 314:1113, 2006.] Expanded libraries of Neanderthal DNA from different sets of remains are revealing Neanderthal genetic diversity, and may eventually tell us about Neanderthal migrations, providing a fascinating look at our hominid past. The medical promise of personal genomic sequences grows as sequencing costs continue to decline and more genes underlying inherited diseases are defined. Knowledge of genomic sequences also provides the prospect of altering them. It is now commonplace to engineer the DNA sequences of organisms ranging from bacteria and yeast to plants and mammals, for research and commercial purposes. Efforts to cure inherited human diseases by human gene therapy have not yet lived up to their potential, but technologies for gene delivery are constantly being improved. Few scientific disciplines will affect the future of our species more than modern genomics. SUMMARY 9.3 Genomics and the Human Story About 30% of the DNA in the human genome is in the exons and introns of genes that encode proteins. Nearly half of the DNA is derived from parasitic transposons. Much of the rest encodes RNAs of many types. Simple-sequence repeats make up the centromere and telomeres. The gene alterations that define humanity can be discerned in part through comparative genomics, using other primates. Comparative genomics is also used to locate the gene alterations that define inherited diseases. Human genomics can be used to study the evolution and migration of our human ancestors over many millennia. Chapter Review KEY TERMS Terms in bold are defined in the glossary. genome genomics systems biology cloning vector recombinant DNA recombinant DNA technology genetic engineering restriction endonucleases DNA ligases restriction-modification system multiple cloning site (MCS) plasmid bacterial artificial chromosome (BAC) yeast artificial chromosome (YAC) expression vector baculovirus bacmid site-directed mutagenesis fusion protein tag reverse transcriptase PCR (RT-PCR) quantitative PCR (qPCR) DNA library complementary DNA (cDNA) cDNA library transcriptome proteome transcriptomics proteomics comparative genomics genome annotation orthologs paralogs synteny RNA-Seq single cell RNA-Seq (scRNA-Seq) green fluorescent protein (GFP) epitope tag yeast two-hybrid analysis CRISPR/Cas guide RNA (gRNA) trans-activating CRISPR RNA (tracrRNA) single guide RNA (sgRNA) single nucleotide polymorphism (SNP) haplotype PROBLEMS 1. Engineering Cloned DNA When joining two or more DNA fragments, a researcher can adjust the sequence at the junction in a variety of subtle ways, as seen in these exercises. a. Write the sequence of each end of a linear DNA fragment produced by an EcoRI restriction digest (include those sequences remaining from the EcoRI recognition sequence). b. Write the sequence resulting from the reaction of this end sequence with DNA polymerase I and the four deoxynucleoside triphosphates (see Fig. 8-34). c. Write the sequence produced at the junction that arises if two ends with the structure derived in (b) are ligated (see Fig. 25-15). d. Write the sequence produced if the structure derived in (a) is treated with a nuclease that degrades only single-stranded DNA. e. Write the sequence of the junction produced if an end with structure (b) is ligated to an end with structure (d). f. Write the sequence of the end of a linear DNA fragment that was produced by a PvuII restriction digest (include those sequences remaining from the PvuII recognition sequence). g. Write the sequence of the junction produced if an end with structure (b) is ligated to an end with structure (f). h. Suppose you can synthesize a short duplex DNA fragment with any sequence you desire. With this synthetic fragment and the procedures described in (a) through (g), design a protocol that would remove an EcoRI restriction site from a DNA molecule and incorporate a new BamHI restriction site at approximately the same location. (See Fig. 9-2.) i. Design four different short synthetic double-stranded DNA fragments that would permit ligation of structure (a) with a DNA fragment produced by a PstI restriction digest. In one of these fragments, design the sequence so that the final junction contains the recognition sequences for both EcoRI and PstI. In the second and third fragments, design the sequence so that the junction contains only the EcoRI and only the PstI recognition sequence, respectively. Design the sequence of the fourth fragment so that neither the EcoRI nor the PstI sequence appears in the junction. 2. Selecting for Recombinant Plasmids When cloning a foreign DNA fragment into a plasmid, it is oen useful to insert the fragment at a site that interrupts a selectable marker (such as the tetracycline-resistance gene of pBR322). The loss of function of the interrupted gene can be used to identify clones containing recombinant plasmids with foreign DNA. With a yeast artificial chromosome (YAC) vector, a researcher can distinguish vectors that incorporate large foreign DNA fragments from those that do not, without interrupting gene function. How are these recombinant vectors identified? 3. DNA Cloning The restriction endonuclease PstI cleaves the plasmid cloning vector pBR322 (see Fig. 9-3). A researcher ligates an isolated DNA fragment from a eukaryotic genome (also produced by PstI cleavage) to the prepared vector. She then uses the mixture of ligated DNAs to transform bacteria and selects plasmid-containing bacteria by growth in the presence of tetracycline. a. In addition to the desired recombinant plasmid, what other types of plasmids might be found among the transformed bacteria that are tetracycline-resistant? How can the types be distinguished? b. The cloned DNA fragment is 1,000 bp long and has an EcoRI site 250 bp from one end. The researcher cleaves three different recombinant plasmids with EcoRI and analyzes them by gel electrophoresis, with the results shown in the image. What does each pattern say about the cloned DNA? Note that in pBR322, the PstI and EcoRI restriction sites are about 750 bp apart. The entire plasmid with no cloned insert is 4,361 bp. Size markers in lane 4 have the number of nucleotides noted. 4. Restriction Enzymes The partial sequence of one strand of a double-stranded DNA molecule is 5′ – – – GACGAAGTGCTGCAGAAAGTCCGCGTTATAGGCAT GAATTCCTGAGG – – – 3′ The cleavage sites for the restriction enzymes EcoRI and PstI are shown below.

Write the sequence of both strands of the DNA fragment created when this DNA is cleaved with both EcoRI and PstI. The top strand of your duplex DNA fragment should be derived from the strand sequence given. 5. Designing a Diagnostic Test for a Genetic Disease Huntington disease (HD) is an inherited neurodegenerative disorder, characterized by the gradual, irreversible impairment of psychological, motor, and cognitive functions. Symptoms typically appear in middle age, but onset can occur at almost any age. The course of the disease can be 15 to 20 years. Biomedical research is improving our understanding of the molecular basis of the disease. The genetic mutation underlying HD has been traced to a gene encoding a protein (Mr 350,000) of unknown function. The region of the gene that encodes the amino terminus of the protein has a repeated sequence of CAG codons (for glutamine). The length of this simple trinucleotide repeat indicates whether an individual will develop HD, and at approximately what age the first symptoms will occur. The sequence is repeated 6 to 39 times in individuals who will not develop HD, 40 to 55 times in those with adult-onset HD, and more than 70 times in individuals with childhood-onset HD. A small portion of the amino-terminal coding sequence of the 3,143-codon HD gene is shown. The nucleotide sequence of the DNA is given in black, the amino acid sequence corresponding to the gene is given in blue, and the CAG repeat is shaded. Using Figure 27-7 to translate the genetic code, outline a PCR-based test for HD that could be carried out using a blood sample. Assume the PCR primer must be 25 nucleotides long. By convention, unless otherwise specified, a DNA sequence encoding a protein is displayed with the coding strand — the sequence identical to the mRNA transcribed from the gene (except for U replacing T) — on top, such that it reads 5′ to 3′, le to right. Information from The Huntington’s Disease Collaborative Research Group, Cell 72:971, 1993. 6. Using PCR to Detect Circular DNA Molecules In a species of ciliated protist, a segment of genomic DNA is sometimes deleted. The deletion is a genetically programmed reaction associated with cellular mating. A researcher proposes that the DNA is deleted in a type of recombination called site- specific recombination, with the DNA at either end of the segment joined together and the deleted DNA ending up as a circular DNA reaction product. Suggest how the researcher might use the polymerase chain reaction (PCR) to detect the presence of the circular form of the deleted DNA in an extract of the protist. 7. Protein Dynamics within Cells In a bacterial cell, two proteins, X and Y, are thought to have similar functions. Researchers genetically engineered each protein to fuse with a variant of the green fluorescent protein, one that glows red (X) and the other yellow (Y). Controls showed that both fusion proteins retained their activity, and both produced visible spots of light (foci) when expressed. To better understand the biological functions of the two proteins, the researchers expressed the fusion proteins in the same bacterial cell under two different conditions. Under nutrient- rich conditions, distinct red and yellow puncta (well-defined clustering of foci) were distributed throughout the cell. One or two red puncta were typically found within the nucleoid (chromosomal DNA), whereas the multiple yellow puncta were distributed throughout the cell. However, under nutrient starvation, the yellow puncta migrated and co- localized (overlapped) with the red puncta. What might be concluded from these observations? 8. Mapping a Chromosome Segment Researchers isolated a group of overlapping clones, designated A through F, from one region of a chromosome. They then separately cleaved each of the clones using a restriction enzyme and resolved the pieces by agarose gel electrophoresis. The image shows the electrophoresis results. There are nine different restriction fragments in this chromosomal region, with a subset appearing in each clone. Using this information, deduce the order of the restriction fragments in the chromosome. 9. Immunofluorescence In a common protocol for immunofluorescence detection of cellular proteins, an investigator uses two antibodies. The first binds specifically to the protein of interest. The second is labeled with fluorochromes for easy visualization, and it binds to the first antibody. In principle, one could simply label the first antibody and skip one step. Why use two successive antibodies? 10. Yeast Two-Hybrid Analysis You are a researcher who has just discovered a new protein in a fungus. Design a yeast two- hybrid experiment to identify the other proteins in the fungal cell with which your protein interacts, and explain how this could help you determine the function of your protein. 11. RNA-Seq RNA-Seq is a next-generation sequencing method used to quantitatively profile the cellular transcriptome. Researchers use RNA-Seq to compare the expression of genes under different environmental conditions or between different types of cells. There are three general steps in an RNA-Seq workflow: 1. Generate a cDNA library from cellular RNA. 2. Add oligonucleotide adapters to the fragments of the cDNA library. 3. Use next-generation sequencing to identify transcriptionally active genes from the cDNA library. What is the role of the enzyme reverse transcriptase in an RNA-Seq workflow? 12. Cellular RNA Suppose that an investigative team conducted an RNA-Seq experiment on mouse liver cells. The team found many sequences that contained no open reading frames (Chapter 27) — long stretches of consecutive triplet codons that could be translated into a protein and therefore suggest the presence of a gene. Suggest a reason for this observed lack of ORFs. 13. Use of Outgroups in Comparative Genomics A hypothetical protein found in human, orangutan, and chimpanzee has the following sequences (boldface indicates amino acid residue M differences; dashes indicate a deletion, meaning the residues are missing in that sequence): Human: ATSAAGYDEWEGGKVLIHL – – KLQNRGALLELDIGAV Orangutan: ATSAAGWDEWEGGKVLIHLDGKLQNRGALLELDIGAV Chimpanzee: ATSAAGWDEWEGGKILIHLDGKLQNRGALLELDIGAV What is the most likely sequence of the protein present in the last common ancestor of human and chimpanzee? 14. Human Migrations I Native American populations in North America and South America have mitochondrial DNA haplotypes that can be traced to populations in northeast Asia. The Aleut and Eskimo populations in the far northern parts of North America possess a subset of the same haplotypes that link other Native Americans to Asia, and the Aleut and Eskimo populations also have several additional haplotypes that can be traced to Asian origins but are not found in native populations in other parts of the Americas. Provide a possible explanation. 15. Human Migrations II DNA (haplotypes) originating from the Denisovans can be found in the genomes of Indigenous Australians and Melanesian Islanders. However, the same DNA markers are not found in the genomes of people native to Africa. Explain. 16. Finding Disease Genes You are a gene hunter, trying to find the genetic basis for a rare inherited disease. Examination of six pedigrees of families affected by the disease provides inconsistent results. For two of the families, the disease is co-inherited with markers on chromosome 7. For the other four families, the disease is co-inherited with markers on chromosome 12. Explain how this difference might have arisen. 17. RT-PCR Primer Design Investigators can use sequences of transcribed mRNA as a PCR template to produce a corresponding DNA sequence. Reverse transcriptase, an enzyme that works like DNA polymerase, amplifies the mRNA template as DNA in the first PCR cycle. Aer making the DNA strands from the RNA template, the investigator can carry out the remaining cycles with DNA polymerase, using standard PCR protocols. She can then compare the detected amplified sequences to the genome to analyze transcriptional activity. Thus, reverse transcriptase PCR (RT-PCR) is a powerful experimental technique used to detect RNA from living cells, which transcribe their DNA into RNA, as opposed to dead tissues, which do not. Consider the mRNA transcript shown. 5′– AUAUCGCUCCACGUAACUGAAAGAAAAGUGUGGAGCUAGCA GUCGAGA–3′ Which DNA oligonucleotide pair could serve as a suitable primer in an RT-PCR amplification of this transcript? The oligonucleotides are written in the 5′ to 3′ direction. a. Primer 1: GGAGACCTTGACT; Primer 2: AGTCAAGGTCTCC b. Primer 1: GACTGCTAGCTCC; Primer 2: GTTACGTGGAGCG c. Primer 1: GCCGCGCGCGCGC; Primer 2: CCCCGCCGCGCCG d. Primer 1: CACGATTCAACGTG; Primer 2: TTCGCATTGCCGAA DATA ANALYSIS PROBLEM 18. HincII: The First Restriction Endonuclease Discovery of the first restriction endonuclease to be of practical use was reported in two papers published in 1970. In the first paper, Smith and Wilcox described the isolation of an enzyme that cleaved double-stranded DNA. They initially demonstrated the enzyme’s nuclease activity by measuring the decrease in viscosity of DNA samples treated with the enzyme. a. Why does treatment with a nuclease decrease the viscosity of a solution of DNA? The authors determined whether the enzyme was an endonuclease or an exonuclease by treating 32P- labeled DNA with the enzyme, then adding trichloroacetic acid (TCA). Under the conditions used in their experiment, single nucleotides would be TCA- soluble and oligonucleotides would precipitate. b. No TCA-soluble 32P-labeled material formed upon treatment of the 32P-labeled DNA with the nuclease. Based on this finding, is the enzyme an endonuclease or is it an exonuclease? Explain your reasoning. When a polynucleotide is cleaved, the phosphate usually is not removed but remains attached to the 5′ or 3′ end of the resulting DNA fragment. Smith and Wilcox determined the location of the phosphate on the fragment formed by the nuclease in three steps: 1. Treat unlabeled DNA with the nuclease. 2. Treat a sample (A) of the product with γ -32P- labeled ATP and polynucleotide kinase (which can attach the γ -phosphate of ATP to a 5′ OH but not to a 5′ phosphate or to a 3′ OH or 3′ phosphate). Measure the amount of 32P incorporated into the DNA. 3. Treat another sample (B) of the product of step 1 with alkaline phosphatase (which removes phosphate groups from free 5′ and 3′ ends), followed by polynucleotide kinase and γ -32P- labeled ATP. Measure the amount of 32P incorporated into the DNA. c. Smith and Wilcox found that sample A had 136 counts/min of 32P; sample B had 3,740 counts/min. Did the nuclease cleavage leave the phosphate on the 5′ end or the 3′ end of the DNA fragments? Explain your reasoning. d. Treatment of bacteriophage T7 DNA with the nuclease gave approximately 40 specific fragments of various lengths. How is this result consistent with the enzyme’s recognizing a specific sequence in the DNA as opposed to making random double-strand breaks? At this point, there were two possibilities for the site- specific cleavage: the cleavage occurred either (1) at the site of recognition or (2) near the site of recognition but not within the sequence recognized. To address this issue, Kelly and Smith determined the sequence of the 5′ ends of the DNA fragments generated by the nuclease, in five steps: 1. Treat phage T7 DNA with the enzyme. 2. Treat the resulting fragments with alkaline phosphatase to remove the 5′ phosphates. 3. Treat the dephosphorylated fragments with polynucleotide kinase and γ -32P-labeled ATP to label the 5′ ends. 4. Treat the labeled molecules with DNases to break them into a mixture of mono-, di-, and trinucleotides. 5. Determine the sequence of the labeled mono-, di-, and trinucleotides by comparing them with oligonucleotides of known sequence on thin- layer chromatography. The labeled products were identified as follows: mononucleotides — A and G; dinucleotides — (5′)ApA(3′) and (5′)GpA(3′); trinucleotides — (5′)ApApC(3′) and (5′)GpApC(3′). e. Which model of cleavage is consistent with these results? Explain your reasoning. Kelly and Smith went on to determine the sequence of the 3′ ends of the fragments. They found a mixture of (5′)TpC(3′) and (5′)TpT(3′). They did not determine the sequence of any trinucleotides at the 3′ end. f. Based on these data, what is the recognition sequence for the nuclease, and where in the sequence is the DNA backbone cleaved? Use Table 9-2 as a model for your answer. References Kelly, T. J., and H. O. Smith. 1970. A restriction enzyme from Haemophilus influenzae: II. Base sequence of the recognition site. J. Mol. Biol. 51:393–409. Smith, H. O., and K. W. Wilcox. 1970. A restriction enzyme from Haemophilus influenzae: I. Purification and general properties. J. Mol. Biol. 51:379–391.

Practice
Multiple choice (25 questions)

Stems are from the chapter Problems section; correct choices are drawn from Abbreviated Solutions to Problems (Appendix B) in the same edition.

Practice questions (from chapter Problems & Appendix B)Score: 0 / 25

1. Engineering Cloned DNA When joining two or more DNA fragments, a researcher can adjust the sequence at the junction in a variety of subtle ways, as seen in these exercises. a. Write the sequence of each end of a linear DNA fragment produced by an EcoRI restriction digest (include those sequences remaining from the EcoRI recognition sequence). b. Write the sequence resulting from the reaction of this end sequence with DNA polymerase I and the four deoxynucleoside triphosphates (see Fig. 8-34). c. Write the sequence produced at the junction that arises if two ends with the structure derived in (b) are ligated (see Fig. 25-15). d. Write the sequence produced if the structure derived in (a) is treated with a nuclease that degrades only single-stranded DNA. e. Write the sequence of the junction produced if an end with structure (b) is ligated to an end with structure (d). f. Write the sequence of the end of a linear DNA fragment that was produced by a PvuII restriction digest (include those sequences remaining from the PvuII recognition sequence). g. Write the sequence of the junction produced if an end with structure (b) is ligated to an end with structure (f). h. Suppose you can synthesize a short duplex DNA fragment with any sequence you desire. With this synthetic fragment and the procedures described in (a) through (g), design a protocol that would remove an EcoRI restriction site from a DNA molecule and incorporate a new BamHI restriction site at approximately the same location. (See Fig. 9-2.) i. Design four different short synthetic double-stranded DNA fragments that would permit ligation of structure (a) with a DNA fragment produced by a PstI restriction digest. In one of these fragments, design the sequence so that the final junction contains the recognition sequences for both EcoRI and PstI. In the second and third fragments, design the sequence so that the junction contains only the EcoRI and only the PstI recognition sequence, respectively. Design the sequence of the fourth fragment so that neither the EcoRI nor the PstI sequence appears in the junction.

2. Selecting for Recombinant Plasmids When cloning a foreign DNA fragment into a plasmid, it is oen useful to insert the fragment at a site that interrupts a selectable marker (such as the tetracycline-resistance gene of pBR322). The loss of function of the interrupted gene can be used to identify clones containing recombinant plasmids with foreign DNA. With a yeast artificial chromosome (YAC) vector, a researcher can distinguish vectors that incorporate large foreign DNA fragments from those that do not, without interrupting gene function. How are these recombinant vectors identified?

3. DNA Cloning The restriction endonuclease PstI cleaves the plasmid cloning vector pBR322 (see Fig. 9-3). A researcher ligates an isolated DNA fragment from a eukaryotic genome (also produced by PstI cleavage) to the prepared vector. She then uses the mixture of ligated DNAs to transform bacteria and selects plasmid-containing bacteria by growth in the presence of tetracycline. a. In addition to the desired recombinant plasmid, what other types of plasmids might be found among the transformed bacteria that are tetracycline-resistant? How can the types be distinguished? b. The cloned DNA fragment is 1,000 bp long and has an EcoRI site 250 bp from one end. The researcher cleaves three different recombinant plasmids with EcoRI and analyzes them by gel electrophoresis, with the results shown in the image. What does each pattern say about the cloned DNA? Note that in pBR322, the PstI and EcoRI restriction sites are about 750 bp apart. The entire plasmid with no cloned insert is 4,361 bp. Size markers in lane 4 have the number of nucleotides noted.

4. Restriction Enzymes The partial sequence of one strand of a double-stranded DNA molecule is 5′ – – – GACGAAGTGCTGCAGAAAGTCCGCGTTATAGGCAT GAATTCCTGAGG – – – 3′ The cleavage sites for the restriction enzymes EcoRI and PstI are shown below. Write the sequence of both strands of the DNA fragment created when this DNA is cleaved with both EcoRI and PstI. The top strand of your duplex DNA fragment should be derived from the strand sequence given.

5. Designing a Diagnostic Test for a Genetic Disease Huntington disease (HD) is an inherited neurodegenerative disorder, characterized by the gradual, irreversible impairment of psychological, motor, and cognitive functions. Symptoms typically appear in middle age, but onset can occur at almost any age. The course of the disease can be 15 to 20 years. Biomedical research is improving our understanding of the molecular basis of the disease. The genetic mutation underlying HD has been traced to a gene encoding a protein (Mr 350,000) of unknown function. The region of the gene that encodes the amino terminus of the protein has a repeated sequence of CAG codons (for glutamine). The length of this simple trinucleotide repeat indicates whether an individual will develop HD, and at approximately what age the first symptoms will occur. The sequence is repeated 6 to 39 times in individuals who will not develop HD, 40 to 55 times in those with adult-onset HD, and more than 70 times in individuals with childhood-onset HD. A small portion of the amino-terminal coding sequence of the 3,143-codon HD gene is shown. The nucleotide sequence of the DNA is given in black, the amino acid sequence corresponding to the gene is given in blue, and the CAG repeat is shaded. Using Figure 27-7 to translate the genetic code, outline a PCR-based test for HD that could be carried out using a blood sample. Assume the PCR primer must be 25 nucleotides long. By convention, unless otherwise specified, a DNA sequence encoding a protein is displayed with the coding strand — the sequence identical to the mRNA transcribed from the gene (except for U replacing T) — on top, such that it reads 5′ to 3′, le to right. Information from The Huntington’s Disease Collaborative Research Group, Cell 72:971, 1993.

6. Using PCR to Detect Circular DNA Molecules In a species of ciliated protist, a segment of genomic DNA is sometimes deleted. The deletion is a genetically programmed reaction associated with cellular mating. A researcher proposes that the DNA is deleted in a type of recombination called site- specific recombination, with the DNA at either end of the segment joined together and the deleted DNA ending up as a circular DNA reaction product. Suggest how the researcher might use the polymerase chain reaction (PCR) to detect the presence of the circular form of the deleted DNA in an extract of the protist.

7. Protein Dynamics within Cells In a bacterial cell, two proteins, X and Y, are thought to have similar functions. Researchers genetically engineered each protein to fuse with a variant of the green fluorescent protein, one that glows red (X) and the other yellow (Y). Controls showed that both fusion proteins retained their activity, and both produced visible spots of light (foci) when expressed. To better understand the biological functions of the two proteins, the researchers expressed the fusion proteins in the same bacterial cell under two different conditions. Under nutrient- rich conditions, distinct red and yellow puncta (well-defined clustering of foci) were distributed throughout the cell. One or two red puncta were typically found within the nucleoid (chromosomal DNA), whereas the multiple yellow puncta were distributed throughout the cell. However, under nutrient starvation, the yellow puncta migrated and co- localized (overlapped) with the red puncta. What might be concluded from these observations?

8. Mapping a Chromosome Segment Researchers isolated a group of overlapping clones, designated A through F, from one region of a chromosome. They then separately cleaved each of the clones using a restriction enzyme and resolved the pieces by agarose gel electrophoresis. The image shows the electrophoresis results. There are nine different restriction fragments in this chromosomal region, with a subset appearing in each clone. Using this information, deduce the order of the restriction fragments in the chromosome.

9. Immunofluorescence In a common protocol for immunofluorescence detection of cellular proteins, an investigator uses two antibodies. The first binds specifically to the protein of interest. The second is labeled with fluorochromes for easy visualization, and it binds to the first antibody. In principle, one could simply label the first antibody and skip one step. Why use two successive antibodies?

10. Yeast Two-Hybrid Analysis You are a researcher who has just discovered a new protein in a fungus. Design a yeast two- hybrid experiment to identify the other proteins in the fungal cell with which your protein interacts, and explain how this could help you determine the function of your protein.

11. RNA-Seq RNA-Seq is a next-generation sequencing method used to quantitatively profile the cellular transcriptome. Researchers use RNA-Seq to compare the expression of genes under different environmental conditions or between different types of cells. There are three general steps in an RNA-Seq workflow:

12. Generate a cDNA library from cellular RNA.

13. Add oligonucleotide adapters to the fragments of the cDNA library.

14. Use next-generation sequencing to identify transcriptionally active genes from the cDNA library. What is the role of the enzyme reverse transcriptase in an RNA-Seq workflow?

15. Cellular RNA Suppose that an investigative team conducted an RNA-Seq experiment on mouse liver cells. The team found many sequences that contained no open reading frames (Chapter 27) — long stretches of consecutive triplet codons that could be translated into a protein and therefore suggest the presence of a gene. Suggest a reason for this observed lack of ORFs.

16. Use of Outgroups in Comparative Genomics A hypothetical protein found in human, orangutan, and chimpanzee has the following sequences (boldface indicates amino acid residue M differences; dashes indicate a deletion, meaning the residues are missing in that sequence): Human: ATSAAGYDEWEGGKVLIHL – – KLQNRGALLELDIGAV Orangutan: ATSAAGWDEWEGGKVLIHLDGKLQNRGALLELDIGAV Chimpanzee: ATSAAGWDEWEGGKILIHLDGKLQNRGALLELDIGAV What is the most likely sequence of the protein present in the last common ancestor of human and chimpanzee?

17. Human Migrations I Native American populations in North America and South America have mitochondrial DNA haplotypes that can be traced to populations in northeast Asia. The Aleut and Eskimo populations in the far northern parts of North America possess a subset of the same haplotypes that link other Native Americans to Asia, and the Aleut and Eskimo populations also have several additional haplotypes that can be traced to Asian origins but are not found in native populations in other parts of the Americas. Provide a possible explanation.

18. Human Migrations II DNA (haplotypes) originating from the Denisovans can be found in the genomes of Indigenous Australians and Melanesian Islanders. However, the same DNA markers are not found in the genomes of people native to Africa. Explain.

19. Finding Disease Genes You are a gene hunter, trying to find the genetic basis for a rare inherited disease. Examination of six pedigrees of families affected by the disease provides inconsistent results. For two of the families, the disease is co-inherited with markers on chromosome 7. For the other four families, the disease is co-inherited with markers on chromosome 12. Explain how this difference might have arisen.

20. RT-PCR Primer Design Investigators can use sequences of transcribed mRNA as a PCR template to produce a corresponding DNA sequence. Reverse transcriptase, an enzyme that works like DNA polymerase, amplifies the mRNA template as DNA in the first PCR cycle. Aer making the DNA strands from the RNA template, the investigator can carry out the remaining cycles with DNA polymerase, using standard PCR protocols. She can then compare the detected amplified sequences to the genome to analyze transcriptional activity. Thus, reverse transcriptase PCR (RT-PCR) is a powerful experimental technique used to detect RNA from living cells, which transcribe their DNA into RNA, as opposed to dead tissues, which do not. Consider the mRNA transcript shown. 5′– AUAUCGCUCCACGUAACUGAAAGAAAAGUGUGGAGCUAGCA GUCGAGA–3′ Which DNA oligonucleotide pair could serve as a suitable primer in an RT-PCR amplification of this transcript? The oligonucleotides are written in the 5′ to 3′ direction. a. Primer 1: GGAGACCTTGACT; Primer 2: AGTCAAGGTCTCC b. Primer 1: GACTGCTAGCTCC; Primer 2: GTTACGTGGAGCG c. Primer 1: GCCGCGCGCGCGC; Primer 2: CCCCGCCGCGCCG d. Primer 1: CACGATTCAACGTG; Primer 2: TTCGCATTGCCGAA DATA ANALYSIS PROBLEM

21. HincII: The First Restriction Endonuclease Discovery of the first restriction endonuclease to be of practical use was reported in two papers published in 1970. In the first paper, Smith and Wilcox described the isolation of an enzyme that cleaved double-stranded DNA. They initially demonstrated the enzyme’s nuclease activity by measuring the decrease in viscosity of DNA samples treated with the enzyme. a. Why does treatment with a nuclease decrease the viscosity of a solution of DNA? The authors determined whether the enzyme was an endonuclease or an exonuclease by treating 32P- labeled DNA with the enzyme, then adding trichloroacetic acid (TCA). Under the conditions used in their experiment, single nucleotides would be TCA- soluble and oligonucleotides would precipitate. b. No TCA-soluble 32P-labeled material formed upon treatment of the 32P-labeled DNA with the nuclease. Based on this finding, is the enzyme an endonuclease or is it an exonuclease? Explain your reasoning. When a polynucleotide is cleaved, the phosphate usually is not removed but remains attached to the 5′ or 3′ end of the resulting DNA fragment. Smith and Wilcox determined the location of the phosphate on the fragment formed by the nuclease in three steps:

22. Treat unlabeled DNA with the nuclease.

23. Treat a sample (A) of the product with γ -32P- labeled ATP and polynucleotide kinase (which can attach the γ -phosphate of ATP to a 5′ OH but not to a 5′ phosphate or to a 3′ OH or 3′ phosphate). Measure the amount of 32P incorporated into the DNA.

24. Treat another sample (B) of the product of step 1 with alkaline phosphatase (which removes phosphate groups from free 5′ and 3′ ends), followed by polynucleotide kinase and γ -32P- labeled ATP. Measure the amount of 32P incorporated into the DNA. c. Smith and Wilcox found that sample A had 136 counts/min of 32P; sample B had 3,740 counts/min. Did the nuclease cleavage leave the phosphate on the 5′ end or the 3′ end of the DNA fragments? Explain your reasoning. d. Treatment of bacteriophage T7 DNA with the nuclease gave approximately 40 specific fragments of various lengths. How is this result consistent with the enzyme’s recognizing a specific sequence in the DNA as opposed to making random double-strand breaks? At this point, there were two possibilities for the site- specific cleavage: the cleavage occurred either (1) at the site of recognition or (2) near the site of recognition but not within the sequence recognized. To address this issue, Kelly and Smith determined the sequence of the 5′ ends of the DNA fragments generated by the nuclease, in five steps:

25. Treat phage T7 DNA with the enzyme.