Saturday, November 12, 2016

HGP Counterfactuals, Part 2: The Forgotten Maps

Yesterday's post explored the concept of alternative histories, or counterfactuals and laid out why they might be a useful way to think about the value of the Human Genome Project.  In this installment, I'll explore what I will call the forgotten maps, the critical elements of the HGP which are all too easily forgotten.  These were both critical and expensive components of the project, so forgetting them is a mistake.  Their prominence has faded as new technologies have come in and subsumed them, or they were mostly means to the end of a first human sequence, but understanding the project requires understanding these forgotten maps.  And I will admittedly cover only a few; I'd invite anyone familiar with the ones I don't illuminate to remedy my failings (a careful reading of the Nature paper on the physical maps wouldn't be bad either).
The usually stated goal for the Human Genome Project was to determine the complete DNA sequence of a human.  Ignoring for a moment the fact that there isn't a single sequence by any stretch, there is a serious set of issues packed into, and often forgotten, within that statement.  Sequences are very useful, as I elaborated prior to this series.  But that dealt with biology that could be discovered intrinsitcally in the sequence.  There is also whole realms of important biological ties that simply cannot be computed from a sequence.  Second, and perhaps more importantly, how would you know the sequence was correct?  Especially given that wasn't realistic to expect even one sequence per chromosome arm; there would be difficult regions that would be a struggle to sequence.   How do you organize all these islands, and how can you be confident in the end that the order of islands, or the islands themselves, are mostly correct.

To be usable for many scientists, particularly the human geneticists who were many of the strongest proponents of a human genome sequencing project, the sequence would need to be tied down to existing landmarks.  There were two key types of landmarks which these biologists would require to be surveyed into the sequence: genetic markers and cytogenetic locations. 

Genetic markers are used for performing linkage studies, enabling the identification of genes responsible for Mendelian traits.  At the time the HGP was first being debated, Restriction Fragment Length Polymorphisms (RFLPs) were the dominant technology for human genetic markers.  By the early 1990s, simple tandem repeats (STRs) were making huge inroads, as they were easier to type en masse than RFLPs.  Each of these can be converted to a sequence; many of the STR typing techniques used PCR, so the primer sequences were obviously known.  STR typing would end up being a key part of some of the mapping technologies I will describe below, and fell nicely into a sequence.  

Cytogenetic maps are another story altogether.  This is the critical component of relating each bit of sequence to a specific spot on a chromosome.  Chromosome images (karyotypes) have limited resolution; each dark or light band can be megabases in size.  So there is a resolution mismatch. But that doesn't mean that location isn't important; knowing which genes are in a deleted, amplified or translocated bit of chromosome could be key to linking biology to regions of the genome sequence.  

For reliable mapping of a sequence to the chromosome, one uses In Situ Hybridization (ISH), often of the Fluorescent variety (FISH).  Cells are squashed on microscope slide in the right way, spreading the chromosomes out.  The chromosomes are stained to reveal the bands. DNA is hybridized to the chromosomes and then revealed using a stain (ISH) or fluoresence microscopy.  A skilled cytogeneticist reads the slide, finding good signal amidst badly flattened or stained chromosomes.  So a low-throughout process requiring skilled scientists.  Not fast, not cheap.

For actually sequencing the genome, at least it was thought, one needed a tiled set of clones covering the entire chromosome.  This would be tacked down at intervals to the cytogenetic map.  Clones would be pulled out for the tiling by testing them for STRs, which takes out two birds with one stone since the tying to the genetic map comes for "free".  

Ideally, the clones would be as large as possible.  If you are having trouble picturing this, imagine building a rail fence in which the rails must overlap at the posts.  The longer the rails, the fewer overlaps and the more efficiently you can cover the needed distance.  Also in the case of clones, fewer clones means a smaller management headache.  After all, each clone must be stored (in duplicate), prepared for sequencing and sequenced.

The initial big push for a clone map followed this logic to an unfortunate conclusion.  Yeast Artificial Chromosomes were the available technology which could clone the longest segments, so this YAC technology was chosen first.  YAC maps were built of the human genome and sequencing of YACs began.  That's when a rude surprise was discovered.  While YACs got the broad connectivity of the map mostly right, they were often corrupted by the yeast.  Saccharomyces yeast are very adept at recombining repetitive sequences, which the human genome is full of.  So YACs could not be used as sequencing substrates.

The solution to this was to use the YAC map to build a new map with Bacterial Artificial Chromosomes (BACs) and their close cousins PACs.  These vectors can't carry as much DNA as YACs, but what they do carry is replicated very faithfully.  Many lessons learned from early misfires were applied.  For example, it is convenient to use plasmids which have a high copy number, meaning there are many copies of the plasmid rather than just one copy like the bacterial chromosome.  This is great for prepping DNA for sequencing or other needs, as there will be a lot of clone and not much host DNA.  But it is again a mistake: multiple copies can induce the bacteria to recombine similar regions.  High copy number also amplifies any toxicities of the carried DNA.  So BACs and PACs were designed to be single copy vectors, and were very successful.

So the mapping of the YACs was repeated with the BACs (and PACs; from here on I'll lump them under the term BAC).  However, to find the desired minimal spanning set of clones, finer scale markers were needed than the known STRs, and something that could be typed more cheaply.  The most widely used solution was restriction fragment fingerprinting; digest each clone with a restriction enzyme and treat the resulting pattern of fragments as a barcode.  Clones sharing significant portions of their barcode overlapped, and from this a map could be built.  Then the minimum-tiling set could be chosen.  Choosing a minimum set was considered critical since sequencing was seen as expensive; the less sequencing the better.  Note that choosing a minimum-spanning set put certain trust in the BACs; a more cautious approach would have required 2X coverage (or higher) of the genome for the sequencing set. 

Still, it is hard to trust a BAC or YAC map over long distances.  Ideally, there would be regular checks to make sure that the map hasn't somehow gone off the rails.  Try free-handing a map of the world -- it's easy to get some parts right, but it's also easy for some small errors to trigger later errors that make much of the world unrecognizable.  Cytogenetic ties help, but that map is at a really huge scale.  So what was needed was another physical mapping technology with a resolution scale much larger than BACs or YACs. 

The dominant technology for that next scale was radiation hybrids.  These rely on a much earlier observation which had actually been used to build some of the first human cytogenetic maps.  If human cells are forced to fuse with rodent cells (which can be accomplished by a number of chemical tricks), the cells now are unhappy because they have too many chromosomes.  The natural solution is to jettison chromosomes, getting down to some approximation of a normal complement.  However, this jettisoning process is non-random; the human chromosomes (for reasons I believe are still a mystery) are dumped more frequently.  So if one screens the final cells, it is possible to find cells which are all mouse save a few (or perhaps just one) human chromosome.  

Mapping STRs onto what I've described would be a way to avoid ISH, but would only have the resolution of chromosomes. RH technology went a step further: the human cells were subjected to high doses of ionizing radiation (such as gamma rays) prior to fusion.  This radiation would fragment the human chromosomes, so that post-fusion loss would occur as chromosome parts rather than entire chromosomes. It also works out that this mapping works best if each cell retains bits of multiple chromosomes.  Build a panel of such RH hybrid cells and extract the DNA and screen it by PCR, and a probabilistic map is created.  Use different levels of irradiation, and the different maps will have different resolutions.

Another class of map that was generated for the genome was restriction maps.  These provided more levels of confirmation on the map of the genome, though to be honest I'm not sure they were ever used for much more.  Even before the genome project was done, restriction mapping was an approach increasingly important only in undergraduate courses.

Long after the public project had started this multi-layer, careful approach, Celera leaped in with a whole genome shotgun approach.  By cloning their human source DNA (which, as long rumored, turned out to be from Craig Venter) into a few different insert size libraries and simply sequencing these, plus some very clever informatics strategies (leading lights in the field declared shotgun assembly to not be feasible), Celera generated a human genome sequence.  Not to take away from that remarkable achievement, but without all those public maps the Celera sequence would have been much less useful and much less believable. 

That also applies to nearly all "next generation" sequences of the human genome.  Until the most recent long read sequences with Pacific Biosciences technology backed up with BioNano Genomics, it was simply impossible to generate a genome sequence de novo of any real quality. Even with the "platinum genomes" now showing up, our confidence in their correctness is backed by the reference genome, which means ultimately all those maps.  

What happened to those maps?  Once the BAC maps had been completed, the YAC maps were effectively junk. Since these clones could not be trusted for sequencing, they really had little further value.  BAC clones, on the other hand, were valuable for manipulating in culture.  So you could dial up BAC maps as well.  RH panels had most of their value for long-range scaffolding of the clone maps.  Still, I found myself trying to track them down back at Infinity, as there was a potential value to such mostly mouse with a little human content cells to be used in some lab tricks.

These technologies have become rarer and rarer to see in use, or if used are used differently.  Recombinational capture techniques enable cloning specific regions of human DNA, making the value of a large ordered BAC library less. BACs have been used to study segmental duplications, but these necessarily require making new libraries from specific individuals.  And as noted above, if one wishes to build a long-range map for an individual or to support a new reference genome, techniques such as BioNano and Dovetail have taken over.  The techniques that supported the HGP required armies of technicians operating in multiple warehouse-sized spaces full of equipment; nowadays the equipment to generate a platinum genome could fit in a low budget motel room (stripped of the low budget furniture, of course).

With the benefit of hindsight, could it have been done less expensively?  Yes, but it would have required a gigantic leap into the void.  For example, Celera's shotgun strategy could have been pursued first, with end-sequenced BACs then mapped onto the emerging assembly.  BACs could have been randomly sequenced initially, with directed hunting of BACs to cover gaps as a terminal stage.  In other words, maps could have trailed the sequences.  But this was not only too big a leap in thinking, but also meant that the clone maps, which had intrinsic value, would have been delivered later.  Indeed, I think more than a few human geneticists saw the BAC map anchored to STR markers as the real prize, with an actual sequence as icing.   So a map-late strategy was both a conceptual leap and a political leap much too far.

In my next installment, I'll look at the strategies that were considered for generating the actual sequence.  Themes seen here will repeat there, with variations.  After all, it was largely the same individuals with the same thinking patterns who were constructing the approaches.

No comments: