Debugging Biology

Michael Antonov
18 min readJun 14, 2021

--

To solve aging, we need to understand the molecular, cellular and higher level processes driving it in the organism, and find ways to interfere with them so as to stop/slow down the process, or even reverse aspects of it to a younger state. On the high level, the process of this is straightforward and is not unlike debugging a software system; it can be illustrated by the following diagram.

The steps illustrated are as follows:

  1. To study aging, we first need to observe the differences in phenotype. One can do this by comparing differences between young and old (or similarly the diseased and healthy state) on the cellular, tissue or higher organismal level. There can be differences at each level, so we want to capture them with maximum possible detail.
  2. To better understand the differences, we may want to build a model. There are different types of models; broadly speaking, models are technical representations for a body of knowledge with some, although limited, predictive power. As our understanding of biology grows, relying on trained data sets, statistics, simulated pathways and other models will help better understand cause — effect relationships and predict some of the effects changes might have.
  3. Based on the observed phenotype difference, we want to find ways to modify it to a healthier state. Some of these modifications may be “obvious”, such as with mechanical or chemical structure changes, others can be more involved, rely on guesses or be informed by the model analysis.
  4. Once we decide on an intervention, there is a set of tools we can apply — these range from small molecules and environmental changes, to genetic engineering to change gene expression in the organism or cell. After the modification is made we may need to grow the culture or organism, along with controls, so that we can go back to observing the phenotype.
  5. If there is an improvement, which in case of aging research could be a phenotype closer to the younger state, we have unidentified a potential candidate intervention. We then need to further study it to determine the underlying mechanisms, wider effects on the organism and identify further improvements.
  6. Even if there is no improvement, we can use the data to update our model for better understanding of the system. We then go back to the model and hypothesis to identify the new thing to try.

In many ways, this is what scientists and biology researchers have been doing for years. For someone coming from a software field, seeing this flow feels really empowering. It suggests that by following and iterating on this process should enable us to understand the biology and ultimately modify it to the desired state. This includes solving most of the diseases and aging.

In software engineering we are used to being the masters of our destiny, creating anything we want with fast iteration. While we may hit limitations on algorithmic complexity or system constraints, or even get lost in large system complexity — there are no true unknowns. At the end of the day, we can observe all of the computer memory state, know how instructions are executed and what the building blocks are.

Everything is more complex with biology. One fundamental challenge is that there are still true unknowns — things we haven’t yet discovered that can drastically change outcomes and our thinking. Furthermore, our rate of iteration is much slower: weeks, months or years instead of minutes or days in software… 100x difference matters a lot.

Yet, given our improved understanding of biomolecules, their functions and pathways, exponential progress in sequencing, plus the power of AI to handle the data, there are reasons to be optimistic about dramatic improvements in our understanding and the rate of scientific progress. Here, I wanted to highlight some of the tools that give us insight into biology, and potential improvements that can speed-up anti-aging discovery.

Challenges with Iteration

The fundamental challenge with biomedical research is that our tooling is limited, resulting in an iteration process that is manual and time-consuming, facing complications on several levels:

  • We don’t have the full set of tools to comprehensively capture biology on each level, so we only get partial and limited information.
  • While we are accumulating increasing knowledge on genes, proteins and pathways, we don’t yet have comprehensive, predictive and accessible models of cells, organs, and all interactions.
  • We don’t have easy, fast and comprehensive automation for all aspects of modification, growth and observation of biology at all levels.
  • Finally, biology and organism growth can be slow — sometimes we simply need to wait for processes to take place or animals to grow up, which may take months or even years.

I hope to unpack these in a series of posts, starting with molecular tools here, so we can better understand the current state and perhaps consider what can be done with each area.

Birds Eye Overview of Biological Molecules

To understand cells and biology at the deepest level, we need to study molecules within the cell and how they interact, as well as higher level organelles and structure. In this writeup, I will focus on the biological molecules, hoping to get to cellular level next time. Biological molecules include:

  • Water, small molecules and inorganic ions,
  • Carbohydrates used for nutrients, energy storage and some cases, such as cellulose in plants, providing cell structure,
  • Lipids used for energy storage, signaling and cellular membranes,
  • Nucleic acids represent information, with DNA encoding genetic information and RNA playing multiple roles. Messenger RNA carries information from DNA to ribosomes, molecular machines partially constructed from RNA themselves, which synthesize proteins by translating nucleotides into amino acids.
  • Proteins are major functional components of the cell, acting as enzymes — molecular machines that catalyze reactions. They also play structural, transport, signaling and storage roles.

To the left is a diagram of a mammalian cell, which quantifies the order of magnitude for some of these molecules within a cell, as described in a great book Cell Biology by the Numbers. In terms of mass, 70% of the cell is occupied by water. Proteins, RNA and lipids occupy 55%, 20% and 9% percent of the remaining dry weight. These, of course are arranged into many functional structures, including membranes with various channels and receptors, nucleus which holds DNA in eukaryotic cells, mitochondria for energy use and so on.

Describing molecular roles further is outside the scope of this writeup; The Molecular Composition of Cells chapter in The Cell: A Molecular Approach provides some great detail if you are interested.

An important point to highlight is that key biological molecules, including DNA, RNA and proteins are polymers — chemically connected sequences built out of different nucleotides in case of the nucleic acids, or 20 possible amino acids in case of proteins. This means that to understand a particular molecule, we need to “read” its unique sequence, as well as understand its shape and folding, as in case of proteins, shape determines their enzymatic function.

Molecular Tools to Capture Biology

There are different ways of studying molecules, including filtering and/or breaking them apart and examining the resulting spectra, detecting their presence with help of antibodies, or sequencing with more involved techniques. Of course, we also need to understand molecular interactions and larger complexes they form. The rest of this writeup focuses on the tools we have to study biomolecules, their composition, shape, and concentration as well as some of the trade offs and the direction technology is heading.

So here they are — the most common approaches used to study biological molecules:

  1. IR, UV-Vis, NMR Spectroscopy — used to study molecular bond structure and doing quantitative (UV-Vis) analysis during chemical reactions.
  2. Mass Spectrometry — Measures mass/charge ratio of ionized molecules, useful for understanding molecular composition of proteins.
  3. X Ray-crystallography and Cryo-EM — For structural biology, identifying the structure of molecules and proteins, including detail such as conformation and location of atoms.
  4. Antibody-based Methods — Western Blots, ELISAs, and multiplex assays to analyze the presence and quantity of proteins.
  5. qPCR — Real Time polymerase chain reaction, which monitors amplification of DNA with help of fluorescence. Useful for diagnostics, quantification of gene expression.
  6. Sequencing — reads the structure of DNA and RNA.

Let me go over these.

Spectrometry

The UV-Vis, IR, NMR spectrometers measure the spectrum of light, or nuclear magnetic resonance, and are mostly useful for quantifying reactions (UV-Vis) and understanding the structure of chemical compounds. You will typically want to study a pure compound in a standardized medium. By analyzing IR and NMR charts, you can determine the presence and location of individual chemical bonds in a molecule. This is illustrated in a wiki sample H NMR spectrum below, where one can rely on the x-axis chemical shift to identify functional groups (CH3, etc), with separation of peaks further describing the arrangement of the molecule.

Mass spectrometry (MS) is used to analyze various molecules, in particular biological substances, such as proteins. Inside of a mass spectrometer sample molecules are vaporized in vacuum and ionized. They take a different flight path based on their mass to charge (m/z) ratio, producing a detection spectrum used to analyze them. One of the key benefits of MS is that it is truly unbiased, enabling study of new substances and their molecular components, including proteins and lipids, as well as their bound cofactors or modifications.

Tandem Mass spectrum of a peptide [wiki]; Mass Spectrometer.

For protein MS analysis, samples are first trypsin-digested into fragments and separated by chromatography before flowing into MS and being ionized. The resulting ions mass-to-charge ratio is measured with time-of-flight (TOF) or other approaches to get a spectra, which can be processed by peptide mass fingerprinting software to identify the protein.

Alternatively, the ions are further fragmented by collision and analyzed by a second MS, as with MS/MS. Peptides typically break at peptide bonds, so the resulting spectrum, including distance between the peaks, allows you to read the amino acid sequence.

I recommend this great writeup by the Broad institute on Mass Spectrometry if you are interested in learning more. Also, here is a bit of terminology you might encounter as you look at MS types:

  • MALDI and ESI are MS ionization techniques that enable ionizing larger protein molecules without destroying them,
  • Time of flight (TOF) and Quadrupole are types of mass analyzers for measuring m/z of an ion so it can be plotted on the spectra.
  • MS/MS refers to two-stage tandem MS helpful for analyzing proteins, where ions after the first stage are broken down through collision and the resulting fragments are analyzed further. TOF/TOF and Triple Quadrupole are examples of MS/MS.

Overall, MS strengths lie in its ability to study new substances, perform absolute mass measurement and study molecular structure/modifications. It is not as well suited for absolute quantification of the amount of protein in the sample, and may not have the sensitivity to detect low abundance proteins within a more complex sample. While there are new multiplexing approaches that address some of these challenges such as SILAC, MS instrumentation also comes with high price, starting at over $100k and going much higher for high-end models.

X-Ray Crystallography and Cryo-EM

X-Ray Crystallography and Cryo-EM are methods for identifying the structure of molecules and proteins, including their conformation and individual atom coordinates. This is important since protein folding shape determines its function.

  • X-Ray crystallography has been the primary method for solving protein structure with over 150,000 structures in Protein Data Bank, compared to 13,000 for NMR and 6000 for Cryo-EM. It requires formation of crystals with regular internal protein lattice. It then determines the atomic/molecular structure of the crystal by analyzing angles of diffracted x-ray beams, processed to compute chemical bonds and mean positions of atoms.
  • Cryo-EM reconstruction applies electron microscopy to samples cooled to cryogenic temperatures in vitreous water. Technology broke the 3 angstrom resolution barrier in 2015 with advances in direct-electron detectors, so the number of Cryo EM submissions to PDB has been accelerating in recent years.
Cryo-EM structure of rhinovirus C 15a, displaying icosahedral symmetry, with spikes and valleys in the surface.

One of the benefits of Cry-EM is that it avoids crystallization, which can be very time consuming and difficult to prepare. It is better suited for studying membrane proteins which can be hard to separate for crystallization and larger molecules/protein assemblies typically bigger than 60 kDA in size, as low molecular mass molecules which have fewer scattering atoms contribute less to the signal. X-Ray crystallography tends to be more useful for smaller proteins, also resulting in higher resolution and precision in atomic coordinates, aided by the regular crystal structure. Here is an interesting article on X-rays in Cryo-EM Era.

Antibody-based Methods

Antibody-based detection methods are used extensively in science to detect and quantify proteins and other molecular structures, as well as to stain tissue to reveal analyte location. Protein quantification is important for both research and diagnostics: on research side, protein measurement aides with mechanistic understanding of the cell function; on the diagnostic side, it enables disease detection — as an example, levels of C-reactive protein (CRP) rise in response to tissue damage and therefore are used for detection of heart disease.

Protein quantification typically relies on protein-specific antibodies being conjugated with a unique DNA strand, a bead or an enzyme that aids detection. In case of an enzyme, it can be used to catalyze a reaction that triggers color change, observed to detect its presence.

There are hundreds of companies providing tools and reagents for antibody-based assays, so here I’ll highlight two classical methods still used in research — Western Blot and ELISA, and mention a few more advanced modern techniques:

https://en.wikipedia.org/wiki/Western_blot
  • Western Blot is one of the cheapest and most common methods of protein detection; combining gel separation of proteins based on molecular weight/charge with antibody staining to make them visible as a band. It is semi-quantitative, as you can interpret the size of the band but it is hard to do precisely; when necessary, you can literally cut out a band out of the gel for further analysis, such as with mass spec. If you’ve seen images of vertical gel images in scientific papers, this is probably it (or similar northern/southern blot, used to detect RNA/DNA).
  • ELISA, or “Enzyme Linked Immunosorbent Assay” is a plate-based method used to detect and quantify the abundance of proteins/substances with the help of antibodies. ELISAs are designed in such a way that when the detection antibody binds to a protein of interest, it stays within the plate well; otherwise it is washed away. Enzymes bound to such remaining antibodies catalyze a color change in a fluorescent reagent, so the protein is detected.
  • ELISAs are often done using 96-well plates that you can get in pre-made kits for a given protein, such as Invitrogen IL-8 ELISA Kit. Several samples can be processed on the same plate, with multiple dilution levels each. Due to its good sensitivity and a broad dynamic range, ELISAs are seen as the gold standard for protein quantification.
ELISA plate [wiki]; Plate Reader.

There’s been great progress in Multiplexing methods over the last two decades, allowing one to quantify many, often hundreds of proteins in the same sample at once — useful for analyzing protein expression and smarter diagnostics. Two common approaches are bead-based “suspension arrays” or planar arrays.

  • Bead-based approaches, as with Luminex, use microscopic antibody-coated beads (2.5M beads/mL), mixed with a sample. Such beads can be separated based on the unique wavelength of light with flow cytometry to quantify each type of protein.
  • Planar arrays (ex: RayBiotech and FullMoonBio) place antibodies in a predetermined pattern on a surface, used to capture molecules and display them as an expression map.
  • Alternative technologies, such as Olink® Explore 1536/384, make use of antibody pairing and DNA hybridization to convert protein binding “signal” into DNA that can be sequenced. You get good sensitivity and much higher multiplexing to handle many hundreds of samples per run, yet limited quantification capabilities due to reliance on sequencer reads.
  • Highest sensitivity approaches today, such as Quanterix Simoa, surpass ELISA and are able to detect and quantify proteins at femtogram level, which essentially requires single molecule capture. For now, these panels are still limited to 10-pexity per sample and involve expensive equipment.

One of the biggest challenges with protein quantification is caused by the high dynamic range of protein concentrations, which can vary by 10^12, making it hard to detect rare low-concentration molecules crowded out by the higher-concentration ones. The other one is the cost and availability of specific antibodies to protein targets: while we have thousands of antibodies, they do not cover the entire human proteome. There are time/ease use considerations as well, as sometimes reagents or steps require hours of preparation and processing.

Nevertheless, this area of proteomics is advancing very fast. A lot of money is going into developing higher plexity and sensitivity methods, coupled with AI will help us diagnose a wide range of diseases early. Because of this I would expect more high-sensitivity multiplex, cow-cost methods to become available over the next few years.

qPCR

qPCR — Quantitative Polymerase Chain Reaction is commonly used to quantify gene expression, i.e. the level of a given RNA in the sample. This might be useful in an experiment to compare gene expression after a certain gene knockout or pathway modification, or in diagnostics to detect the presence of a certain virus.

PCR utilizes the power of the polymerase enzyme to amplify a DNA sequence of interest. This method makes use of a thermocycler that repeatedly denatures DNA separating its strands, allowing the enzyme to copy a specific region defined by the primer sequences. As a result that specific sequence of interest is amplified over a billion times and can thus be detected. In case of qPCR, the RNA in the sample is first converted into DNA using a reverse transcriptase enzyme and then the regular PCR is run on the resulting so-called cDNA. Its abundance is detected using fluorescence, which corresponds to the level of expression and can be visualized with amplification curves.

Gene expression results of qPCR might be reported in a scientific paper with an illustration similar to the one above, where RNA_1 expression was knocked down, potentially with help of a viral vector or short hairpin RNA, while a different RNA_2 is not affected.

Sequencing

DNA and RNA sequencing tools read the chain of nucleic acids, decoding them into sequences of 2-bit bases representing Cytosine, Guanine, Adenine, Thymine (or Uracil in RNA). These sequences are important because they contain genes that guide creation of proteins, their regulation and other machinery of life. A switch in one base could replace an amino acid in a protein, breaking its functionality and introducing disease; alternatively, it could also provide an evolutionary advantage. The whole human genome contains around 3.1 billion bases (or 6.2 billion if counting both chromosomes), and the major scientific achievement of the last twenty years was first determining the structure of these bases, and then dropping the cost of doing so to just a few hundred dollars today.

DNA sequencing has huge value in research as it allows us to study genes, including their differences between biological species, and with help of bioinformatics software attempt to map them to function and phenotypes. RNA sequencing allows us to understand transcription, or which genes are expressed in different cell types. Beyond this, sequencing is also used indirectly to study methylation, chromatin binding, or even perform protein assays with help of custom DNA adapters to antibodies. Overall, these tools help us better understand cellular state and function.

There are many different technical approaches to sequencing such as sequencing by synthesis done by Illumina, where a fluorescent signal is emitted and read every time a nucleotide is added; or detecting ionic current changes as DNA strand passes through a protein channel in a device membrane, as in Oxford Nanopore. Scanning a DNA strand this way generates a “read”, which depending on the technology provider is either short (~150–200 bases, Illumina) or long, such as 1500 bases+. Scientists pick the approach based on their goals, as well as cost and error rate:

  • Short reads (Illumina, Ion Torrent, BGI Genomics) are often cheaper and have low error rates, often under 0.5% per base. They are great for detecting SNPs/variations which could cause a disease, but make it harder to build the whole genome without reference.
  • Long reads (Pac Bio, Oxford Nanopore) approaches have higher error rates and are generally better for de-novo construction of the genome. This can make them better for metagenomic analysis, which includes sequencing many organisms at once for understanding a given microbiome or studying the population of organisms in a lake.

Ultimately, tens of thousands to millions of such read segments can be generated per run producing gigabytes of data; alignment software tools such as GSNAP or Bowtie 2 are then used to map it to the reference genome. Once reads are mapped, you can check whether there is a difference/mutation in the nucleotides of a given gene and perform other analyses.

Recent technical progress in sequencing is driven not only by the research needs, but also by its applicability in diagnostics, such as for liquid biopsy tests. Beyond genetic screening, sequencing coupled with AI has potential to detect cancer and other diseases, greatly increasing its market size. This means we can expect lower costs and higher accuracy going forward.

Challenges and Trends in Molecular methods

As you can see, today we have tools that help us understand the structure of molecules, as well as quantify them in a sample — often with high precision. We can also understand the shapes of proteins and protein complexes, including the possible molecular bindings and interactions. Yet there are many challenges left to overcome:

  • First, solutions at the edge of molecular research are technically complex. Whether we are talking about Cryo-EM, high end mass spectrometers, or the details involved in next-gen gen sequencing, the engineering know-how and costs are significant. It will take a lot more work to make them cheaper and more widely available.
  • We don’t have the whole picture of what is going on within a cell, tissue or an organism. Given today’s tools, we strive to piece an understanding from incomplete data at different levels of detail, plus potentially unknown gaps we can’t yet measure. Aspects of aging can be even harder to tease out compared to basic biochemical functions due to the longer timescales involved.
  • Finally, with sample data we only get snapshots in time. Information of how protein concentrations change over time, of how they interact with each other at different timescales needs to be teased out with experiments and data analysis.

The good thing is that there is an increase in the rate of progress, and we are getting a lot more data. While data itself doesn’t automatically give insights, with proper analysis it helps build understanding and ultimately identify biological relationships.

I wanted to show two charts to demonstrate recent progress. On the left, is an image from X-rays in Cryo-EM Era that illustrates the number of structures submitted to PDB per year. While solving these protein structures is still hard, the number of structures solved per year is increasing.

On the right is the rate of growth in major databases as illustrated in a chart by Kanehisa Laboratories. The Y axis scale here is exponential — so the rate of growth is more impressive then it might seem. As an example, the number of “KEGG genes” represented by the purple line (these are genes annotated in the KEGG orthology) went up around 10x in the last decade.

At this point we have accumulated a meaningful amount of knowledge on the proteome of humans and key model animals, yet a lot of information still needs to be gained, in particular on function of the proteins and their numerous modifications.

Of the technologies described earlier, areas of sequencing and proteomics are arguably making the fastest progress. This is driven by the value of diagnostics, with proteomics increasingly seen as a tool for detecting diseases early and tracking the state of individual health. Given potential to benefit each individual and the resulting revenue that scales with the population, it is a huge market opportunity. This also means that sequencing and protein panels will be accessible at increasingly low cost to both researchers and the general public.

Improving tools allow us to explore molecular concentrations, pathways and phenotypes at much wider scale and cheaper than before, making the “Debug Biology Cycle” more accessible and enabling unbiased research. This means we are more empowered than ever to see and understand age-related differences in biology, plus explore potential approaches to age reversal. We can also develop better biomarkers of aging — measurements that can be used, at least in theory, to evaluate effectiveness of age-related therapies.

Ultimately, many of these tools can be brought together, along with higher level instrumental and software techniques to better understand the cellular state, as well as overall interactions of cells in the tissues. This is an area I’d like to explore further next time.

--

--

Michael Antonov

Investor @ formic.vc, software architect, serial entrepreneur and co-founder of Oculus. Interested in biotechnology and hardest challenges facing humanity.