
The attention of the scientific community and the popularity among the general population of studies on microbiota are the result of a long history of research in different areas that remains very active today. Microbial populations are incredibly complex, and specific software and robust hardware able to handle an enormous amount of data are required in order to study them, in addition to a sufficiently modest per analysis cost for this kind of research to be financially feasible. Therefore, in the last few decades, DNA sequencing techniques have evolved and improved allowing for significant cost reductions. In turn, computational capacity has increased enough to make it possible to use software developed ad hoc in order to study the complex bacterial world.
The starting point for microbial analysis (both intestinal microbiota and microbiota from other areas) is DNA sequencing, which determines the order of the nitrogenous bases adenine (A), guanine (G), thymine (T), and cytosine (C) of the gene 16S rRNA. The raw sequencing data are analyzed and processed with the appropriate software to determine the community’s characteristics and taxonomy. Finally, a specific statistical study is conducted to determine the scientific results.
The molecular identification of DNA was the subject of intense scientific research in the second half of the 20th century until the publication of the Sanger sequencing method, developed by English biochemist Frederick Sanger (winner of the Nobel Prize in Chemistry in 1980 with W. Gilbert for this finding) and A.R. Coulson. This technique was more efficient, faster, and more accurate than previous methods, but its fame is mainly due to the marketing in 1986 of the first automatic sequencer to use the Sanger method, Applied Biosystems 370, and to the use of these sequencers, namely the ABI PRISM 3700, in the Human Genome Project (1985-2003). Despite its success and scientific potential, this kind of analysis was still beyond the reach of most due to the timelines and high operating costs entailed: it took more a decade to sequence the human genome at a cost of 2.047 billion euros. The need to lower costs and reduce timelines drove the rise of massive sequencing techniques (Next Generation Sequencing-NGS) and between 1990 and 2005 the cost of nucleotide sequencing dropped from $10 to $0.01, definitively opening the doors of DNA analysis. The most frequently used sequencing technology today is the one developed by Ilumina since it offers platforms (MiniSeq and HiSeq Xten, among others) able to produce robust results at reasonable prices. These devices are able to do short sequences in parallel, producing millions of reads at once, lowering costs and reducing the amount of time needed to receive the results.
The best target for phylogenetics and bacterial taxonomy studies is the sequencing of the 16S gene. It codifies for a ribosomal subunit that is widely conserved in the bacteria and contains hypervariable regions interspersed between its sequence’s conserved regions. These hypervariable regions are unique to each bacterial species, allowing for their classification or taxonomy. The conserved regions, on the other hand, allow for the development of universal primers that connect to familiar sequences shared by the majority of bacteria. The sequencing of a short read of approximately 200-400 base pairs is able to point to one or several proximal regions with the use of specific primers that align with the conserved regions on both sides of the hypervariable object.
One of the main limitations of massive sequencing of the gene 16S is the complexity and amount of data received. Before massive sequencing, the majority of research that studied bacterial communities was slow and expensive, but it allowed for long reads (above 500 base pairs), facilitating identification. The analysis is fast and inexpensive, but it generates millions of short reads that require a careful bioinformatics analysis plan.
Since 2008 with the start of the Human Microbiome Project (HMP), scientists have been dealing with the issue of how to handle huge amounts of information: big data. Different computer tools have slowly been appearing on the scene to carry out cleaning, filtering, and taxonomic assignment processes. They are not generally user-friendly programs and they require some computer skills. The most frequently used include Mothur and QIIME, which have been constantly updated over the last decade to include the statistical methods developed specifically for the study of microbial populations. In January 2018, QIIME support and development were suspended and a new program called QIIME2 replaced them. QIIME 2 is constantly updated to fix bugs and add the latest updates for the study of microbiota, such as the DADA2 program, which processes and corrects the massive sequencing data obtained by Illumina. More expert users can use the statistics program R, which is a more versatile option, but also the most complex of these programs. R is a programming language focused on statistics and it is not specific like QIIME2 or Mothur, but it is possible to use it to study microbiota installing the necessary components: DADA2, microbiome, ALDEx2, etc.
Regardless of the sequencing method, the final results are represented in operational taxonomic units (OTU), which are sequences that identify organisms generally down to the genus or species. To taxonomically assign these OTU, different techniques can be used to cross-reference the OTU with a database (e.g., Greengenes or Silva). The OTU frequency table is also used to characterize populations, a process consisting of calculating the alpha and beta diversity indices. Diversity indices and the frequency allocation table are used in parallel to compare populations and make conclusions about microbiota.
A significant limitation to massive sequencing (second generation) is the duration of the reads. In short, in order to maintain the quality of the read, long DNA molecules must be divided into small segments. Otherwise, due to random errors, DNA synthesis between the amplified DNA helices would gradually fall out of synchronization. Computational efforts aimed at rebuilding the segments frequently are based on approximate statistics that cannot provide exact structures. In order to overcome these limitations, third generation massive sequencing (also known as long-read sequencing) is being actively developed. It aims to read the nucleotide sequences at the level of a single molecule, without breaking long threads of DNA into small segments and then inferring the nucleotide sequences by amplification and synthesis. Unfortunately, there are still critical challenges to the engineering of the instruments to be overcome before this can become commercially available.
Regardless of the limitations to the sequencing and statistical approximation techniques, microbiota have shown their relationship to a number of chronic illnesses such as obesity, diabetes, and even autism, and the scientific community is certain that it will find markers for preventive healthcare and new treatment methods for patients. In the future, technological advancements will allow microbial analyses to be easily and routinely carried out – akin to the way that blood and urine are currently analyzed – and therefore more information will be available to prevent illness and/or design individualized treatments with the use of specific probiotics or fecal transplants.
Carlo Bressa is professor of Medical Specialties in the Doble Grado en Farmacia y Biotecnología