Sergei L. Kosakovsky Pond, PhD

The new T2T human genome, hundreds of high-quality mammalian and other genomes from VGP and Zoonomia projects, tens of millions of SARS-CoV-2 genomes collected during the pandemic, are high-profile examples of exponential rates of data generation. Increasingly complex methods are being developed to the lifecycle of these data, generation, assembly, quality control, inference, and interpretation. However, several key methods based on maximum likelihood inference, and developed as far back as the 1960s remain the workhorse of comparative genomics. I will describe how new volumes and types of data drove the identification of what George Box called “importantly wrong aspects” of the methods and models, and their refinements, and improvements. I will also discuss recent developments in algorithmic and computational efficiency needed to handle very large datasets, using many examples from SARS-CoV-2 genomics.