Using machine learning as an alternative to phylogenetic bootstrap and for quantifying MSAs

This was part of Contemporary Challenges in Large-Scale Sequence Alignments and Phylogenies

Tal Pupko, Tel-Aviv University

Tuesday, August 12, 2025

Abstract: Accurate estimation of branch support values is critical for robust phylogenetic inference. Traditional approaches—such as Felsenstein’s bootstrap, parametric tests, and their approximations—often struggle to balance accuracy, speed, and interpretability. Here, we introduce a data-driven method that estimates branch support values with a clear probabilistic interpretation. We simulated thousands of realistic phylogenetic trees along with their corresponding multiple sequence alignments (MSAs). Each alignment was analyzed using state-of-the-art phylogenetic inference tools, and the resulting trees were compared against the known true trees. Using this extensive dataset, we trained machine learning models to predict support values for each bipartition in maximum-likelihood trees. Our models consistently outperform standard methods in both accuracy and computational efficiency. We further demonstrate the practical utility of our approach on empirical datasets. In addition, we explore the application of machine learning for evaluating MSAs. While most current methods optimize heuristic functions like the sum-of-pairs score, few studies have assessed whether such objectives truly yield the most accurate alignments. We show that machine-learned scores correlate more strongly with true MSA accuracy than traditional metrics, enabling more reliable selection among alternative alignments. Together, these findings highlight the potential of machine learning to improve both tree inference and MSA evaluation in phylogenetic analyses.