pangoLEARN is an algorithm for lineage assignment, implemented as of pangolin 2.0.
This new algorithm, which relies on machine learning, offers fast lineage assignment.
The current version of pangoLEARN uses a random forest. The previous versions of pangoLEARN models are still available for download at the previous tagged release on GitHub.
The model was trained using SARS-CoV-2 sequences from GISAID, acknowledgements here, with their lineages assigned by manually curating the global ML tree, as is the standard lineages data release procedure for pango-designation.
Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 30 minutes on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of a decision tree learning algorithm. The code for this process is available in the cov-lineages/pangoLEARN repository.
While in standard regression a line of best fit is found for a set of training data, which represents a linear relationship between variables of interest, a logistic regression fits a sigmoid function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.
Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.
Decision trees are especially intuitive models. A series of if-then-else decision rules are learned from the input data and assembled into a hierarchical structure. The internal nodes represent a decision between a number of paths, and the leaf nodes represent a final classification. By learning this tree structure from data which itself comes from a phylogenic tree, we are representing this underlying phylogenic tree structure in a way which can be more easily harnessed for sequence classification.
Decision trees can sometimes be thought of as brittle as small changes in the training data can potentially result in large changes in the tree structure. However, several similar model types were tested in preparation for this data release, including various Random Forest models, which are generally thought to be more robust than decision trees. The simpler decision tree performed the best both in terms of accuracy and training time. Because the older pieces of the underlying phylogenic tree structure will remain stable, this seems to be a case where the normal stability concerns associated with this model are less relevant.
A random forest is an ensemble of decision trees, each built using a subset of the training samples. Classifications are made when each individual tree ‘votes’ on the final decision.