pangolin outputs a csv file with taxon name and lineage assigned, one line corresponding to each sequence in the fasta file provided. The following descriptions relate to pangolin 3.0 onwards.
CSV column headers
The name of an input query sequence. Note that spaces and commas in sequence names (not a good idea to have these characters in sequence names in general) get replaced by underscores.
The most likely lineage assigned to a given sequence based on the inference engine used and the SARS-CoV-2 diversity designated. This assignment may be is sensitive to missing data at key sites.
In the pangoLEARN decision tree model, a given sequence gets assigned to the most likely category based on known diversity. If a sequence can fit into more than one category, the conflict score will be greater than 0 and reflect the number of categories the sequence could fit into. If the conflict score is 0, this means that within the current decision tree there is only one category that the sequence could be assigned to.
This score is a function of the quantity of missing data in a sequence. It represents the proportion of relevant sites in a sequnece which were imputed to the reference values. A score of 1 indicates that no sites were imputed, while a score of 0 indicates that more sites were imputed than were not imputed. This score only includes sites which are used by the decision tree to classify a sequence.
If a query is assigned a constellation by scorpio this call is output in this column. The full set of constellations searched by default can be found at the constellations repository.
The support score is the proportion of defining variants which have the alternative allele in the sequence.
The conflict score is the proportion of defining variants which have the reference allele in the sequence. Ambiguous/other non-ref/alt bases at each of the variant positions contribute only to the denominators of these scores
A version number that represents both the pango-designation number and the inference engine used to assign the lineage. For example:
- PANGO-1.2 indicates an identical sequence has been previously designated this lineage, and has so gone through manual curation. The number 1.2 indicates the version of pango-designation that this assignment is based on.
- PLEARN-1.2 indicates that this sequence is different from any previously designated and that the pangoLEARN model was used as an inference engine to predict the most likely lineage based on the given version of pango-designation.
- PUSHER-1.2 indicates that this sequence is different from any previously designated and that UShER was used as an inference engine with fast tree placement and parsimony-based lineage assignment, based on a guide tree (protobuf) file built from the data in a given pango-designation release version.
The version of pangolin software running.
The dated version of the pangoLEARN model installed.
The version of pango-designation lineages that this assignment is based on.
Indicates whether the sequence passed the QC thresholds for minimum length and maximum N content.
If any conflicts from the decision tree, this field will output the alternative assignments. If the sequence failed QC this field will describe why. If the sequence met the SNP thresholds for scorpio to call a constellation, it’ll describe the exact SNP counts of Alt, Ref and Amb (Alternative, reference and ambiguous) alleles for that call.
|Virus1||B.1.617.1||PANGO-1.2||2.4.2||2021-05-10||1.2||passed_qc||Assigned using designation hash.|
Alt alleles 16;
Ref alleles 3;
Amb alleles 4
|Virus3||A||PANGO-1.2||2.4.2||2021-05-10||1.2||passed_qc||Assigned using designation hash.|