pangolin will assign the most likely lineage out of all currently designated lineages and uses scorpio to sanity-check specific lineages that correspond to variants by presence of constellations.
All input sequences are aligned against an early, anonymised reference SARS-CoV-2 sequence. pangolin creates the alignment using minimap2 to map and then using gofasta to generate a fasta file from that mapping with the non-coding regions masked out with N's.
This alignment is rationalised to unique sequences using a hash system
Each sequence is run through the designation cache to see if it has been previously designated a lineage
The sequences are run through a sequence QC check that reports proportion ambiguity of a given sequence. Any sequences that fail this check will not get assigned a lineage.
Each sequence is run through scorpio to check for variants of concern or variants of interest, as defined in cov-lineages constellations.
The output of each of these steps is amalgamated into a preprocessing report csv file.
The final products of the preprocessing pipeline are the alignment file and the merged report.
The inference pipelines each run the inference step (either pangoLEARN or UShER)
The output of each inference step is processed into a standardised format
The output of both sets of pipelines (pre-processing and inference) is integrated to produce the final lineage report.