How it works

Back to pangolin documentation home page.

Changes in pangolin 4.0

  1. Changes to pangolin dependencies: pangolin-data now a dependency, removes need for pangoLEARN and pango-designation
  2. Version number reporting change. See output documentation for more information
  3. Change to output csv format
  4. Change to how sequences that fail to get assigned a lineage are reported: Change from `None` to `Unassigned`.

pangolin 4.0 pipeline schema

pangolin 4.0 pipeline description

pangolin will assign the most likely lineage out of all currently designated lineages and uses scorpio to sanity-check specific lineages that correspond to variants by presence of constellations.

  1. All input sequences are aligned against an early, anonymised reference SARS-CoV-2 sequence. pangolin creates the alignment using minimap2 to map and then using gofasta to generate a fasta file from that mapping with the non-coding regions masked out with N's.

    This alignment is rationalised to unique sequences using a hash system

    Each sequence is run through the designation cache to see if it has been previously designated a lineage

    The sequences are run through a sequence QC check that reports proportion ambiguity of a given sequence. Any sequences that fail this check will not get assigned a lineage.

    Each sequence is run through scorpio to check for variants of concern or variants of interest, as defined in cov-lineages constellations.

    The output of each of these steps is amalgamated into a preprocessing report csv file.

    The final products of the preprocessing pipeline are the alignment file and the merged report.

  2. The inference pipelines each run the inference step (either pangoLEARN or UShER)

    The output of each inference step is processed into a standardised format

  3. The output of both sets of pipelines (pre-processing and inference) is integrated to produce the final lineage report.