As of pangolin 4.0, pangolin will run lineage assignment by default in accurate (UShER) mode.
Note: This will be significantly slower for larger datasets, which we still recommend be run in fast mode or with a lot of threads.
To run in fast mode (e.g. for larger datasets), specify --analysis-mode fast on the command line and this will run the pangoLEARN model inference.
Pangolin includes multiple analysis engines: UShER and pangoLEARN.
Scorpio is used in conjunction with UShER/ pangoLEARN to curate variant of concern (VOC)-related lineage calls.
In pangolin 4.0, UShER is the default and is selected using option "usher" or "accurate". pangolin runs a parsimony-based lineage assignment using UShER as the inference engine.
Run pangolin <query> where <query> is the name of your input (fasta) file
pangoLEARN can alternatively be selected using "pangolearn" or "fast". As of v4.0, pangoLEARN mode uses a random forest machine learning approach as the inference engine.
Run pangolin --analysis-mode fast <query> where <query> is the name of your input (fasta) file
Finally, it is possible to skip the UShER/ pangoLEARN step by selecting "scorpio" mode, but in this case only VOC-related lineages will be assigned. The output version number (e.g. `SCORPIO_v0.1.4`) corresponds to the version of constellations used in the scorpio assignment.
In the process of lineage assignment, pangolin creates an alignment using minimap2 to map against an early, anonymised reference SARS-CoV-2 sequence and then using gofasta to generate a fasta file from that mapping with the non-coding regions masked out with N's.
For convenience (I know I certainly find it very useful for quickly generating a SARS-CoV-2 alignment), pangolin has a flag that will output this alignment in addition to the lineage report instead of writing it to temp. The exact parameters can be found in the source code here.
By default the output alignment file is called alignment.fasta, but as of pangolin 4.0 you can now specify the name of this file with the --alignment-file flag.
usage: pangolin[options] pangolin: Phylogenetic Assignment of Named Global Outbreak LINeages optional arguments: -h, --help show this help message and exit Input-Output options: query Query fasta file of sequences to analyse. -o OUTDIR, --outdir OUTDIR Output directory. Default: current working directory --outfile OUTFILE Optional output file name. Default: lineage_report.csv --tempdir TEMPDIR Specify where you want the temp stuff to go. Default: $TMPDIR --no-temp Output all intermediate files, for dev purposes. --alignment Output multiple sequence alignment. --alignment-file ALIGNMENT_FILE Multiple sequence alignment file name. Analysis options: --analysis-mode ANALYSIS_MODE Specify which inference engine to use. Options: accurate (UShER), fast (pangoLEARN), pangolearn, usher. Default: UShER inference. --skip-designation-cache Developer option - do not use designation cache to assign lineages. --max-ambig MAXAMBIG Maximum proportion of Ns allowed for pangolin to attempt assignment. Default: 0.3 --min-length MINLEN Minimum query length allowed for pangolin to attempt assignment. Default: 25000 Data options: --update Automatically updates to latest release of pangolin, pangolin-data, scorpio and constellations, then exits. --update-data Automatically updates to latest release of constellations and pangolin-data, including the pangoLEARN model, UShER tree file and alias file then exits. -d DATADIR, --datadir DATADIR Data directory minimally containing the pangoLEARN model, header files, UShER tree and alias file. Default: Installed pangolin-data package. --usher-tree USHER_PROTOBUF UShER Mutation Annotated Tree protobuf file to use instead of --usher default from pangolin-data repository or --datadir. Misc options: --aliases Print Pango alias_key.json and exit. -v, --version show program's version number and exit -pv, --pangolin-data-version show version number of pangolin data files (UShER tree and pangoLEARN model files) and exit. --all-versions Print all tool, dependency, and data versions then exit. --verbose Print lots of stuff to screen -t THREADS, --threads THREADS Number of threads