
How to: lineage designation
So you think you've identified a new lineage? The following is a step-by-step guide of how to add your new lineage to the growing list of lineages we maintain and can then be assigned using pangolin.
Refer to the Pango lineage guide to check if your cluster fits the new lineage guidelines.
Go to the github.com/cov-lineages/pango-designation repository, shown below. Notice the lineage_notes.txt and lineages.csv files. These include the latest set of manually curated lineage designations.
It may well be that others have also noticed this new lineage and have already requested it. It's a good idea to check the issues list and the latest tagged releases to ensure your new lineage doesn't already exist
Use the example issue to create a new issue describing why your cluster should be a new lineage and presenting evidence to support the lineage designation, potentially in the form of a phylogenetic tree. Note that your sequences need to be on GISAID to be added to the lineage scheme and you must provide a list of sequence names or GISAID IDs that we can match to the database. Ideally, sequences can be identified using the format consistent with the lineages.csv file.
If your lineage proposal fits in with the lineage scheme, the following steps will take place:
A lineage name will be designated to your proposed lineage (or confirmed if you've suggested one). This name and your description will be added to the lineage_notes.txt file, and the sequence names you provide and the new lineage designation will be appended to the lineages.csv file.
A new release will be tagged at github.com/cov-lineages/pango-designation. Small lineage updates will get a minor release tag, whereas more large-scale designations will get a major tag.
This new set of designations will be input into the pangoLEARN training pipeline and run on the latest GISAID data. The information input is the sequence_name and the lineage designation. These will be matched up to the sequences on GISAID, so it is important for the correct names to be provided.
Issues we've encountered previously are different treatment of spaces in sequence names. For example on GISAID may output sequence name `South Africa/XXXXX/2020`, as spaces are not tolerated in fasta headers, our standard is to replace spaces with an underscore, creating `South_Africa/XXXXX/2020`. We have some checks in place in the pipeline for these edge cases, however please take case to avoid systematic changes to names that may unexpectedly interfere with our ability to match names to sequences.
When the model completes training, the updated files are pushed to the pangoLEARN repository and a new pangoLEARN release is tagged. These new model files will be able to assign genomes similar to those in your new lineage that lineage designation. More information about how the pangoLEARN model training and assignment works can be found in the pangolin documentation.
The information on the cov-lineages.org website is updated once daily, always with the latest designations and assignments. Within 24 hours you should be able to find your new lineage reflected on the website.