Getting Started
Installation
Warning
🔌 Right now, VOCAL tested on Linux system only 💻
clone this repository:
You can easly install all dependencies with conda:
Running VOCAL Manual mode (MOC information)
... in three steps.
Step1: Annotate mutations in the Spike protein
This creates by default a variant_table.tsv file with all mutations.✴️ Note
When VOCAL
is run without option, it realigns each query sequence to the reference Wuhan sequence NC_045512 using the pairwise alignment function in the biopython library.
tip: 🐌 SLOW ??
🐌 SLOW ??: The alignment option in vocal uses a biopython pairwise aligner and can be relatively slow. It is thus recommended to first generate an alignment file of all the sequences before running vocal annotation of the mutations.
The alignment file (in PSL format) can be created using the tool pblat
that can be downloaded here or simply installed through our provided conda environment.
👀 Thus, if we want to use precomputed whole-genome alignments of the fasta file as a PSL file ( --PSL
option) to improve alignment speed please see the below section, otherwise please continue to step2.
To generate a PSL file with alignments
Example command to generate PSL format.
To run VOCAL with a PSL file;
Step2: Annotate mutation phenotypes
python vocal/Mutations2Function.py -i results/variant_table.tsv -a data/table_cov2_mutations_annotation.tsv -o results/variants_with_phenotypes.tsv
Step3: Detect/Alert emerging variants
Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-o results/
in case we want to include metadata file, use (-a)
Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-a test-data/meta.tsv \
-o results/
Attention
⚠️Note: meta data must have these information
- ID column (match with sample ID in FASTA file)
- LINEAGE column (e.g., B.1.1.7, BA.1)
- SAMPLING DATE column (the date that a sample was collected) (format YYYY-mm-dd)
Finally, we can easily generate report into HTML format at the end of the analysis.
python vocal/Reporter.py \
-s results/vocal-alerts-samples-all.csv \
-c results/vocal-alerts-clusters-summaries-all.csv \
-o results/vocal-report.html
Running VOCAL with PS
PS = Positive Selection
Step1: Annotate mutations in the Spike protein.
This step is similar to the VOCAL-Manual-mode step1
Step2: Annotate mutation phenotypes with PS
We will use positive selection to annotate mutation site.
python vocal/Mutations2Function.py -i results/variant_table.tsv -a data/table_cov2_mutations_annotation.PS.tsv -o results/variants_with_phenotypes.PS.tsv
Step3: Detect/Alert emerging variants with PS
This time we will use Script_VOCAL_unified.PS.R
to make a prediction.
exmaple,
Rscript --vanilla "vocal/Script_VOCAL_unified.PS.R" -n PS \
-f results/variants_with_phenotypes.PS.tsv \
-o results/
or in case we want to include metadata file, use (-a)
Rscript --vanilla "vocal/Script_VOCAL_unified.PS.R" -n PS \
-f results/variants_with_phenotypes.PS.tsv \
-a test-data/meta.tsv \
-o results/
✴️ Did you know ?
We extended and integrated the VOCAL scoring scheme with the Machine learning model. You can try our "VOCAL-AutoMode".