Getting Started

Installation

Warning

🔌 Right now, VOCAL tested on Linux system only 💻

clone this repository:

git clone https://github.com/rki-mf1/vocal.git

You can easly install all dependencies with conda:

cd vocal
conda create -n vocal -f env/environment.yml
conda activate vocal

Running VOCAL Manual mode (MOC information)

... in three steps.

Step1: Annotate mutations in the Spike protein

python vocal/vocal.py -i test-data/sample-test.fasta -o results/variant_table.tsv

This creates by default a variant_table.tsv file with all mutations.

✴️ Note

When VOCAL is run without option, it realigns each query sequence to the reference Wuhan sequence NC_045512 using the pairwise alignment function in the biopython library.

tip: 🐌 SLOW ??

🐌 SLOW ??: The alignment option in vocal uses a biopython pairwise aligner and can be relatively slow. It is thus recommended to first generate an alignment file of all the sequences before running vocal annotation of the mutations. The alignment file (in PSL format) can be created using the tool pblat that can be downloaded here or simply installed through our provided conda environment.

👀 Thus, if we want to use precomputed whole-genome alignments of the fasta file as a PSL file ( --PSL option) to improve alignment speed please see the below section, otherwise please continue to step2.

To generate a PSL file with alignments

Example command to generate PSL format.

pblat test-data/ref.fna test-data/sample-test.fasta -threads=4 results/output.psl

To run VOCAL with a PSL file;

python vocal/vocal.py -i test-data/sample-test.fasta --PSL results/output.psl -o results/variant_table.tsv

Step2: Annotate mutation phenotypes

python vocal/Mutations2Function.py -i results/variant_table.tsv -a data/table_cov2_mutations_annotation.tsv -o results/variants_with_phenotypes.tsv

By default, this step will create the consolidated table (variants_with_phenotypes.tsv file) of mutations with phenotype annotation.

Step3: Detect/Alert emerging variants

Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-o results/

in case we want to include metadata file, use (-a)

Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-a test-data/meta.tsv \ 
-o results/

Attention

⚠️Note: meta data must have these information

ID column (match with sample ID in FASTA file)
LINEAGE column (e.g., B.1.1.7, BA.1)
SAMPLING DATE column (the date that a sample was collected) (format YYYY-mm-dd)

Finally, we can easily generate report into HTML format at the end of the analysis.

python  vocal/Reporter.py  \
        -s results/vocal-alerts-samples-all.csv \
        -c results/vocal-alerts-clusters-summaries-all.csv \
        -o results/vocal-report.html

Running VOCAL with PS

PS = Positive Selection

Step1: Annotate mutations in the Spike protein.

This step is similar to the VOCAL-Manual-mode step1

Step2: Annotate mutation phenotypes with PS

We will use positive selection to annotate mutation site.

python vocal/Mutations2Function.py -i results/variant_table.tsv -a data/table_cov2_mutations_annotation.PS.tsv -o results/variants_with_phenotypes.PS.tsv

Step3: Detect/Alert emerging variants with PS

This time we will use Script_VOCAL_unified.PS.R to make a prediction.

exmaple,

Rscript --vanilla "vocal/Script_VOCAL_unified.PS.R" -n PS \
-f results/variants_with_phenotypes.PS.tsv \
-o results/

or in case we want to include metadata file, use (-a)

Rscript --vanilla "vocal/Script_VOCAL_unified.PS.R" -n PS \
-f results/variants_with_phenotypes.PS.tsv \
-a test-data/meta.tsv \
-o results/

✴️ Did you know ?

We extended and integrated the VOCAL scoring scheme with the Machine learning model. You can try our "VOCAL-AutoMode".