Variant annotation

Variant annotation is the process of assigning functional information to genetic variants. Labelling genetic variants with further information is often a key step in the analysis of genetic data. We might, for example, wish to know whether any of the variants we are studying results in a change to the amino acid sequence of a protein product. This requires annotating our variants with functional information which is typically done using a computational tool.

The process of variant annotation is dependent on other biological databases. For the problem above in which we are interested in a change in coding sequence we first need to determine whether our variants of interest are in protein coding genes. Else where on this site you will see that different databases define genes differently and the annotation we give to our variant may differ depending on what database we have used to define genes (%cite19, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4062061/).

The functional effect of a genetic variant is most conclusively demonstrated via experimentation in an appropriate model. For a single genetic variant this is often no mean feat. Computational prediction of the functional effect of a genetic variant can therefore be of great value. **An example of when computational tools have predicted a variant is likely to be interesting which is later validated experimentally**. A huge number of tools have been developed but the underlying methods for defining interesting variants can be generalised:

Epidemiologically interesting variants - is the variant associated with the phenotype of interest
Variants in conserved regions - DNA changes with time, if a region is very well preserved it suggests it is potentially intolerant of variants are therefore important for survival
Structurally significant variants - these variants change the structure which is thought to be important. This is often the structure of a protein which is what we are most comfortable with but other genomic regions such as enhancers, promotors, methylation sites etc. could be affected

It seems relatively unlikely that one of these techniques is always more valuable than the others and it's likely the biological context is important. To be clear - the following is all speculation. In oligogenic or polygenic disease architectures epidemiological methods may be less valuable as there is often no easy way to determine prevalence of a group of variants unless a dedicated association study is undertaken. Highly conserved regions of DNA may have been protected from damage by other reasons such as very stable structure of better-than-average repair. Furthermore, if the phenotype of interest is relatively unique to humans we might not expect variants of concern to lie in such 'old' genome. Predicting the structural consequences of variants can be challenging particularly with respect to 3D structure and factors related to expression. These methods also potentially bias towards loss of function rather than gain of function mechanisms.

When choosing a tool for variant annotation, in addition to the above, it is essential to consider the data that has been used to construct the predictor. Significant biases may occur. The level of annotation across human genes varies drastically. Some genes have been extensively studied for decades while others have little more than a name and predicted protein sequence. Variants in well annotated genes are, for certain tools, more likely to be prioritised as functionally interesting than those in less annotated genes. Furthermore, some tools that have been developed are metapredictors. These metapredictors use a combination of other variant annotation tools to derive a conglomerate annotation. This potentially has the advantage of leveraging the power of different methods, but if the underlying tools were developed on the same datasets this can lead to overfitting and subsequent overestimation of the tools accuracy and possible detriment predicting the function of novel variants.

Some tools only predict the outcome of protein coding variants and some are only available with a paid license.

Tool	Methods	Metapredictor
Alamut batch
ANNOVAR
CADD
DeepSequence	Conservation
Polyphen
Polyphen2
REVEL
SIFT	Conservation
SnpEff
VEP