Important, for an improved readability of the manual, we suggest that you (temporarily) hide the Galaxy history and Galaxy tool panels of VINYL. You can do so by hitting the "arrow-like" buttons at the right end and left end corner of the page. To get the panels back, just hit the buttons again
Manual for the prioritazion of Genetic variants with VINYL
- 4.1 Overview and Description of the Algorithm
- 4.2 Parameters and configuration files
- 4.3 Custom configuration files
- 5.1 What am I optimizing after all? The optimality criterion and beyond
- 5.2 Brief description of the input parameters and their values
- 5.3 Brief discussion of the output
- 7.1 What is Survival analysis and why do I need it?
- 7.2 Can you explain the output format. What is the ideal cutoff?
- 7.3 How do I filter the output file?
- 7.4 Post processing and visualization of the results
2. Control population VCF files
3.Variant Annotation with Annovar
4. Algorithm for the computation of the pathogenicity score
5. Using the VINYL optimizer to find the optimal scoring system
6. Using VINYL to score your variants
7. Post processing: using Survival to find the optimal cut-off
8. A brief tutorial
Please be aware that the present user guide is focused on the description of the functionalities of VINYL. Users are encouraged to refer to the (excellent) training materials by the Galaxy development team for a more detailed guide to the usage of Galaxy
1.1 For impatient people.
To run a quick and dirty analysis on your data you can use VINYL with default settings. Although optimization (see optimizer) is highly recommended. Two input VCFs are needed, one containing genetic variants from a cohort of affected individuals, and one from a population of unaffected controls. If the latter is not available to you, you can take advantage of one of the several VCF files of genetic variants associated with geographic human populations that are available in VINYL at this link . Please be aware that ideally you should select the population that is more closely related to your cohort of patients. If you have performed a targeted resequencing study, please see the manual for instruction on how to pre-process the VCF files included in VINYL. Once you have scored your variants with VINYL can you use the "survival tool" to identify the best score cut-off for pathogenic variants. The ready-made publicly available VINYL workflow can do that for you. If you are new to VINYL anyway you still want to have a look to the Post processing: using Survival to find the optimal cut-off section, in order to understand what to do with the output.
1.2 What VINYL can do for youVINYL is a software designed to assist in variant prioritization in medium-large cohort of patients. The program computes an aggregate score, which is based on an extensive collection of publicly available annotations, in order to identify/prioritize variants that are likely to be pathogenic or have a clinical significance. In order to derive an optimal cut off score for the prioritization of variants, VINYL uses a strategy based on "survival analysis", where the pathogenicity score distribution of the affected individuals is compared with a matched cohort of unaffected individuals. To facilitate the usage of the software, VINYL is provided in the form of a public Galaxy instance, based on the Laniakea suite. To ensure the maximum level of security, VINYL uses Encrypted data volumes for the storage of the data.
Please be aware that VINYL is not a tool for the control of the quality of your VCF files, and does not apply any filter or metric for the identification of poorly supported or low quality variants. Please make sure that your VCF files were filtered and pre-processed appropriately before uploading them to VINYL. If you don't know what all this means, we strongly advise you to perform variant calling analysis on your data using our tool CoVaCS which provides a highly accurate and completely automated system for this type of analysis.
Please remeber to cite: Chiara et al 2020 if you find VINYL useful in your work!
1.3 What you can do for VINYL
Although VINYL is under constant and careful development, inconvenients might always happen. Therefore if you encounter any problem, issues or unwanted behaviour, or if you feel like some feature are missing in VINYL, which should really be added to the tool/Galaxy instance, feel free to contact us.
2 Control populations and VCF filesTo run VINYL you will need 2 files in VCF format containing genetic profiles of a "control" cohort of unaffected individuals and a cohort of affected"subjects. Ideally these 2 cohorts should be composed by a similar number of individuals, selected from matched or related human geographic populations. If for any reason you do not have a VCF file with matched controls, you can take advantage of VCF from publicly available studies of human genetic variants. A collection of such files is already available in VINYL (see below). If for some strange reasons VCF files that you upload to VINYL are not recognized as VCF and hence are not available for the analysis (which probably means that Galaxy was not able to identify the format of your file correctly), please be aware that you will be required to set the correct format for your file manually. This operation is performed by hitting the name of the file in your history tab, then the "Datatypes" menu and finally by selecting the correct format from the dropdown menu that will appear. Remember to hit the "change datatype" button when done. See this link for a very detailed reference of files datatype in Galaxy.
2.1 Reference VCF files available in VINYLVINYL provides a collection of refence VCF files which were obtained from the Ensembl genome browser repository . Please have a look at at this link to navigate through VCF files that are currently available in VINYL. These files can be easily imported to your history without the need to download and or upload them to VINYL. If you are aware of additional repositories that are missing and you really feel that should be incorporated in VINYL, please contact us
2.2 Selecting the correct control populationFor getting the best results out of VINYL, you should selected the control population that matches as closely as possible your cohort of affected individuals. In case of doubts, you can understand which control population is better for you by comparing your VCF file with VCF files available in VINYL by the means of the VCFtools-compare utility. As outlined in Figure 1, vcf compare is pretty straighforward to use. The output consists in basic metrics of similarity/dissimilarities between the 2 files. The file that displays the highest levels of similarity should be selected.
Figure 1: Comparison of VCF files
2.3 Using bedtools-intersect to match your TGSIf you have performed a targeted resequencing experiment (TGS), VCF files of control populations available in VINYL are not suitable for you as these files were obtained by WGS/or whole genome hybridization arrays, and are not limited to particular regions of the genome. In order to minimize possible ascertainment biases we strongly suggest that instead of using the complete files, you should subset these VCF to match your experimental design as closely as possible. This operation can be performed by the bedtools intersect utility, which is included in the VINYL Galaxy instance. As illustrated in the figure below all you need do is to upload a bed file of the target regions to VINYL, and then use the bedtools-intersect utility to extract the variants that are enclosed in your "target" regions from the VCF file. This process is illustrated by Figure 2:
Figure 2: Subsetting of a VCF file
3. Variant Annotation with Annovar
VINYL is designed to work on VCF files annotated by the means of the Annovar software. Currently Annovar is one of the most popular tool for the annotation of genetic variants. Annovar incorporates a large collection of databases and publicly available resources for the annotation of genetic variants, see this link for a complete list of such databases. To facilitate the annotation of your VCF files, VINYL incorporates a ready made-preinstalled version of Annovar. This version comes with a wealth of databases and resources for the annotation, which should be more than enough for most use-cases. See Table1 in the supplementary materials of the paper for a complete list of database and resources. As usual if you feel that our collection is incomplete, feel free contact us to ask the incorporation of additional resources/databases. To annotate you VCF files in VINYL, you are STRONGLY encouraged to use the instance of Annovar incorporated in VINYL. As outlined in Figure 3 you can select (ad de-select) annotation tracks simply by typing their names in the forms. To have a complete list of the annotation tracks that are currently available, you can hit the "Select All" button.
Figure 3: Using annovar in VINYL
4. Algorithm for the computation of the pathogenicity scoreVINYL computes its pathogenicity score by aggregating different annotations from publicly available resources (as provided by Annovar). A brief overview of the algorithm for the computation of the pathogenicity score is provided in the following section.
4.1 Overview and Description of the AlgorithmVINYL derives a score for the evaluation of genetic variants from the integration of different types of evidence/annotations. Each type these annotation that contribute to the calculation of the score in VINYL are informally referred to as a compoment. Currently VINYL score is formed by 12 distinct components, each reflecting a different type of functional evidence. The final scole in VINYL is nothing but a simple summation of these components (see paper). Since, however each different component/type of annotation might or might not be relevant in different use cases/for the evaluation of different pathological conditions, VINYL is designed to optimize the way in which the score is computed on a case by case basis. This is achieved by assigning a weight or importance coefficient to every component of the score. Depending on the relevance/importance of each component for the specific case, the corresponding coefficient is adjusted (increased or decreased) resulting in a different scoring system. This process is called: Score Optimization. A brief description of each component of the score, and of the criteria used its the calculation follows. See below for more details on the optimization process: Optimization.
- Score_DB:Annotation in Clinvar The aggregate score is incremented by this value if variants are annotated as Pathogenic or Likely Pathogenic according to the Clinvar database. Subtracted for variants annotated as Benign or Likely Benign. Only annotations in Clinvar that match the name of the pathological condition and/or the symptoms of the disease are considered.
- Score_RV:Allele frequency This value is addeded to the total pathogenicity score if the MAF of the variant according to public databases of human genetic variants is lower or equal to the value indicated by the AF cutoff parameter.
- Score_FE:Predicted functional effect For variants that have a predicted functional effect on the protein/mRNA, this score is added to the global pathogenicity score. Functional effects are indicated according to the Annovar annotation system. List of relevant functional effects are provided by the configuration file described below
- Score_NS:Disruptive non synonymous variants This value is summed to the global score if a non-synonymous variant is predicted to have a deleterious effect on the protein according to one or more tools for the prediction of the effect of NS variants (default CADD and metaSVM). The list of methods that should be considered for the evaluation of the effect of NS variants is specified by the keywords file. When multiple tools are specified in the keywords configuration file (see below), Score_NS/N (where N is the number of tools specified) is added on a tool base (independently for every tool that reports a deleterious effects).
- Score_OR:Over-representation in the cohort This value is added to the pathogenicity score if a variant that has a MAF<=0.01 according to public databases of human genetic variation is observed in N or more individuals in the dataset. The value of N is specified by the number of individuals parameter. As a rule of thumb N should be set to 5-10% of the size of your cohort of individuals. Defaults to 10% of the size of the cohort.
- Score_eQ:Association with eQTLs The global score is incremented by this value if a genetic variant is associated with expression Quantitative Trait Loci (eQTL) according to the GTEx database, in one or more tissues. A list of (valid) tissues, according to the GTEx notation, can be provided by the QTL file parameter
- Score_AD:Genes implicated in/or related to the pathological condition This value is added to the aggregate pathogenicity score if variants are associated with genes that are known to be implicated in the diseases. Users can (and are highly encouraged to) provide their list of disease associated genes by the means of a file in the format described below
- Score_R:Regulatory regions This value is added to the score if genetic variants are associated with regulatory regions ( like for example Enhancers or Promoters) according to the Ensembl regulatory build or to the ORegAnno database, or are associated with genomic region showing a NCER score in the top 10% (nCER90) or in the top 5% (nCER95) of the distribution. When multiple resources are specified in the keywords configuration file (see below), Score_T/N (where N is the number of resources specified) is added for every positive entry
- Score_T:TFBS This value is added to the score if genetic variants are associated with Transcription Factor Binding Sites (TFBS) according to the Ensembl regulatory build or to the a ORegAnno database. When multiple resources are specified in the keywords configuration file (see below), Score_T/N (where N is the number of resources specified) is added for every positive entry
- Score_M:miRNA This value is added to the score if genetic variants are associated with miRNA binding sites according to the Ensembl regulatory build or to the ORegAnno database. When multiple resources are specified in the keywords configuration file (see below), Score_M/N (where N is the number of resources specified) is added for every positive entry.
- Score_G:GWAS This value is added to the score for genetic variants associated to a phenotypic trait relevant for the disease according to a Genome Wide Association Study (GWAS). Similar to score_DB only annotations that match the name of the pathological condition and/or the symptoms of the disease are considered. Symptoms can be specified via a plain text configuration file. See below
- Score_SP:Splicing This value is added to the score for genetic variants predicted to have a disruptive effect on splice sites according to the dbscSNV database, version 1.1. Both the AdaBoost and Random Forest predictors are available.
Figure 4: Parameters for the computation of the score
4.2 Parameters and configuration filesBeyond coefficients for the weight of each different type of scores, VINYL requires several configuration files that instruct the tool in the calculation of its pathogenicity score. Although some of these files are not mandatory, it is highly suggested to incorporate into VINYL as much information as possible in order to obtain the best results. An example of each of these files can be found at the following link. Detailed description of all the configuration files and input parameters of VINYL are provided in the following section.
- Annovar keywords used for the computation of the score This is a configuration file that specifies the keywords that are used by VINYL for the extraction of relevant annorations from VCF files and for the computation of the pathogenicity score. Names of these keywords need to match exactly those used by Annovar for the annotation of your VCF. If you are unsure of what this might possibly mean, please refer to the VCF file format specification at specification. Details and specifications of the annotation provided by Annovar are detailed in the header of your VCF file in the Meta-Inofrmation lines. Annotations provided by Annovar are preceded by the tag #INFO. An example can be observed in Figure 5:
- AF keywords: specify the databases/resources that are used to obtain Allele Frequency annotations. When multiple resources are provided the maximum value is considered
- NStool keywords: specify the tools for the evaluatuion of the functional effects of Non Synonymous variants that are considered by VINYL for the computation of the Disruptive non synonymous variants score. More than one tool can be specified. Defaults are CADD and metaSVM. However VINYL supports the complete collection of more than 25 tools as available from the dbNSFP database. Please refer to dbNSFP for the complete list/collection of tools that are available.
- Effect keywords: these keywords set the columns that are used by VINYL to extract the predicted annotation of the functional effects of the variants. If values associated with these keys match any of the deleteriuos effects as listed in the Functional effects file the pathogenicity score is incremented
- Splice keywords: specify the databases/resources that are used to evaluate the effects of genetic variants on splice site. In the current implementation of VINYL these correspond with the dbscSNV_ADA_SCORE and the dbscSNV_RF_SCORE. Both scores are derived from the dbscSNV database
- Reg keywords: specify the databases to be considered for the annotation of regulatory regions. Defaults to ORegAnno_REGULATORY_R and ENSEMBLReg. The nCER90 and nCER95 databases can be specified as well.
- tfbs keywords: specify the databases to be considered for the annotation of Trascription Factor Binding sites regions. Defaults to ORegAnno_REGULATORY_TFBS and ENSEMBLTFBS
- mirna keywords: specify the databases to be considered for the annotation of miRNA binding sites. Defaults to ENSmiRNA and ORegmiRNA
- gwas keywords: these keywords are used to indicate the resources to be used for the annotation of SNPs implicated with phenotypic traits of interest (seebelow) according to a Genome Wide Association Study. At the time being, the annovar database incorporated in VINYL includes only one resource for this type of annotation, the GWAS reference database, as obtained from the The NHGRI-EBI Catalog of published genome-wide association studies
- List of deleterious functional effects This configuration file specifies the predicted functional effects that are to be considered highly deleterious by VINYL. The file has a very simple format: each functional effect is provided in a separate line. The convention used to indicates these effects is the same as that applied by Annovar. A default file is already incorporated in VINYL, which provides an example of the file format. This is a simple text file with one "functional effect" per line. See the Annovar Documentation for thorough explanation of the annotation of predicted functional effects by Annovar.
- List of symptoms/phenotypic conditions related to the disease This file provides a list of symptoms or related keywords that are used by VINYL to screen Clinvar Annotations and identify variants that have been implicated in similar pathologies or phenotypes. The file is provided in a simple text format, where each of the keyword needs to be reported on a separate line. See here for an example of a valid file. Users are encouraged to check the controlled vocabulary of names and symptoms associated with the pathological condition of interest in OMIM .
- Disease name Name or functional description of the pathological condition. This parameter is used to perform a soft check of the annotation in Clinvar and to identify variants that have been previously implicated in the disease. As outlined above, Clinvar annotations that match the value provided by this parameter and/or the symptoms of the disease are considered for the computation of the pathogenicity score. This parameter is highly recommended. Users are encouraged to check the controlled vocabulary of names and symptoms associated with the pathological condition of interest in OMIM. If the patients are affected by different pathological conditions, multiple names can be provided, sepatared by a "#". For example: "cardiomiopathy#dilated" specifies the keywords "caridiomyopathy" and "dilated".
- Disease genes This file provides a list of genes that have been previously implicated in the disease or in similar pathological conditions. The score for Genes implicated in/or related to the pathological conditionis computed based on the list of genes indicated in this file. The file is in simple text format, with each gene on a separate line. An example is provided here
- eQTL file This configuration file provides a list of tissues that are used by VINYL for the annotation of eQTL and the scoring of variants associated with eQTLs in those tissue. Names of tissues need match names used in the GTEx project. The format is analogous to several other configuration files in VINYL, a simple text file with the name of the tissues in a separate line. See here for an example.
- Over-representation score cut off Cut off value for the over-reprentation score. The increment value specified by Over Representation score is addeded to the pathogenicity score only for variants that have an allele count in the cohort equal to or greater than this cut off. As a rule of thumb, this should be set to approximately 5-10% of the size of your cohort of individuals
- Allele frequency cutoff Cut off value for the Allele frequency score. The value specified by scoreAF is addeded to the pathogenicity score only for variants that have an allele frequency lower or equal to this cut-off value
- Hereditary model of the disease parameters Currently VINYL supports 4 (mutually exclusive) types of disease hereditary model: Autosomic Dominant, Autosomic Recessive and X-linked Dominant and X-linked recessive. Disease models can be specified by the means of the AD (Autosomic Dominant),the XLD (X-Linked Dominant) and by the XLR (X-Linked Recessive) parameters. These parameters accept logical values: T=TRUE, F=FALSE. When AD is set TRUE (Default) variants are scored by using an Autosomic Dominant model of the disease, otherwise when AD is set to FALSE the disease model is set to Autosomic Recessive and only homozygous variants are considered. A similar logic applies for XLD (X-linked dominant) and XLR (X-linked recessive). Please notice that the models are mutually exclusive, and that VINYL will raise an error if more than one model is set to TRUE. In VINYL variants that are not inherited in a manner consistent with the disease model, for example variants that not located on the X-chromosome when the model is X-linked dominant, or heterozygous variants when the model is AD are penalized by a 40% linear reduction of their score.
Figure 5A file with default values that should match most of use cases is already incorporated in VINYL. This file however can by modified simply by opening it in any text editor. The format is very straightforward: a 2 column tab delineated file where the first column indicates the name of the keyword (as defined in the header) and the second column the type. See below for an example of how to provide custom values. Currently VINYL accepts 8 different types of "keywords":
Figure 6: Configuration files
4.3 Custom configuration filesAll the configuration files incorporated in VINYL can be personalized/adjusted to the specific needs of the users. Simply by opening/editing them in a text editor. While the editing/revision of the file that contains the symptom list, or the file that provides a list of gene already associated with the pathological condition of interest is very straighforward, and indeed you just need to add edit or gene names or symptoms, one per line. The modification of the file with the keywords and/or the associated file with the list of functional effects of genetic variants, might be less obvious. To illustrate a special case of the editing of these files we will take of different types of annotations are included in the VCF files provided by the DDD project to illustrate how novel features could/should be added to your configuration file. As illustrated in the previous section the types of annotations and information that are already included in your VCF file should be already listed in the header of the file itsef. Please refer to the VCF file format specification at specification to understand which type of information and annotations are available to you. From the file, it should be clear that in general 2 distinct types of features/annotations can be incorporated in VCF files:
- Qualifier values or keys: keys are a type of annotation which is provided in the form of a qualifier (or key) followed by a value (can be a string, a number etc...) in vcf annotation lines, the key and the values are separated by a = symbol. VINYL stores these values in the same way as they are represented in the file. Basically an associative array is created, where keys and values are stored
- Bolean values or flags: flags are labels that can be attached to the annotation but do not store any value per se. These annotations can be seen as a simple string that might or might not be present in the annotation line. The relevant information in this case is presence/absence of the flag itself. In VINYL, for uniformity flag values are still stored in a associative array. To circumvent the fact that flags are not associated with a value, contrary to keys, in VINYL we treat flags as special forms of keys where the key and the associated value are identical. So, bolean values are processed by VINYL as a special form of qualifiers with an identical key=>value pair.
A real case exampleTo illustrate a couple of examples of how novel keys and values can be introduced in the computation of the score calcuated by VINYL, we will take advantage of a couple of features, included in the annotated VCF files provided by the DDD consortium, which are not normally incorporated in Annovar. These include:
- Annotations of allele frequencies. Type: Key, name: DDD_AF, Description: "DDD internal allele frequency."
- Annotations of de-novo variants according to Denovo gear. These are specified by 2 distinct flags: DENOVO-SNP and DENOVO-INDEL
Adding the DDD_AF annotation to the set of annotations that are used for allele frequency is pretty straightforward: all you need to do is open your keyword configuration file, and add a line with the name DDD_AF and type AF (allele frequency) of the novel annotation. The process is illustrated in figure 7 A. To show a different use case, in this example we will add the DENOVO-SNP and DENOVO-INDEL keywords to the component of score associated with functional effects. In this case, 2 files need to be edited: the keyword file and the functional effect file. Editing of the keyword file is straightforward: you just need to enter a couple of lines with the name of the nover features, followed by a tabulation and by the keyword Effect. (Figure 7B). At this point all you need to do is to update the effects configuration file accordingly, by entering the names/annotations of the novel effects that could/should be considered deleterious. Since DENOVO-SNP and DENOVO-INDEL are bolean flags (see above), as explained in the previous section, in VINYL the values associated with these annotations are identical to the name (keys) of the annotation themselves. So basically all we need to do is to add DENOVO-SNP and DENOVO-INDEL to our effects configuration file as well. Importantly this applies to all the Flag types of annotation. Please be aware than instead when qualifier values are added to this component of the score users should take complete responsibility and control which type of values are/can be associated with each qualifier
Figure 7: Configuration files
5. Using the VINYL optimizer to find the optimal scoring systemVINYL includes a small but powerful utility, the VINYL-optimizer that uses genetic algorithms to perform a thorough exploration of the parameter space, in order to derive the optimal weights for every component of the pathogenicity score. The VINYL optimizer is available as a stand-alone tool in the VINYL Galaxy instance. The required input for this tool is two VCF files, one from a control population and one from ta cohort of affected individuals (but you should be already familiar with that concept at this point). We strongly reccommend that the optimizer should be executed every time that you analyse new data. Indeed, different scoring systems and weights are usually derived/associated with different types of pathological conditions. See the following sections for a description of the principles behind the optimizer and advices on its usage.
5.1 What am I optimizing after all? The optimality criterion and beyondSeveral studies (Cirulli et al 2015, Lee et al 2014, Moutsianas et al 2015, Guo et al 2016, Li et al 2008) have reported that populations of affected individuals are expected to harbor an excess of deleterious or slightly deleteriuos variants at disease associated loci, with respect to a matched population of unaffected controls. VINYL uses this principle to optimize the computation of its pathogenicity score. As illustrated above, the pathogenicity score computed by VINYL is the result of the aggregation of different types of annotations that are scored using a different weight/importance (or if you prefer numeric values). To optimize the balance between these different sources of information the VINYL optimizer uses genetic algorithms to explote the parameter space in order to find the combination of values that maximizes the number of potentially pathogenic variants identified in the population of affected individuals, while at the same time minimizing the number of potentially pathogenic variants identified in the control population. In simple words, what the VINYL optimizer does is to test several different combinations of values for the calculation of the score. The optimal scoring system and the corresponding threshold for the identification of potentially pathogenic variants are established by an iterative survival analysis based on the Wang Allison method. For every scoring system, a range of possible cut-off values spanning from the maximum score to the minimum score with an interval of 0.5 are evaluated, and the number of predicted potentially pathogenic variants identified in the populations of affected individuals and in the controls are recorded. A Fisher’s Exact test is used to test the significance of the over-representation of likely pathogenic variants in the population of affected individuals. Finally, the scoring system (and the corresponding threshold value) that maximizes the difference between the number of potentially pathogenic variants identified in the population of affected individuals with respect to the control population, and that, at the same time, minimizes the number of potentially pathogenic variants identified in the control population is selected
5.2 Brief description of the input parameters and their valuesThe parameters used by the VINYL optimizer are the same as the parameters required by VINIL. Since hovewer optimization is performed by a search in the parameters space (that is by trying different combinations of parameters), users in this case are required to provide an interval instead of a single value for the numeric parameters. This is done by specifying a minimum and a maximum value for every parameter, as illustrated in the Figure below. The optimizer will then compute and test all the possible combinations for all the parameters. Please be aware that this operation can require a few hours, typically 2-3 hours (depending on the size of your VCF files) For a thorough discussion of the parameters and their meaning, please see 4.2 Parameters and configuration files. See figure 6 for an overview of the interface of the VINYL optimizer.
Figure 8: Running the optimizer
Discarding parameters/setting different maximum valuesAlthough, by default VINYL is configured to allow the same maximum and minimum value for every score component weight, thus giving the same importance to different types of annotations, the optimizer can also be configured to disregard/minimize some of the score components. As you can see from Figure 8 (above), the minimum and maximum importance of each parameter is specified by a range (from 1 to 20) of numeric values. These represent the parameters search space. If you want that for some specific reasons one or more of the parameters are fixed to a maximum (or minimum) level of importance during the optimization process, all you have to do is to restrict the respective search space by modifying the minimum and maximum value. If you want to provide fixed values you can simply provide the same threshold both for the minimum and the maximum values. Ad exemplified in figure 9.
Figure 9: Personalizing the scoreIn this case as you can see, we have decided to keep the component of the score associated with databases and resources for the annotation of variants associated with a disease (score_DB) at the maximum value of 20. Thus giving to this score the maximum importance. Conversely, the importance of scores associated with allele frequencies has been minimized (score_RV is set to 1 both for the minimum and maximum value). Finally the importance of the scores associated with the predicted functional effects (score_FE) is limited between 5 and 10, implying that this component needs to be at least 5 times more important than the component associated with allele frequency, but at best it should weight half the importance (10 compared to 20) of the component of the score associated with known disease causing variants. By setting the limits of each score component arbitrarily, based on their specific needs, users can create highly customized scoring system, and are completely free to maximize or minimize the relevance of different types of functional annotations on a case by case basis.
5.3 Brief description of the outputThe output of the VINYL optimizer tool consists in a large table which ricapitulates all the values (weights) used for the calculation of the score components, and (last 4 columns) a p-value for the enrichment in likely pathogenic variants in the affected individuals,the number of total variants and the number of likely pathogenic variants identified in the population of affected individuals and in the controls, respectively. The output of the optimizer is already sorted (in descending order, by p-value), so the "optimal" combination of weight values of weights combination, is always reported at top, as Illustrated by Figure 7
Figure 10: Output of the optimizerAs, reading the relative importance of each component directly from this table file might not be exactly straightforward, VINYL incorporates a small utility to allow a visual comparison of the features of the best X scoring models. This utility, which is called barplot and can be found under the VINYL menue of the Galaxy implementation of our tool, provides a nice and immediate means to compare if/how/where different scoring systems derived by VINYL give different importance to distinct features or annotations. The barplot utility, requires as its sole input an output file in the same tabular format produced by the VINYL optimizer. The number of models to be compared can be specified by the tobecompared option. A minimum of 10 to 30 models can be selected in the current implementation. An example of how to run this tool and of the type of output that can be obtained is presented in figure 11
Figure 11: Score composition
6. Using VINYL to score your variants
6.1 I have my optimal scoring system. What do I do next?At this point you are almost done. All you need to do is to run VINYL twice, using the scoring system suggested by the optimizer. Be aware that you will need to run the tool twice: first on you cohort of affected individuals and then on the control population. It is crucial that both files are analysed using identical values for all the parameters. For this reason we highly recommend to submit this as analysis as a "multiple dataset" in Galaxy, as shown in Figure 8. Once you have done this you are ready to run the VINYL survival tool.
Figure 12: Running VINYL with optimized parametersPlease notice that the output of vinyl will consist in 2 files:
- a vcf file with the suffix scored.vcf that is equivalent to the input file, but has the VINYL score incorporated in the information field. Meaning that the file is now annotated with the VINYL patogenicity score (Qualifier VINYL_score)
- a tabular file with the suffix scored.summary. This file provides a detailed table, in tsv (tab-separated) format with the breakdown of the calculation of the score for every variant,as well as the final score as computed by VINYL. An example of this table is provided in Figure 13. The first 4 columns indicate the coordinates. Columns 5 to 8 report the number of individuals carrying the variant, as well as the associated gene (if any). Finally columns 9 to 21 reports the values of the different score components,and column 22 the final score
Figure 13: Breakdown of VINYL scoresThis tabular file can opened/inspected with any spreadsheet editor to visualize/compare the scores associated with differen variants, or alternatively (see below) can be used in conjuction with the heatmap utility, as described below to understand the patterns of scores and functional effecs associated with different genetic variants in your dataset.
7. Post processing: using Survival to find the optimal cut-offIf you have scored your pair of vcf file (affected and controls) with VINYL the last step of the analysis is the the execution of the "survival" analysis tool for the identification of the ideal score cut-off.
7.1 What is Survival analysis and why do I need it?Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. In VINYL principles derived from Survival analysis are used to verify whether (as should be expected) your population of affected individuals is enriched in "High scoring-likely pathogenic-variants" with respect to the population of matched controls. This analysis is performed in a very straightforward manner: a survival function, in our case a simple threshold is applied to both set of variants, and the number of variants that "survive" (i.e are above the cut-off) is counted in both populations. A Fisher test statistics is used to evaluate if the number of surviving variants (i.e high scoring variants with a score above the cut-off) is significantly increasead in the population of affected individuals. Ideally the ideal value for the identification of the cut-off score is value minimizes the P-value of this test, that is the threshold that finds the highest enrichment of high scoring variants (surviving variants) in the population of affected individuals. Importantly since in the survival analysis tool reports also the number of variant with a score above the cut-off threshold also in the population of matched controls you can use this value as a reference for the expected ratio on False Positives (FP) prioritized variants. Indeed the idea here is that unaffected individuals should not have in their genomes genetic variants potentially associated with a pathological condition.
7.2 Can you explain the output format. What is the ideal cutoff?The output format consists in a simple table, where for each cut-off value(colum1) you obtain the number of variants with a score above that cut-off in the "affected" population (colum2), the equivalent number for the control population (column3), the p-value according to a Fisher test for the over-representation of "surviving or high scoring variants" in the population of affected individuals (column4) and the associated fold change (i.e the relative enrichment of high scoring variants-if any- in the cohort of affected individuals). Ideally the optimal cut-off value is the value that maximizes the fold change and minimizes the Fisher-test p-value. Anyway any cut-off value associated with a relatively low (5 in 1000 for example) number of high scoring variants in the control population (these should be your false positives) and a relatively high (150 in 1000 for example) number of such variants in the affected population is a good choice. Indeed, as explained above while the number of variants above the threshold identified in your population of matched controls, are likely to represent false positives depending on the condition/disease under study, it might be that the diagnosis of the pathological condition is not so straightforward, and/or that affected but undiagnosed individuals might be included also in this set. So we strongly reccomend that the cut-off score identified by VINYL should not be considered as a stringent and absolute cut-off value, but rather as an indication or a starting point to direct the attention of investigators/clinicians towards the most relevant genetic variants. In this respect, we strongly encourage users to inspect the VINYL output file to identify/refine their analys Once you have established a reasonable cut-off, see the following section for the post processing of your files
Figure 14: Output of the VINYL survival analysis/
7.3 How do I filter the output file?To extract all the variants above a user defined cut-off from one of the output files of VINYL you can take advantage of the "Filter and Sort" utility in Galaxy. For example from the VINYL tabular output file, it is pretty simple to extract all the variants that have a score above a pre-defined cut-off. All you need to do is to filter on the last column (18) by using this expression: c18>"CutOffValue". A sketch of this procedure is provided in Figure 10 At this point you are ready to analyse your candidate variants.
Figure 15: Filtering VINYL output
7.4 Post processing and visualization of the resultsVINYL incoportates several helper utilities that can be used to perform post processing of the results and to identify interesting patterns in your dataset. These tools take in input different types of output files produced by VINYL, and can be used to execute complex analyses or visualize patterns in your data. The Heatmap tool, is a helper utility in VINYL that can be used to obtain a graphical visualization of the main features and functional annotations. This tool takes in input the detailed tabular file produced by VINYL (see above). A score cut-off is the second parameter, as plotting all the variants included in your files would be impractical/difficult to visualize. Ideally this can be the same cut-off value (or very close) to that was used for the selection/prioritization of variants. An example of the execution of this tool, which can be found under the heatmap menue in VINYL is provided in Figure 16. In this case a cut-off threshold value of 30 was used.
Figure 16: Filtering VINYL outputFor every variant (on the rows) dark (purple) values in the figure indicate the most relevant functional annotations/components of the score associated with that particular variant PCA Principal Component Analysis, or PCA is a techinque used in statistics to visualize in a more coeherent and simple manner the main features of large datasets composed of several observations. In VINYL PCA analyses of genetic variants can be used to observe/identify groups of patients/individuals with similar mutations or similar genetic profiles. This type of analysis can be performed by a dedicated tool, called PCA. As illustrated in Figure 17, the input to this utility is a pair of VCF files scored by VINYL. In particular the vcf file of the population of affected individuals and the equivalent vcf file for the population of matched controls.
Figure 17: Filtering VINYL outputAs it can be clearly observed from the figure, the PCA analysis on the Sorrentino et al dataset clearly delineates 2 groups of patients. Corresponding to 2 different clusters of orange points. Both clusters are well separated from the un-affected individuals, indicated by purple dots. Burden analysis is a special type of analysis, used in clinical studies to verify if a particular gene/genes accumulates deleterious mutations in a population. VINYL incorporates a dedicated tool, called boxplot that is used to compare distributions of VINYL scores of you population of affected and unaffected individuals in different genes. As it can be imagined from the name, distributions of scores are compared by means of boxplots (or whisker) plots. A Wilcoxon Sum and rank test is used to compare the distributions, and derive p-values for the significance (if any) of the increase in VINYL score distributions associated with the population of affected individuals. A separate boxplot (affected and not affected) is plotted for every gene, and the p-value for the comparison is reported on the top of the figure. If this p-value is smaller than 0.05 the cut-off that is normally used to establish statistical significance, the gene is condisered to accumulate an excess of deleterious or likely deleterious mutations in the population of affected individuals. As depicted in Figure 18, inputs to this tool consist in
- A Tabular files of VINYL scorers for the population of affected individuals
- A Tabular files of VINYL scorers for the population of matched controls
- A cut-off value used to identify likely deleterious variants. All variants will be considered. This is used only to plot the dotted blue and red lines that you see in the figure
Figure 18: Filtering VINYL outputAs it can be clearly observed from the figure, although no single genes seem to be statistically significantly increased in deleterious variants in the Forleo et al dataset, the FKRP gene for example shows a marginally significant, but evident increase in VINYL scores in the population of affected individuals.
8 A brief tutorialIf at this point you feel that you are ready to use VINYL, but you are still a little bit confused, this section will guide you into a brief (estimated time about 15 minutes) analysis of a reference dataset used for the testing and development of the VINYL (see Chiara et al 2019 and Forleo et al 2017 for more details)
8.1 Where do I get some files to play with?The following link (or See under Shared Data -> histories) provides a dataset which can be used to test the main functionalities of VINYL. The dataset is composed by c.a. 30 files, which include inputs, outputs and configuration files for running VINYL. As shown in Figure 19, to import these data, all you need to do is to hit the Import history button at the top of the page. All the files will be imported to your current history automatically.
Figure 19: Importing the test dataset into your work-spaceAs mentioned above, this dataset is composed by about 30 files, which can be divided into 3 main categories: Input files, Configuration files and Output files. A brief description of these files will be provided in the following section.
Input filesTwo input files are provided which are numbered from 1 to 2. These are vcf files which contain genetic profiles of individuals affected by cardiomyopathies, and of course of a population of healthy controls. All the files have already been annotated using the Annovar tool incorporated in VINYL. See File8 and File11 for the output. Please notice that the file of the control population (TSI, Toscani in Italia, see the paper) has already been subset (by the means of bedtools intersect tool,see above) to the genomic regions included in the TGS resequencing panel designed by Forleo et al. All the files are ready to be used with VINIL.
- 1 Sorrentino_etal2017.vcf: this file contains the complete list of 1035 genetic variants, identified in a population of 38 individuals affected by cardiomyopahties in the Forleo et al paper.
- 2 TSI_cardioH.vcf: this is the VCF file for the control population. The file is already annotated and trimmed, to contain only variants included in the target regions
Configuration FilesFile 3 to File 7, are some examples of configuration files for VINYL. Please refer to this section to have a more complete overview of VINYL's configuration file. A copy of these configuration files can also be obtained from
- 3 disease_related_genes_example_file : example of a configuration file used to provide a list of disease related genes
- 4 functional_effects_example_file : example of a configuration file used to provide a list of Functional effects which should be considered deleterious
- 5 keywords_configuration_example_file: configuration file containing a list of keyword, used to specify which databases, resources and annotation should be considered by VINYL for the computation of the pathogenicity score
- 6 symptoms_configuration_example_file : example of a configuration file used to provide a list of symptoms of the disease
- 7 eQTL_configuration_example_file : example of a file used to provide a list of tissue to be used for the annotation/scoring of variants associated with eQTLs
Output FilesThis toy dataset includes also a series of output files obtained by running VINYL on the files described in the previous section. In particular:
- File15 optimizer : contains the results of the VINYL optimizer for this dataset. See below for how to reproduce there results
- File17 to File22 : contain the results of the execution of VINYL on File 1 and File 2. VINYL generates 3 output files for every input file received: a file in tabular format which contains a list of the variants, the main annotations and the VINYL score, a VCF where the VINYL score annotation is incorporated as and additional field and finally a log file that contains possible errors or problems encountered during the execution of VINYL
- File23 survival on Sorrentino et al vs TSI: : demonstrates the final results of a survival analysis comparing File20- that contains pathogenicity scores computed on the population of "affected" individuals- and File17 which contains the scores for the population of healthy individuals. See the section above for more details on the interpretation of this output file. From this file it should be pretty evident that the optimal cut-off values in this case are 30.4 or 29.9. See below for how to reproduce these results