Detection of de novo variants

Description   Prerequisites   Dialog   Practical tips   Example   Algorithm

Description

Identification of de novo variants in a trio (child + both parents). The output includes posterior de novo probabilities as well as the percentage of reads with the alternative allele.

Prerequisites

To apply the de novo algorithm in FILTUS, the following requirements should be met:

NOTE: If your variant files don't meet the above requirements, there may still be hope! In many cases you can identify potential de novo variants using a family based gene sharing with a dominant model. For instance, if all you have are individual variant files for a child and both parents, this would be the thing to do. The downside is that you wouldn't get posterior probabilities - or any other measure of classification strength - and probably quite a few more false positives.

Dialog

De novo dialog

To open the de novo dialog, choose De novo variant detection in the Analysis menu. Any filters (e.g. PASS) should be applied before opening the dialog. The entry points of the dialog are as follows:

Trio samples
Indicate the sample numbers (corresponding to the sample order in the Loaded samples window in the main FILTUS area.) Alternatively, you can use the syntax ID <string> where <string> is a unique identifier for the sample name (as given in the variant file). For example, if the sample name of the child is "Trio1_child" you can write ID child. If the string is not unique (i.e. if there are multiple loaded sample names containing "child") you will be warned. If you don't know who's who in the trio, you can find out using the Quality Control Plots (see Tip 3 on that page).

Finally, you should also indicate the gender of the child, to ensure correct handling of X-linked variants.

Mutation rate
This is used in the algorithm as the prior probability of a mutation at a given position in a single meiosis. Default value: 1e-8. The algorithm is usually not very sensitive to this parameter, but the posterior probabilities will be affected if you change it radically.

Allele frequencies
Indicate a column containing frequencies for the ALT alleles. The Missing entry value is substituted whenever the column does not contain a number, or if the variant is multiallelic. If your variant file does not have frequency data, the Missing entry value will be used for all variants. Without correct frequencies, the program will still identify the same potential de novo variants, but the posterior probabilities may be less accurate.

Output filters
The purpose of these filters is to reduce the number of false positives in the output. A true heterozygous de novo variant is expected to present with ALT percentage close to 50% in the child, and 0% in the parents. In practice some slack is recommended, e.g. child > 30% and parents < 5%.

One may experiment with these filters for other purposes too: For example, to look for de novo mosaic variants in the child one could try a very loose cutoff for the child, e.g. child > 10%, while requiring parents = 0%. Or oppositely, for variants inherited from a mosaic parent, something like parents < 25% and child > 40% would be sensible, without resulting in too many false positives. Of course, these are merely suggestions whose validity depend heavily on the actual contents of the variant file (e.g. the quality of the variants and the parameters of the variant calling).

Summary
A summary of the findings is printed here. The identified variants are shown in the main FILTUS window; to inspect them you must close the de novo dialog.

Practical tips

Tip 1: To save the results, first close the de novo dialog, and then select Save main window content in the File menu.

Tip 2: When browsing variants in the main FILTUS window, you can right click on any particular variant to see details about that variant for all the samples.

Example

Among the testfiles included in the FILTUS package is the file trioHG002_22X.vcf, which contains variants on chromosomes 22 and X from the exome sequencing of a trio. The variants are annotated with Annovar. (The trio is the Jewish trio from the Genome in a Bottle Consortium, described here. The complete exome sequencing variant set, is publically available from this ftp site.)

The following steps shows a typical de novo analysis in FILTUS.

  1. Load the trioHG002_22X.vcf file in the testfiles folder. Make sure the "Keep 0/0" box is checked.
  2. To remove the worst noice, apply column filters "VCF_FILTER - equal to - PASS" and "DP - greater than - 9".
  3. Find out who's who in the trio: Open the QC plot dialog (in the Analysis menu) and just press the first "Plot!" button. The Private variants plot should convince you that sample 1 is the child, which in combination with the Gender plot implies that the sample order is 1=boy, 2=father, 3=mother.
  4. Close the plot dialog and open the de novo dialog. Enter the sample order and child gender as deduced above, and choose "1000g2014oct_all" as the allele frequency column.
  5. Press Compute.
If you didn't set any output thresholds, the summary display should say that 4 variants were found. To examine them, first close the dialog. By right clicking any individual variant you'll see genotype details from all loaded samples about that variant.

A few words about interpreting the findings: In summary: 1) Posterior probabilities should be taken with a grain of salt, and 2) If you have access to the BAM files, a closer look at the actual reads (e.g. in IGV or similar software) is always a good idea.

Algorithm

FILTUS uses the GT field of the genotype columns to recognize the following de novo genotype patterns (father + mother = child):

A variant is treated as X-linked in this context only if it is located outside of the pseudoautosomal regions PAR1 and PAR2 on the X chromosome. Multiallelic generalizations of the above patterns are also caught. However, combinations with any of the following properties are treated as benign and discarded from further analyses: For each variant where a de novo genotype combination is identified, FILTUS reports the posterior de novo probability. This is computed using a Bayesian formula involving the population allele frequencies, the genotype likelihoods reported by the variant caller (in the PL fields) and the prior mutation rate specified by the user.