Detection of de novo variants
Description
Prerequisites
Dialog
Practical tips
Example
Algorithm
Description
Identification of de novo variants in a trio (child + both parents). The output includes posterior de novo probabilities
as well as the percentage of reads with the alternative allele.
Prerequisites
To apply the de novo
algorithm in FILTUS, the following requirements should be met:
- The variants of the trio should be contained in a single file containing all three samples. Joint variant
calling is highly recommended.
- The variant file must have VCF-like genotype
columns, with the correct FORMAT column indicated in the input settings dialog.
- The FORMAT column (the one with entries like
GT:AD:DP:GQ:PL) must contain the fields GT, AD and PL, which
are all used in the algorithm.
- The variant file must be loaded with the keep 0/0 option
checked in the input
settings dialog.
NOTE: If your variant files don't meet the above requirements, there may still be hope! In many cases you can identify potential de novo variants using a
family based gene sharing with a dominant model. For instance, if all you have are individual variant files for a child and both parents,
this would be the thing to do. The downside is that you wouldn't get posterior probabilities - or any other measure of classification strength - and probably quite a few more false positives.
Dialog
To open the de novo
dialog, choose De novo
variant detection in the Analysis menu. Any
filters (e.g. PASS) should be applied before opening the dialog. The entry points of the dialog are as follows:
-
Trio samples
-
Indicate the sample numbers (corresponding to the sample
order in the Loaded
samples window in the main FILTUS area.) Alternatively,
you can use the syntax ID <string> where
<string> is a unique identifier for the sample name (as
given in the variant file). For example, if the sample name of the
child is "Trio1_child" you can write ID child. If the string is not
unique (i.e. if there are multiple loaded sample names containing
"child") you will be warned. If you don't know who's who in the trio,
you can find out using the Quality Control Plots (see Tip 3 on that page).
Finally, you should also indicate the gender of the child, to ensure correct handling of X-linked variants.
-
Mutation rate
-
This is used in the algorithm as the prior probability of a mutation at
a given position in a single meiosis. Default value: 1e-8. The
algorithm is usually not very sensitive to this parameter, but the
posterior probabilities will be affected if you change it radically.
-
Allele frequencies
-
Indicate a column containing frequencies for
the ALT alleles. The Missing
entry value is substituted whenever the column does not
contain a number, or if the variant is multiallelic. If your variant file does not have frequency data,
the Missing entry value
will be used for all variants. Without correct frequencies, the program
will still identify the same potential de novo variants,
but the posterior probabilities may be less accurate.
-
Output filters
-
The purpose of these filters is to reduce the
number of false positives in the output. A true heterozygous de novo
variant is expected to present with ALT percentage close to 50% in the
child, and 0% in the parents. In practice some slack is recommended,
e.g. child > 30% and parents < 5%.
One may experiment
with these filters for other purposes too: For example, to look for de novo
mosaic variants in the child one could try a very loose cutoff for
the child, e.g. child > 10%, while requiring parents = 0%. Or
oppositely, for variants inherited from a mosaic parent, something like
parents < 25% and child > 40% would be sensible, without
resulting in too many false positives. Of course, these are merely
suggestions whose validity depend heavily on the actual
contents of the variant file (e.g. the quality of the
variants and the parameters of the variant calling).
-
Summary
-
A summary of the findings is printed here. The identified variants are shown in the main FILTUS window; to inspect them you must close the de novo dialog.
Practical tips
Tip 1:
To save the results, first close the de novo dialog, and then select Save main window content in the File menu.
Tip 2:
When browsing variants in the main FILTUS window, you can right click on any particular variant to see details about that variant for all the samples.
Example
Among the testfiles included in the FILTUS package is the file trioHG002_22X.vcf,
which contains variants on chromosomes 22 and X from the exome sequencing of a trio. The variants are annotated with Annovar.
(The trio is the Jewish trio from the Genome in a Bottle Consortium, described here.
The complete exome sequencing variant set, is publically available from this
ftp site.)
The following steps shows a typical de novo analysis in FILTUS.
- Load the trioHG002_22X.vcf file in the testfiles folder. Make sure the "Keep 0/0" box is checked.
- To remove the worst noice, apply column filters "VCF_FILTER - equal to - PASS" and "DP - greater than - 9".
- Find out who's who in the trio: Open the QC plot dialog (in the Analysis menu) and just press the first "Plot!" button.
The Private variants plot should convince you that sample 1 is the child, which in combination with the Gender plot implies that the sample order is
1=boy, 2=father, 3=mother.
- Close the plot dialog and open the de novo dialog. Enter the sample order and child gender as deduced above,
and choose "1000g2014oct_all" as the allele frequency column.
- Press Compute.
If you didn't set any output thresholds, the summary display should say that 4 variants were found. To examine them, first close the dialog.
By right clicking any individual variant you'll see genotype details from all loaded samples about that variant.
A few words about interpreting the findings:
- The top two variants are listed with posterior probability 1. However, they both have lower ALT percentage (ca 38%)
than you would expect from a heterozygous de novo variant. The fact they are only 52 basepairs apart is a bit suspicious,
and we see from the genomicSuperDups column that they are in a region of segmental duplication.
All in all, it is quite likely that both of these are false positives.
- The third variant has posterior probability 0, and is almost certainly a false positive. This is a tri-allelic variant,
where the TAA > TA change is reported to be de novo. If you right click on the variant you will see that both parents actually has
plenty of reads with the alledged de novo allele.
(Note about the alleles: Don't look at the "Ref" and "Alt" columns, which are added by Annovar and do not preserve multiple alternative alleles in this file.
The correct allele columns are "VCF_REF" and "VCF_ALT" much further to the right.)
- The ALT percentages of the fourth variant look very convincing, so why does it have posterior probability 0?
The problem is the father: According to the details, he has 365 REF reads and 0 ALT, but for some reason the genotype quality GQ is 0.
This could be a mapping problem (perhaps related to the fact that the variant is a fairly long insertion - or segmental duplication), or something else.
Anyway it would be worth investigating closer.
In summary: 1) Posterior probabilities should be taken with a grain of salt, and
2) If you have access to the BAM files, a closer look at the actual reads (e.g. in IGV or similar software) is always a good idea.
Algorithm
FILTUS uses the GT field of the genotype columns to recognize the following de novo genotype patterns (father + mother = child):
- Autosomal:
- 0/0 + 0/0 = 0/1
- 0/0 + 0/0 = 1/1
- 0/0 + 0/1 = 1/1
- 0/1 + 0/0 = 1/1
- X-linked, child is boy:
- X-linked, child is girl:
- 0 + 0/0 = 0/1
- 0 + 0/0 = 1/1
- 0 + 0/1 = 1/1
A variant is treated as X-linked in this context only if it is located outside of the pseudoautosomal regions PAR1 and PAR2 on the X chromosome.
Multiallelic generalizations of the above patterns are also caught.
However, combinations with any of the following properties are treated as benign and discarded from further analyses:
- The de novo allele is 0 (= REF). Example: 1/1 + 1/1 = 0/1.
- Child genotype equals either of the parents. Example: 0/0 + 1/1 = 1/1.
- Missing genotype in any trio member.
- A male trio member is reported as heterozygous for an X-linked variant.
For each variant where a de novo genotype combination is identified, FILTUS reports the posterior de novo probability. This is computed using a Bayesian formula involving the
population allele frequencies, the genotype likelihoods reported by the variant caller (in the PL fields) and the prior mutation rate specified by the user.