Using VCF files

This vignette documents the functions in the BinaryDosage package that convert VCF files to binary dosage files.

Note: The examples below use functions to access information in binary dosage files. Information about these functions can be found in the vignette Using Binary Dosage Files. Data returned by the function getbdinfo contains information about a binary dosage file. Information on the data return by getbdinfo can be found in the vignette Genetic File Information.

Introduction

VCF files are a useful way to store genetic data. They have a well defined format and can be easily parsed. The output files returned from the imputation software minimac are returned in this format. The minimac software also returns an information file that is supported by the BinaryDosage package. The Michigan Imputation Server using minimac for imputation and returns VCF and information. Functions in the BinaryDosage package have default settings to use the files return from minimac.

Uncompressed VCF files are text files and can be very large, 100s of GB. Because of this they are quite often compressed. Files returned from the Michigan Imputation Server are compressed using gzip. This makes the files much smaller but greatly increases the read time. The BinaryDosage package supports reading of gzip compressed VCF files.

The BinaryDosage package was originally designed for use with files return from the Michigan Imputation Server. It was quickly learned that if any manipulation was done to these files by various tools such as vcftools, The conversion routine in the BinaryDosage package would not work. The routine was modified to support additional VCF file formats.

The BinaryDosage package has a routine to convert VCF files into a binary format that maintains the dosage, genetic probabilities, and imputation statistics. This results in a file about 10-15% the size of the uncompressed VCF file with much faster, 200-300x, read times. In comparison, Using gzip to compress the VCF file reduces the size of the file to about 5% its original size but makes run times slower.

Routines were written to help debug the conversion routine. It was discovered these routines were quite useful for accessing data in the VCF file and are now included in the package. This document contains instructions on how to use these routines.

Example files

There are several sample files included with the BinaryDosage package. The file names will be found by using the system.file command in the examples. This will be used many times in the examples.

The binary dosage files created will be temporary files. They will be created using the tempfile command. This will also be used many times in the examples. All output files will use the default format of 4. For information on other formats see the vignette Binary Dosage Formats.

Converting a VCF file to the Binary Dosage format

The vcftobd routine converts VCF files to the binary dosage format. Many different formats of VCF files by the BinaryDosage package. The following sections show how to convert VCF files in various formats to the binary dosage format.

Minimac files

Since the binary dosage format was initially created for use with files produced by minimac, the default options for calling the routine to convert a VCF file to a binary dosage format are for this type of file.

Uncompressed VCF files

Uncompressed VCF files are the easiest to convert to the binary dosage format since the default values for vcftobd are set for this format.

No imputation information file

An imputation information file is not required to convert a VCF file into the binary dosage format. In this case, the parameter value for vcffiles is set to the VCF file name.

The following commands convert the VCF file set1a.vcf into the binary dosage format.

Using the imputation information file

The minimac program returns an imputation information file that can be passed to the vcftobd routine. This is done by setting the parameter vcffiles to a vector of characters containing the VCF and the imputation information file names.

The following commands convert the VCF file set1a.vcf into the binary dosage format using the information file set1a.info.

The differences between the two binary dosage datasets can be checked by running the getbdinfo routine on both files. The value of snpinfo in the list returned from getbdinfo will be an empty for the first file and will contain the imputation information for the second file..

The following commands show the first file does not contain any imputation information and the the second one does. The imputation information for the second file is converted to a table for easier displaying.

bdinfo1a_winfo$snpinfo
aaf maf avgcall rsq
0.347 0.347 0.701 0.956
0.016 0.016 0.879 0.434
0.265 0.265 0.714 0.933
0.297 0.297 0.723 0.952
0.204 0.204 0.871 0.970
0.524 0.476 0.751 0.955
0.425 0.425 0.722 0.951
0.443 0.443 0.879 0.948
0.265 0.265 0.820 0.927
0.245 0.245 0.708 0.964

Compressed VCF files

VCF files can be quite large, 100s of GB. Because of this they are often compressed. The funciton vcftobd supports VCF files compressed using gunzip by adding the option gz = TRUE to the function call. The compressed file can be converted using an imputation information. The imputation file must NOT be compressed.

The following code reads in the data from the compressed VCF file set1a.vcf.gz. This the set1a.vcf file after it has been compressed using gunzip. The file will be read in twice, once without the imputation information file and once with the imputation information fie.

Checking the files

The four binary dosage files created above should all have the same dosage and genotype probabilities in them. The following code calculates the alternate allele frequencies for each of the binary dosage files using the bdapply function. The results are then displayed in a table showing the alternate allele frequencies are the same for each file. The value for SNPID was taken from the list return from getbdinfo.

Alternate Allele Frequencies
SNPID aaf1a_woinfo aaf1a_winfo aaf1a_woinfo_gz aaf1a_winfo_gz
1:10000:C:A 0.3468 0.3468 0.3468 0.3468
1:11000:T:C 0.0159 0.0159 0.0159 0.0159
1:12000:T:C 0.2651 0.2651 0.2651 0.2651
1:13000:T:C 0.2966 0.2966 0.2966 0.2966
1:14000:G:C 0.2040 0.2040 0.2040 0.2040
1:15000:A:C 0.5236 0.5236 0.5236 0.5236
1:16000:G:A 0.4246 0.4246 0.4246 0.4246
1:17000:C:A 0.4434 0.4434 0.4434 0.4434
1:18000:C:G 0.2655 0.2655 0.2655 0.2655
1:19000:T:G 0.2446 0.2446 0.2446 0.2446

Other VCF file formats

The vcftobd function can support VCF files in formats other than those returned from minimac. This is done by examining the value of FORMAT for each SNP in the VCF file. The routine looks for the values “DS” and “GP”, dosage and genotype probabilities, in the FORMAT column. If one or both of these values are found, the appropriate information is written to the binary dosage file.

The file set2a.vcf contains only the dosage values. The following code converts it to a binary dosage file. The getsnp function is then used to extract the first SNP and display the values for the first 10 subjects.

Dosage and Genotype Probabilities
SubjectID dosage p0 p1 p2
I1 2.000 NA NA NA
I2 1.002 NA NA NA
I3 1.000 NA NA NA
I4 0.122 NA NA NA
I5 1.000 NA NA NA
I6 0.000 NA NA NA
I7 1.004 NA NA NA
I8 2.000 NA NA NA
I9 0.000 NA NA NA
I10 0.918 NA NA NA

Other vcftobd options

There are other options for vcftobd. These options effect how the information is written to the binary dosage file.

format and subformat options

The format and subformat options determine the format of the binary dosage files. These format are documented in the Binary Dosage Formats.

snpidformat

The snpidformat options specifies how the SNP ID is written to the binary dosage file. The default value is 0. This tells the code to use the SNP IDs that are in the VCF file. Other values that can be supplied to the function creates a SNP ID from the chromosome, location, reference, and alternate allele values.

When the snpidformat is set to 1, the SNP ID is written in the format

Chromosome:Location

When the snpidformat is set to 2, the SNP ID is written in the format

Chromosome:Location:Reference Allele:Alternate Allele

When the snpidformat is set to -1, the SNP ID is not written to the binary dosage file. When the binary dosage file is read, the SNP ID is generated using the format for snpidformat equal to 2. This reduces the size of the binary dosage file.

Note: If the SNP IDs in the VCF are in snpidformat 1 or 2, the code recognizes this and writes the smaller binary dosage files.

Note: If the SNP IDs in the VCF are in snpidformat 2, and the snpidformat option is set to 1, an error will be returned. This is because of possible information loss and the binary dosage file does not change.

SNP Names by Format
format0 format1 format2 formatm1
1:10000:C:A 1:10000 1:10000:C:A 1:10000:C:A
1:11000:T:C 1:11000 1:11000:T:C 1:11000:T:C
1:12000:T:C 1:12000 1:12000:T:C 1:12000:T:C
1:13000:T:C 1:13000 1:13000:T:C 1:13000:T:C
1:14000:G:C 1:14000 1:14000:G:C 1:14000:G:C
1:15000:A:C 1:15000 1:15000:A:C 1:15000:A:C
1:16000:G:A 1:16000 1:16000:G:A 1:16000:G:A
1:17000:C:A 1:17000 1:17000:C:A 1:17000:C:A
1:18000:C:G 1:18000 1:18000:C:G 1:18000:C:G
1:19000:T:G 1:19000 1:19000:T:G 1:19000:T:G
rs12345 1:20000 1:20000:A:G 1:20000:A:G

bdoptions

When using binary dosage format 4.x it is possible to store additional information about the SNPs in the file. This is information consists of the following values

  • Alternate allele frequency
  • Minor allele frequency
  • Average call
  • Imputation r-squared

These values are normally provided in the imputation information file. However, it is possible to calculate the alternate and minor allele frequency without the imputation information file. This can be useful when a subset of subjects is extracted from the VCF file that returned from minimac. It is also possible to estimate the imputation r-squared. See the vignette Estimating Imputed R-squares for more information on the r-squared estimate.

The value for bdoptions is a vector of character values that can be “aaf”, “maf”, “rsq”, or and combination of these values. The values indicate to calculate alternate allele frequency, minor allele frequency, and imputation r-squared respectively.

The following code converts the set1a.vcf file into a binary dosage file and calculates the alternate allele frequency and the minor allele frequency, and estimates the imputed r-squared. These values are then compared to the values in the binary dosage file that was generated by using the set1a.info information file.

Information vs Calculated Information
aaf_info aaf_calc maf_info maf_calc rsq_info rsq_calc
0.347 0.347 0.347 0.347 0.956 0.956
0.016 0.016 0.016 0.016 0.434 0.434
0.265 0.265 0.265 0.265 0.933 0.933
0.297 0.297 0.297 0.297 0.952 0.952
0.204 0.204 0.204 0.204 0.970 0.970
0.524 0.524 0.476 0.476 0.955 0.955
0.425 0.425 0.425 0.425 0.951 0.951
0.443 0.443 0.443 0.443 0.948 0.948
0.265 0.265 0.265 0.265 0.927 0.927
0.245 0.245 0.245 0.245 0.964 0.964

Additional routines

The following routines are available for accessing information contained in VCF files

getvcfinfo

The getvcfinfo routine returns information about a VCF file. For more information about the data returned see Genetic File Information. This information needs to be passed to vcfapply so it can efficiently read the VCF file.

The parameters passed to getvcfinfo are

filenames is a character vector that can contain up to two values. The first value is the name of the VCF file. The second value is optional and is the name of the imputation information file.

gz is a logical value that indicates if the VCF file has been compressed using gzip. This only applies to the VCF file. Compression of the imputation information file is not supported.

index is a logical value that indicates if the VCF file should be indexed. Indexing the VCF file takes time but greatly reduces the time needed to read the file. Indexing can only be done on uncompressed VCF files.

snpidformat is an integer value from 0 to 2 that indicates how the SNP ID should be formatted. The value indicates which of the follow formats to use.

vcfapply

The vcfapply routine applies a function to all SNPs in the VCF file. The routine returns a list with length equal to the number of SNPs in the VCF file. Each element in the list is the value returned by the user supplied function. The routine takes the following parameters.

The user supplied function must have the following parameters.

The user supplied function can have other parameters. These parameters need to passed to the vcfapply routine.

There is a function in the BinaryDosage package named getaaf that calculates the alternate allele frequencies and is the format needed by vcfapply routine. The following uses getaaf to calculate the alternate allele frequency for each SNP in the set1a.vcf file using the vcfapply routine and compares it to the aaf in the information file.

Information vs Calculated aaf
SNP aafinfo aafcalc
1:10000:C:A 0.347 0.347
1:11000:T:C 0.016 0.016
1:12000:T:C 0.265 0.265
1:13000:T:C 0.297 0.297
1:14000:G:C 0.204 0.204
1:15000:A:C 0.524 0.524
1:16000:G:A 0.425 0.425
1:17000:C:A 0.443 0.443
1:18000:C:G 0.265 0.265
1:19000:T:G 0.245 0.245