Tool to identify concept in the OMOP Genomic vocabulary from VCF and other files as well as HGVS notations
KOIOS is an open source tool developed and supported by the OHDSI Oncology WG that allows users to combine their
variant data with the OMOP Genomic Vocabulary in order to generate a set
of genomic standard concept IDs from raw patient-level genomic data.
KOIOS can presently be installed directly from GitHub:
# install.packages("devtools")
devtools::install_github("odyOSG/KOIOS")
The file userScript.R may be loaded as a default workflow wherein only
the initial reference genome and VCF file or VCF files directory need be
specified.
Users must provide at least one valid VCF file in either .vcf or .vcf.gz
format. This may be in the form of a single file, or a directory
containing a set of .vcf or .vcf.gz files.
Users may simply run KOIOS according to the following simple pipeline:
library(KOIOS)
#Load the OMOP Genomic Vocabulary into R
concepts <- loadConcepts()
#Specify input file or directort
vcf <- loadVCF(userVCF = "Input.vcf")
#Specify and load human reference genome, if known
ref <- "hg19"
ref.df <- loadReference(ref)
#Process VCF and generate all relevant HGVSG identifiers for input records
vcf.df <- processVCF(vcf)
vcf.df <- generateHGVSG(vcf = vcf.df, ref = ref.df)
vcf.df <- processClinGen(vcf.df, ref = ref, progressBar = F)
#Combine this output data with the OMOP Genomic vocab to produce a DF containing a list of concept codes
vcf.df <- addConcepts(vcf.df, concepts, returnAll = T)
If the user is unaware of the reference genome used to generate a given
VCF file they may run the following command, which checks their VCF
variants against known ClinGen variants.
vcf <- loadVCF(userVCF = "Input VCF")
ref <- "auto"
ref <- findReference(vcf)
ref.df <- loadReference(ref)
Multiple VCF files within a single directory may be submitted
simultaneously within a single command:
#Load the VCF directory
vcf <- loadVCF(userVCF = "SomeDirectory/")
#Set ref to hg19
ref <- "hg19"
concepts.df <- multiVCFPipeline(vcf, ref, generateTranscripts, concepts)
While it is possible to use the automatic reference finder for multiple
files, it is not recommended due to the long runtime.
It is also possible to run KOIOS on VCF-like data formats, with examples
detailed below. An appropriate reference is required, as with VCF data.
mutations <- read.csv("data_mutations.txt", sep = "\t")
#reference information is likely stored in mutations$NCBI_Build
mut_vcf <- processcBioPortal(mutations)
mut_vcf <- processClinGen(mut_vcf, ref = ref, progressBar = F)
mut_vcf <- addConcepts(mut_vcf,concepts)
HGVSg data can be directly read into KOIOS and submitted via the
processClinGen function. A minimal HGVSg dataframe input requires a
column named “hgvsg”.
hgvsg <- read.csv("hgvsg.csv", sep = "\t")
hgvsg <- processClingen(hgvsg,ref=ref)
Data already formatted into transcript (HGVSc) or protein (HGVSp)
formats, such as with cBioPortal input data (As below), may also be
submitted to KOIOS.
These data are simply matched directly with the extended concepts
object, derived from the OMOP Genomic vocabulary.
transcript_data <- read.csv("data_transcripts.txt", sep = "\t")
transcript_merge <- merge(mut_transcripts,concepts_ext,by.x="hgvsc",by.y="concept_synonym_name)
#The following is an optional step to remove version information from input transcript HGVSc.
#This allows for a wide range of older data to be submitted to the vocabulary, but has a small chance of generating false positive matches.
#transcript_data$match_hgvs <- gsub(".[0-9]*:",":",mut_transcripts$HGVSc)
#concepts_ext$match_hgvs <- gsub(".[0-9]*:",":",concepts_ext$concept_synonym_name)
#transcript_merge <- merge(mut_transcripts,concepts_ext,by="match_hgvs")
KOIOS may also be used to match gene fusion data with the relevant
concept_ids, such as with cBioPortal gene fusion data (As below).
concepts_fusion <- loadConcepts_fusions()
fusions_data <- read.csv("data_sv.txt", sep = "\t")
fusions_data <- generateFusions_cBioPortal(fusions_data,concepts_fusion)
If you encounter a clear bug, please file an issue with a minimal
reproducible example at the GitHub
issues page.