The pharmacoepigenomics informatics pipeline defines a pathway of novel and known warfarin pharmacogenomics variants

Aim: ‘Pharmacoepigenomics’ methods informed by omics datasets and pre-existing knowledge have yielded discoveries in neuropsychiatric pharmacogenomics. Now we evaluate the generality of these methods by discovering an extended warfarin pharmacogenomics pathway. Materials & methods: We developed the pharmacoepigenomics informatics pipeline, a scalable multi-omics variant screening pipeline for pharmacogenomics, and conducted an experiment in the genomics of warfarin. Results: We discovered known and novel pharmacogenomics variants and genes, both coding and regulatory, for warfarin response, including adverse events. Such genes and variants cluster in a warfarin response pathway consolidating known and novel warfarin response variants and genes. Conclusion: These results can inform a new warfarin test. The pharmacoepigenomics informatics pipeline may be able to discover new pharmacogenomics markers in other drug-disease systems.


Background
Pharmacogenomics is poised for transformation by the epigenome Pharmacogenomics variant discovery has prioritized the search for protein coding variants (CVs) with genetic association and biochemical methods [1]. With the sequencing of the human genome, the initial hope for immediate discovery of highly penetrant CVs for many phenotypes [2,3] did not bear out, both in pharmacogenomics and in many other fields [4]. And with the advent of genome-wide association studies (GWAS), pharmacogenomics phenotypes began to be investigated with this powerful new tool [5]. Although most available pharmacogenomics tests were and are based on highly penetrant CVs, GWAS studies have revealed that the bulk of genetic variation in drug response, like many other phenotypes, comes from noncoding regulatory variants which are cooperative and combinatoric [6].
The development of the broader field of genomics and genomic regulation has been extremely fruitful in the last decade, but the deployment of GWAS in pharmacogenomics variant discovery has lagged deployment in other disciplines [5], epigenomic interpretation of GWAS results has been underutilized [7], and the translation of GWAS results into clinical tests has been slower still in most areas [8][9][10]. Over the last decade, genome-wide methods have produced increasingly deep and broad atlases of both the genome and epigenome. Genomic atlases have included HapMap [11] and 1000 Genomes [12], while epigenome atlases have included ENCODE Phases I [13] and II [14], the Roadmap Epigenome Mapping Consortium [15], the International Human Epigenome Consortium [16] and the upcoming Human Cell Atlas [17]. Along with this, potent new techniques for probing the spatial and temporal genome have emerged, including time series omics analysis and spatial genome methods like Hi-C [18,19,20].
Analysis of data from these atlases and targeted experiments has produced a new paradigm of regulation in which the spatial genome and epigenome take a prominent role [15,18,21]. In this paradigm, gene transcription is the result of the convergence of a series of spatial and functional factors. It takes place in chromatin loops at the spatial union of topologically associating domains (TADs) containing genic promoters with TADs containing controlling enhancers and co-regulated genes in a transcription factory [18]. Such factories assemble prior to the initiation of transcription [22,23], at the edges of chromosome territories [18], in TADs localized to interior of the cell nucleus [21]. This process occurs with the aid of activating transcription factors and epigenomic marks [15], in a dynamic [24,25], cell type specific [18] manner, with different enhancers active in different cell types and tissue microenvironments, while some act in many or all cell types and tissues.
This model, which has developed along with increasingly powerful epigenomic measurement techniques [15] and imaging modalities [26], is yielding increasing predictive power and new insight in many areas. Genome-wide, multiomics epigenomes (i.e., those which combine a number of complementary genome-wide epigenomic measurements) are now available on a genome-wide basis in many cell lines and tissues [15], and even in multiple physiological conditions in many cases. As a result of this, the epigenome is increasingly regarded like the reference genome: as a resource to be consulted for systems and loci of interest [17], rather than an unknown quantity to be queried experimentally in specific contexts.
This transition is rapidly revolutionizing variant discovery with the emergence of integrated multi-omics pipelines for variant discovery. These pipelines, which include HaploReg [27] and RegulomeDB [6], screen association regions from the genome for variants bearing important marks of epigenomic regulatory variants. These methods have allowed investigators to begin investigating at scale the over 90% of associated variants which are noncoding regulatory variants [6], and to get new and increasing value out of GWAS studies whose results were initially enigmatic [28][29][30].
The pharmacoepigenomics informatics pipeline: multi-omics variant discovery for pharmacogenomics However, in pharmacogenomics, these insights have not as yet been broadly applied. Despite recent work showing that an epigenome, 4D nucleome-based variant discovery and validation approach can yield value in excess of traditional methods [31][32][33], this new era of 'pharmacoepigenomics' [34,35] is still in its infancy. Indeed, much pharmacogenomics variant discovery still proceeds along traditional lines involving the search for CVs to be designated as star (*) alleles, with a particular emphasis on pharmacokinetic (PK) genes for absorption, distribution, metabolism and excretion (ADME). Such genes have formed the focus for test development for response, dosing and adverse drug events (ADEs) and adverse drug reactions.
To aid the wider dissemination of these advanced methods in pharmacogenomics, we have created the pharmacoepigenomics informatics pipeline (PIP), an integrative multi-omics variant discovery pipeline designed specifically for next-generation pharmacogenomics and pharmacoepigenomics. It combines omics datasets (including epigenome compendia and phenotypically driven GWAS) with domain-specific knowledge (including key tissues and known significant genes). The PIP's variant discovery strategy is based on two separate workflows: a CV workflow based on traditional methods and an expression regulatory variant (ERV) workflow that integrates bioinformatics algorithms and omics datasets to screen and organize regulatory variants. These variants and their target genes are organized together at the end of the analysis to provide the input for downstream statistical analysis, gene-set enrichment and pathway analysis to clarify the genotype-phenotype relationships.
The ERV workflow is carried out with an overall model for combining multiple omics with pre-existing knowledge about drugs and disease (i.e., relevant tissues and candidate variants) to discover variants and pathways that have a causal influence in chosen phenotypes. It begins with lead SNPs from association studies and expands them in a population-specific manner with linkage mapping. After this, tissue-specific omics datasets are used to evaluate the regulatory function of the genomic regions around the variants, the dependence of that function on the status of the variants and the identity of likely target genes. Finally, variant-gene pairs passing all these tests are filtered and organized with pathway mapping and gene-set enrichment.
Nascent implementations of the ERV workflow have been used in several pharmacogenomics variant discovery experiments [31][32][33], in a number of drug-disease systems including lithium for bipolar disorder, valproic acid for traumatic brain injury and citalopram and ketamine for major depression [ga higgins et al. unpublished data] . The PIP is designed to introduce additional capabilities and further automation into such experiments.

The need for improved warfarin pharmacogenomics
Warfarin is an anticoagulant used for the prevention and treatment of venous thromboembolism in cardiac disease, postoperative recovery and other contexts involving the need for coagulation control [36]. An inhibitor of the vitamin K epoxide reductase enzyme complex (including VKORC1), it exhibits wide interindividual response variation [36].
Novel & known variants for warfarin response define a pathway Research Article A total of 23 associations from 22 genome-wide association studies (GWAS)  met the criteria for inclusion in the warfarin experiment, and their significant associations were rendered into the MVF and organized into phenotypic classes, along with 23 additional variants annotated by the Pharmacogenomics Knowledge Base (PharmGKB ® ) [61]. The 23 additional variants included all variants annotated by the PharmGKB as of September 2016, with a significant association with warfarin response, and were not already included in the list of variants identified from GWAS. Population descriptions are taken from the summary annotations in the original papers and the GWAS Catalog [62] , but in the PIP input files they are represented in 1000 Genomes format to replicate the detailed descriptions in the original papers as precisely as possible. Although they represent a variety of ancestries, European ancestry predominated among 17 out of 22 (77%) of the study populations. Dose requirements to achieve therapeutic international normalized ratios (INR) vary as much as tenfold among patients [36]. Despite the recent availability of other oral anticoagulants, warfarin is still a commonly prescribed anticoagulant, and until 2015 this venerable drug was the most popular prescribed oral anticoagulant with about seven million office visit interactions resulting in a prescription [37]. Warfarin has complicated PK and pharmacodynamics (PD), with its influences on the action of a number of clotting factors each exhibiting a different half-life [36]. It is widely acknowledged that warfarin dosing requirements and other phenotypes exhibit a strong patient-specific genetic element. Factor-of-ten differences in typical dosing requirements based on alleles in CYP2C9, the primary metabolic enzyme of warfarin, are present on the package insert label [36], and data on association between warfarin requirements and other genetic loci have proliferated (Table 1 & Table 2). There have been multiple attempts to provide clinicians with genotype-guided dosing algorithms based typically on genotypes of VKORC1 and CYP2C9 [38], and in some cases CYP4F2 and GGCX. Despite the strength of variation in these key genes, there is a 'missing heritability' problem: identified loci in these genes account for only 30-50% of the predictive power implied by heritability estimates [38]. Current-generation tests use only variants in these genes, and almost all use the same three SNPs: rs1799853, rs1057910 and rs9923231 (a promoter SNP for VKORC1). Partially because of this, these tests have achieved only limited success despite intensive development and validation effort. Despite hopes that two large trials (EU-PACT [64] and COAG [65]) would report good results, their appearance in the same issue of the NEJM did not bear out such hopes. COAG reported no difference in time within the therapeutic range, and EU-PACT reported a difference, but only compared with fixed starting doses with subsequent adjustment, not to initial dosing methods based on clinical indications that represent the real-world alternative to genetic dosing. Neither trial was powered to report a difference in bleeding and embolism events. Subsequently, a published meta-analysis of nine randomized controlled trials of warfarin pharmacogenomics dosing algorithms versus manual dose determination showed that current-generation tests using VKORC1 and CYP2C9 offer no improvement in time within the therapeutic range, percentage of patients with high INR or bleeding and coagulation events [66].
Since the conclusion of these trials, expert comment has indicated a consensus that such algorithms do not add clinical value relative to dosing on the basis of clinical indications, despite their predictive power [67,68]. Although the recent GIFT randomized controlled trial of genomic warfarin dosing found that genome-guided dosing provided a significant benefit in the composite end point of major bleeding, INR of four or greater, venous thromboembolism or death [69], the generalizability of the study's results is limited. GIFT had narrow inclusion criteria: patients aged 65 years or older initiating warfarin for elective hip or knee arthroplasty. Those patients are at higher risk, and thus the generalizability of these results to larger patient populations and indications is debatable.
Despite their predictive power, such tests have failed to deploy in mainstream clinical practice. This experience suggests that new lines of research should be opened, such as the application of advanced genomic methods, to discover new variants, recover missing heritability and develop a new generation of tests which can add value relative to dosing by clinical indications.

The pharmacoepigenomics informatics pipeline
The PIP uses lead variants from GWAS and candidate gene studies to find genetically linked permissive candidate variants (PCVs), using data from the 1000 Genomes Project for populations matched to the source studies. These variants are evaluated by two separate workflows: the ERV workflow for regulatory variants and the CV workflow for CVs. The ERV workflow evaluates the PCVs in disease-relevant tissues for DNA methylation, transcription factor binding, histone marks, DNase I hypersensitivity, chromatin state, quantitative trait loci (QTLs) and transcription factor binding site disruption using tissue-specific omics datasets. The CV workflow finds common nonsynonymous CVs within the pool of PCVs. Both sets of variants are mapped back to their host genes using RefSeq [70], and screened for expression in relevant tissues. The final output variants and their host genes are subjected to pathway analysis in Ingenuity ® Pathway Analysis (IPA ® ; Qiagen GMBH) [71], offering an additional level of screening along with mechanistic insight for subsequent pharmacogenomics test development.
The methods of the PIP are shown in Figure 1. The PIP is based on the methods of our previous paper on lithium pharmacogenomics [32]. We performed two experiments with this pipeline. First, a reproduction of the lithium experiment with a restricted PIP feature set to replicate these methods to validate the initial pipeline by reproducing our lithium pathway. Second, a new experiment in the pharmacogenomics of warfarin using the full feature set to investigate whether the PIP may add value to a well-studied area of pharmacogenomics.

Construction of input files for lithium & warfarin PIP experiments
A PIP experiment begins with the input of associated lead variants in the form of a master variant file (MVF) which contains the refseq ID of the variant, the populations in which the original association was derived using 1kG Phase III population codes, and the PubMed ID of the study in which the association was derived.
PIP experiments were undertaken for lithium/bipolar disorder (BPD) and warfarin/anticoagulation, beginning with the construction of MVFs. For the warfarin experiment, a literature review in September 2016 identified 23 GWAS on warfarin response and other pharmacological phenotypes of warfarin, venous thromboembolism risk and baseline anticoagulant protein levels in healthy patients. Joint analysis of studies on related phenotypes, and overlap between disease risk and pharmacogenomics variants, are known phenomena from previous work. In addition, we added 23 variants annotated by PharmGKB ® , yielding a total of 204 SNPs. The input studies for the warfarin experiment are summarized in Table 1.
The warfarin experiment began with input data from populations all over the world, including European, east Asian, south Asian, African and American cohorts. Population-specific associations from these groups were interpreted in the context of other population-specific input data, from the worldwide genome catalog of 1000 Genomes. Population descriptions are taken from the summary annotations in the original papers and the GWAS Catalog [62], Novel & known variants for warfarin response define a pathway Research Article but in the PIP input files are represented in 1000 Genomes format to replicate the detailed descriptions in the original papers as precisely as possible.
The various consortium data (ENCODE, REMC) and ChromHMM [72] chromatin states are imported for relevant tissues, with the relevant tissues being supplied in the format information for each consortium separately, for a unified set of tissues, in a master tissue file (MTF). The MTF is designed for each experiment to contain the set of tissues which are relevant to the particular drug-disease system under investigation, so that tissue-specific omics data may be used in the experiment. Information on the relevant chromatin marks and transcription factors was imported in a series of transcription factor (TF) and chromatin mark whitelist files containing metadata codes in the formats of each of the consortia.
MTFs were constructed for both experiments. The warfarin MTF included the human liver (site of warfarin metabolism), vasculature (site of action) and small intestine (site of absorption, a known variable in response and metabolism). These were represented by a mixture of cell line and tissue samples in the ENCODE, REMC and GTEx datasets, but a complete epigenome with methylation, DNase accessibility, all core histone marks and many TF-binding tracks was available in all tissues.
With the MVFs and MTFs generated for the lithium and warfarin experiments, the PIP was used to conduct its analysis.

Permissive candidate generation & variant annotation
Generation and annotation of PCVs occurs in the following manner: • Linkage disequilibrium mapping is performed in the PLINK software package [73], version 1.9 [74], with the 2 May 2013 (latest) release of 1000 Genomes consortium Phase III data. For each of the SNPs in the MVF, PLINK computes LD coefficients and outputs a set of all cis and trans variants that achieve R^2 ≥ 0.8 linkage with the input variant. These variants constitute the set of PCVs.
• Chromatin state annotation is performed with the version of ChromHMM results available in the Washington University (WUSTL) epigenome browser [75] as of April 2016. Chromatin mark information on a number of indicative marks along with DNase accessibility measurements, TF binding and methylation (currently not used in scoring) are assayed using ENCODE and REMC data in the relevant marks and the relevant tissues as rendered in the MTF and whitelist files.
• TFBS disruption data are assessed using TFM-scan [76] on 23-bp-per-side hg19 sequences on reference and alternate alleles from the source SNPs, using integer-converted data from the PWM library used in HaploReg [27], and requiring a threshold crossing or 10-point difference in score between reference and alternate alleles for a qualifying position weight matrix (PWM), as a measure of significance.
• Gene expression data were retrieved from the latest build of GTEx [79], and from the Allen Brain Atlas [63].
• The CV workflow located all nonsynonymous CVs among the PCVs using dbSNP data, then screened them for minor allele frequency with 1000 Genomes data on all human populations, using a MAF cutoff of 0.01.

Variant scoring, expression testing & pathway analysis
Scoring proceeded according to Figure 1, with SNPs being required to pass all of the separate thresholds in each of the individual portions of either one of the two workflows (ERV and CV) to proceed. Then, their host genes must pass gene expression testing in relevant tissues with a twofold coverage threshold in order to reach the final gene list. All of these steps are undertaken on the Flux computing cluster at the University of Michigan. Finally, the gene list is analyzed with IPA.
Lithium PIP experiment: (mostly) same methods as original experiment The two experiments were conducted with slightly different feature sets. The warfarin experiment used the full feature set, while the lithium experiment used a restricted feature set designed to match the semi-automated analysis from our 2015 paper, including no use of gene bodies as candidate region inputs, no CV workflow, expression analysis future science group Research Article Allyn-Feuer, Ade, Luzum, Higgins & Athey performed manually with Allen Brain Atlas in situ hybridization data on brain tissue and the use of biochronicity analysis to screen genes. The lithium MVF was constructed to contain precisely the set of input variants originally used in the lithium experiment previously published, a total of 108 SNPs. The lithium MTF included the same tissues used in the original lithium experiment; BPD-relevant brain regions plus the human liver. The lithium MTF does not feature GTEx data because the restricted feature set for this experiment used manual gene expression analysis with Allen Brain Atlas in situ data in order to replicate the original lithium experiment. The tissue files for both experiments are summarized in Table 3.
Despite this effort, the lithium experiment feature set of the PIP does not fully match the original experiment [32]. Several obstacles were encountered in which subtle differences in methods were necessitated by database deprecation, ambiguities in the documentation of available literature methods, and ambiguities in versioning of online resources. These obstacles included: • LD mapping methods: in the original paper, LD mapping was performed in HaploReg v4.1 [27], which uses the Phase I cohort of the 1000 Genomes project [12] as its LD mapping cohort. However, this tool was developed after The tissue file coding used in the lithium and warfarin PIP experiments, respectively. The first column contains natural language names for the tissues, while subsequent columns contain colon-delimited consortium ontology codes for related tissues and cell lines. The lithium experiment does not have tissue codes for GTEx because annotation of gene expression mapping was done manually with Allen Brain Atlas in situ hybridization data [63] in this experiment, as described herein. PIP: Pharmacoepigenomics informatics pipeline.  • Switch in QTL databases: the GeneVar [83] QTL database we used in the original analysis is no longer available in its full form. We have contacted the original authors and they have not responded. Because of this, we have been forced to seek supplemental databases in addition to the partial GeneVar database still available, as described below.

PWM integer approximation
The PIP uses the same database of PWMs that was used by HaploReg [27] during the initial lithium analysis, and scores them against the flanking sequences using the same software, TFM-Scan [76]. However, the PWM matrices available for bulk download are in probability format, and TFM-Scan presents warnings about accuracy when using probability matrices. It expects frequency matrices. We do not know if HaploReg used the probability matrices and (possibly) suffered inaccuracies, or if they have copies of the matrices in natively integer format which they have not made available, or if they converted the probability matrices to integer. We have elected to convert the probability matrices to frequency matrices, but because of rounding, this is imperfect and may introduce small errors.

Results
Lithium replication experiment: more significant pathway, more genes & variants To demonstrate the ability of the PIP to replicate the results of previous integrative omics methods in pharmacoepigenomics variant discovery, we first undertook a PIP experiment to replicate the findings of our 2015 paper on lithium [32], as a positive control. The feature set of the PIP used for this experiment was restricted to match that of the previous experiment, as described in the methods. The lithium experiment yielded a total of 1727 PCVs, of which a total of 78 passed the entire set of filters. These results differ from and expand on the results of the original lithium analysis. Although the original experiment identified 19 SNPs in 10 genes, the new experiment identified 17 out of 19 original SNPs, and all 10 original genes plus more SNPs in these and 2 more genes for a total of 78 SNPs in 12 genes. These are the ten genes from the original lithium paper, with the addition of HTR1A and TNIK, both of which appeared in the glutamatergic lithium pathway called by IPA in the original lithium paper.
Because of the factors mentioned in the methods, as well as imprecisions in the manual steps in the original lithium experiment, the lithium results differ subtly from those of the original experiment. This underscores the importance of using versioned, locally stored versions of all databases in bioinformatics experiments to allow reproducibility. In addition, however, the automated analysis recovers additional genes from the same pathway in the lithium experiment, emphasizing the utility of automated workflows with advanced functionality.
Thus, we reproduce the glutamatergic pathway for lithium response with the PIP, as reflected in Figures 2 & 3. IPA calls the same pathway for both the 10 genes discovered in the original experiment and the expanded 12-gene set. The pathway also includes 22 additional genes not discovered in either experiment but used to connect them in the pathway mapping functionality of IPA. The two pathways differ only in IPA's inclusion of a group of miRNAs in the old pathway and not in the new, which is the result of an IPA database change. The PIP recovers the same pathway as the original experiment.
In addition to this, in the period of time between the execution of the 2015 lithium experiment and the publication of this manuscript, other groups have reported connections between lithium response and absorption, distribution, metabolism and excretion and some of the genes we reported in this network [96,107].
Using substantially the same input data as the original lithium experiment [32], the PIP pipeline successfully identified the same glutamatergic pathway in the human brain and identified additional genes which are part of the same pathway. This constitutes a validation that the PIP replicates the discerning power of our previous semi-automated methods, and may function as a foundation for future development.

Warfarin experiment: known & novel variants form an enhanced & unified pathway
With the methods thus validated, we proceed to the results of the warfarin experiment in the PIP. The input variants from the warfarin study yielded a total of 4492 PCVs, which were analyzed for each of the criteria in the ERV and CV workflows. Of these PCVs, there were significant associations between the sets of SNPs passing some related modules in the PIP, and some modules were, in this experiment, significantly stricter than others. However, all the modules filtered out a meaningful number of SNPs as part of the overall process. This process is summarized in Figure 4.
The ERV and CV workflows identify a total of 223 SNPs, which were then mapped to their host and target genes, which were subjected to the expression test with GTEx data on tissues from the MTF. Of these 223 SNPs, 87 were located in 41 genes passing the expression test. Notable among them is the presence of classic and previously known warfarin response genes including CYP2C9, clotting factors and VKORC1.  Novel & known variants for warfarin response define a pathway Research Article Next, we performed pathway analysis in IPA on this set of 41 genes. There are only four genes whose variants are used in current-generation warfarin pharmacogenomics tests: VKORC1, CYP2C9, CYP4F2 and GGCX. Despite the predictive power and clinical utility of these variants and the tests constructed with them, IPA does not connect them into a pathway, and previous warfarin literature has not conceptualized warfarin response in pathway terms. Despite this, and despite the typical pattern wherein IPA discovers many pathways of questionable relevance, this analysis yielded only two pathways, one of them with 31 genes and a particularly striking p-value of 1E-65 (as calculated by IPA). The gene list for this pathway is shown in Table 3, while the pathway network diagram is shown in Figure 5. It densely unites previously known warfarin response genes (including those not previously used in test design) with new genes.
This pathway may be considered the 'canonical' warfarin pathway based on the published literature base. It consolidates the bulk of discovered genes, particularly all but one of those previously known for warfarin response, including VKORC1, warfarin-metabolizing CYP enzymes and a collection of clotting factors. This includes the FGA and FGG fibrinogen subunits. Despite the fact that they operate as a complex and cannot function separately, FGG had been previously identified for warfarin response but FGA had not, whereas the PIP identifies both genes, which are located in the same TAD and are co-regulated.
Among previously known genes in this pathway, over 75% are highest expressed in liver, while 57% of newly discovered genes are highest expressed in the intestines or vasculature. Notably, a mixture of coding and regulatory SNPs is identified for both known and new genes.
The second pathway contains only one previously known warfarin gene (and that one, BCKDK, of disputed  Figure 6. Warfarin and coagulation-related terms are prominent among the results, and many of the genes contributing to these identifications were not previously identified as warfarin related.
The PIP has identified many known genes for warfarin phenotypes, and added both new coding and regulatory SNPs for known genes, as well as new genes.
In addition to this, we investigated the warfarin input and output variants with DeltaSVM [108][109][110], evaluating their propensity to differentially induce DNase accessibility in the HepG2 liver cell line. Although the result SNPs contained a number of strong candidates by DeltaSVM, these candidates were not enriched relative to the candidates among the input set, nor were the mean and variance of the DeltaSVM scores significantly different between inputs and outputs. This implies that our existing variant dependence algorithms (PWM alteration) are mostly independent of DeltaSVM's predictive power, and that adding DeltaSVM or similar methods to the variant dependence portion of the pipeline (with appropriate changes to the scoring system) would add value to the pipeline.

The PIP is scalable: runtime & computational performance
For the warfarin experiment, total runtime and its division into the various components of the workflow are summarized in Figure 7. The total experimental runtime on the University of Michigan Flux cluster was about 14,300 node hours (each node using two six-core Intel Xeon X5650 processors running at 2.67 Ghz, with 48GB of RAM). Of this, the bulk was devoted to the initial LD calculations, QTL analysis and the retrieval of epigenome data from REMC datasets.
Although we anticipate substantial runtime improvements from code optimization in subsequent versions of the PIP, it is clear that this version is already computationally scalable for use in pharmacogenomics variant discovery in many systems. When sufficient cluster resources are available, it can finish running after about 6 hours of computation. Although REMC and QTLs are the largest contributors to total runtime, PLINK is a large contributor to  Novel & known variants for warfarin response define a pathway Research Article parallel runtime because it is parallel on the level of input SNPs, not on the level of PCVs. With a sufficient number of nodes PLINK accounts for a third of total runtime, though not total computation.

Warfarin pharmacogenomics
The discovery of a number of previously unknown candidate genes and SNPs for warfarin response, dosing and ADEs, including a unifying pathway with known variants, is a potential direction for a significant improvement in warfarin pharmacogenomics. This includes both additional variants controlling known warfarin response genes, and also putative new genes with associated variants. These variants should be included in the construction of next-generation predictive models for warfarin pharmacogenomics, and the mechanistic investigation of the newly discovered pathway may also be valuable.
Among the SNPs, regulatory SNPs are identified in blue, while coding SNPs are shown in green.
Among the SNPs, regulatory SNPs are identified in blue, while coding SNPs are shown in green.  although it appears not to have been explicitly described previously. Thus, the PIP has systematized and extended the previous literature into a more integrated picture. Because of the extensive pleiotropy observed with pharmacogenomics genes and variants, systematizing evidence is very important for integrative test design [111].
It is noteworthy that in the warfarin experiment, the PIP was successful in rediscovering both key PK and PD genes.
Although pharmacogenomics has traditionally used more variants in PK genes, as discussed in the introduction, PD variants have proved to be extremely potent in the instances where they do find clinical use, including VKORC1 promoter variants used in current-generation warfarin genetic tests, which are the most potent variants used in such tests. Thus, it is heartening to see extensive discovery of PD genes in the PIP, including extensive discovery of regulatory variants for PD genes. Discovery methods like the PIP may be able to aid the transition to make more use of PD genes in pharmacogenomics, as suggested previously [31,32,34].
PK genes discovered for warfarin include CYP2C9 and other key metabolic enzymes for warfarin. However, for PK genes, the PIP mostly discovered only CVs, and not regulatory variants. We suspect this may be the case for two reasons. First, the CYP loci have a complicated structure of homology with many similar genes in close proximity, which tends to confound both hybridization and linkage and sequence alignment [112][113][114][115], all of which are essential steps in the PIP and the omics assays which provide its source information. Second, however, the current version of the PIP locates target genes by sequence proximity, rendering it unable to find distal regulatory elements like those which control many facultative metabolic enzymes. It is anticipated, therefore, that subsequent PIP enhancements will enable better discovery of regulatory variants for PK genes.
Warfarin is a major anticoagulant, but in recent years a host of other anticoagulants including other vitamin K analog agonists as well as novel oral anticoagulants have been gaining in acceptance [37]. Many of these medications share known mechanistic elements with warfarin, and also have known differences in mechanisms [116,117]. Various drugs  Novel & known variants for warfarin response define a pathway Research Article that modulate the coagulation system are used for other purposes [117,118], including clopidogrel, which is routinely prescribed on a chronic basis to reduce the risks of stroke and heart disease. And some SNPs discovered in the PIP experiment for warfarin have also appeared in association experiments for other anticoagulants, especially vitamin K antagonists. For example, rs2108622, a coding SNP in CYP4F2, which the PIP identifies as a warfarin response SNP, is also a response SNP for acenocoumarol [106]. A joint analysis of the pharmacoepigenomics properties of this broader collection of coagulation system-modifying drugs, integrating and comparing pathways and mechanisms, could be of real benefit to extend the breadth of future anticoagulation pharmacogenomics testing. More broadly, these results indicate that the epigenome-and 4D nucleome-informed paradigm of pharmacogenomics that we described in our 2015 papers [31,32,34] may generalize beyond neuropsychiatry into cardiovascular pharmacogenomics and other domains. The genes and variants discovered with the PIP should be the subject of follow-up research in both pharmacogenomics test design and future anticoagulant drug development. Automated pharmacoepigenomics-based variant discovery and investigation methods like the PIP should continue to be developed, with potential future application in pharmacogenomics over the coming years.   Our successful reproduction of the results of the lithium experiment with the PIP demonstrates that reproducibility in bioinformatics experiments depends closely on adherence to important reproducibility principles. Reproduction of previous lithium results was complicated by elements of the original experiment like the inclusion of manual steps, consulting resources from external databases without retaining them or documenting precisely which tracks were consulted and the use of databases whose contents change in a unversioned manner. These lessons have been incorporated into the PIP with design features like the use of versioned, locally stored databases, the selection of resources and records in an algorithmic and reproducible manner and the retention of intermediate results and code. This ensures that PIP experiments are reproducible. Remaining challenges, including the use of the unversioned IPA database for pathway analysis, should be rectified in subsequent PIP refinements. Conversely, however, it is clear from our results that the automated pipeline offers better functionality than the semi-automated pipeline it was designed to reproduce. It is possible to make pervasive use of machine learning in gauging the variant dependence of enhancer function. The current pipeline gauges variant dependence only with PWM measurements, but a number of machine learning algorithms have recently been published for gauging variant function, including DeltaSVM [108][109][110], methods of Nishizaki [7] and recently GKM-DNN [119][120][121][122][123]. We have performed testing with DeltaSVM indicating that its predictive power is 'orthogonal' to that of PWMs, and would add value, both by allowing rejection of variants that pass the current scoring system but do not show variant dependence with DeltaSVM, and rescue of variants that barely miss the current thresholds but exhibit strong variant dependence in DeltaSVM. The integration of some of these published methods would strengthen the PIP.
In addition, methods like the PIP may help alleviate a vexing issue in current pharmacogenomics and other genetic testing: inapplicability of some test results across populations. Sometimes genetic associations discovered with association testing may hold in the population from which the study populations were drawn (predominantly European, although increasingly Asian populations are being used in research conducted in Asia), but not in other populations [124][125][126][127]. In this case, only 3 of 23 studies included African populations. This has been an issue of comment both in the literature [128][129][130][131] and in the lay press [132133]. Strategies for surmounting it have comprised including population-specific lead SNPs in test design, designing separate tests for different populations, including populationlinked SNPs in test design, and various combinations of these methods [134,135]. Warfarin response and coagulation Novel & known variants for warfarin response define a pathway Research Article phenotypes in particular are an area in which certain populations, including African-Americans, have exhibited different response profiles and different GWAS lead SNPs from European populations [54,136].
The PIP and similar methods may help to resolve this problem. One mechanism by which such an effect may arise is when a GWAS lead SNP is only a tagging variant for an underlying effector SNP, and the LD relationship between these SNPs is population dependent. In such cases, the PIP, by evaluating the population of population-specific LD partners of the lead SNP, and finding effector SNPs within this population will sometimes enable tests to be designed on the basis of population-specific GWAS which will function predictively in a population-agnostic manner.
Methods like these may generalize in other drug-disease systems. PIP-style methods may identify trans-ethnic causal regulatory variants and recover missing heritability by interpreting association studies in the light of disease mechanisms and massive cohort omics. These high-quality novel variants may help to enable the development of new pharmacogenomics tests with greater predictive power and clinical benefit.

Summary points
• Pharmacoepigenomics methods, informed by omics datasets and published literature, have yielded discoveries in neuropsychiatric pharmacogenomics. • Pharmacogenomics is poised to be rapidly revolutionized by regulatory variant discovery driven by integrated multi-omics pipelines. • However, much pharmacogenomics variant discovery still proceeds along traditional lines involving the search for coding variants to be designated as star (*) alleles. • We have developed a pharmacoepigenomics informatics pipeline (PIP), an integrative multi-omics variant discovery pipeline designed specifically for next-generation pharmacogenomics and pharmacoepigenomics, and validated it using a positive control in lithium pharmacogenomics. • The PIP is based on 4D genome architecture and integrates bioinformatics algorithms and omics datasets along with pre-existing domain-specific knowledge derived from the published literature to screen and organize pharmacogenomics regulatory variants and a separate module to find nonsynonymous coding variants. • Despite large genetic influence on dosing, pharmacogenomics tests in warfarin have failed to find wide use because the three commonly used SNPs in two key genes account for ∼30% of the heritability of response. • Using the PIP to analyze the results of 23 genome-wide association studies on warfarin-related phenotypes, we identified known and novel putative pharmacogenomics variants and genes for warfarin response, including some adverse event markers. • Such genes and variants appear to cluster in a warfarin response pathway containing known warfarin response genes, adding additional variants altering the protein sequence of or putatively regulating these genes, and also adding new genes and variants. • Multi-omics pipelines for pharmacogenomics screening may further enhance our understanding of other drugdisease systems, and provide direction for the creation of new pharmacogenomics tests with improved patient outcomes.