HM Medical Clinic

H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Open Access
GSVA: gene set variation analysis formicroarray and RNA-Seq dataSonja H¨anzelmann1,2, Robert Castelo1,2* and Justin Guinney3* Background: Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene
expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis
include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling
experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can
model pathway activity within highly heterogeneous data sets.
Results: To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates
variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of
GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of
its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data
from both microarray and RNA-seq experiments.
Conclusions: GSVA provides increased power to detect subtle pathway activity changes over a sample population in
comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic
analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to
the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part
of the Bioconductor project and can be downloaded at
is collectively exerted and may vary by environmental The ability to measure mRNA abundance at a genomic stimuli, genetic modifications, or disease state. Thus, scale has led to many efforts to catalog the diverse organizing genes into gene sets provides a more intuitive molecular patterns underlying biological processes. To and stable context for assessing biological activity.
facilitate the interpretation and organization of long lists Many methodological variations of GSE methods have of genes resulting from microarray experiments, gene set been proposed including non-parametric enrich- enrichment (GSE) methods have been introduced. They ment statistics , battery testing , and focused systematically measure and annotate molecular profiles gene set testing . Battery testing methods aim that are inherently noisy and difficult to interpret. GSE at identifying gene sets standing out from a large collec- analyses begin by obtaining a ranked gene list, typically tion of annotated pathways and gene signatures. Focused derived from a microarray experiment that studies gene gene set testing methods try to carefully evaluate a few expression changes between two groups. The genes are gene sets that are relevant to the experiment being ana- then mapped into predefined gene sets and their gene lyzed . GSE methods have been successfully applied expression statistic is summarized into a single enrich- in many experimental conditions to interpret the pathway ment score for each gene set. A significant benefit of these architecture of biological states including cancer pathway-based methods is interpretability: gene function metabolic disease and development For a recentreview on GSE methods the reader may consult An important distinction among many of the GSE meth- 1Research Program on Biomedical Informatics (GRIB), Hospital del Mar Medical ods is the definition of the null hypothesis that is tested Research Institute (IMIM), Barcelona, Catalonia, Spain 3Sage Bionetworks, 1100 Fairview Ave N., Seattle, Washington, 98109 USA . The null hypothesis of a competitive test declares Full list of author information is available at the end of the article 2013 H¨anzelmann et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (, which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly cited.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
that there are no differences between genes inside and out- over the samples independently of any class label. Concep- side the gene set (e.g., ). A self-contained test defines tually, this methodology can be understood as a change in its null hypothesis only in terms of the genes inside the coordinate systems for gene expression data, from genes gene set being tested (e.g., ). More concretely, for a to gene sets. This transformation facilitates post-hoc con- self-contained test on a gene set, the differential expres- struction of pathway-centric models, such as differential sion of just one of its genes allows one to reject the null pathway activity identification or survival prediction. Fur- hypothesis of no differential expression for that gene set.
ther, we demonstrate the flexibility of GSVA by applying it It follows, that self-contained tests provide higher power to RNA-seq data.
than competitive tests to detect subtle changes of expres-sion in a gene set. But they may not be useful to single out a few gene sets in a battery testing setting because of the A schematic overview of the GSVA method is provided potentially large number of reported results.
in Figure which shows the two main required inputs: Finally, many GSE methods assume two classes (e.g.
a matrix X = {xij}p×n of normalized expression values case/control) and evaluate enrichment within this context (see for details on the preprocessing steps) for p The limits imposed by this assumption become genes by n samples, where typically p  n, and a collec- evident with the rise of large genomic studies, such tion of gene sets  = {γ1, . . , γm}. We shall denote by xi as The Cancer Genome Atlas project (TCGA - the expression profile of the i-th gene, by xij the specific , an ambitious project with the expression value of the i-th gene in the j-th sample, and by goal to identify the molecular determinants of multiple γk the subset of row indices in X such that γk ⊂ {1, . . p} cancer types. In contrast to case-control studies with small defines a set of genes forming a pathway or some other sample sizes, the TCGA project has large patient cohorts functional unit. Let γk be the number of genes in γk.
with multiple phenotypes, structured with hierarchical, GSVA starts by evaluating whether a gene i is highly or multi-class, and censored data. Hence, GSE methods are lowly expressed in sample j in the context of the sample needed that can assess pathway variation across large, het- population distribution. Probe effects can alter hybridiza- erogeneous populations with complex phenotypic traits.
tion intensities in microarray data such that expression To address these challenges, we present a non- values can greatly differ between two non-expressed genes parametric, unsupervised method called Gene Set Varia- . Analogous gene-specific biases, such as GC content tion Analysis (GSVA). GSVA calculates sample-wise gene or gene length have been described in RNA-seq data set enrichment scores as a function of genes inside and To bring distinct expression profiles to a common outside the gene set, analogously to a competitive gene set scale, an expression-level statistic is calculated as follows.
test. Further, it estimates variation of gene set enrichment For each gene expression profile xi = {xi1, . . , xin}, a Figure 1 GSVA methods outline. The input for the GSVA algorithm are a gene expression matrix in the form of log2 microarray expression values
or RNA-seq counts and a database of gene sets. 1. Kernel estimation of the cumulative density function (kcdf). The two plots show two simulated
expression profiles mimicking 6 samples from microarray and RNA-seq data. The x-axis corresponds to expression values where each gene is lowly
expressed in the four samples with lower values and highly expressed in the other two. The scale of the kcdf is on the left y-axis and the scale of the
Gaussian and Poisson kernels is on the right y-axis. 2. The expression-level statistic is rank ordered for each sample. 3. For every gene set, the
Kolmogorov-Smirnov-like rank statistic is calculated. The plot illustrates a gene set consisting of 3 genes out of a total number of 10 with the
sample-wise calculation of genes inside and outside of the gene set. 4. The GSVA enrichment score is either the maximum deviation from zero (top)
or the difference between the two sums (bottom). The two plots show two simulations of the resulting scores under the null hypothesis of no gene
expression change (see main text). The output of the algorithm is a matrix containing pathway enrichment scores for each gene set and sample.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
non-parametric kernel estimation of its cumulative den- and a normalized ES. The first ES is the maxi- sity function is performed. In the case of microarray data, mum deviation from zero of the random walk of the j-th a Gaussian kernel pg. 148) is used: sample with respect to the k-th gene set: ESmax = ν   xijxik jk [ arg max et22 dt , For each gene set k, this approach produces a distribu- tion of enrichment scores that is bimodal (Figure step where hi is the gene-specific bandwidth parameter that 4, top panel, Additional file Figure S1). This is an intrin- controls the resolution of the kernel estimation, which is sic property of the KS like random walk, which generates set to hi = si/4, where si is the sample standard deviation non-zero maximum deviations under the null distribu- of the i-th gene (Figure step 1). In the case of RNA-seq tion. In GSEA it is also observed that the empirical data, a discrete Poisson kernel is employed: null distribution obtained by permuting sample labels isbimodal and, for this reason, significance is determined independently using the positive and negative sides of the ik + r)y r (xij) = 1 null distribution. In our case, we would like to provide k=1 y=0 a standard Gaussian distribution of enrichment scoresunder the null hypothesis of no change in pathway activ- where r = 0.5 in order to set the mode of the Poisson ker- ity throughout the sample population. For this purpose nel at each xik, because the mode of a Poisson distribution we propose a second, alternative score that produces an with an integer mean λ occurs at λ and λ − 1, and at the ES distribution approximating this requirement (Figure largest integer smaller than λ when λ is continuous.
step 4, bottom panel, Additional file Figure S1): Let zij denote the previous expression-level statistic ij), or ˆ Fr(xij), depending on whether xij are con- ESdiff =ES+−ES− = max (0, ν (0, ν jk ())− min tinuous microarray, or discrete count RNA-seq values, respectively. The following step condenses expression- level statistics into gene sets by calculating sample-wise where ES+ and ES− are the largest positive and negative enrichment scores. To reduce the influence of potential outliers, we first convert z random walk deviations from zero, respectively, for sam- ij to ranks z(i)j for each sample j and normalize further r ple j and gene set k. This statistic may be compared to ij = p/2 − z(i)j to make the ranks symmetric around zero (Figure step 2). This is done the Kuiper test statistic , which sums the maximum to up-weight the two tails of the rank distribution when and minimum deviations to make the test statistic more computing the final enrichment score.
sensitive in the tails. In contrast, our test statistic penal- We assess the enrichment score similar to the GSEA and izes deviations that are large in both tails, and provides ASSESS methods using the Kolmogorov-Smirnov a "normalization" of the enrichment score by subtracting (KS) like random walk statistic (Figure step 3): potential noise. There is a clear biological interpretationof this statistic, it emphasizes genes in pathways that are concordantly activated in one direction only, either ij τ I(g(i) γk) i=1 I(g(i) ∈ γk ) over-expressed or under-expressed relative to the overall jk () =  p γ ij τ I(g(i) γk) population. For pathways containing genes strongly acting in both directions, the deviations will cancel each otherout and show little or no enrichment. Because this statis- where τ is a parameter describing the weight of the tail tic is unimodal and approximately normal (as observed in the random walk (default τ = 1), γk is the k-th gene via simulation, see below), downstream analyses which set, I(g(i) γk) is the indicator function on whether may impose distributional assumptions on the data are the i-th gene (the gene corresponding to the i-th ranked thus possible. In certain cases, the characteristics of this expression-level statistic) belongs to gene set γk, γk is the statistic may be undesirable, especially if the relevant gene number of genes in the k-th gene set, and p is the number sets are not explicitly separated into "up" and "down" of genes in the data set. Conceptually, Eq. produces a dis- behavior (as the MSigDB provides for many gene sets). In tribution over the genes to assess if the genes in the gene such circumstances, the statistic defined by Eq. should set are more likely to be found at either tail of the rank distribution (see for a more detailed description).
Figure step 4 and Additional file Figure S1 show We offer two approaches for turning the KS like random a simple simulation where standard Gaussian deviates walk statistic into an enrichment statistic (ES) (also called are independently sampled from p = 20, 000 genes and GSVA score), the classical maximum deviation method n = 30 samples, thus mimicking a null distribution of H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
no change in gene expression. One hundred gene sets are right-singular vector of the singular value decomposition uniformly sampled at random from the p genes with sizes of the gene set pg. 9). The combined z-score method ranging from 10 to 100 genes. Using these two inputs, we standardizes first, as PLAGE, each gene expression calculate the maximum deviation ES and the normalized profile into z-scores but the pathway activity profile is then ES. The resulting distributions are depicted in Figure obtained by combining the individual gene z-scores per step 4 and Additional file Figure S1.
sample Figure one). Both, PLAGE and the combined Although the GSVA algorithm itself does not evaluate z-score are parametric and assume that gene expression statistical significance for the enrichment of gene sets, profiles are jointly normally distributed. The combined z- significance with respect to a phenotype can be easily score additionally assumes that genes act independently evaluated using conventional statistical models. Likewise, within each gene set. The ssGSEA method from Barbie false discovery rates can be estimated by permuting the et al. uses the difference in empirical cumulative dis- sample labels . We make no general prescrip- tribution functions of gene expression ranks inside and tion for thresholds of significance or false discovery, as outside the gene set to calculate an enrichment statistic these choices are highly context dependent and may vary per sample which is further normalized by the range of according to each experiment. Examples of these tech- values taken throughout all gene sets and samples.
niques are provided in the following section.
Comparison of methods on simulated data
GSVA is unsupervised and yields single sample enrich- Review of other methods
ment scores. Therefore, we can directly compare the per- Methods for gene set enrichment can be generally parti- formance of GSVA to the combined z-score, single sample tioned according to the criteria of supervised vs unsuper- GSEA and PLAGE However, in contrast to the vised, and population vs single sample assessments. Most other methods, GSVA calculates first an expression statis- GSE methods, such as GSEA , are supervised and pop- tic with the kernel estimation of the ECDF over the sam- ulation based, in that they compute an enrichment score ples, which should help in protecting the method against per gene set to describe the entire data set, modeled on a systematic gene specific effects, such as probe effects, and phenotype (discrete, such as case-control, or continuous).
therewith increase its sensitivity. To verify this hypothesis The simplest of this genre is described by Tian et al. we have performed the following three simulation studies.
evaluated as the mean differential expression (e.g. case vs In the first study, we simulated microarray data from control) of a set of genes, compared to those genes not in a linear additive model with sample and probe effects the gene set. One of the major drawbacks of this method for p = 1, 000 genes and two groups of samples (see is that gene correlations are not taken into account, which . Using this model we have generated data sets of might lead to an increased number of false-positive gene increasing sample size and defined two gene sets formed sets with respect to GSEA . Many other supervised, by 30 genes each, where one gene set is differentially population based approaches have also been described expressed (DE) and the other is not. For the DE gene set we considered strong and weak signal-to-noise ratios A supervised, single sample based approach was intro- and two different fractions of DE genes (50% and 80%) duced in the ASSESS method After dichotomizing resulting in four different simulation scenarios. Using the the samples based on phenotypic classes, the ASSESS simulated data from each scenario, we have calculated method computes density estimates for each gene/class pathway activity profiles with the four sample-wise GSE followed by the evaluation of an enrichment score for each methods (GSVA, ssGSEA, PLAGE and the combined z- sample/gene set. This method is well-suited for assessing score) and applied a t-test on the DE and non-DE gene gene set variation across a dichotomous phenotype. GSVA sets between the two groups of samples. Using the DE also utilizes density estimates for evaluating sample-wise gene set and a significance threshold of α = 0.05, we have enrichment, but by omitting phenotypic information, it estimated the statistical power of each method as func- enables more general downstream analyses and therefore tion of the sample size. On the same data, but using the non-DE gene set, we have estimated the empirical type-I Three unsupervised, single sample enrichment meth- error rate at α = 0.05. The results of this simulation in ods have been developed, Pathway Level analysis of Gene Figure show that GSVA attains higher statistical power Expression (PLAGE), single sample GSEA (ssGSEA) and than the other three methods in each of the four simulated the combined z-score These methods compute scenarios while providing similar control of the type-I an enrichment score for each gene set and individual sample. PLAGE standardizes each gene expression profile In the second simulation study, we compared the accu- over the samples and then estimates the pathway activ- racy of each GSE method to identify differential pathway ity profiles for each gene set as the coefficients of the first activity by calling DE gene sets. For this, we used the H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Weak signal-to-noise-ratio Statistcal Power 0.2 Statistcal Power 0.2 Empirical Type−I Error 0.00 Empirical Type−I Error Statistcal Power 0.2 Statistcal Power 0.2 Empirical Type−I Error Empirical Type−I Error Figure 2 Comparison of statistical power and type-I error rate between GSVA, PLAGE, single sample GSEA (ssGSEA) and combined
z-score (zscore).
The averaged results of 1,000 simulations are depicted as function of the sample size on the x-axis, for each of the GSE methods.
On the y-axis either the statistical power (A, C, E, G) or the empirical type-I error rate (B, D, F, H) is shown. Data were simulated from a linear additive
model with sample and probe effects (see for p = 1, 000 genes. GSE scores were calculated with each method with respect to two gene
sets, one of them differentially expressed (DE) and the other one not. Statistical power and empirical type-I error rates were estimated by performing
a t-test on the DE and non-DE gene sets, respectively, at a significance level of α = 0.05. These simulations were carried out under the following
four different scenarios for the DE gene set: (A,B) weak signal-to-noise ratio, 50% of DE genes in the DE gene set; (C,D) strong signal-to-noise ratio,
50% of DE genes in the DE gene set; (E, F) weak signal-to-noise ratio, 80% of DE genes in the DE gene set; (G, H) strong signal-to-noise ratio, 80% of
DE genes in the DE gene set.
previously defined four simulation scenarios as well as the and formed by sampling uniformly at random among the linear additive model with a fixed sample size of n = 60 other 990 genes. We used a fixed configuration on the and p = 10, 000 genes to simulate data of more realis- magnitude of differential expression (strong) and on the tic dimensions. We set the first 2,000 genes as DE and fraction of DE genes in the DE gene set (50%). In a similar simulated 1,000 gene sets of which we defined 500 as DE way to the survival simulation by Bair and Tibshirani (see . For each simulated gene expression data we generated survival times and censoring status for each set, GSE scores were calculated and a two-sample t-test observation with different parameters for each group of was employed to call DE gene sets at 5% FDR. The per- samples (see This setting was generated twice formance of each GSE method was measured by the area to have independent training and test data sets.
under the ROC curve (AUC) across 100 independent sim- GSE scores were calculated separately on the training ulations (see . AUC values were calculated from and test data. A Cox proportional hazards model (Cox the binary vector of DE calls to compare the ability of each PHM) was fitted to each GSE score profile in the train- method to identify DE gene sets at a genome-wide signifi- ing data. The model with the lowest p-value provided by cance level. The results are shown in Figure This figure the Wald test was used to predict risk on the test data. As shows that GSVA attains significantly higher mean AUC baseline comparison, we also fitted a Cox PHM to each values than the other GSE methods (P < 0.05) in all but gene expression profile on the training data and selected two of the twelve pairwise comparisons. This improve- the 10 genes, corresponding to the gene set size across all ment in performance of GSVA over the other methods gene sets, with lowest p-values given by the Wald test to is also observed at a more stringent FDR cutoff of 1% also predict risk on the test data.
(Additional file Figure S2).
The performance of each gene set and gene-level model Finally, we carried out a third simulation study in the (using 10 genes) on the test data was assessed by the con- context of survival analysis. We used again the former cordance index. This simulation was repeated 100 times linear additive model to simulate microarray data with and four entire runs were performed on increasing sam- p = 1, 000 genes and two groups of samples. This time, ple sizes n = {25, 50, 75, 100} of the simulated data. In however, we performed a cross-validation study to assess Figure the distribution of concordance index values is predictive power using 50 gene sets, each consisting of reported separately for each method and sample size.
10 genes. One of the gene sets was set as DE between GSVA provides higher mean and median concordance the two sample groups while the other 49 were not DE index values than the other methods at every of the four H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Weak signal−to−noise ratio
Strong signal−to−noise ratio
50% DE genes in gene set
50% DE genes in gene set
Weak signal−to−noise ratio
Strong signal−to−noise ratio
80% DE genes in gene set
80% DE genes in gene set
Figure 3 Comparison of differential pathway activity identification of GSVA, PLAGE, single sample GSEA (ssGSEA) and combined z-score
Each panel shows the area under the ROC curve (AUC) on the y-axis for differentially expressed genes predicted by each method at 5%
FDR over 100 simulations (see On top of each boxplot the p-value of the t-test for no difference in means between GSVA and the
corresponding method is reported. The two panels on top correspond to simulations where 50% of the genes in DE gene sets were DE while the
two at the bottom contained 80% of DE genes on those DE gene sets. The two panels on the left correspond to a weak signal-to-noise ratio in the
DE magnitude while the two on the right correspond to a strong one. Diamonds indicate mean values in boxplots.
sample sizes and the difference in means is significant signature of the phenotype ALL vs MLL within different (P < 0.05) when n ≥ 50.
scenarios of magnitude of expression change.
We began by ranking all genes by fold change. Then, Lymphoblastic Leukemia: ALL vs MLL
we partitioned this ranking into three equally sized frac- A canonical use of pathway-centric methods is the study tions depicted in red, violet and blue in the volcano plot of how pathway or gene set variation reveals the underly- in Figure panel A. We used each tercile of genes with ing biological structure with respect to a given phenotype.
increasing fold changes and bootstrap 10 samples from An example of this type of analysis was demonstrated each class 1000 times. We applied the four GSE meth- in Verhaak, et al , where they showed how murine- ods to the bootstrapped data together with the canonical derived neuronal gene sets revealed a corresponding Broad C2 collection of gene sets Subsequently, we per- structure for glioblastoma subtypes in a large human formed differential expression analysis on the enrichment cohort. To assess the higher power of GSVA to detect dif- scores using limma From each ranking of adjusted ferentially expressed gene sets relevant to a phenotype of p-values we selected the top 5 gene sets and used their interest in real data, we have used a human leukemia data enrichment scores to make a hierarchical clustering of set. The data set consists of 37 different individuals with the samples. We finally partitioned the samples into two leukemia, of which 20 correspond to acute lymphoblastic groups using the two main branches of the hierarchy and leukemia (ALL) and 17 to mixed-lineage leukemia (MLL) calculated the adjusted rand index (ARI) with respect We assessed the performance of the four sample- to the corresponding sample label to assess the robustness wise GSE methods by evaluating their ability to produce a of the clustering.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Figure 4 Comparison of the predictive power for survival analysis of gene-level, GSVA, PLAGE, single sample GSEA (ssGSEA) and combined
z-score (zscore) on simulated data.
Each panel corresponds to a different sample size of the simulated data. The y-axis shows the concordance
index values of predicting survival risk on test data from 100 independent simulations. On top of each boxplot the p-value of the t-test for no
difference in means between GSVA and the corresponding method is reported. The method gene refers to a simple gene-level survival model built
from the top 10 genes with lowest p-values reported by the Wald test performed on the training data. Diamonds indicate mean values in boxplots.
As Figure shows, ARI values depend on the tercile of TCGA to obtain pathway enrichment scores for each fold change magnitude considered. Except in the case of of the canonical gene sets (C2) in MSigDB, and compared the genes belonging to the tercile with largest fold changes the four GSE methods. We performed a five-fold cross- (panel D), GSVA produced enrichment scores that led validation and calculated GSE scores separately on each to significantly higher ARI values (t-test for difference in training and testing partition of the data with each of the means p-value < 2e − 16) than ssGSEA, PLAGE or the four compared methods. We also considered the original combined z-score approaches, demonstrating the larger expression data for a simple gene-level model. On each power of GSVA to produce signatures capable of detecting of the training data sets, we fitted a Cox PHM for each subtle gene expression changes. Sample-wise enrichment gene set, and each gene, in the gene-level model. Then, scores easily enable extending this kind of analysis to we selected those five gene-sets, or genes in the gene- a more complex phenotype with three or more sample level model, with the lowest p-value of the Wald test for groups. Such an example using adrenocortical carcinoma no effect on survival. Using the selected gene-sets, we fit- data can be found in Additional file Figure S3 and ted again a Cox PHM on the training data and used it to predict risk on the training and test data sets of GSEscores. We repeated this for the gene-level model. Finally, Survival analysis in ovarian carcinoma
we assessed the predictive performance of those models, We next examined pathway models for predicting patient each of them representing a different method, by calcu- survival in ovarian serous cystadenocarcinoma (OV). We lating the concordance index of the predicted risk. As used a large gene expression experiment (n = 588) from Figure shows, except for the training data set using the H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Adjusted Rand Inde log2 ALL− log2 MLL Adjusted Rand Inde Adjusted Rand Inde Figure 5 Comparison of differential pathway activity identification of GSVA, PLAGE, single sample GSEA (ssGSEA) and combined z-score
(zscore) on a leukemia data set.
(A) Volcano plot of gene expression changes in the Leukemia data set. Genes highlighted in red form the first
tercile of largest absolute fold changes, violet indicates the second tercile and blue the third tercile. (B-D) Adjusted rand index (ARI) indicating the
accuracy of classifying the two groups of samples by hierarchical clustering of the enrichment scores produced by each of the compared methods
at the top-5 differentially activated gene sets. The distribution of ARI values is formed by bootstrapping 1,000 times 10 samples from each sample
group. Colors match the key given for genes in the volcano plot of (A) and show that, as expected, genes with larger fold changes lead to larger ARI
values. However, when fold changes are small (B-C) and the underlying signature becomes extremely subtle, GSVA produces enrichment scores
that lead to differentially activated gene sets which classify the two sample groups substantially better than using ssGSEA, zscore or PLAGE.
gene-level model, GSVA attains higher mean and median DNA repair and modulation of the innate and adaptive concordance index values than the other methods in both, immunity, respectively. Further inspection of the top sig- training and testing data sets.
nificant gene sets (P < 10−3) show that many of them One of the main benefits of pathway-centric approaches are involved in wound and immune response. Interest- is the interpretability they provide in understanding the ingly, the 3rd and 13th ranked gene sets are derived mechanisms of disease. In Table we list the top gene from response signatures to tretinoin treatment, an all- sets associated to survival as identified by GSVA (a com- trans retinoic acid drug that has been shown to suppress plete list is available in Additional file Table S2). False growth in ovarian cancer cell lines Finally, among discovery rates (FDR) are re-estimated using a permu- the top 20 gene sets we note the presence of several tation based approach by randomly ordering the sample EGF and RAS related pathways. While EGFR and RAS labels (patient survival times) 100 times, resulting in FDR mutants are not commonly observed in ovarian cancer estimates of 0.05 and 0.2 for p-value thresholds of 10−4 activation of these well-studied oncogenes may still and 5 · 10−3, respectively. The first and second ranked play an important role in progression and survival in gene sets suggest two important survival mechanisms: ovarian cancer.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
GSVA ssGSEA zscore GSVA ssGSEA zscore Figure 6 Survival analysis in a TCGA ovarian cancer data set. Predictive performance in the survival analysis of a TCGA ovarian cancer microarray
data set of n = 588 samples, measured by the concordance index obtained from a 5-fold cross-validation from (A) the training data and (B) the test
data. Diamonds indicate means in boxplots. Except in the training data using the gene-level model, GSVA provides higher mean and median
concordance index values than the other compared methods in both training and testing cross-validated data sets.
GSVA for RNA-seq data
derived from RNA-seq data. But GOseq ignores genes The application of high-throughput sequencing to that are not considered as differentially expressed and interrogate RNA concentration in biological samples, removes them from the analysis, hence ignoring genes popularly known as RNA-seq, is steadily becoming the with subtle changes. Also, rank-based methods ignore rel- technology of choice to profile gene expression The ative changes of genes in a pathway resulting in equal resulting sequence-based measurements take the form of treatment of the genes, although they might have different discrete count data and yield a larger dynamic range and fold changes Hence, these methods may be under- unbiased power than microarray technology to survey the powered to detect subtle changes in pathway activity.
cellular state of entire transcriptomes. The nature of these Here, we show how to apply GSVA to RNA-seq data.
data, however, often requires specific statistical models We provide pathway activity profiles analogous to the and bioinformatic methods for their analysis, as in the ones obtained from microarray data by using samples of case of differential expression analysis This is also lymphoblastoid cell lines (LCL) from HapMap individ- the case of many GSE methods developed for microarray uals which have been profiled using both technologies data which make distributional assumptions that preclude Microarray and RNA-seq data were processed to their direct application to RNA-seq count data obtain gene expression data matrices with matching gene To our knowledge, no attempt has been made to con- and sample identifiers . The RNA-seq data con- dense gene-level RNA-seq expression profiles into gene sists of two tables of counts derived from reads obtained sets to capture subtle changes in gene expression. GSE at two different sequencing centers, denoted by Argonne methods exist that either work with closed lists of differ- and Yale; see . We calculated Spearman correlations entially expressed genes (e.g. topGO , GOseq ), or for all genes and gene sets from both technologies. The rankings of some differential expression statistic, such as resulting distributions of correlation values are shown in GSEA and the mean-rank gene set enrichment method Figure panels A and B, using the Argonne RNA-seq . GOseq is specifically designed to address gene data (see Additional file Figure S4 for analogous results length biases in lists of differentially expressed genes for the Yale RNA-seq data). We show that GSVA enrich-ment scores correlate similarly to gene expression levelsproduced by both profiling technologies.
Table 1 Top 5 pathways predictive of survival in ovarian
We also examined two gene sets containing gender- specific genes in detail: genes that escape X-inactivation Cox P-value
in female samples and genes that are located on SIMBULAN UV RESPONSE NORMAL DN the male-specific region of the Y chrosomome Figure illustrates that microarray and RNA-seq enrich- BIOCARTA VIP PATHWAY ment scores correlate very well in these gene sets, with ZIRN TRETINOIN RESPONSE WT1 UP ρ = 0.82 for the male-specific gene set and ρ = 0.78 DASU IL6 SIGNALING SCAR UP for the female-specific gene set. Male and female samples WANG HCP PROSTATE CANCER show higher GSVA enrichment scores in their corre- FDR evaluated as .05 with p-value threshold of 10−4.
sponding gene sets. This demonstrates the flexibility of H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
Figure 7 GSVA for RNA-seq (Argonne). A. Distribution of Spearman correlation values between gene expression profiles of RNA-seq and
microarray data. B. Distribution of Spearman correlation values between GSVA enrichment scores of gene sets calculated from RNA-seq and
microarray data. C and D. Comparison of GSVA enrichment scores obtained from microarray and RNA-seq data for two gene sets containing genes
with sex-specific expression: MSY formed by genes of the male-specific region of the Y chromosome (male-specific), and XiE formed by genes that
escape X-inactivation in females (female-specific). Red and blue points represent female and male samples, respectively. In both cases GSVA scores
show very high correlation between the two profiling technologies where female samples show higher enrichment scores in the female-specific
gene set and male samples show higher enrichment scores in the male-specific gene set.
GSVA to enable analogous unsupervised and single sam- of differentially expressed genes in the gene set (50% and ple GSE analyses in data coming from both microarray 80%) and the signal-to-noise ratio expressed as the mag- and RNA-seq technologies.
nitude of the mean sample effect in DE genes for one ofthe sample groups (weak and strong signal-to-noise ratio).
For non-DE genes μ1 = μ2 = 0 with σ1 = σ2 = 1 while for DE genes μ2 = 0.5 for the weak effect, μ2 = 1 for The simulation studies were carried out using the fol- the strong effect and σ2 = 0.5. Using the model in Eq. lowing linear additive model for mimicking normalized with these parameters, we simulated 1,000 independent microarray data on p genes and n samples divided in two data sets. For each of the four GSE methods we obtained a groups representing a case-control scenario: GSE score matrix for two gene sets (DE and non-DE) by nsamples. On each GSE score matrix, we performed a two- yij = αi + βj + ij , sample t-test on the two gene sets for a difference in mean where αi N (μ = 0, σ = 1) is a gene-specific effect, between the two groups of samples (H0 : μ1 −μ2 = 0) at a such as a probe-effect, with i = 1, . . , p, βj N (μj, σj) is significance level α = 0.05. The statistical power was then a sample-effect with j = 1, 2 and eij N (μ = 0, σ = 1) estimated as 1 minus the fraction of non-rejections of the corresponds to random noise.
DE gene set and the empirical type-I error was estimated When assessing statistical power and type-I error in as the fraction of rejections of the non-DE gene set, across Figure we set p = 1, 000 genes, out of which the first the 1,000 simulations.
30 were considered to form a DE gene set and the next In the second simulation study we considered p = 30 a non-DE gene set. We considered four different sam- 10, 000 genes out of which 2,000 were set as DE and ple sizes n = {10, 20, 40, 60} and two varying conditions from which 1,000 gene sets were built, 500 of them being leading to four different simulation scenarios: the fraction DE. DE genes and gene sets were simulated using the H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
previously described parameters and simulation scenar- Chips whose processing batch was confounded with ios. Non-DE gene sets were simulated by sampling from the outcome of interest are not considered in the anal- the p = 10, 000 genes uniformly at random while DE gene ysis. Each remaining Affymetrix chip was background sets were simulated by sampling among DE and non-DE adjusted, normalized and log2 transformed using the genes in the proportions (50% or 80% of DE genes) defined Robust Multi-array Average (RMA) algorithm .
by the corresponding scenario. For each scenario, we sam- Genes that are not expressed over the detection level of pled the data this way 100 times and calculated GSE scores the microarray or whose expression values have a limited using the four GSE methods for every resulting data set.
variability through the samples do not provide discrimi- Using those GSE scores we performed a two-sample t-test natory power and may compromise the statistical power for each gene set and called DE those meeting FDR cutoffs of subsequent analyses. For this reason, we removed 50% of 5% and 1%. Performance was assessed by calculating of the genes with lower variability as measured by the ROC curves and AUC values using the R package ROCR interquartile range (IQR) across the samples except in the LCL microarray data.
The simulation study assessing the predictive power of GSE scores for survival in Figure was performed using RNA-seq data processing
linear additive model in Eq. where μ2 = 1 was fixed The RNA-seq data from Pickrell et al. (2010) were for DE genes in one of the sample groups. Survival times produced at two sequencing centers, Argonne and Yale, were generated for each sample group from two normal and preprocessed by the authors into two separate tables distributions N (μ = 6, σ = 2) and N (μ = 10, σ = 2).
of counts of 41,466 Ensembl genes by 80 and 81 sam- Censoring times were generated from a normal distribu- ples, respectively. We use these two tables of counts, and tion N (μ = 10, σ = 3). A sample was considered to be refer the reader for details on read mapping and sum- censored when the censoring time was smaller than the marization into gene-level counts to the methods of the survival time.
publication Some of the samples (11 from Argonneand 12 from Yale) were prepared and sequenced twicewithin each sequencing center. In these cases we kept the sample of deeper coverage obtaining a final number of 69 Data for differential expression analysis was obtained samples on each table. We further filtered genes with low from the following sources: Leukemia expression by discarding those with a mean of less than and Adrenocortical Carcinoma 0.5 counts per million calculated in log GSE10927). Data for 2 scale resulting in tables of counts with 17,607 genes (Argonne) and 17,843 the ovarian analysis was downloaded from TCGA on April genes (Yale) by 69 samples and we kept genes present in 2011. At the time of analysis, 389 samples were avail- both tables (17,324). Next, we normalized these two tables able that had clinical data, gene expression (Affy U133A), of counts adjusting for gene length and G+C content using and CNV (Affy SNP 6.0). In all cases, TCGA Level 3 the Bioconductor package cqn . The corresponding data was used. Gene expression data was batch cor- gene length and G+C content information was extracted rected using ComBat . RNA-seq data corresponded from data deposited at the same site from where the tables to HapMap lymphoblastoid cell lines (LCL) of of counts were downloaded.
Yoruba individuals and the processed tables of counts In order to proceed with the comparison of GSVA were downloaded from enrichment scores between microarray and RNA-seq Matching microarray samples form part data, we further filtered these two normalized tables of of a larger study by Huang and co-workers counts in order to match the genes and samples obtained after processing the LCL microarray data from Huangand co-workers . This step required first to trans- Microarray data processing
late Ensembl gene identifiers into Entrez gene identifiers Data analysis was performed using the R and Bio- and second to match gene and sample identifiers between conductor software. We selected chips which passed microarray and RNA-seq data. After these two steps we quality control using affyPLM . AffyPLM fits mod- obtained the two final tables of counts analyzed in this els on probe set level to identify chips of lower quality.
paper of 11,508 Entrez genes by 36 samples from which 23 Relative Log Expression (RLE) values (comparing probe correspond to female and 13 to male individuals.
expression on each array against the median expres-sion across all arrays) and Normalized Unscaled Standard Gene sets database
Errors (NUSE) (standard error estimates obtained for each In all experiments, we used the gene sets database from gene and standardized across arrays) are calculated and the Molecular Signature Database version 3 (MSigDB) cut-offs applied to remove low-quality samples.
C2 collection (curated pathways) with 833 canonical H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
pathways and 2392 chemical and gene perturbations, unless otherwise stated. After mapping genes from an We have presented a method for assaying the variation experiment to the gene set database, we ignore all gene of gene set enrichment over a sample population. The sets with fewer than 10 genes or more than 500 genes.
method is freely available as a Bioconductor package forR under the name GSVA at FDR and multiple hypothesis correction
The increasing availability of large data sets with mul- In most experiments, we use a permutation approach to tiple assays and complex phenotypes has motivated our estimate an empirical FDR at a specified p-value thresh- work because the study of these data sets within the con- old. In several cases we report multiple hypothesis correc- text of pathways will be critical to their understanding.
tion based on the Benjamini-Hochberg (B.H.) approach The GSVA method is both non-parametric and unsuper- to obtain corrected p-values. In general, multiple vised, and bypasses the conventional approach of explic- hypothesis correction on gene sets is problematic, as many itly modeling phenotypes within the enrichment scoring gene sets are highly overlapping and therefore not merely algorithm. We have also shown how GSVA can be easily correlated, but essentially duplicated. Our use of B.H. is adapted to the analysis of RNA-seq data producing results likely a conservative estimate of FDR and therefore used analogous to its microarray counterpart. In the Additional primarily as a demonstration of statistical power.
file two other examples of GSVA applications can befound including differential pathway analysis in a multi- class adrenocortical carcinoma data set (Additional file The analyses conducted on simulated and real data Figure S3 and Table S1), and correlation analysis of path- demonstrate that GSVA outperforms competing meth- ways and copy-number alterations in ovarian carcinoma ods for modeling pathway variation across samples in the (Additional file Figure S5).
context of identification of differential pathway activity For future directions, we believe GSVA may be used in and survival analysis. However, given the large number genetical genomics strategies analogous to eQTL mapping of GSE methods published and available to the bioinfor- to study, what we might call, pathway-QTL to identify matic community, GSVA may not be the optimal tool DNA polymorphisms that impact pathway activity for every expression data set. We recommend GSVA as This could be extended further to support causal inference an intermediate universal tool, providing summaries of , where pathways replace genes in modeling the causal pathway activity for more open-ended biological analy- chain of genotype → gene expression → phenotype.
sis. For specific applications, highly specialized algorithmsoptimized for addressing domain specific problems may Availability and requirements
outperform GSVA. The user should also be aware that the • Project name: GSVA
non-parametric density estimation within the GSVA algo- • Project home page:
rithm requires a sufficient number of observations which, according to our analysis of statistical power in Figure • Operating system(s): Platform independent
should be larger than n = 10.
Programming language: R, C
Non-specific filtering of genes in high-throughput • Other requirements: R (>= 2.15.0), the R package
experiments has been shown to increase the statistical methods, and the Bioconductor package GSEABase power to detect significant changes in gene expression lev- (>= 1.18.0) els and this observation is likely to hold at gene set • License: GPL (>= 2)
level. We have used a simple non-specific filtering strat- • Any restrictions to use by non-academics: no
egy of a minimum and maximum cutoff on the size of a gene set after gene identifiers have been matched betweengene expression data and gene sets. However, other strate-gies based on expected features of biologically relevant gene sets could potentially be more helpful. For instance,genes that are part of the same gene set or pathway are Supplementary material including figures S1 to S5
more likely to be expressed coordinately and are expected and tables S1 and S2.
to exhibit some degree of correlation. Gene sets con-taining correlated genes are more coherent and provide Competing interests
The authors declare no conflict of interest.
a higher biological signal than incoherent, uncorrelatedgene sets Hence, removing functionally incoher- ent pathways could constitute an appealing non-specific JG conceived and designed the GSVA algorithm. JG and RC implemented thesoftware. SH and JG conceived and designed the applications of GSVA. SH, RC filtering strategy to improve detection power at gene and JG analyzed the data and wrote the paper. All authors read and approved the final manuscript.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
12. Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK: ROAST:
We thank the following individuals for their helpful comments: Ingo Vogt, Alba rotation gene set tests for complex microarray experiments.
Jene, Gunes Gundem, Sonja Althammer and Josh Millstein. S.H. and R.C.
Bioinformatics (Oxford, England) 2010, 26(17):2176–2182.
acknowledge support from an ISCIII COMBIOMED grant [RD07/0067/0001] . [PMID: 20610611] and a Spanish MINECO grant [TIN2011-22826]. J.G. is supported in part by the 13. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, National Cancer Institute Integrative Cancer Biology Program, grant Zahnow CA, Patterson N, Golub TR, Ewen ME: A mechanism of cyclin D1
action encoded in the patterns of gene expression in human cancer.
Cell 2003, 114(3):323–334.
1 Research Program on Biomedical Informatics (GRIB), Hospital del Mar Medical 14. Shepard JL, Amatruda JF, Stern HM, Subramanian A, Finkelstein D, Ziai J, Research Institute (IMIM), Barcelona, Catalonia, Spain. 2Department of Finley KR, Pfaff KL, Hersey C, Zhou Y, Barut B, Freedman M, Lee C, Experimental and Health Sciences, Universitat Pompeu Fabra, Barcelona, Spitsbergen J, Neuberg D, Weber G, Golub TR, Glickman JN, Kutok JL, Catalonia, Spain. 3Sage Bionetworks, 1100 Fairview Ave N., Seattle, Aster JC, Zon LI: A zebrafish bmyb mutation causes genome
Washington, 98109 USA.
instability and increased cancer susceptibility. Proc Natl Acad Sci USA
2005, 102(37):13194–13199.
Received: 22 May 2012 Accepted: 21 December 2012 15. Segr e AV, Groop L, Mootha VK, Daly MJ, Altshuler D, Consortium D, Published: 16 January 2013 investigators M: Common inherited variation in Mitochondrial genes
is not enriched for associations with Type 2 diabetes or related

glycemic Traits. PLoS Genet 2010, 6(8):e1001058.
Goeman JJ, Geer SAvd, Kort Fd, Houwelingen HCv: A global test for
groups of genes: testing association with a clinical outcome.
16. Pece S, Tosoni D, Confalonieri S, Mazzarol G, Vecchi M, Ronzoni S, Bernard Bioinformatics 2004, 20:93–99.
L, Viale G, Pelicci PG, Fiore PPD: Biological and molecular heterogeneity
of breast cancers correlates with their cancer stem cell content. Cell
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, 2010, 140:62–73.
Puigserver P, Carlsson E, Ridderstr˚ale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, 17. Hung JH, Yang TH, Hu Z, Weng Z, DeLisi C: Gene set enrichment
Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes
analysis: performance evaluation and usage guidelines. Brief
involved in oxidative phosphorylation are coordinately
Bioinformatics 2012, 13(3):281–291.
downregulated in human diabetes. Nature Genet 2003, 34(3):267–273.
. [PMID: 21900207] 18. Goeman JJ, B ¨uhlmann P: Analyzing gene expression data in terms of
Sweet-Cordero A, Mukherjee S, Subramanian A, You H, Roix JJ, gene sets: methodological issues. Bioinformatics (Oxford, England)
Ladd-Acosta C, Mesirov J, Golub TR, Jacks T: An oncogenic KRAS2
2007, 23(8):980–987.
expression signature identified by cross-species gene-expression
analysis. Nature Gen 2005, 37:48–55.
19. Kim SY, Volsky DJ: PAGE: Parametric analysis of gene set enrichment.
BMC Bioinformatics 2005, 6:144. [PMID: 15941488 PMCID: 1183189]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, 20. Tenenbaum JD, Walker MG, Utz PJ, Butte AJ: Expression-based Pathway
Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set
Signature Analysis (EPSA): Mining publicly available microarray
enrichment analysis: A knowledge-based approach for interpreting
data for insight into human disease. BMC Med Genomics 2008, 1:51.
genome-wide expression profiles. Proc Natl Acad Sci USA 2005,
21. Creighton CJ: Multiple oncogenic pathway signatures show
coordinate expression patterns in human prostate tumors. PLoS One
Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, Schinzel AC, 2008, 3(3):e1816.
Sandy P, Meylan E, Scholl C, Fr ¨ohling S, Chan EM, Sos ML, Michel K, 22. Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity
Mermel C, Silver SJ, Weir BA, Reiling JH, Sheng Q, Gupta PB, Wadlow RC, toward precise disease classification. PLoS Comput Biol 2008,
Le H, Hoersch S, Wittner BS, Ramaswamy S, Livingston DM, Sabatini DM, Meyerson M, Thomas RK, Lander ES, Mesirov JP, Root DE, Gilliland DG, 23. Zilliox MJ, Irizarry RA: A gene expression bar code for microarray data.
Jacks T, Hahn WC: Systematic RNA interference reveals that
Nat Meth 2007, 4(11):911–913.
oncogenic KRAS-driven cancers require TBK1. Nature 2009,
24. Hansen KD, Irizarry RA, Wu Z: Removing technical variability in
RNA-seq data using conditional quantile normalization. Biostatistics
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling
25. Silverman BW: Density Estimation for Statistics and Data Analysis. London: studies. Proc Natl Acad Sci USA 2005, 102(38):13544–13549.
Chapman and Hall; 1986. . [ISBN 9780412246203] Barry WT, Nobel AB, Wright FA: Significance analysis of functional
26. Canale A, Dunson DB: Bayesian kernel mixtures for counts. J Am Stat
categories in gene expression studies: a structured permutation
approach. Bioinformatics 2005, 21(9):1943–1949.
27. Edelman E, Porrello A, Guinney J, Balakumaran B, Bild A, Febbo PG, Mukherjee S: Analysis of sample set enrichment scores: assaying the
Efron B, Tibshirani R: On testing the significance of sets of genes. Ann
enrichment of sets of genes for individual samples in genome-wide
Appl Stat 2006, 1(1):107–129.
expression profiles. Bioinformatics 2006, 22(14):e108–e116.
Dørum G, Snipen L, Solheim M, Sæbø S: Rotation testing in gene set
enrichment analysis for small direct comparison experiments. Stat
28. Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller Apps Gen Mol Bio 2009, 8:.
CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O'Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson 10. Irizarry RA, Wang C, Zhou Y, Speed TP: Gene set enrichment analysis
JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, made simple. Stat Methods Med Res 2009, 18(6):565–575.
Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN: Integrated
genomic analysis identifies clinically relevant subtypes of
11. Jiang Z, Gentleman R: Extensions to gene set enrichment.
glioblastoma characterized by abnormalities in PDGFRA, IDH1,
Bioinformatics 2007, 23(3):306–313.
EGFR, and NF1. Cancer Cell 2010, 17:98–110.
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
29. Pearson E: Comparison of tests for randomness of points on a line.
structure. Bioinformatics 2006, 22(13):1600–1607.
Biometrika 1963, 50:315–325.
30. Tamayo P, Steinhardt G, Liberzon A, Mesirov JP: Gene set enrichment
48. Young MD, Wakefield MJ, Smyth GK, Oshlack A: Gene ontology analysis
analysis made right. arXiv:1110.4128 2011.
for RNA-seq: accounting for selection bias. Genome Biol 2010,
31. Khatri P, Dr˘aghici S: Ontological analysis of gene expression data:
current tools, limitations, and open problems. Bioinformatics 2005,
49. Michaud J, Simpson KM, Escher R, Buchet-Poyau K, Beissbarth T, Carmichael C, Ritchie ME, Sch ¨utz F, Cannon P, Liu M, Shen X, Ito Y, Raskind WH, Horwitz MS, Osato M, Turner DR, Speed TP, Kavallaris M, Smyth GK, 32. Nam D, Kim SY: Gene-set approach for expression pattern analysis.
Scott HS: Integrative analysis of RUNX1 downstream pathways and
Brief Bioinformatics 2008, 9(3):189–197.
target genes. BMC Genomics 2008, 9:363.
33. Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment
50. Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current
tools: paths toward the comprehensive functional analysis of large
approaches and outstanding challenges. PLoS Comput Biol 2012,
gene lists. Nucleic Acids Res 2009, 37:1–13.
. [PMID: 19033363 PMCID: PMC2615629] 51. Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W, Clark TA, Chen TX, 34. Jung K, Becker B, Brunner E, Beißbarth T: Comparison of global tests for
Schweitzer AC, Blume JE, Cox NJ, Dolan ME: A genome-wide approach
functional gene sets in two-group designs and selection of
to identify genetic variants that contribute to etoposide-induced
potentially effect-causing genes. Bioinformatics 2011,
cytotoxicity. Proc Natl Acad Sci USA 2007, 104(23):9758–9763.
. [PMID: 17537913] 52. Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, 35. Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression
Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding
using singular value decomposition. BMC Bioinformatics 2005, 6:225.
mechanisms underlying human gene expression variation with RNA
sequencing. Nature 2010, 464(7289):768–772.
16156896 PMCID: PMC1261155] 36. Bair E, Tibshirani R: Semi-supervised methods to predict patient
53. Carrel L, Willard HF: X-inactivation profile reveals extensive variability
survival from gene expression data. PLoS Biol 2004, 2(4).
in X-linked gene expression in females. Nature 2005,
. [PMID: 15094809 PMCID: 37. Armstrong SA, Staunton JE, Silverman LB, Pieters R, Boer MLd, Minden 54. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations
Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, specify a distinct gene expression profile that distinguishes a
Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou SF, unique leukemia. Nature Gen 2002, 30:41–47.
Latrielle P, Leonard S, Mardis E, Maupin R, McPherson J, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S, Rohlfing T, Scott K, Schultz B, Strong 38. Smyth GK: Linear models and empirical Bayes Methods for assessing
C, Tin-Wollam A, Yang SP, Waterston RH, Wilson RK, Rozen S, Page DC: differential expression in microarray experiments. Stat Appl Gen Mol
The male-specific region of the human Y chromosome is a mosaic of
Biol 2004, 3:.
discrete sequence classes. Nature 2003, 423(6942):825–837.
39. Hubert L, Arabie P: Comparing partitions. J Classif 1985, 2:193–218.
55. Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing
40. Network TCGAR: Integrated genomic analyses of ovarian carcinoma.
classifier performance in R. Bioinformatics 2005, 21(20):3940–3941.
Nature 2011, 474(7353):609–615.
56. Giordano TJ, Kuick R, Else T, Gauger PG, Vinco M, Bauersfeld J, Sanders D, . [PMID: 21720365] Thomas DG, Doherty G, Hammer G: Molecular classification and
41. Soprano KJ, Purev E, Vuocolo S, Soprano DR: Rb2/p130 and protein
prognostication of adrenocortical tumors by transcriptome
phosphatase 2A: key mediators of ovarian carcinoma cell growth
profiling. Clin Cancer Res: Official J Am Assoc Cancer Res 2009,
suppression by all-trans retinoic acid. Oncogene 2006,
57. Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray
42. Um SJ, Lee SY, Kim EJ, Han HS, Koh YM, Hong KJ, Sin HS, Park JS: expression data using empirical Bayes methods. Biostatistics 2007,
Antiproliferative mechanism of retinoid derivatives in ovarian
cancer cells. Cancer Letters 2001, 174(2):127–134.
58. Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, 43. Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H, et al.: A second
COSMIC: mining complete cancer genomes in the Catalogue of
generation human haplotype map of over 3.1 million SNPs. Nature
Somatic Mutations in Cancer. Nucleic Acids Res 2010,
2007, 449(7164):851–861.
59. Team RDC: R: A Language and Environment for Statistical Computing.
44. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and
Vienna: R Foundation for Statistical Computing; 2010. quantifying mammalian transcriptomes by RNA-Seq. Nat Meth 2008,
. [ISBN 3-900051-07-0] 60. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, 45. Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package
Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, for differential expression analysis of digital gene expression data.
Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Bioinformatics 2010, 26:139–140.
Tierney L, Yang JYH, Zhang J: Bioconductor: open software
development for computational biology and bioinformatics.
46. Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK: ROAST:
Genome Biol 2004, 5(10):R80.
rotation gene set tests for complex microarray experiments.
. [PMID: 15461798] Bioinformatics (Oxford, England) 2010, 26(17):2176–2182.
61. Bolstad BM: Low-level analysis of high-density oligonucleotide array
. [PMID: 20610611] data: background, normalization and summarization. PhD thesis,
47. Alexa A, Rahnenf ¨uhrer J, Lengauer T: Improved scoring of functional
University of Waikato 2004. groups from gene expression data by decorrelating GO graph
H¨anzelmann et al. BMC Bioinformatics 2013, 14:7
62. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries
of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003,
63. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a
practical and powerful approach to multiple testing. J R Stat Soc 1995,
64. Bourgon R, Gentleman R, Huber W: Independent filtering increases
detection power for high-throughput experiments. Proc Natl Acad Sci
2010, 107(21):9546–9551.
65. Levine DM, Haynor DR, Castle JC, Stepaniants SB, Pellegrini M, Mao M, Johnson JM: Pathway and gene-set activation measurement from
mRNA expression data: the tissue distribution of human pathways.
Genome Biol 2006, 7(10):R93. [PMID: 17044931]
66. Parts L, Stegle O, Winn J, Durbin R: Joint genetic analysis of gene
expression data with inferred cellular phenotypes. PLoS Genet 2011,
67. Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R,
Metzger JM, Yang L, Castle J, Zhu H, Kash SF, Drake TA, Sachs A, Lusis AJ:
An integrative genomics approach to infer causal associations
between gene expression and disease.
Nature Genet 2005, 37:710–717.
Cite this article as: H¨anzelmann et al.: GSVA: gene set variation analysis for
microarray and RNA-Seq data. BMC Bioinformatics 2013 14:7.
Submit your next manuscript to BioMed Central
and take full advantage of:

• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at


Int. J. Pharm. Med. & Bio. Sc. 2013 Kumar Amit et al., 2013 ISSN 2278 – 5221 Vol. 2, No. 4, October 2013 © 2013 IJPMBS. All Rights Reserved A COMPARATIVE STUDY OF EFFICACY OF TERBINAFINE AND FLUCONAZOLE IN PATIENTS OF TINEA CORPORIS