David.grangier.info
Learning from Heterogeneous Sources via Gradient Boosting Consensus
David Grangier†
has been used in movie recommendation
Multiple data sources containing different types of fea-
example, in Fig. given that we observe that
tures may be available for a given task. For instance,
the rating for "The Godfather" is 9.2 (out of 10),
users' profiles can be used to build recommendation sys-
and "The Giant Spider Invasion" is 2.8, what are the
tems. In addition, a model can also use users' histori-
ratings for "Apocalypse Now" and "Monster a-Go Go"?
cal behaviors and social networks to infer users' inter-
Note that in this task, there are multiple available
ests on related products. We argue that it is desirable
databases that record various information about movies.
to collectively use any available multiple heterogeneous
For instance, there is a genre database (Fig. a
data sources in order to build effective learning mod-
sound technique database (Fig. a running times
els. We call this framework heterogeneous learning. In
database (Fig. an actor graph database that links
our proposed setting, data sources can include (i) non-
two movies together if the same actor/actress performs
overlapping features, (ii) non-overlapping instances, and
in the movies (Fig. and a director graph database
(iii) multiple networks (i.e. graphs) that connect in-
that links two movies if they are directed by the same
stances. In this paper, we propose a general optimiza-
Note that these multiple data
tion framework for heterogeneous learning, and devise
sources have the following properties:
a corresponding learning model from gradient boosting.
• Firstly, each data source can have its own feature
The idea is to minimize the empirical loss with two con-
For example, the running times database
straints: (1) There should be consensus among the pre-
(Fig. has numerical features;
dictions of overlapping instances (if any) from different
database (Fig. has nominal features, and the
data sources; (2) Connected instances in graph datasets
actor graph database (Fig. provides graph re-
may have similar predictions. The objective function is
lational features.
solved by stochastic gradient boosting trees. Further-more, a weighting strategy is designed to emphasize in-
• Secondly, each data source can have its own set of
formative data sources, and deemphasize the noisy ones.
instances. For example, the genre database does
We formally prove that the proposed strategy leads to
not have the record for "Monster a-Go Go"; the
a tighter error bound. This approach consistently out-
running times database does not have any record
performs a standard concatenation of data sources on
of "Apocalypse Now".
movie rating prediction, number recognition and ter-rorist attack detection tasks. We observe that the pro-
Note that it is difficult to build an accurate prediction
posed model can improve out-of-sample error rate by as
model by using only one of the five databases, since the
much as 80%.
information in each of them is incomplete. However, ifwe consider the five data sources collectively, we are able
to infer that the rating of "Apocalypse Now" (ground
Given a target task, multiple related data sources can
truth: 8.6) may be close to that of "The Godfather",
be used to build prediction models. Each of the related
since they are similar in genre and they are connected
data sources may have a distinct set of features and
in the actor graph. Similarly, one can infer that the
instances, and the combination of all data sources
rating for "Monster a-Go Go" (ground truth: 1.5) is
may yield better prediction results.
similar to that of "The Giant Spider Invasion".
illustrated in Fig.
The task is to predict movie
In the past, multi-view learning was pro-
ratings in the Internet Movie Database which
posed to study a related problem where each instancecan have different views. However, it usually does notconsider graph data with relational features, especially
∗Computer Science Department, University of Illinois at
when there are multiple graphs and each graph may
Chicago, USA. {xshi9, psyu}@uic.edu.
†AT&T Labs, USA. {jpaiement, grangier}@research.att.com.
only contain a subset of the relation features. Hence,
we study a more general learning scenario called hetero-
T he Giant Spider Invasion
T he Giant Spider Invasion
(a) Movie rating prediction.
(b) Genre database.
Running t imes (mins)
DT S, Digit al, 6-Track
T he Giant Spider Invasion
T he Giant Spider Invasion
(c) Sound technique database.
(d) Running times.
The Giant Spider
The Giant Spider
(e) Actor graph.
(f) Director graph that doesnot have record on "Apoca-lypse Now".
Figure 1: Combining different sources to infer movie ratings. The true rating for "Apocalypse Now" is 8.6, whilethe rating for "Monster a-Go Go" is 1.5.
geneous learning where the data can come from multi-
the gradient residual of the objective function. We call
ple sources. Specifically, the data sources can (1) have
our proposed algorithm Gradient Boosting Consensus
non-overlapping features (i.e., new features in certain
(GBC) because each data source generates a set of trees,
data sources), (2) have some non-overlapping instances
and the consensus of the decision trees makes the final
(i.e., new objects/instances in certain data sources),
prediction. Moreover, GBC has the following proper-
and (3) contain multiple network (i.e. weighted graphs)
datasets. Furthermore, some of the data sources maycontain substantial noise or low-quality data. Our aim
• Deep-ensemble. Recall that the traditional boost-
is to utilize all data sources collectively and judiciously,
ing tree model is an iterated algorithm that builds
in order to improve the learning performance.
new trees based on the previous iterations (residu-
A general objective function is proposed to make
als). Usually, these new trees are generated based
good use of the information from these multiple data
on the residual of only one data source. However,
The intuition is to learn a prediction func-
as shown in Fig. GBC generates new trees collec-
tion from each data source to minimize the empirical
tively from all data sources (horizontally) in each
loss with two constraints. First, if there are overlap-
iteration (vertically). We call it "deep ensemble"
ping instances, the predictions of the same instance
since it ensembles models both horizontally and
should be similar even when learning from different data
vertically to make the final prediction.
sources. Second, the predictions of connected data (i.e.,
• Network-friendly. Unlike traditional boosting trees,
instances connected in any of the graphs) should be
GBC can take advantage of multiple graph datasets
similar. Finally, the prediction models are judiciously
to improve learning. In other words, it can take
combined (with different weights) to generate a global
advantage of traditional vector-based features and
prediction model. In order to solve the objective func-
graph relational features simultaneously.
tion, we borrow ideas from gradient boosting decisiontrees (GBDT), which is an iterated algorithm that gen-
• Robust. Some data sources may contain substan-
erates a sequence of decision trees, where each tree fits
tial noise. A weighting strategy is incorporated into
Set of trees
Set of trees
through the latent factor. There are mainly two differ-
for data source 1
for data source 2
ences between our work and the previous approaches.
First, most of the previous works do not consider thevector-based features and the relational features simul-
taneously. Second and foremost, most of the previous
works require the data sources to have records of all in-
stances in order to enable the mapping, while the pro-
posed GBC model does not have this constraint.
Another area of related work is collective classifi-
cation (e.g., that aims at predicting the class label
Iter t+1 E
from a network. Its key idea is to combine the super-vision knowledge from traditional vector-based feature
Figure 2: Gradient Boosting Consensus.
vectors, as well as the linkage information from the net-work. It has been applied to various applications such
GBC to emphasize informative data sources and
as part-of-speech tagging classification of hypertext
deemphasize the noisy ones. This weighting strat-
documents using hyperlinks etc.
egy is further proven to have a tighter error bound
works study the case when there is only one vector-based
in both inductive and transductive settings.
feature space and only one relational feature space, andthe focus is how to combine the two. Different from tra-
We conducted three sets of experiments. These ex-
ditional collective classification framework, we consider
periments include IMDB movie rating prediction, UCI
multiple vector-based features and multiple relational
number recognition, and terrorist attack detection, and
features simultaneously. Specifically, proposes an
each task has a set of data sources with heterogeneous
approach to combine multiple graphs to improve the
features. For example, in the IMDB movie rating pre-
learning. The basic idea is to average the predictions
diction task, we have data sources about the plots of
during training.
There are three differences between
the movies (text data), technologies used by the movies
the previous works and the current model. Firstly, we
(nominal features), running times of the movies (nu-
allow different data sources to have non-overlapping in-
merical features), and several movie graphs (such as di-
stances. Secondly, we introduce a weight learning pro-
rector graph, actor graph). All these mixture types of
cess to filter out noisy data sources. Thirdly, we consider
data sources were used collectively to build a prediction
multiple vector-based sources and multiple graphs at
model. Since there is no previous model that can handle
the same time. Hence, all the aforementioned methods
the problem directly, we have constructed a straightfor-
could not effectively learn from the datasets described
ward baseline which first appends all data sources to-
in Section 4, as they all contain multiple vector-based
gether into a single database, and uses traditional learn-
data sources and relational graphs.
ing models to make predictions. Experiments show thatthe proposed GBC model consistently outperforms our
Problem Formulation
baseline, and can decrease the error rate by as much
In this section, we formally define the problem of het-
erogeneous learning, and then introduce a general learn-ing objective. In heterogeneous learning, data can be
described in heterogeneous feature spaces from multi-
There are several areas of related works upon which our
ple sources. Traditional vector-based features are de-
proposed model is built. First, multi-view learning (e.g.,
noted with the column vectors x
is proposed to learn from instances which have
ing to the i-th data in the j-th source (or the j-th fea-
multiple views in different feature spaces. For example,
ture space) whose dimension is dj. In matrix form,
in a framework is proposed to reconcile the cluster-
dj ×m is the dataset
ing results from different views. In a term called
in the j-th feature space where m is the sample size.
consensus learning is proposed. The general idea is to
Different from vector-based features, graph relational
perform learning on each heterogeneous feature space
features describe the relationships between instances.
independently and then summarize the results via en-
In other words, they are graphs representing connec-
semble. Recently, proposes a recommendation model
tivity/similarity of the data.
Specifically, we denote
(collaborative filtering) that can combine information
Gg =< Vg, Eg > as the g-th graph where Vg is the
from different contexts.
It finds a latent factor that
set of nodes and Eg ⊆ Vg × Vg is the set of edges.
connects all data sources, and propagate information
Table 1: Symbol definition
The j-th data (column vector) in the i-th source (the i-th feature space).
The g-th relational graph.
The set of unlabeled data in the i-th data source.
The prediction model built from the i-th data source.
Graph connectivity constraint.
Set of labeled data.
We assume that the features from the same data source
Furthermore, C f , w = 0 is the constraint derived from
are from the same feature space, and hence each data
the principle of consensus, defined as follows:
source has a corresponding feature space. Furthermore,
different data sources may provide different sets of in-
L fi(x), E f (x)
stances. In other words, some instances exist in some
data sources, but are missing in the others.
heterogeneous learning is a machine learning scenario
where we consider data from different sources, but they
may (1) have different sets of instances, (2) have differ-
ent feature spaces, and (3) have multiple network based
(graph) datasets. Hence, we have p data sources pro-
viding vector-based features X(1), · · · , X(p) and q data
It first calculates the expected prediction E f (x)
sources providing relational networks G1, · · · , Gq. The
of a given unlabeled instance x, by summarizing
aim is to derive learning models (classification, regres-
the current predictions from multiple data sources
sion or clustering) by collectively and judiciously using
This expectation is computed only
the p + q data sources. A set of important symbols in
from the data sources that contain x; in other words,
the remaining of the paper are summarized in Table
it is from the data sources whose indices are in the set{i x ∈ Ui} where Ui is the set of unlabeled instances
Gradient Boosting Consensus
in the i-th data source. Hence, if the j-th data source
In this section, we describe the general framework of the
does not have record of x, it will not be used to cal-
proposed GBC model and its theoretical foundations.
culate the expected prediction. This strategy enablesGBC to handle non-overlapping instances in multiple
The GBC framework In order to use multiple
data sources, and uses overlapping instances to improve
data sources, the objective function aims at minimizing
the consensus. Eq. forces the predictions of x (e.g.,
the overall empirical loss in all data sources, with
f1(x), f2(x), · · · ) to be close to E f (x).
two more constraints. First, the overlapping instances
Furthermore, according to the principle of con-
should have similar predictions from the models trained
nectivity similarity, we introduce another constraint
on different data sources, and we call this the principle
G f , w as follows:
of consensus.
Second, when graph relational data
is provided, the connected data should have similar
predictions, and we call this the principle of connectivity
similarity. In summary, the objective function can be
written as follows:
The above constraint encourages connected data tohave similar predictions. It works by calculating the
graph-based expected prediction of x by looking at
i(x), y) is the empirical loss on the set of
training data T , w
the average prediction (
i is the weight of importance of
the i-th data source, which is discussed in Section
all its connected neighbors (z's). If there are multiple
graphs, all the expected predictions are summarized by
If the L-2 loss is used in L, we have
We use the method of Lagrange multipliers
(fi(x) − y) + λ0
to solve the constraint optimization in Eq. The
objective function becomes
L(fi(x), y) + λ0C f , w + λ1G f , w
The L-2 loss is a straightforward loss function for theGBC model, and it is used to perform regression tasksin Section
where the two constraints C f , w and G f , w are reg-
GBC with Logistic Loss: With logistic loss, the
ularized by Lagrange multipliers λ0 and λ1. These pa-
partial derivative in Eq. becomes:
rameters are determined by cross-validation, which isdetailed in Section Note that in Eq. the weights
wg (i, g = 1, 2, · · · ) are essential. On one hand,
the wis are introduced to assign different weights to dif-ferent vector-based data sources. Intuitively, if the t-th
data source is more informative, wt should be large. On
the other hand, the ˆ
wgs are the weights for the graph re-
lational data sources. Similarly, the aim is to give high
Note that the above formula uses the binary logistic
weights to important graph data sources, while deem-
loss where y = −1 or y = 1, but one can easily extend
phasizing the noisy ones.
We define different weight
this model to tackle multi-class problems by using the
one-against-others strategy. In Section we adopt this
wg) for the data sources with vector-
based features (w
strategy to handle multi-class problems.
i) and graph relational features ( ˆ
The values of the weights are automatically learned and
With the updating rule, we can build the GBC
updated in the training process, as discussed in Sec-
model as described in Algorithm It first finds the
initial prediction models for all data sources in Step 1.
Then, it goes into the iteration (Step 3 to Step 11)
Model training of GBC We use stochastic gra-
that generates a series of decision trees. The basic idea
dient descent to solve the optimization problem in
is to follow the updating rule in Eq. and build a
Eq. In general, it is an iterated algorithm that up-
decision tree gi(xi) to fit the partial derivative of the
dates the prediction functions f (x) in the following way:
loss (Step 5). Furthermore, we follow the idea of and let the number of iterations T be set by users. Inthe experiment, it is determined by cross-validation.
Then given a new data x, the predicted output is
f (x) ← f (x) − ρ ∂f(x)
It is updated iteratively until a convergence condition
where P(y) is a prediction generation function, where
is satisfied. Specifically, inspired by gradient boosting
P(y) = y in regression problems, and P(y) = 1 iff y > 0
decision trees (or GBDT a regression tree is built
(P(y) = −1 otherwise) in binary classification problems.
to fit the gradient
∂L , and the best parameter ρ is
explored via line search Note that the calculationof ∂L depends on the loss function L(f, y) as reflected
Weight Learning In the objective function de-
scribed in Eq. one important element is the set of
in Eq. In the following, we use the L-2 loss (for
regression problems) and the binary logistic loss (for
wg) for the data sources. Ideally, infor-
mative data sources will have high weights, and noisy
binary classification problem) as examples:
data sources will have low weights. As such, the pro-
GBC with L-2 Loss: In order to update the
posed GBC model can judiciously filter out the data
prediction function of the i-th data source, we follow
sources that are noisy.
To this aim, we design the
the gradient descent formula as follows.
weights by looking at the empirical loss of the modeltrained from the data source.
Specifically, if a data
source induces large loss, its weight should be low. Fol-
fi(x) ← fi(x) − ρ ∂f
lowing this intuition, we design the weight as follows:
Input: Data from different sources: X
data sources as follows:
p, Expected outputs (labels or
regression values) of a subset of data
wiL fi(xa), fi(xb)/z
Y. Number of iterations N .
Output: The prediction model HGBF ˆ
fi(x) to be a constant such that
i(xa), fi(xb) is the pairwise loss that evalu-
ates the difference between the two predictions f
i = 1, 2, · · · , p.
The idea behind Eq. is to evaluate
whether a graph can link similar instances together.
2 Initialize wi = 1 .
If most of the connected instances have similar predic-
tions, the graph is considered to be informative. Note
that both the weights in Eq. and the weights in
For all x(i), compute the negative
Eq. are updated at each iteration. By replacing
gradient with respect to f (x(i)):
them into Eq. one can observe that the objectivefunction of the GBC model is adaptively updated at
each iteration. In other words, at the initial step, each
data source will be given equal weights; but after sev-eral iterations, informative data sources will have higher
Fit a regression model gi(x(i)) that
learning weights, and the objective function will "trust"
predicts zi's from x(i)'s.
more the informative data sources.
Line search to find the optimal gradientdescent step size as
Generalization bounds In this section, we con-
sider the incompatibility framework in and to
ρi = arg min L ˆ
fi(x) + ρigi(x(i)), w
explain the proposed GBC model. Specifically, we show
that the weight learning process described in Section
Update the estimate of ˆ
can help reduce an error bound. For the sake of sim-
plicity, we consider the case where we have two data
fi(x(i)) + ρigi(x(i))
1 and X2, and the case with more data sources
can be analyzed with similar logic. Note that the goal is
to learn a pair of predictors (f1; f2), where f1 : X1 → ˆ
Update w as Eq. and Eq.
and f2 : X2 → ˆ
Y is the prediction space. Further
denote F1 and F2 as the hypothesis classes of interest,
consisting of functions from X
1 (and, respectively, X2 )
to the prediction space ˆ
Algorithm 1: Gradient Boosting Consensus
loss of f1, and L(f2) is similarly defined. Let a Bayesoptimal predictor with respect to loss L be denoted asf ∗. We now apply the incompatibility framework forthe multi-view setting to study GBC. We first de-fine the incompatibility function χ : F
and some t ≥ 0 as those pairs of functions which are
compatible to the tune of t, which can be written as:
Cχ(t) = {(f1, f2) : f1 ∈ F1, f2 ∈ F2 and E[χ(f1, f2)] ≤ t}
where L fi(x), y is the empirical loss of the modeltrained from the i-th data source, and z is a normaliza-
Intuitively, the function Cχ(t) captures the set of func-
tion constant to ensure the summation of w
1 and f2 that are compatible with respect
one. Note that the definition of the weight w
to a "maximal expected difference" t.
from the weighting matrix in normalized cut The
is proven that there exists a symmetric function d :
exponential part can effectively give penalty to large
F1 × F2, and a monotonically increasing non-negative
function Φ on the reals such that for all f ,
i will be large if the empirical loss of the
i-th data source is small; it becomes small if the loss is
E[d(f1(x); f2(x))] ≤ Φ(L(f1) − L(f2))
large. It is proven in Theorem that the updatingrule of the weights in Eq. can result in a smaller
With these functions at hand, we can derive the follow-
error bound. Similarly, we define the weights for graph
Theorem 4.1. Let L(f1) − L(f ∗) < 1 and L(f2) −
as compared to the equal-weighting strategy whose last
2, then for the incompatibility function Cχ(t),
Hence, the weighting strategy
induces a tighter bound since
d is a constant depends on the function d we have
important to note that if the the predictions of differentdata sources vary significantly (
1 − 2 is large), the
proposed weighting strategy has a much tighter bound
LGBC(f1, f2) ≤ L(f ∗) + bayes +
than the equal-weighting strategy.
1 ,f2 )∈Cχ (t)
if there are some noisy data sources that potentiallylead to large error rate, GBC can effectively reducetheir effect. This is an important property of GBC to
Proof. Note that L(f1) − L(f ∗) < 1 and L(f2) −
handle noisy data sources. This strategy is evaluated
L(f ∗) < 2, and the proposed model GBC adopts a
empirically in the next section.
weighted strategy linear to the expected loss, which isapproximately LGBC(f1, f2) =
According to Lemma 8 in we have E[χ(f1, f2)] ≤
In this section, we report three sets of experiments that
were conducted in order to evaluate the proposed GBC
model applied to multiple data sources.
LHGBF(f1, f2) ≤ LHGBF(f ∗1, f ∗2) + bayes
answer the following questions:
With Lemma 7 in we can get
• Can GBC make good use of multiple data sources?
Can it beat other more straightforward strategies?
LHGBF(f1, f2) ≤ L(f ∗) + bayes +
• What is the performance of GBC if there exist non-
1 ,f2 )∈Cχ (t)
overlapping instances in different data sources?
Datasets The aim of the first set of experiments
Similarily, we can derive the error bound of GBC in a
is to predict movie ratings from the IMDB
Note that there are 10 data sources in this task. Forexample, there is a data source about the plots of the
Theorem 4.2. Consider
movies, and a data source about the techniques used
Eq. 4 in Given the regularized parameter λ > 0,
in the movies (e.g., 3D IMAX). Furthermore, there are
we denote Lλ(f ) as the expected loss with the regular-
several data sources providing different graph relational
ized parameter λ. If we set λc =
data about the movies.
For example, in a director
for the pair of functions (f
graph, two movies are connected if they have the same
1, f2) ∈ F1 × F2 returned by
the transductive learning algorithm, with probability at
director. A summary of the different data sources can
least 1 − δ over labeled samples,
be found in Table It is important to note that each ofthe data sources may provide certain useful information
for predicting the ratings of the movies. For instance,
the Genre database may reflect that certain types of
1, f2) ≤ Lλ(f ∗) +
movies are likely to have high ratings (e.g., Fantasy); the
Director graph database implicitly infers movie ratings
from similar movies of the same director (e.g., Steven
Spielberg has many high-rating movies.). Thus, it isdesirable to incorporate different types of data sources
where n is the number of labeled examples, and CLip is
to give a more accurate movie rating prediction. This is
the Lipschitz constant for the loss, and ˆ
an essential task for online TV/movie recommendation,
term bounded by the number of unlabeled examples and
such as the famous $1,000,000 Netflix prize
the bound of the losses.
The second set of experiments is about handwritten
Note that Theorem and Theorem derive the
number recognition. The dataset contains 2000 hand-
error bounds of GBC in inductive and transductive
written numerals ("0"–"9") extracted from a collection
setting respectively. In effect, the weighting strategy
reduces the last term of the error bound to
Table 2: IMDB Movie Rating Prediction
Technology Database
Sound Technology Database
Running Time Database
of Dutch utility The handwritten numbers are
model that can handle the same problem directly; i.e.,
scanned and digitized as binary images. They are repre-
building a learning model from multiple graphs and mul-
sented in terms of the following seven data sources with
tiple vector-based datasets with some non-overlapping
different vector-based feature spaces: (1) 76 Fourier co-
Furthermore, as far as we know, there is
efficients of the character shapes, (2) 216 profile corre-
no state-of-the-art approaches that use the benchmark
lations, (3) 64 Karhunen-Love coefficients, (4) 240 pixel
datasets described in the previous section in the same
averages in 2 × 3 windows, (5) 47 Zernike moments,
way. For instance, in the movie prediction dataset, we
(6) a graph dataset constructed from the morphological
crawl the 10 data sources directly from IMDB and use
similarity (i.e., two objects are connected if they have
them collectively in learning. In the case of the number
similar morphology appearance), and (7) a graph gen-
recognition dataset, we have two graph data sources,
erated with the same method as (6), but with random
which are different from previous approaches that only
Gaussian noise imposed in the morphological similarity.
look at vector-based features clustering or
This dataset is included to test the performance of GBC
feature selection problems
In order to evaluate
on noisy data. The aim is to classify a given object to
the proposed GBC model, we design a straightforward
one of the ten classes ("0"–"9"). The statistics of the
comparison strategy, which is to directly join all fea-
dataset are summarized in Table
tures together. In other words, given the sources with
The third set of datasets is downloaded from the
vector-based features X(1), · · · , X(p) and the adjacency
UMD collective classification database The database
matrices of the graphs M(1), · · · , M(q), the joined fea-
consists of 1293 different attacks in one of the six labels
tures can be represented as follows:
indicating the type of the attack:
kidnapping, NBCR attack, weapon attack and other
X = [X(1)T , · · · , X(p)T , M(1)T , · · · , M(q)T ]T
Each attack is described by a binary value
vector of attributes whose entries indicate the absence
Since there is only one set of joined features, tradi-
or presence of a feature. There are a total of 106 distinct
tional learning algorithms can be applied on it to give
vector-based features, along with three sets of relational
predictions (each row is an instance; each column is a
features. One set connects the attacks together if they
feature from a specific source). We include support vec-
happened in the same location; the other connects the
tor machines (SVM) in the experiments as it is used
attacks if they are planned by the same organization.
widely in practice.
Note that in GBC, the consen-
In order to perform robust evaluation of the proposed
sus term in Eq. and the graph similarity term in
GBC model, we add another data source based on the
Eq. can use unlabeled data to improve the learning.
vector-based dataset, but with a random Gaussian noise
Hence, we also compare it with semi-supervised learning
Again, this is to test the capability of the
models. Specifically, semi-supervised SVM (Semi-SVM)
proposed model to handle noise.
with a self-learning technique is used as the secondcomparison model. Note that we have three tasks in the
Comparison Methods and Evaluations It is
experiment where one of them (i.e., the movie rating
important to emphasize again that there is no previous
prediction task) is a regression task. In this task, re-gression SVM is used to give predictions. Addition-
aly, since the proposed model is derived from gradient
boosting decision trees, GBDT is used as the third
comparison model, and its semi-supervised version
is included as well. It is important to note that in order
We observe two major phenomena in the experi-
to use the joined features from Eq. these com-
ments. Firstly, the proposed GBC model effectively re-
parison models require that there is no non-overlapping
duces the error rate as compared to the other learn-
instances. In other words, all data sources should have
ing models in both settings.
It is especially obvious
records of all instances; otherwise, the joined features
in the movie rating prediction dataset where 10 data
will have many missing values since some data sources
sources are used to build the model. In this dataset,
may not have records of the corresponding instances.
GBC reduces the error rate by as much as 80% in the
To evaluate GBC more comprehensively, we thus con-
first setting (when there are 90% of training instances),
ducted the experiments on two settings:
and 60% in the second setting (when there are 10% oftraining instances). This shows that GBC is especially
• Uniform setting: the first setting is to force all data
advantageous when a large number of data sources are
sources to contain records of all instances. We only
available. We further analyze this phenomenon in the
look at the instances that have records in all data
next section with Table On the other hand, the com-
sources. Table presents the statistics of datasets
parison models have to deal with a longer and noisier
in this setting. In this case, we can easily join the
feature vector. GBC beats the four approaches by ju-
features from different sources as in Eq.
diciously reducing the noise (as discussed in Section and Secondly, we can observe that GBC outper-
• Non-overlapping setting: the second setting is to
forms the other approaches significantly and substan-
allow different data sources to have some non-
tially in the second setting (Fig. where some instances
overlapping instances. Thus, an instance described
do not have records on all data sources. As analyzed
in one data source may not appear in other data
in the previous section, this is one of the advantages of
This setting is more realistic, as the
GBC over the comparison models that have to deal with
The proposed GBC model
missing values.
is able to handle this case, since it allows non-overlapping instances. However, for the comparison
method, there will be many missing values in the
In this section, we would like to answer the following
joined features as discussed above. In this case, we
replaced the missing values with the average valuesof the corresponding features. In this setting, 30%
• To which extent GBC helps to integrate the knowl-
of the instances do not have records in half of the
edge from multiple sources, compared to learn from
data sources.
each source independently? Specifically, how do theprinciples of consensus and connectivity similarity
We conducted experiments on the above two settings.
During each run, we randomly selected a certain portionof examples as training data, keeping the others as test
• Is the weight learning algorithm necessary?
For the same training set size, we randomly
selected the set of training data 10 times and the rests
• Do we need multiple data sources?
were used as test data, and the results were averaged
number of data sources affects the performance?
over the 10 runs. The experiment results are reportedwith different training set sizes. Note that the proposed
In GBC, both λ0 and λ1 (from Eq. are deter-
GBC model can be used for both classification and
mined by cross-validation. In the following set of ex-
regression. We used error rate to evaluate the results for
periments, we tune the values of λ0 and λ1 in order to
classification tasks, and root mean square error (RMSE)
study the specific effects of the consensus and connectiv-
for regression tasks.
ity terms. We compare the proposed GBC model withthree algorithms on the number recognition dataset in
Analysis of the Experiments Our aim is to
the uniform setting (i.e., all data sources contain all
study the performance of the proposed GBC model
in the two setting described above: uniform and non-overlapping settings. The experiment results are sum-
• The first comparison model is to set λ0 to zero, and
marized in Fig. and Fig. respectively. The x-axes
determine λ1 by cross-validation. In this case, the
record different percentage of training data (while the
consensus term is removed. In other words, the al-
remainder of the data is used for evaluation), and the
gorithm can only apply the connectivity similarity
y-axes report the errors of the corresponding learning
principle. We denote this model as "GBC without
Table 3: Data Descriptions
# of data sources
Average dimension
Movie Rating Prediction
10 (4 graph and 6 others)
Number Recognition
7 (2 graph and 5 others)
Classification (10 labels)
Terrorist Attack Classification
4 (2 graph and 2 others)
Classification (6 labels)
Percentage of training data
Percentage of training data
Percentage of training data
(a) Movie rating prediction
(b) Number recognition
(c) Terrorist attack detection
Figure 3: All data sources record the same set of objects (overlapping objects with different features).
• The second comparison model is to set λ1 to zero,
and let λ0 be determined by cross-validation. In
GBC without Consensus
GBC without Graph
other words, the connectivity similarity term is
removed, and the algorithm only depends on theempirical loss and the consensus term. We denote
this model as "GBC without graph".
• The third comparison model is to set both λ
λ1 to 0. Hence, the remaining term is the empiricalloss, and it is identical to the traditional GBDT
The empirical results are presented in Fig. It can be
Figure 5: How do consensus principle and connectivity
observed that all three models outperforms traditional
similarity principle help?
GBDT. We draw two conclusions from this experiment.
First, as was already observed in previous experiments,learning from multiple sources is advantageous. Specifi-
ing data is limited, the graph connectivity serves as a
cally, the GBDT model builds classifiers for each of the
more important source of information when connecting
data source independently, and average the predictions
unlabeled data with the limited labeled data.
at the last step. However, it does not "communicate"
it is important to note that in the GBC model, the
the prediction models during training. As a result, it
weights of different data sources are adjusted at each it-
has the worst performance in Fig. Second, both the
eration. The aim is to assign higher weights to the data
principle of consensus and the principle of connectiv-
sources that contain useful information, and filter out
ity similarity improve the performance. Furthermore,
the noisy sources. This step is analyzed as an important
it shows that the connectivity similarity term helps im-
one in Theorem since it can help reduce the upper
prove the performance more when the number of train-
bound of the error rate. We specifically evaluate this
ing data is limited. For example, when there are only
strategy on the terrorist detection task. Note that in
10% of training instances, the error rate of GBC with
order to perform a robust test, one of the vector-based
only the connectivity term (i.e., GBC without consen-
sources contains Gaussian noise, as described in Sec-
sus) is less than 9%, while that of GBC with only the
tion The empirical results are presented in Fig.
consensus term (i.e., GBC without Graph) is around
It can be clearly observed that the weighting strategy
13%. This is because when the number of labeled train-
is reducing the error rate by as much as 70%. Hence,
Percentage of training data
Percentage of training data
Percentage of training data
(a) Movie rating prediction
(b) Number recognition
(c) Terrorist attack detection
Figure 4: Each data source is independent (with 30% of overall non-overlapping instances).
provides complementary information useful to build a
GBC without Weight Learning
comprehensive model of the whole dataset.
This paper studies the problem of building a learning
model from heterogeneous data sources. Each source
can contain traditional vector-based features or graph
relational features, with potentially non-overlapping
sets of instances. As far as we know, there is no pre-
vious model that can be directly applied to solve this
problem. We propose a general framework derived from
Figure 6: Why is the weighting strategy necessary?
gradient boosting, called gradient boosting consensus(GBC). The basic idea is to solve an optimization prob-lem that (1) minimizes the empirical loss, (2) encouragesthe predictions from different data sources to be sim-
an appropriate weighting strategy is an important step
ilar, and (3) encourages the predictions of connected
when dealing with multiple data sources with unknown
data to be similar.
The objective function is solved
by stochastic gradient boosting, with an incorporated
It is also interesting to evaluate to which extent
weighting strategy to adjust the importance of different
GBC performance is improved as the number of data
data sources according to their usefulness. Three sets
sources increases. For this purpose, the movie rating
of experiments were conducted, including movie rating
prediction dataset is used as an example.
prediction, number recognition, and terrorist detection.
study the case when there is only one data source. In
We show that the proposed GBC model substantially
order to do so, we run GBC on each of the data source
reduce prediction error rate by as much as 80%. Fi-
independently, then on 2 and 4 data sources. In the
nally, several extended experiments are conducted to
experiments with 2 data sources, we randomly selected
study specific properties of the proposed algorithm and
2 sources from the pool (Table as inputs to GBC.
its robustness.
These random selections of data sources were performed10 times, and the average error is reported in Table
A similar strategy was implemented to conduct theexperiment with 4 data sources. In Table the results
As a future work, we will explore better methods
are reported with different percentages of training data,
to determine the algorithm parameters automatically.
and the best performances are highlighted with bold
Furthermore, we will improve the model to handle large
It can be observed that the performance
scale datasets. We will also explore other approaches to
with only one data source is the worst, and has high
handle heterogeneous learning.
root mean square error and high variance.
more data sources available, the performance of GBCtends to be better. This is because each data source
Table 4: Effect of different number of sources. Reported results are RMSE with variance in parenthesis.
[10] B. Taskar, P. Abbeel, and D. Koller, "Discriminative
Part of the work was done when Xiaoxiao Shi was a
probabilistic models for relational data.," in Proceed-ings of the Annual Conference on Uncertainty in Arti-
summer intern at AT&T Labs. It is also supported in
ficial Intelligence, 2002.
part by NSF through grants IIS 0905215, DBI-0960443,
[11] H. Eldardiry and J. Neville, "Across-model collective
CNS-1115234, IIS-0914934, OISE-1129076, and OIA-
ensemble classification," in AAAI, 2011.
0963278, and Google Mobile 2014 Program.
[12] D. P. Bertsekas, Nonlinear Programming (Second ed.).
Cambridge, MA.: Athena Scientific, 1999.
[13] J. H. Friedman, "Stochastic gradient boosting," Com-
putational Statistics & Data Analysis, vol. 38, no. 4,pp. 367–378, 2002.
[1] P. Melville, R. J. Mooney, and R. Nagarajan, "Content-
[14] J. Shi and J. Malik, "Normalized cuts and image seg-
boosted collaborative filtering for improved recommen-
mentation," IEEE Trans. Pattern Anal. Mach. Intell.,
dations," in AAAI/IAAI, pp. 187–192, 2002.
vol. 22, no. 8, pp. 888–905, 2000.
[2] A. Blum and T. M. Mitchell, "Combining labeled and
[15] M. Balcan and A. Blum, "A pac-style model for
unlabeled data with co-training," in COLT, pp. 92–100,
learning from labeled and unlabeled data," in COLT,
pp. 111–126, 2005.
[3] S. Oba, M. Kawanabe, K. M¨
uller, and S. Ishii, "Het-
[16] K. Sridharan and S. M. Kakade, "An information the-
erogeneous component analysis," in NIPS, 2007.
oretic framework for multi-view learning," in COLT,
[4] K. Nigam and R. Ghani, "Analyzing the effectiveness
pp. 403–414, 2008.
and applicability of co-training," in CIKM, pp. 86–93,
[17] Y. Koren, R. M. Bell, and C. Volinsky, "Matrix
factorization techniques for recommender systems,"
[5] B. Long, P. S. Yu, and Z. Zhang, "A general model
IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
for multiple view unsupervised learning," in SDM,
[18] M. van Breukelen and R. Duin, "Neural network
pp. 822–833, 2008.
initialization by combined classifiers," in ICPR, pp. 16–
[6] J. Gao, W. Fan, Y. Sun, and J. Han, "Heterogeneous
source consensus learning via decision propagation and
[19] X. Z. Fern and C. Brodley, "Cluster ensembles for high
negotiation," in KDD, pp. 339–348, 2009.
dimensional clustering: An empirical study," Journal
[7] D. Agarwal, B. Chen, and B. Long, "Localized factor
of Machine Learning Research., vol. 22, no. 8, pp. 888–
models for multi-context recommendation," in KDD,
pp. 609–617, 2011.
[20] X. He and P. Niyogi, "Locality preserving projections,"
[8] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher,
in NIPS, 2003.
and T. Eliassi-Rad, "Collective classification in net-
[21] X. Zhu and A. B. Goldberg, Introduction to Semi-
work data," AI Magazine, vol. 29, no. 3, pp. 93–106,
Supervised Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning, Morgan & Clay-
[9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira,
pool Publishers, 2009.
"Conditional random fields: Probabilistic models for
[22] P. Laskov, "An improved decomposition algorithm for
segmenting and labeling sequence data.," in Pro-
regression support vector machines," in NIPS, pp. 484–
ceedings of the International Conference on Machine
Learning, 2001.
Source: http://david.grangier.info/papers/2012/shi_sdm_2012.pdf
Informe Final de Evento Cisticercosis Año 2009 INSTITUTONACIONAL D Avances en el conocimiento de la enfermedad CISTICERCOSIS: SITUACION DE LA PARASITOSIS Diana Marcela Walteros Acero MD Referente Nacional de Cisticercosis Grupo Zoonosis Subdirección de Vigilancia y Control INTRODUCCIÓN La cisticercosis es una parasitosis causada por el metacestodo de la Taenia Solium, larva que tiene gran capacidad de invasión de tejidos como el musculo esquelético, Tejido Celular Subcutáneo y Musculo cardiaco. i Sin embargo la localización que genera mayores complicaciones, letalidad y secuelas es en el Sistema Nervioso Central, configurando el cuadro de Neurocisticercosis cuyas manifestaciones clínicas son variadas y dependen de la ubicación y forma de las vesículas parasitarias. En 1993 el grupo internacional de trabajo declara que la Cisticercosis es erradicable, teniendo en cuenta los siguientes criterios: el huésped definitivo es el ser humano y este a su vez es la fuente de infección del cerdo, que es el principal huésped intermediario, en el momento hay tratamiento efectivo para la infección animal y por ultimo no se han encontrado reservorios en animales silvestres. El ser humano es el único huésped definitivo natural de la tenia y el cerdo es el principal huésped intermediario, por tanto la prevalencia de la enfermedad depende de esta relación e interacción. Aunque el humano también es el huésped definitivo de la T. saginata y los bovinos los huéspedes intermediarios, ninguna de las subespecies de la T. saginata produce la infección. ii En el 2002, en Irán se reportan 2 casos de Cisticercosis cerebral y cardiaca en perros, los cuales se podrían constituir en otros huéspedes intermediarios de la enfermedad aparte de los cerdos.iii Dentro de las medidas de control de la enfermedad en los animales se han considerado, entre otras: encorralamiento para evitar el contacto de los cerdos con las larvas de la tenia, alimentación balanceada y adecuada que no incluya desperdicios ni heces humanas, desparasitación de los animales por lo menos 2 meses antes de su sacrificio, control del estado de salud y revisiones frecuentes por profesionales veterinarios, sacrificio en lugares con infraestructura adecuada y con previa verificación de ausencia de quistes en la lengua, sacrificio y desecho de los animales enfermos, refrigeración, transporte y comercialización bajo medidas de higiene y especificaciones adecuadas para la conservación de la carne y vacunación de los animales para prevenir el desarrollo de la enfermedad. Las vacunas disponibles son de diferentes características, unas incluyen el extracto crudo del parasitoiv v, otras incluyen subunidades proteicasvi vii y otras son vacunas de DNAviii ix .
Contents lists available at Pharmacological Research Cholesterol-independent neuroprotective and neurotoxic activities of statins: Perspectives for statin use in Alzheimer disease and other age-related D. Allan Butterfield , Eugenio Barone , Cesare Mancuso a Department of Chemistry, Center of Membrane Sciences, University of Kentucky, Lexington, KY 40506, USA b Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40506, USA