david.grangier.info

David.grangier.info

Learning from Heterogeneous Sources via Gradient Boosting Consensus David Grangier† has been used in movie recommendation Multiple data sources containing different types of fea- example, in Fig. given that we observe that tures may be available for a given task. For instance, the rating for "The Godfather" is 9.2 (out of 10), users' profiles can be used to build recommendation sys- and "The Giant Spider Invasion" is 2.8, what are the tems. In addition, a model can also use users' histori- ratings for "Apocalypse Now" and "Monster a-Go Go"? cal behaviors and social networks to infer users' inter- Note that in this task, there are multiple available ests on related products. We argue that it is desirable databases that record various information about movies.
to collectively use any available multiple heterogeneous For instance, there is a genre database (Fig. a data sources in order to build effective learning mod- sound technique database (Fig. a running times els. We call this framework heterogeneous learning. In database (Fig. an actor graph database that links our proposed setting, data sources can include (i) non- two movies together if the same actor/actress performs overlapping features, (ii) non-overlapping instances, and in the movies (Fig. and a director graph database (iii) multiple networks (i.e. graphs) that connect in- that links two movies if they are directed by the same stances. In this paper, we propose a general optimiza- Note that these multiple data tion framework for heterogeneous learning, and devise sources have the following properties: a corresponding learning model from gradient boosting.
• Firstly, each data source can have its own feature The idea is to minimize the empirical loss with two con- For example, the running times database straints: (1) There should be consensus among the pre- (Fig. has numerical features; dictions of overlapping instances (if any) from different database (Fig. has nominal features, and the data sources; (2) Connected instances in graph datasets actor graph database (Fig. provides graph re- may have similar predictions. The objective function is lational features.
solved by stochastic gradient boosting trees. Further-more, a weighting strategy is designed to emphasize in- • Secondly, each data source can have its own set of formative data sources, and deemphasize the noisy ones.
instances. For example, the genre database does We formally prove that the proposed strategy leads to not have the record for "Monster a-Go Go"; the a tighter error bound. This approach consistently out- running times database does not have any record performs a standard concatenation of data sources on of "Apocalypse Now".
movie rating prediction, number recognition and ter-rorist attack detection tasks. We observe that the pro- Note that it is difficult to build an accurate prediction posed model can improve out-of-sample error rate by as model by using only one of the five databases, since the much as 80%.
information in each of them is incomplete. However, ifwe consider the five data sources collectively, we are able to infer that the rating of "Apocalypse Now" (ground Given a target task, multiple related data sources can truth: 8.6) may be close to that of "The Godfather", be used to build prediction models. Each of the related since they are similar in genre and they are connected data sources may have a distinct set of features and in the actor graph. Similarly, one can infer that the instances, and the combination of all data sources rating for "Monster a-Go Go" (ground truth: 1.5) is may yield better prediction results.
similar to that of "The Giant Spider Invasion".
illustrated in Fig. The task is to predict movie In the past, multi-view learning was pro- ratings in the Internet Movie Database which posed to study a related problem where each instancecan have different views. However, it usually does notconsider graph data with relational features, especially ∗Computer Science Department, University of Illinois at when there are multiple graphs and each graph may Chicago, USA. {xshi9, psyu}@uic.edu.
†AT&T Labs, USA. {jpaiement, grangier}@research.att.com.
only contain a subset of the relation features. Hence, we study a more general learning scenario called hetero- T he Giant Spider Invasion T he Giant Spider Invasion (a) Movie rating prediction.
(b) Genre database.
Running t imes (mins) DT S, Digit al, 6-Track T he Giant Spider Invasion T he Giant Spider Invasion (c) Sound technique database.
(d) Running times.
The Giant Spider
The Giant Spider
(e) Actor graph.
(f) Director graph that doesnot have record on "Apoca-lypse Now".
Figure 1: Combining different sources to infer movie ratings. The true rating for "Apocalypse Now" is 8.6, whilethe rating for "Monster a-Go Go" is 1.5.
geneous learning where the data can come from multi- the gradient residual of the objective function. We call ple sources. Specifically, the data sources can (1) have our proposed algorithm Gradient Boosting Consensus non-overlapping features (i.e., new features in certain (GBC) because each data source generates a set of trees, data sources), (2) have some non-overlapping instances and the consensus of the decision trees makes the final (i.e., new objects/instances in certain data sources), prediction. Moreover, GBC has the following proper- and (3) contain multiple network (i.e. weighted graphs) datasets. Furthermore, some of the data sources maycontain substantial noise or low-quality data. Our aim • Deep-ensemble. Recall that the traditional boost- is to utilize all data sources collectively and judiciously, ing tree model is an iterated algorithm that builds in order to improve the learning performance.
new trees based on the previous iterations (residu- A general objective function is proposed to make als). Usually, these new trees are generated based good use of the information from these multiple data on the residual of only one data source. However, The intuition is to learn a prediction func- as shown in Fig. GBC generates new trees collec- tion from each data source to minimize the empirical tively from all data sources (horizontally) in each loss with two constraints. First, if there are overlap- iteration (vertically). We call it "deep ensemble" ping instances, the predictions of the same instance since it ensembles models both horizontally and should be similar even when learning from different data vertically to make the final prediction.
sources. Second, the predictions of connected data (i.e., • Network-friendly. Unlike traditional boosting trees, instances connected in any of the graphs) should be GBC can take advantage of multiple graph datasets similar. Finally, the prediction models are judiciously to improve learning. In other words, it can take combined (with different weights) to generate a global advantage of traditional vector-based features and prediction model. In order to solve the objective func- graph relational features simultaneously.
tion, we borrow ideas from gradient boosting decisiontrees (GBDT), which is an iterated algorithm that gen- • Robust. Some data sources may contain substan- erates a sequence of decision trees, where each tree fits tial noise. A weighting strategy is incorporated into Set of trees
Set of trees
through the latent factor. There are mainly two differ- for data source 1
for data source 2
ences between our work and the previous approaches.
First, most of the previous works do not consider thevector-based features and the relational features simul- taneously. Second and foremost, most of the previous works require the data sources to have records of all in- stances in order to enable the mapping, while the pro- posed GBC model does not have this constraint.
Another area of related work is collective classifi- cation (e.g., that aims at predicting the class label Iter t+1 E
from a network. Its key idea is to combine the super-vision knowledge from traditional vector-based feature Figure 2: Gradient Boosting Consensus.
vectors, as well as the linkage information from the net-work. It has been applied to various applications such GBC to emphasize informative data sources and as part-of-speech tagging classification of hypertext deemphasize the noisy ones. This weighting strat- documents using hyperlinks etc.
egy is further proven to have a tighter error bound works study the case when there is only one vector-based in both inductive and transductive settings.
feature space and only one relational feature space, andthe focus is how to combine the two. Different from tra- We conducted three sets of experiments. These ex- ditional collective classification framework, we consider periments include IMDB movie rating prediction, UCI multiple vector-based features and multiple relational number recognition, and terrorist attack detection, and features simultaneously. Specifically, proposes an each task has a set of data sources with heterogeneous approach to combine multiple graphs to improve the features. For example, in the IMDB movie rating pre- learning. The basic idea is to average the predictions diction task, we have data sources about the plots of during training.
There are three differences between the movies (text data), technologies used by the movies the previous works and the current model. Firstly, we (nominal features), running times of the movies (nu- allow different data sources to have non-overlapping in- merical features), and several movie graphs (such as di- stances. Secondly, we introduce a weight learning pro- rector graph, actor graph). All these mixture types of cess to filter out noisy data sources. Thirdly, we consider data sources were used collectively to build a prediction multiple vector-based sources and multiple graphs at model. Since there is no previous model that can handle the same time. Hence, all the aforementioned methods the problem directly, we have constructed a straightfor- could not effectively learn from the datasets described ward baseline which first appends all data sources to- in Section 4, as they all contain multiple vector-based gether into a single database, and uses traditional learn- data sources and relational graphs.
ing models to make predictions. Experiments show thatthe proposed GBC model consistently outperforms our Problem Formulation baseline, and can decrease the error rate by as much In this section, we formally define the problem of het- erogeneous learning, and then introduce a general learn-ing objective. In heterogeneous learning, data can be described in heterogeneous feature spaces from multi- There are several areas of related works upon which our ple sources. Traditional vector-based features are de- proposed model is built. First, multi-view learning (e.g., noted with the column vectors x is proposed to learn from instances which have ing to the i-th data in the j-th source (or the j-th fea- multiple views in different feature spaces. For example, ture space) whose dimension is dj. In matrix form, in a framework is proposed to reconcile the cluster- dj ×m is the dataset ing results from different views. In a term called in the j-th feature space where m is the sample size.
consensus learning is proposed. The general idea is to Different from vector-based features, graph relational perform learning on each heterogeneous feature space features describe the relationships between instances.
independently and then summarize the results via en- In other words, they are graphs representing connec- semble. Recently, proposes a recommendation model tivity/similarity of the data.
Specifically, we denote (collaborative filtering) that can combine information Gg =< Vg, Eg > as the g-th graph where Vg is the from different contexts.
It finds a latent factor that set of nodes and Eg ⊆ Vg × Vg is the set of edges.
connects all data sources, and propagate information Table 1: Symbol definition The j-th data (column vector) in the i-th source (the i-th feature space).
The g-th relational graph.
The set of unlabeled data in the i-th data source.
The prediction model built from the i-th data source.
Graph connectivity constraint.
Set of labeled data.
We assume that the features from the same data source Furthermore, C f , w = 0 is the constraint derived from are from the same feature space, and hence each data the principle of consensus, defined as follows: source has a corresponding feature space. Furthermore, different data sources may provide different sets of in- L fi(x), E f (x) stances. In other words, some instances exist in some data sources, but are missing in the others.
heterogeneous learning is a machine learning scenario where we consider data from different sources, but they may (1) have different sets of instances, (2) have differ- ent feature spaces, and (3) have multiple network based (graph) datasets. Hence, we have p data sources pro- viding vector-based features X(1), · · · , X(p) and q data It first calculates the expected prediction E f (x) sources providing relational networks G1, · · · , Gq. The of a given unlabeled instance x, by summarizing aim is to derive learning models (classification, regres- the current predictions from multiple data sources sion or clustering) by collectively and judiciously using This expectation is computed only the p + q data sources. A set of important symbols in from the data sources that contain x; in other words, the remaining of the paper are summarized in Table it is from the data sources whose indices are in the set{i x ∈ Ui} where Ui is the set of unlabeled instances Gradient Boosting Consensus in the i-th data source. Hence, if the j-th data source In this section, we describe the general framework of the does not have record of x, it will not be used to cal- proposed GBC model and its theoretical foundations.
culate the expected prediction. This strategy enablesGBC to handle non-overlapping instances in multiple The GBC framework In order to use multiple data sources, and uses overlapping instances to improve data sources, the objective function aims at minimizing the consensus. Eq. forces the predictions of x (e.g., the overall empirical loss in all data sources, with f1(x), f2(x), · · · ) to be close to E f (x).
two more constraints. First, the overlapping instances Furthermore, according to the principle of con- should have similar predictions from the models trained nectivity similarity, we introduce another constraint on different data sources, and we call this the principle G f , w as follows: of consensus.
Second, when graph relational data is provided, the connected data should have similar predictions, and we call this the principle of connectivity similarity. In summary, the objective function can be written as follows: The above constraint encourages connected data tohave similar predictions. It works by calculating the graph-based expected prediction of x by looking at i(x), y) is the empirical loss on the set of training data T , w the average prediction ( i is the weight of importance of the i-th data source, which is discussed in Section all its connected neighbors (z's). If there are multiple graphs, all the expected predictions are summarized by If the L-2 loss is used in L, we have We use the method of Lagrange multipliers (fi(x) − y) + λ0 to solve the constraint optimization in Eq. The objective function becomes L(fi(x), y) + λ0C f , w + λ1G f , w The L-2 loss is a straightforward loss function for theGBC model, and it is used to perform regression tasksin Section where the two constraints C f , w and G f , w are reg- GBC with Logistic Loss: With logistic loss, the ularized by Lagrange multipliers λ0 and λ1. These pa- partial derivative in Eq. becomes: rameters are determined by cross-validation, which isdetailed in Section Note that in Eq. the weights wg (i, g = 1, 2, · · · ) are essential. On one hand, the wis are introduced to assign different weights to dif-ferent vector-based data sources. Intuitively, if the t-th data source is more informative, wt should be large. On the other hand, the ˆ wgs are the weights for the graph re- lational data sources. Similarly, the aim is to give high Note that the above formula uses the binary logistic weights to important graph data sources, while deem- loss where y = −1 or y = 1, but one can easily extend phasizing the noisy ones.
We define different weight this model to tackle multi-class problems by using the one-against-others strategy. In Section we adopt this wg) for the data sources with vector- based features (w strategy to handle multi-class problems.
i) and graph relational features ( ˆ The values of the weights are automatically learned and With the updating rule, we can build the GBC updated in the training process, as discussed in Sec- model as described in Algorithm It first finds the initial prediction models for all data sources in Step 1.
Then, it goes into the iteration (Step 3 to Step 11) Model training of GBC We use stochastic gra- that generates a series of decision trees. The basic idea dient descent to solve the optimization problem in is to follow the updating rule in Eq. and build a Eq. In general, it is an iterated algorithm that up- decision tree gi(xi) to fit the partial derivative of the dates the prediction functions f (x) in the following way: loss (Step 5). Furthermore, we follow the idea of and let the number of iterations T be set by users. Inthe experiment, it is determined by cross-validation.
Then given a new data x, the predicted output is f (x) ← f (x) − ρ ∂f(x) It is updated iteratively until a convergence condition where P(y) is a prediction generation function, where is satisfied. Specifically, inspired by gradient boosting P(y) = y in regression problems, and P(y) = 1 iff y > 0 decision trees (or GBDT a regression tree is built (P(y) = −1 otherwise) in binary classification problems.
to fit the gradient ∂L , and the best parameter ρ is explored via line search Note that the calculationof ∂L depends on the loss function L(f, y) as reflected Weight Learning In the objective function de- scribed in Eq. one important element is the set of in Eq. In the following, we use the L-2 loss (for regression problems) and the binary logistic loss (for wg) for the data sources. Ideally, infor- mative data sources will have high weights, and noisy binary classification problem) as examples: data sources will have low weights. As such, the pro- GBC with L-2 Loss: In order to update the posed GBC model can judiciously filter out the data prediction function of the i-th data source, we follow sources that are noisy.
To this aim, we design the the gradient descent formula as follows.
weights by looking at the empirical loss of the modeltrained from the data source.
Specifically, if a data source induces large loss, its weight should be low. Fol- fi(x) ← fi(x) − ρ ∂f lowing this intuition, we design the weight as follows: Input: Data from different sources: X data sources as follows: p, Expected outputs (labels or regression values) of a subset of data wiL fi(xa), fi(xb)/z Y. Number of iterations N .
Output: The prediction model HGBF ˆ fi(x) to be a constant such that i(xa), fi(xb) is the pairwise loss that evalu- ates the difference between the two predictions f i = 1, 2, · · · , p.
The idea behind Eq. is to evaluate whether a graph can link similar instances together.
2 Initialize wi = 1 .
If most of the connected instances have similar predic- tions, the graph is considered to be informative. Note that both the weights in Eq. and the weights in For all x(i), compute the negative Eq. are updated at each iteration. By replacing gradient with respect to f (x(i)): them into Eq. one can observe that the objectivefunction of the GBC model is adaptively updated at each iteration. In other words, at the initial step, each data source will be given equal weights; but after sev-eral iterations, informative data sources will have higher Fit a regression model gi(x(i)) that learning weights, and the objective function will "trust" predicts zi's from x(i)'s.
more the informative data sources.
Line search to find the optimal gradientdescent step size as Generalization bounds In this section, we con- sider the incompatibility framework in and to ρi = arg min L ˆ fi(x) + ρigi(x(i)), w explain the proposed GBC model. Specifically, we show that the weight learning process described in Section Update the estimate of ˆ can help reduce an error bound. For the sake of sim- plicity, we consider the case where we have two data fi(x(i)) + ρigi(x(i)) 1 and X2, and the case with more data sources can be analyzed with similar logic. Note that the goal is to learn a pair of predictors (f1; f2), where f1 : X1 → ˆ Update w as Eq. and Eq. and f2 : X2 → ˆ Y is the prediction space. Further denote F1 and F2 as the hypothesis classes of interest, consisting of functions from X 1 (and, respectively, X2 ) to the prediction space ˆ Algorithm 1: Gradient Boosting Consensus loss of f1, and L(f2) is similarly defined. Let a Bayesoptimal predictor with respect to loss L be denoted asf ∗. We now apply the incompatibility framework forthe multi-view setting to study GBC. We first de-fine the incompatibility function χ : F and some t ≥ 0 as those pairs of functions which are compatible to the tune of t, which can be written as: Cχ(t) = {(f1, f2) : f1 ∈ F1, f2 ∈ F2 and E[χ(f1, f2)] ≤ t} where L fi(x), y is the empirical loss of the modeltrained from the i-th data source, and z is a normaliza- Intuitively, the function Cχ(t) captures the set of func- tion constant to ensure the summation of w 1 and f2 that are compatible with respect one. Note that the definition of the weight w to a "maximal expected difference" t.
from the weighting matrix in normalized cut The is proven that there exists a symmetric function d : exponential part can effectively give penalty to large F1 × F2, and a monotonically increasing non-negative function Φ on the reals such that for all f , i will be large if the empirical loss of the i-th data source is small; it becomes small if the loss is E[d(f1(x); f2(x))] ≤ Φ(L(f1) − L(f2)) large. It is proven in Theorem that the updatingrule of the weights in Eq. can result in a smaller With these functions at hand, we can derive the follow- error bound. Similarly, we define the weights for graph Theorem 4.1. Let L(f1) − L(f ∗) < 1 and L(f2) − as compared to the equal-weighting strategy whose last 2, then for the incompatibility function Cχ(t), Hence, the weighting strategy induces a tighter bound since d is a constant depends on the function d we have important to note that if the the predictions of differentdata sources vary significantly ( 1 − 2 is large), the proposed weighting strategy has a much tighter bound LGBC(f1, f2) ≤ L(f ∗) + bayes + than the equal-weighting strategy.
1 ,f2 )∈Cχ (t) if there are some noisy data sources that potentiallylead to large error rate, GBC can effectively reducetheir effect. This is an important property of GBC to Proof. Note that L(f1) − L(f ∗) < 1 and L(f2) − handle noisy data sources. This strategy is evaluated L(f ∗) < 2, and the proposed model GBC adopts a empirically in the next section.
weighted strategy linear to the expected loss, which isapproximately LGBC(f1, f2) = According to Lemma 8 in we have E[χ(f1, f2)] ≤ In this section, we report three sets of experiments that were conducted in order to evaluate the proposed GBC model applied to multiple data sources.
LHGBF(f1, f2) ≤ LHGBF(f ∗1, f ∗2) + bayes answer the following questions: With Lemma 7 in we can get • Can GBC make good use of multiple data sources? Can it beat other more straightforward strategies? LHGBF(f1, f2) ≤ L(f ∗) + bayes + • What is the performance of GBC if there exist non- 1 ,f2 )∈Cχ (t) overlapping instances in different data sources? Datasets The aim of the first set of experiments Similarily, we can derive the error bound of GBC in a is to predict movie ratings from the IMDB Note that there are 10 data sources in this task. Forexample, there is a data source about the plots of the Theorem 4.2. Consider movies, and a data source about the techniques used Eq. 4 in Given the regularized parameter λ > 0, in the movies (e.g., 3D IMAX). Furthermore, there are we denote Lλ(f ) as the expected loss with the regular- several data sources providing different graph relational ized parameter λ. If we set λc = data about the movies.
For example, in a director for the pair of functions (f graph, two movies are connected if they have the same 1, f2) ∈ F1 × F2 returned by the transductive learning algorithm, with probability at director. A summary of the different data sources can least 1 − δ over labeled samples, be found in Table It is important to note that each ofthe data sources may provide certain useful information for predicting the ratings of the movies. For instance, the Genre database may reflect that certain types of 1, f2) ≤ Lλ(f ∗) + movies are likely to have high ratings (e.g., Fantasy); the Director graph database implicitly infers movie ratings from similar movies of the same director (e.g., Steven Spielberg has many high-rating movies.). Thus, it isdesirable to incorporate different types of data sources where n is the number of labeled examples, and CLip is to give a more accurate movie rating prediction. This is the Lipschitz constant for the loss, and ˆ an essential task for online TV/movie recommendation, term bounded by the number of unlabeled examples and such as the famous $1,000,000 Netflix prize the bound of the losses.
The second set of experiments is about handwritten Note that Theorem and Theorem derive the number recognition. The dataset contains 2000 hand- error bounds of GBC in inductive and transductive written numerals ("0"–"9") extracted from a collection setting respectively. In effect, the weighting strategy reduces the last term of the error bound to Table 2: IMDB Movie Rating Prediction Technology Database Sound Technology Database Running Time Database of Dutch utility The handwritten numbers are model that can handle the same problem directly; i.e., scanned and digitized as binary images. They are repre- building a learning model from multiple graphs and mul- sented in terms of the following seven data sources with tiple vector-based datasets with some non-overlapping different vector-based feature spaces: (1) 76 Fourier co- Furthermore, as far as we know, there is efficients of the character shapes, (2) 216 profile corre- no state-of-the-art approaches that use the benchmark lations, (3) 64 Karhunen-Love coefficients, (4) 240 pixel datasets described in the previous section in the same averages in 2 × 3 windows, (5) 47 Zernike moments, way. For instance, in the movie prediction dataset, we (6) a graph dataset constructed from the morphological crawl the 10 data sources directly from IMDB and use similarity (i.e., two objects are connected if they have them collectively in learning. In the case of the number similar morphology appearance), and (7) a graph gen- recognition dataset, we have two graph data sources, erated with the same method as (6), but with random which are different from previous approaches that only Gaussian noise imposed in the morphological similarity.
look at vector-based features clustering or This dataset is included to test the performance of GBC feature selection problems In order to evaluate on noisy data. The aim is to classify a given object to the proposed GBC model, we design a straightforward one of the ten classes ("0"–"9"). The statistics of the comparison strategy, which is to directly join all fea- dataset are summarized in Table tures together. In other words, given the sources with The third set of datasets is downloaded from the vector-based features X(1), · · · , X(p) and the adjacency UMD collective classification database The database matrices of the graphs M(1), · · · , M(q), the joined fea- consists of 1293 different attacks in one of the six labels tures can be represented as follows: indicating the type of the attack: kidnapping, NBCR attack, weapon attack and other X = [X(1)T , · · · , X(p)T , M(1)T , · · · , M(q)T ]T Each attack is described by a binary value vector of attributes whose entries indicate the absence Since there is only one set of joined features, tradi- or presence of a feature. There are a total of 106 distinct tional learning algorithms can be applied on it to give vector-based features, along with three sets of relational predictions (each row is an instance; each column is a features. One set connects the attacks together if they feature from a specific source). We include support vec- happened in the same location; the other connects the tor machines (SVM) in the experiments as it is used attacks if they are planned by the same organization.
widely in practice.
Note that in GBC, the consen- In order to perform robust evaluation of the proposed sus term in Eq. and the graph similarity term in GBC model, we add another data source based on the Eq. can use unlabeled data to improve the learning.
vector-based dataset, but with a random Gaussian noise Hence, we also compare it with semi-supervised learning Again, this is to test the capability of the models. Specifically, semi-supervised SVM (Semi-SVM) proposed model to handle noise.
with a self-learning technique is used as the secondcomparison model. Note that we have three tasks in the Comparison Methods and Evaluations It is experiment where one of them (i.e., the movie rating important to emphasize again that there is no previous prediction task) is a regression task. In this task, re-gression SVM is used to give predictions. Addition- aly, since the proposed model is derived from gradient boosting decision trees, GBDT is used as the third comparison model, and its semi-supervised version is included as well. It is important to note that in order We observe two major phenomena in the experi- to use the joined features from Eq. these com- ments. Firstly, the proposed GBC model effectively re- parison models require that there is no non-overlapping duces the error rate as compared to the other learn- instances. In other words, all data sources should have ing models in both settings.
It is especially obvious records of all instances; otherwise, the joined features in the movie rating prediction dataset where 10 data will have many missing values since some data sources sources are used to build the model. In this dataset, may not have records of the corresponding instances.
GBC reduces the error rate by as much as 80% in the To evaluate GBC more comprehensively, we thus con- first setting (when there are 90% of training instances), ducted the experiments on two settings: and 60% in the second setting (when there are 10% oftraining instances). This shows that GBC is especially • Uniform setting: the first setting is to force all data advantageous when a large number of data sources are sources to contain records of all instances. We only available. We further analyze this phenomenon in the look at the instances that have records in all data next section with Table On the other hand, the com- sources. Table presents the statistics of datasets parison models have to deal with a longer and noisier in this setting. In this case, we can easily join the feature vector. GBC beats the four approaches by ju- features from different sources as in Eq. diciously reducing the noise (as discussed in Section and Secondly, we can observe that GBC outper- • Non-overlapping setting: the second setting is to forms the other approaches significantly and substan- allow different data sources to have some non- tially in the second setting (Fig. where some instances overlapping instances. Thus, an instance described do not have records on all data sources. As analyzed in one data source may not appear in other data in the previous section, this is one of the advantages of This setting is more realistic, as the GBC over the comparison models that have to deal with The proposed GBC model missing values.
is able to handle this case, since it allows non-overlapping instances. However, for the comparison method, there will be many missing values in the In this section, we would like to answer the following joined features as discussed above. In this case, we replaced the missing values with the average valuesof the corresponding features. In this setting, 30% • To which extent GBC helps to integrate the knowl- of the instances do not have records in half of the edge from multiple sources, compared to learn from data sources.
each source independently? Specifically, how do theprinciples of consensus and connectivity similarity We conducted experiments on the above two settings.
During each run, we randomly selected a certain portionof examples as training data, keeping the others as test • Is the weight learning algorithm necessary? For the same training set size, we randomly selected the set of training data 10 times and the rests • Do we need multiple data sources? were used as test data, and the results were averaged number of data sources affects the performance? over the 10 runs. The experiment results are reportedwith different training set sizes. Note that the proposed In GBC, both λ0 and λ1 (from Eq. are deter- GBC model can be used for both classification and mined by cross-validation. In the following set of ex- regression. We used error rate to evaluate the results for periments, we tune the values of λ0 and λ1 in order to classification tasks, and root mean square error (RMSE) study the specific effects of the consensus and connectiv- for regression tasks.
ity terms. We compare the proposed GBC model withthree algorithms on the number recognition dataset in Analysis of the Experiments Our aim is to the uniform setting (i.e., all data sources contain all study the performance of the proposed GBC model in the two setting described above: uniform and non-overlapping settings. The experiment results are sum- • The first comparison model is to set λ0 to zero, and marized in Fig. and Fig. respectively. The x-axes determine λ1 by cross-validation. In this case, the record different percentage of training data (while the consensus term is removed. In other words, the al- remainder of the data is used for evaluation), and the gorithm can only apply the connectivity similarity y-axes report the errors of the corresponding learning principle. We denote this model as "GBC without Table 3: Data Descriptions # of data sources Average dimension Movie Rating Prediction 10 (4 graph and 6 others) Number Recognition 7 (2 graph and 5 others) Classification (10 labels) Terrorist Attack Classification 4 (2 graph and 2 others) Classification (6 labels) Percentage of training data Percentage of training data Percentage of training data (a) Movie rating prediction (b) Number recognition (c) Terrorist attack detection Figure 3: All data sources record the same set of objects (overlapping objects with different features).
• The second comparison model is to set λ1 to zero, and let λ0 be determined by cross-validation. In GBC without Consensus GBC without Graph other words, the connectivity similarity term is removed, and the algorithm only depends on theempirical loss and the consensus term. We denote this model as "GBC without graph".
• The third comparison model is to set both λ λ1 to 0. Hence, the remaining term is the empiricalloss, and it is identical to the traditional GBDT The empirical results are presented in Fig. It can be Figure 5: How do consensus principle and connectivity observed that all three models outperforms traditional similarity principle help? GBDT. We draw two conclusions from this experiment.
First, as was already observed in previous experiments,learning from multiple sources is advantageous. Specifi- ing data is limited, the graph connectivity serves as a cally, the GBDT model builds classifiers for each of the more important source of information when connecting data source independently, and average the predictions unlabeled data with the limited labeled data.
at the last step. However, it does not "communicate" it is important to note that in the GBC model, the the prediction models during training. As a result, it weights of different data sources are adjusted at each it- has the worst performance in Fig. Second, both the eration. The aim is to assign higher weights to the data principle of consensus and the principle of connectiv- sources that contain useful information, and filter out ity similarity improve the performance. Furthermore, the noisy sources. This step is analyzed as an important it shows that the connectivity similarity term helps im- one in Theorem since it can help reduce the upper prove the performance more when the number of train- bound of the error rate. We specifically evaluate this ing data is limited. For example, when there are only strategy on the terrorist detection task. Note that in 10% of training instances, the error rate of GBC with order to perform a robust test, one of the vector-based only the connectivity term (i.e., GBC without consen- sources contains Gaussian noise, as described in Sec- sus) is less than 9%, while that of GBC with only the tion The empirical results are presented in Fig. consensus term (i.e., GBC without Graph) is around It can be clearly observed that the weighting strategy 13%. This is because when the number of labeled train- is reducing the error rate by as much as 70%. Hence, Percentage of training data Percentage of training data Percentage of training data (a) Movie rating prediction (b) Number recognition (c) Terrorist attack detection Figure 4: Each data source is independent (with 30% of overall non-overlapping instances).
provides complementary information useful to build a GBC without Weight Learning comprehensive model of the whole dataset.
This paper studies the problem of building a learning model from heterogeneous data sources. Each source can contain traditional vector-based features or graph relational features, with potentially non-overlapping sets of instances. As far as we know, there is no pre- vious model that can be directly applied to solve this problem. We propose a general framework derived from Figure 6: Why is the weighting strategy necessary? gradient boosting, called gradient boosting consensus(GBC). The basic idea is to solve an optimization prob-lem that (1) minimizes the empirical loss, (2) encouragesthe predictions from different data sources to be sim- an appropriate weighting strategy is an important step ilar, and (3) encourages the predictions of connected when dealing with multiple data sources with unknown data to be similar.
The objective function is solved by stochastic gradient boosting, with an incorporated It is also interesting to evaluate to which extent weighting strategy to adjust the importance of different GBC performance is improved as the number of data data sources according to their usefulness. Three sets sources increases. For this purpose, the movie rating of experiments were conducted, including movie rating prediction dataset is used as an example.
prediction, number recognition, and terrorist detection.
study the case when there is only one data source. In We show that the proposed GBC model substantially order to do so, we run GBC on each of the data source reduce prediction error rate by as much as 80%. Fi- independently, then on 2 and 4 data sources. In the nally, several extended experiments are conducted to experiments with 2 data sources, we randomly selected study specific properties of the proposed algorithm and 2 sources from the pool (Table as inputs to GBC.
its robustness.
These random selections of data sources were performed10 times, and the average error is reported in Table A similar strategy was implemented to conduct theexperiment with 4 data sources. In Table the results As a future work, we will explore better methods are reported with different percentages of training data, to determine the algorithm parameters automatically.
and the best performances are highlighted with bold Furthermore, we will improve the model to handle large It can be observed that the performance scale datasets. We will also explore other approaches to with only one data source is the worst, and has high handle heterogeneous learning.
root mean square error and high variance.
more data sources available, the performance of GBCtends to be better. This is because each data source Table 4: Effect of different number of sources. Reported results are RMSE with variance in parenthesis.
[10] B. Taskar, P. Abbeel, and D. Koller, "Discriminative Part of the work was done when Xiaoxiao Shi was a probabilistic models for relational data.," in Proceed-ings of the Annual Conference on Uncertainty in Arti- summer intern at AT&T Labs. It is also supported in ficial Intelligence, 2002.
part by NSF through grants IIS 0905215, DBI-0960443, [11] H. Eldardiry and J. Neville, "Across-model collective CNS-1115234, IIS-0914934, OISE-1129076, and OIA- ensemble classification," in AAAI, 2011.
0963278, and Google Mobile 2014 Program.
[12] D. P. Bertsekas, Nonlinear Programming (Second ed.).
Cambridge, MA.: Athena Scientific, 1999.
[13] J. H. Friedman, "Stochastic gradient boosting," Com- putational Statistics & Data Analysis, vol. 38, no. 4,pp. 367–378, 2002.
[1] P. Melville, R. J. Mooney, and R. Nagarajan, "Content- [14] J. Shi and J. Malik, "Normalized cuts and image seg- boosted collaborative filtering for improved recommen- mentation," IEEE Trans. Pattern Anal. Mach. Intell., dations," in AAAI/IAAI, pp. 187–192, 2002.
vol. 22, no. 8, pp. 888–905, 2000.
[2] A. Blum and T. M. Mitchell, "Combining labeled and [15] M. Balcan and A. Blum, "A pac-style model for unlabeled data with co-training," in COLT, pp. 92–100, learning from labeled and unlabeled data," in COLT, pp. 111–126, 2005.
[3] S. Oba, M. Kawanabe, K. M¨ uller, and S. Ishii, "Het- [16] K. Sridharan and S. M. Kakade, "An information the- erogeneous component analysis," in NIPS, 2007.
oretic framework for multi-view learning," in COLT, [4] K. Nigam and R. Ghani, "Analyzing the effectiveness pp. 403–414, 2008.
and applicability of co-training," in CIKM, pp. 86–93, [17] Y. Koren, R. M. Bell, and C. Volinsky, "Matrix factorization techniques for recommender systems," [5] B. Long, P. S. Yu, and Z. Zhang, "A general model IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
for multiple view unsupervised learning," in SDM, [18] M. van Breukelen and R. Duin, "Neural network pp. 822–833, 2008.
initialization by combined classifiers," in ICPR, pp. 16– [6] J. Gao, W. Fan, Y. Sun, and J. Han, "Heterogeneous source consensus learning via decision propagation and [19] X. Z. Fern and C. Brodley, "Cluster ensembles for high negotiation," in KDD, pp. 339–348, 2009.
dimensional clustering: An empirical study," Journal [7] D. Agarwal, B. Chen, and B. Long, "Localized factor of Machine Learning Research., vol. 22, no. 8, pp. 888– models for multi-context recommendation," in KDD, pp. 609–617, 2011.
[20] X. He and P. Niyogi, "Locality preserving projections," [8] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, in NIPS, 2003.
and T. Eliassi-Rad, "Collective classification in net- [21] X. Zhu and A. B. Goldberg, Introduction to Semi- work data," AI Magazine, vol. 29, no. 3, pp. 93–106, Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Clay- [9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, pool Publishers, 2009.
"Conditional random fields: Probabilistic models for [22] P. Laskov, "An improved decomposition algorithm for segmenting and labeling sequence data.," in Pro- regression support vector machines," in NIPS, pp. 484– ceedings of the International Conference on Machine Learning, 2001.

Source: http://david.grangier.info/papers/2012/shi_sdm_2012.pdf

Intoxicaciones por sustancias quimicas en colombia reportadas al sivigila hasta la semana 12 de 2006

Informe Final de Evento Cisticercosis Año 2009 INSTITUTONACIONAL D Avances en el conocimiento de la enfermedad CISTICERCOSIS: SITUACION DE LA PARASITOSIS Diana Marcela Walteros Acero MD Referente Nacional de Cisticercosis Grupo Zoonosis Subdirección de Vigilancia y Control INTRODUCCIÓN La cisticercosis es una parasitosis causada por el metacestodo de la Taenia Solium, larva que tiene gran capacidad de invasión de tejidos como el musculo esquelético, Tejido Celular Subcutáneo y Musculo cardiaco. i Sin embargo la localización que genera mayores complicaciones, letalidad y secuelas es en el Sistema Nervioso Central, configurando el cuadro de Neurocisticercosis cuyas manifestaciones clínicas son variadas y dependen de la ubicación y forma de las vesículas parasitarias. En 1993 el grupo internacional de trabajo declara que la Cisticercosis es erradicable, teniendo en cuenta los siguientes criterios: el huésped definitivo es el ser humano y este a su vez es la fuente de infección del cerdo, que es el principal huésped intermediario, en el momento hay tratamiento efectivo para la infección animal y por ultimo no se han encontrado reservorios en animales silvestres. El ser humano es el único huésped definitivo natural de la tenia y el cerdo es el principal huésped intermediario, por tanto la prevalencia de la enfermedad depende de esta relación e interacción. Aunque el humano también es el huésped definitivo de la T. saginata y los bovinos los huéspedes intermediarios, ninguna de las subespecies de la T. saginata produce la infección. ii En el 2002, en Irán se reportan 2 casos de Cisticercosis cerebral y cardiaca en perros, los cuales se podrían constituir en otros huéspedes intermediarios de la enfermedad aparte de los cerdos.iii Dentro de las medidas de control de la enfermedad en los animales se han considerado, entre otras: encorralamiento para evitar el contacto de los cerdos con las larvas de la tenia, alimentación balanceada y adecuada que no incluya desperdicios ni heces humanas, desparasitación de los animales por lo menos 2 meses antes de su sacrificio, control del estado de salud y revisiones frecuentes por profesionales veterinarios, sacrificio en lugares con infraestructura adecuada y con previa verificación de ausencia de quistes en la lengua, sacrificio y desecho de los animales enfermos, refrigeración, transporte y comercialización bajo medidas de higiene y especificaciones adecuadas para la conservación de la carne y vacunación de los animales para prevenir el desarrollo de la enfermedad. Las vacunas disponibles son de diferentes características, unas incluyen el extracto crudo del parasitoiv v, otras incluyen subunidades proteicasvi vii y otras son vacunas de DNAviii ix .

Cholesterol-independent neuroprotective and neurotoxic activities of statins: perspectives for statin use in alzheimer disease and other age-related neurodegenerative disorders

Contents lists available at Pharmacological Research Cholesterol-independent neuroprotective and neurotoxic activities of statins: Perspectives for statin use in Alzheimer disease and other age-related D. Allan Butterﬁeld , Eugenio Barone , Cesare Mancuso a Department of Chemistry, Center of Membrane Sciences, University of Kentucky, Lexington, KY 40506, USA b Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40506, USA