HM Medical Clinic

Automatic Translation of Nominal Compounds from English to
Prashant Mathur, Soma Paul International Conference on Natural Language Processing Report No: IIIT/TR/2009/219 Centre for Language Technologies Research Centre International Institute of Information Technology Hyderabad - 500 032, INDIA Automatic Translation of English Nominal Compound in Hindi
Language Technology Research Language Technology Research Centre, IIIT Hyderabad Centre, IIIT Hyderabad Abstract
translation‟ and so on2. Rackow et al. (1992) has rightly observed that the two main issues English nominal compounds can be in translating the source language NC variously translated into Hindi. This correctly in the target language involves a) paper presents an automatic translation correctness in the choice of the appropriate system for translating English bigram target lexeme during lexical substitution and nominal compound into Hindi. The b) correctness in the selection of the right method comprises of the following steps: target construct type. The issue stated in (b) (1) Translation template generation (2) becomes apparent when we examine a Extraction of nominal compound from parallel corpus of English and Hindi that we English corpus (3) Finding the appropriate have used for the present work. We have sense of the components of the compound found that English nominal compounds can using WSD tool (4) Lexical substitution of be translated in Hindi in the following the component nouns using Bi-Lingual Dictionary (5) Corpus Search using translation templates and Ranking of a. As Nominal Compound possible candidates. We have shown that „Hindu texts‟  hindU SastroM, „milk the correct sense selection of the production‟  dugdha utpAdana component nouns of a given nominal compound during the analysis stage b. As Genitive Construction significantly improves the performance of „rice husk‟  cAval kI bhUsI, „room the system and makes the present work temperature‟ kamare ke tApamAn distinct from all the previous works done for automatic bilingual translation of c. As Adjective Noun Construction Nominal compounds. „nature cure‟  prAkrtik cikitsA, „hill camel‟ „pahARI UMT1.0 Introduction
The words prAkrtik and pahARI being adjectives derived from prakriti and frequently occurring expression in English1. pAhAR respectively. A two word nominal compound (henceforth NC) is a construct of two nouns, the d. As other syntactic phrase rightmost noun being the head and the wax work  mom par ciwroM „work preceeding noun the modifier as found in „cow milk‟, „road condition‟, „machine body pain  SarIr meM dard „pain in body‟ 1 Tanaka and Baldwin (2004) reports that the Cow dung  gobar BNC corpus (84 million words: Burnard (2000)) has 2.6% and the Reuters has (108M wo rds: 2 A nominal compound may be constituted of a Rose et al. (2002)) 3.9% of bigra m no minal more co mp lex structure as „customer satisfaction indices‟, „social service department‟ and so on. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from one described in this paper follow a template based corpus search approach. However, the Hand luggage  haat meM le jaaye present system distinctly differs from the jaane vaale saamaan „luggage to be aforementioned works for the analysis stage. carried by hand‟ Our system, unlike others, attempts to select the correct sense of nominal components by However, no definite clue is available in the running a WSD system on the SL data. As a data that helps one in selecting the right result of that the number of possible construction type of Hindi for translating a translation candidates to be searched in the given English NC. Tanaka and Baldwin target language corpus is significantly (2004) observes that a translator or MT reduced. Translation of nominal compound system attempting to translate a corpus will combines the following subtasks: (1) run across NCs with high frequency, but that each individual NN compound will occur Selection from target language Hindi (2) only a few times (with around 45-60% Extraction of NCs from English corpus (3) occurring only once). The upshot of this for Finding relevant sense of the components of MT systems and translators is that NN NCs. (4) Component Translation to Hindi compounds are too varied to be able to pre- using Bi-Lingual Dictionary (5) Corpus compile in an exhaustive list of translated Search using templates and Ranking of NN compounds. The system must be able to possible candidates. deal with novel NN compounds on the fly. The next section describes the data in some Building an automatic translation system for detail. In section 3, we review earlier works nominal compounds from the source that have followed similar approaches as the language (SL) English to the target language present work. Our approach is described in (TL) Hindi thus becomes a very challenging section 4. Finally the result and analysis is task in NLP. With Google translator we discussed in section 5. could achieve an accuracy of 45% with the same test data that we have used to evaluate our model. It could give a correct translation in 29% cases when a nominal At the time of taking up the present project compound remains a nominal compound in we made a preliminary study of NCs in Hindi. When an NC is translated in genitive English-Hindi parallel corpora in order to construction in Hindi, the translator could identify the distribution of various construct return the correct result 10% of cases. For types which English NC are aligned to. We other cases such as when NC translated as took a parallel corpora of around 50,000 Adjective noun pair or as a single word, the sentences in which we got 9246 sentences performance of Google translator is poor. (i.e. 21% cases of the whole corpus) that has nominal compound. The percentage of This paper presents the architecture of a "Nomin various translations is given in Table 1. al Compound Translator" system that has been able to give an accuracy of We have also come across some cases where 57% when tested on unseen gold standard an NC corresponds to a paraphrase construct test data. We limit our discussion to English for which we have not given a count in this two word nominal compounds in this paper. table. There are .08% cases (see table 1) The approach adopted to build the system when an English NC becomes a single word has a close resemblance to the approaches form in Hindi. The single word form can described in Bungum and Oepen (2009) for either be a simple word as in („cattle Norwaygian to English nominal compound dung‟ gobar) or a compounded word such translation and Tanaka and Baldwin (2004) as „blood pressure‟  raktacApa, „transition (English to Japanese nominal compound and plan‟  parivartana-yojanA. vice versa). All these works including the Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from Construction Type (Rackow et al. (1992)) and b) corpus search based probabilistic approach (Bungum and Nominal compound Oepen (2009) (henceforth B&O), Tanaka and Baldwin (2004) (henceforth T&B)). Rackow et al. tried to set a mapping between the head noun of source language and target language in terms of some grammatical and semantic feature which helped them in Nominal Compound selecting the right lexical item for the target language. The strategy adopted by both Table 1 : Distributi on of translations of
B&O and T&B has close similarity to ours English NC fr om an English Hi ndi par allel
as far as the template generation and the procedure of corpus search is concerned. First, they generate templates which The above table records major translation represent various construct types of the types. There are 1208 cases (approximately target language and then search these 13%) where the English nominal compound templates in a huge corpus. The two works is not translated but transliterated in Hindi. differ in using different strategy for ranking They are mostly technical terms, names of of the possible translated candidates that are chemicals and so on. found in the corpus. We have adopted the T&B proposal for ranking. T&B suggests The figure given in Table 1 is a report of the ranking candidate translation based on target empirical study performed on English-Hindi parallel corpora. We prepare a set of essentially corpus frequency. They develop translation templates that represents the a measure called "interpolated CTQ construct types of Hindi (as in table 1). In (Corpus-based translation quality) metric" section 4, we will discuss how these which extracts frequency counts from the templates are used for searching possible target language corpus (for the details see translation in Hindi raw corpus. From table 1, we come to know that the frequency of While working on source language side, both B&O and T&B disregard local contexts compound in Hindi is the highest. The and does not attempt to identify the sense of second highest construction is the genitive nominal compound in the given context. construct. Parallely we have performed a They have, on the other hand, taken into study with Hindi informants to find out how account of all possible translations of the many cases an English nominal compound component nouns while performing the can legitimately be translated into a corpus search. In this way the number of syntactic genitive construct even when it can search candidates has become many. We have other more accurate translation. Our will discuss in section 4 that while experiment shows that a nominal compound translating a nominal compound we have is well accepted as a genitive construct in tried to consider the meaning of that Hindi in 59% of cases. This is an interesting compound in the given context, that is, the finding which we have used in designing the sentence in which it has occurred. In this heuristics of the present task. regard, our work becomes distinct from other works referred to in this section. 3.0 Related Works

While working on the automatic translation 4.0 Preparation of Data and Approach
of English nominal compound to Hindi, we came across works on two different This section describes our procedure in approaches: a) transfer based approach details. The system is comprised of the Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from following stages: a) Preparation of data and translated construct type of English NC in template generation b) Determining sense of Hindi. The parallel corpus data are the component nouns in the given context, c) inspected and generalized into translation templates. As shown in section 2, the two dictionary, d) corpus search using translation templates <E15 E2>  <H1 H2> and <E1 templates and e) Ranking of the possible E2>  <H1 kA6 H2> are the most frequent ones. The other interesting candidate is Adjective noun phrase in Hindi. Hindi has a 4.1 Preparation of Source Language Data
rich derivational system for adjective formation. In this work we have identified Two sets of language data are prepared for till now 44 templates. the work. The first set is a parallel corpus of around 50,000 sentences in which 9246 4.3 Sense Selection for Source Language
sentences have nominal compound. The source language sentences have been manually examined for nominal compounds The context determines the sense of a given and their correspondent translation is English NC in a corpus. When the identified in the Hindi target language3. The component nouns are taken independently, second set of data consists of 7000 raw they might represent more than one sense. sentences of English on which we have run For each sense the English word might be Tree-tagger4 which is a POS tagger. The translated into more than one Hindi tagger not only gives part of speech of the equivalent word using English to Hindi words but also outputs the lemma for each bilingual dictionary. Let me explain the word. The lemma is required in the later complexity of lexical substitution with data stage for searching the word in the wordnet. from the corpus. We came across the Sentences with nominal compounds are following sentences in the test data: extracted from the tagged data and the nominal compounds are strictly restricted to a. „Millions of people in the border be two consecutive noun construction type. area need to feel safe again‟ We obtain 1584 sentences with distinct b. „Road safety aims to reduce the nominal compounds out of which 1000 harm (deaths, injuries, and property damage) resulting from crashes of road vehicles‟ processing. These sentences are manually The nominal compound identified in translated into Hindi and used half of it as sentence (a) and (b) are „border area‟ and development data and half of it as gold „road safety‟ respectively. All four words standard test data. can be used in more than one sense as given in 2nd column of table 2. 4.2 Generation of Translation Templates
No. of senses from One of the most important subtasks in this templates. Each template is a possible 3 In order to execute this task we have used a JAVA based interface "Sanchay" that has been developed in-house. Using an interface to do 5 E stands for English and H stands for Hindi this task helped us to ma intain consistency in 6 kA is a genitive marker in Hindi. It has variants 4 We used Tree-Tagger (POS-Tagger) for kI and ke. Therefore <H1 kA H2>, <H1 ke H2> tagging the corpus of 1.7M words. It gave an and <H1 kI H2> form three translation accuracy of 94%. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from Table 2 : Number of Senses Listed in Wor dne t
wordnet. Since that was not available to us, For each sense there exists a synset which we have maintained the following strategy. consists of one or more semantically We first acquire all possible translations for equivalent words in the wordnet. If we all the words within a synset from all consider all words for all senses of the possible dictionary resources. Then we take component nouns and attempt to translate all out those Hindi words which are common of them using a bilingual dictionary the translations to all English words of a synset, number of translation candidates will be if there is one. For example, we got the huge in number. Moreover we will be following translations for the two synsets searching for those candidates that are not <„road‟, „route‟> from bilingual dictionaries: relevant for the English NC in the given context. In order to avoid the proliferation of data, we have chosen to use a WSD tool. path, maarg, saDak, raastaa We ran WordNet-SenseRelate (Peterson et maarg, saDak, raastaa al.) on our data for the purpose. This tool Table 4: Translation using a bilingual
specifies the wordnet sense id for each noun dictionary
component within NC as shown in table 3: From table 4, we find out that maarg, saDak, raastaa are common translation for „road‟ and „route‟. Once the Hindi equivalents are obtained they are used to frame the translation candidates which are searched in the corpus for a match. When common „border‟, borderline‟, equivalent(s) is not found for all member „delimitation‟, words of a synset, we try for maximum number of member words for which a <‟area‟, „region‟> common translation is available. The worst <‟road‟, „route‟> case is when we do not find any common <‟safety‟, refuge‟> translation and that was rare in our Table 3: Output of WSD tool
experiment. For example, for the synset members of „border‟ as well as „safety‟ we The third column of table 3 presents the have not come across any common Hindi synset associated with the sense selected by equivalents. For such cases, we try out the WSD tool. Once the synsets are acquired translations of all synset members one by in this process the translation for each word one for generating the translation templates. in the synset is obtained from a bilingual dictionary. Once we look into a bilingual dictionary, again we may come across many Translation Candidates
equivalents of a word which do not match to the sense id selected for that word. For We have performed the corpus search on a example, the word „border‟ (a member of Hindi indexed corpus of 28 million words. the synset of „border‟) has one equivalent For ranking, a reference ranking based on jhaalar in the bilingual dictionary that is the frequency of occurrence of the translate used in the domain of „decoration‟ and not candidates in full in the TL corpora is taken „location‟. We would like to discard such as baseline. To improve on the baseline, a equivalents. Otherwise the whole attempt of stronger ranking measure is borrowed from using WSD tool on the source language side Baldwin and Tanaka (2004). It rates a given will be lost. The ideal situation would have translation candidate according to corpus been to have a mapping from the synset id evidence for both the fully specified of a word in English wordnet to the translation and its parts in the context of the corresponding Hindi synset id in Hindi translation template in question. The Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from measure is called interpolated CTQ metric The motivation for this approach is two fold: that extracts the frequency counts from the a) a word occurs mostly in its default sense target language corpus in the following which is listed as the first sense in any lexicon; b) if the input word is not available in a bilingual dictionary for substitution, a synset gives us other equivalent words. This increases the robustness of the system. The βp(w1H , t)p(w2H , t)p(t) third method is the one we have adopted for the present task – using a WSD tool on the , w2H , t) is the probability source language NC and select the of occurrence of template t with w1 and w2 appropriate sense of the given word in that as its instances and βp(w1H , t)p(w2H , context. The purpose of trying out various t)p(t) is the probability of occurrence of methods for lexical substitution is for translation template t with w1 as its instance examining whether the usage of WSD tool at one time multiplied by the probability of brings in any improvement to the overall occurrence of translation template t with w2 performance of the translator tool. The table as its instance at another time multiplied by below shows that it does. The pre-processed the occurrence of translation template t. input that has been used for lexical Naturally the first term will be given higher substitution is not humanly analyzed data priority than the second term. The result but is actually obtained as the output of presented in the next section will show that Tree-Tagger that gives 94% accuracy and the incorporation of frequency of occurrence the WSD tool WordNet-SenseRelate that of βp(w1H , t)p(w2H , t) has distinctly has produced 80% accurate case for nominal improved the recall in our system. compound disambiguation7. The results of corpus search of the translation candidates are given in the following two tables. The 5.0 Result and Analysis
baseline frequency model performs in the This section presents the result of our various experiments performed as part of substitution Recall translating automatically English NC to Hindi. The results show a distinct improvement in performance as we go from baseline ranking method to CTQ method of ranking. We have used three methods of Wordnet 1st sense + 24% lexical substitution for components of Bilingual dictionary nominal compounds into Hindi equivalents and the result obtained for each method is presented at table 1 and table 2. As part of sense 24.63% 53.68% the first method we have not done any word selection by WSD tool sense disambiguation of the component words of source language NC; on the contrary we have straightaway used the Table 5: Ranking using Baseline Frequency English NC components to all possible Hindi equivalents. For the second method, the first sense of wordnet for the components of the given English NC has 7 It is interesting to note that the accuracy been selected as default sense and all the reported for the WordNet-SenseRelate output on members of synset of the first sense have general data is 58%. When we tested the tool for been substituted using a bilingual dictionary. nomina l co mpound, it gave an accuracy of around 80% for the same. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from With the use of CTQ measure metric, the 6.0 Conclusion and Future Work
improved as shown in the following table: This paper describes the architecture of a template based translation system for substitution Recall translating English nominal compound into Hindi. We have observed that English nominal compounds can variously be translated into Hindi. However no clue is available to determine which type of Hindi Wordnet 1st sense + 28% Bilingual dictionary compound would be translated into. We have, therefore, adopted a corpus search sense 28.50% 62.1% approach that performs the search of selection by WSD candidate templates in a Hindi indexed corpus. While generating templates, we found out that adjectival templates are hard to generate because adjective formation from noun is a complex derivational process Table 6: Ranking Using CTQ (Corpusbased in Hindi. It does not only involve attaching Translation Quality) an adjectival suffix on the noun but also many a time requires a change in the vowel The recall of this experiment was very low. of the stem. In the present work, we have In order to increase the coverage of performed poorly for adjective noun translation, we have done the following translation templates. The future work study. We involved two informants to includes the correct generation of adjectival verify on the development data whether the form from the modifier nouns so that correct compounds which were not found during templates for „Adjective Noun‟ construct corpus search can legitimately be translated can be obtained. One advantage of this as a genitive construct. We found that the approach is that a translation if it exists in heuristics is working for 59% cases. the corpus will never be missed. Therefore Therefore we incorporated this as a default accuracy of translation will depends largely translation case for our system. Whenever a on the amount of target language data corpus search for a translation candidate searched for the translation candidates. fails, we assign a genitive translation for that nominal compound. This results in a steep 7. References
improvement in recall although the precision falls down a little. We ran the experiment George A. Miller. 1994. WORDNET: A on the output of 1st and 3rd lexical Lexical Database for English. HLT. substitution methods. The result is reported in the following table: Helmut Schmid. 1994. Probabilistic Part-of- Speech Tagging Using Decision Trees. In International Conference on New Methods in Language Processing. Manchester, UK. Lou Burnard. 2000. User Reference Guide for the British National Corpus. Technical Table 7: Ranking after inclusion of de fault
translati on (X kA Y, X k I Y, X ke Y as

Pierrette Bouillon, Katharina Boesefeldt, templates)
and Graham Russell. 1994. Compound Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from nouns in a unification-based MT system. In Proc. of the 4th Conference on Applied Natural Language Processing (ANLP), Stuttgart, Germany. Siddharth Patwardhan, Satanjeev Banerjee and SenseRelate::TargetWord – A Generalized Framework for Word Sense Disambiguation. Proceedings of the ACL Interactive Poster and Demonstration Sessions, Ann Arbor, MI Sparck Jones, K. 1983. "So what about parsing compound nouns?," in Automatic Natural Language Processing, K. Sparck Jones and Y. A. Wilks, eds., Ellis Horwood, Chichester, 164--168. Su Nam Kim, Timothy Baldwin: Automatic Interpretation of Noun Compounds Using WordNet Similarity. IJCNLP 2005: 945-956 interpretation of nominal compounds. In Proc. of the 1st Conference on Artificial Intelligence (AAAI-80). Timothy Baldwin and Takaaki Tanaka. 2004. Translation by Machine of Complex Nominals: Getting it right. In Proceedings of the ACL04 Workshop on Multiword Expressions: Barcelona, Spain. Tanaka, Takaaki and Timothy Baldwin. 2003b. Translation Selection for Japanese-English Proceedings of Machine Translation Summit IX, New Orleans, LO, USA. Ulrike Rackow, Ido Dagan, Ulrike Schwall. 1992. Automatic Translation of Noun Compounds. COLING 1992, 1249-1253 Zouhair Maalej, English-Arabic Machine Translation of Nominal Compounds, in Proceedings of the Workshop on Compound Nouns: Multilingual Aspects of Nominal Composition. Geneva: ISSCO, pp. 135–146, 1994. Proceedings of ICON-2009: 7th International Conference on Natural Language Processing, Macmillan Publishers, India. Also accessible from


PHARMA-BRIEF ESPECIAL À custa do pobre?Exame do comportamento empresarial da Boehringer Ingelheim, Bayer e Baxter no Brasil Membro da Health Action International I Brasil: país de contrastes . 1-13 O sistema de saúde brasileiro . 2-8 O mercado farmacêutico no Brasil . 8-10 As companhias examinadas . 11 II Métodos do estudo .14-19I I Resultados do estudo .20-45

Sanofi and lilly announce licensing agreement for cialis® (tadalafil) otc

Sanofi and Lilly announce licensing agreement for Cialis® (tadalafil) OTC - Companies anticipate providing over-the-counter (OTC) product to treat erectile dysfunction after expiration of certain patents - PARIS, France, and INDIANAPOLIS, May 28, 2014 — Sanofi (EURONEXT: SAN and NYSE: SNY) and Eli Lilly and Company (NYSE: LLY) today announced an agreement to pursue regulatory approval of nonprescription Cialis (tadalafil). Cialis is currently available by prescription only worldwide for the treatment of men with erectile dysfunction (ED). Under the terms of the agreement, Sanofi acquires the exclusive rights to apply for approval of Cialis OTC in the United States, Europe, Canada and Australia. Sanofi also holds exclusive rights to market Cialis OTC following Sanofi's receipt of all necessary regulatory approvals. If approved, Sanofi anticipates providing Cialis OTC after expiration of certain patents. Terms of the licensing agreement were not disclosed. "This agreement provides us with an opportunity to work with Lilly, a leader in men's health, to transform how this important medicine is offered to millions of men throughout the world," said Vincent Warnery, senior vice president, Global Consumer Healthcare Division, Sanofi. "The opportunity to forge an industry-leading partnership that adds to Sanofi Consumer Healthcare's leading portfolio and successful track record of over-the-counter switches reinforces consumer health care as a major growth platform for Sanofi." "Millions of men worldwide trust Cialis to treat ED. We are pleased to work with Sanofi to pursue a path that could allow more men who suffer from ED to obtain convenient access to a safe and reliable product without a prescription," said David Ricks, senior vice president, Lilly, and president, Lilly Bio-Medicines. "Switching a medicine to over-the-counter is a highly regulated process that is data-driven and scientifically rigorous. Together with Sanofi, we look forward to working closely with regulatory authorities to define the proper actions and necessary precautions to help patients use over-the-counter Cialis appropriately." Cialis was first approved by the European Medicines Agency in 2002, then by the U.S. Food and Drug Administration in 2003, for the treatment of erectile dysfunction. Ultimately, Cialis has received approval in more than 120 countries for indications that vary by country, including erectile dysfunction and erectile dysfunction and the signs and symptoms of benign prostatic hyperplasia (BPH). Cialis reached $2.16 billion USD (€1.58 billion) in worldwide sales in 2013 and has recorded total global sales of more than $14 billion USD (€10.2 billion) since launch. To date, more than 45 million men worldwide have been treated with Cialis. About Cialis Currently only available with a prescription, Cialis is a tablet taken to treat erectile dysfunction (ED), the signs and symptoms of benign prostatic hyperplasia (BPH), and both ED and the