162 ISSN 1014-1200 FAO PRODUCCIÓN Y SANIDAD ANIMAL USO DE ANTIMICROBIANOS EN ANIMALES DE CONSUMO incidencia del desarrollo de resistencias en salud pública FAO PRODUCCIÓN Y SANIDAD ANIMAL USO DE ANTIMICROBIANOSEN ANIMALES DE CONSUMO incidencia del desarrollo de resistencias en salud pública J. O. ErrecaldeFacultad de Ciencias Veterinarias,Universidad Nacional de La Plata, Argentina
Seeking nonsense, looking for trouble: efficient promotional-infection detection through semantic inconsistency search
2016 IEEE Symposium on Security and Privacy Seeking Nonsense, Looking for Trouble: Efﬁcient Promotional-Infection Detection through Semantic Inconsistency Search Xiaojing Liao1, Kan Yuan2, XiaoFeng Wang2, Zhongyu Pei3, Hao Yang3, Jianjun Chen3, Haixin Duan3, Kun Du3, Eihal Alowaisheq2, Sumayah Alrwais2, Luyi Xing2, and Raheem Beyah1 1Georgia Institute of Technology 2Indiana University 3Tsinghua University Abstract—Promotional infection is an attack in which the viagra and other drugs under nidcr.nih.gov (National Institute adversary exploits a website's weakness to inject illicit advertising of Dental and Craniofacial Research), counterfeit luxury content. Detection of such an infection is challenging due to handbag under dap.dau.mil (Defense Acquisition Portal), and its similarity to legitimate advertising activities. An interestingobservation we make in our research is that such an attack replica Rolex under nv.gov, the domain of the Nevada state almost always incurs a great semantic gap between the infected government. Clearly, all those FQDNs have been unauthorizedly domain (e.g., a university site) and the content it promotes changed for promoting counterfeit or illicit products. This type (e.g., selling cheap viagra). Exploiting this gap, we developed a of attacks (exploiting a legitimate domain for underground semantic-based technique, called Semantic Inconsistency Search advertising) is called promotional infection in our research.
(SEISE), for efﬁcient and accurate detection of the promotional injections on sponsored top-level domains (sTLD) with explicit Promotional infection is an attack exploiting the weakness semantic meanings. Our approach utilizes Natural Language of a website to promote content. It has been used to serve Processing (NLP) to identify the bad terms (those related to various malicious online activities (e.g., black-hat search engine illicit activities like fake drug selling, etc.) most irrelevant to an optimization (SEO), site defacement, fake antivirus (AV) sTLD's semantics. These terms, which we call irrelevant bad terms promotion, Phishing) through various exploit channels (e.g., (IBTs), are used to query search engines under the sTLD for suspicious domains. Through a semantic analysis on the results SQL injection, URL redirection attack and blog/forum Spam).
page returned by the search engines, SEISE is able to detect Unlike the attacks hiding malicious payloads (e.g., malware) those truly infected sites and automatically collect new IBTs from the search engine crawler, such as a drive-by download from the titles/URLs/snippets of their search result items for campaign, the promotional attacks never shy away from search ﬁnding new infections. Running on 403 sTLDs with an initial 30 engines. Instead, their purpose sometimes is to leverage the seed IBTs, SEISE analyzed 100K fully qualiﬁed domain names (FQDN), and along the way automatically gathered nearly 600 compromised domain's reputation to boost the rank of the IBTs. In the end, our approach detected 11K infected FQDN promoted content (either what is directly displayed under the with a false detection rate of 1.5% and over 90% coverage.
domain or the doorway page pointed by the domain) in the Our study shows that by effective detection of infected sTLDs, search results returned to the user when content-related terms the bar to promotion infections can be substantially raised, are included in her query. Such infections can inﬂict signiﬁcant since other non-sTLD vulnerable domains typically have muchlower Alexa ranks and are therefore much less attractive for harm on the compromised websites through loss in reputation, underground advertising. Our ﬁndings further bring to light the search engine penalty, trafﬁc hijacking and may even have legal stunning impacts of such promotional attacks, which compromise ramiﬁcations. They are also pervasive: as an example, a study FQDNs under 3% of .edu, .gov domains and over one thousand shows that over 80% doorway pages involved in black-hat SEO gov.cn domains, including those of leading universities such as are from injected domains .
stanford.edu, mit.edu, princeton.edu, havard.edu and governmentinstitutes such as nsf.gov and nih.gov. We further demonstrate Catching promotional infections: challenges. Even with the the potential to extend our current technique to protect generic prevalence of the promotional infections, they are surprisingly domains such as .com and .org.
elusive and difﬁcult to catch. Those attacks often do notcause automatic download of malware and therefore may notbe detected by virus scanners like VirusTotal and Microsoft Forefront. Even the content injected into a compromised Imagine that you google the following search term: site: website can appear perfectly normal, no difference from the stanford.edu pharmacy. Figure 1 shows what we got on legitimate ads promoting similar products (e.g., drugs, red October 9, 2015. Under the domain of Stanford University wine, etc.), ideological and religious messages (e.g., cult theory are advertisements (ad) for selling cheap viagra! Using various promotion) and others, unless its semantics has been carefully search terms, we also found the ads for prescription-free examined under the context of the compromised site (e.g., 2375-1207/16 $31.00 2016 IEEE 2016, Xiaojing Liao. Under license to IEEE.
DOI 10.1109/SP.2016.48 General Service Administration, EDUCAUSE, DoD Network Information Center), represents a narrow community and carries designated semantics (Section III-A). Later we show that thetechnique has the potential to be extended to generic TLD (gTLD, see Section V-B).
SEISE is designed to search for a set of strategically selected irrelevant terms under an sTLD (e.g., .edu) to ﬁnd out thesuspicious FQDNs (e.g., stanford.edu) associated with theterms, and then further search under the domains and inspect thesnippets of the results before ﬂagging them as compromised.
To make this approach work, a few technical issues need to be Fig. 1: Search ﬁndings of promotional injections in stanford.edu.
addressed: (1) how to identify semantic inconsistency between Search engine result is organized as title, URL and snippet.
injected pages and the main content of a domain; (2) how tocontrol the false positives caused by the legitimate contentincluding the terms, e.g., a health center sites on Stanford selling red wine is unusual on a government's website). So University (containing the irrelevant term "pharmacy"); (3) far, detection of the promotional infections mostly relies on how to gather the search terms related to diverse promotional the community effort, based upon the discoveries made by content. For the ﬁrst issue, our approach starts with a small human visitors (e.g., PhishTank ) or the integrity checks that set of manually selected terms popular in illicit activities (e.g., a compromised website's owner performs. Although attempts gambling, drug and adult) and runs a word embedding based have been made to detect such attacks automatically, e.g., tool to calculate the semantic distance between these terms and through a long term monitoring of changes in a website's a set of keywords extracted from the sTLD's search content, DOM structure to identify anomalies  or through computer which describe the sTLD's semantics. Those most irrelevant vision techniques to recognize a web page's visual change , are utilized for detection (Section III-B). To suppress false existing approaches are often inefﬁcient (requiring long term positives, our approach leverages the observation that similar monitoring or analyzing the website's visual effects) and less promotional content always appear on many different pages effective, due to the complexity of the infections, which, for under a compromised domain for the purpose of improving example, can introduce a redirection URL indistinguishable the rank of the attack website pointed to by the content. As a from a legitimate link or make injected content only visible to result, a search of the irrelevant term under the domain will the search engine.
yield a result page on which many highly frequent terms (such Semantic inconsistency search. As mentioned earlier, fun- as "no prescription", "low price" in the promotional content) damentally, promotional infections can only be captured by turn out to rarely occur across the generic content under the analyzing the semantic meaning of web content and the same domain (e.g., stanford.edu). This is very different from context in which they appear. To meet the demand for a large- the situation, for example, when a research article mentions scale online scan, such a semantic analysis should also be viagra, since the article will not be scattered across many pages fully automated and highly efﬁcient. Techniques of this type, under the site and tends to contain the terms also showing however, have never been studied before, possibly due to the up in the generic content under the Stanford domain, such as concern that a semantic-based approach tends to be complicated "study", "ﬁnding", etc (Section III-B). Finally, using the terms and less accurate. In this paper, we report a design that makes a extracted from the result snippets of the sites detected, SEISE big step forward on this direction, demonstrating it completely further automatically expands the list of the search terms for possible to incorporate Natural Language Processing (NLP) ﬁnding other attacks (Section III-C).
techniques into a lightweight security analysis for efﬁcient and We implemented SEISE and evaluated its efﬁcacy in our accurate detection of promotional infections. A key observation research (Section IV). Using 30 seed terms and 403 sTLDs here is that for the attacks in Figure 1, inappropriate content (across 141 countries and 89 languages), our system automati- shows up in the domains with speciﬁc meanings: no one expects cally analyzed 100K FQDNs and along the way, expanded the that a .gov or .edu site promotes prohibited drugs, counterfeit keyword list to 597 terms. In the end, it reported 11K infected luxury handbags, replica watches, etc. Such inconsistency can FQDNs, which have been conﬁrmed to be compromised1 be immediately identiﬁed and located from the itemized search through random sampling and manual validation. With its result on a returned search result page, which includes the low false detection rate (1.5%), SEISE also achieved over 90% title, URL and snippet for each result (as marked out in detection rate. Moving beyond sTLD, we further explore the Figure 1). This approach, which detects a compromised domain (e.g., stanford.edu) based upon the inconsistency between the 1Note that in line with the prior research , the term "compromise" here domain's semantics and the content of its result snippet reported refers to not only direct intrusion of a web domain, which was found to by a search engine with regard to some search terms, is be the most common cases in our research (80%, see Section VI), but alsoposting of illicit advertising content onto the domain through exploiting its called semantic inconsistency search or simply SEISE. Our weak (or lack of) input sanitization: e.g., blog/forum Spam and link Spam current design of SEISE focuses on sponsored top-level domain (using exposed server-side scripts to dynamically generate promotion pages (sTLD) like .gov, .edu, .mil, etc., that has a sponsor (e.g., US under the legitimate domain).
potential extension of the technique to gTLDs such as .com impact, ongoing underground promotion campaigns, affecting (Section V-B). A preliminary design analyzes .com domains leading educational institutions and government agencies, and using their site tag labeled by SimilarSites , which is found the unique techniques the perpetrator employs. Further we to be pretty effective: achieving a false detection rate (FDR) of demonstrate the impacts of our innovation, which signiﬁcantly 9% when long keywords gathered from compromised sTLDs raises the bar to promotional infections and can potentially be extended to protect generic domains.
Our ﬁndings. Looking into the promotional infections detected Roadmap. The rest of the paper is organized as follows: by SEISE, we were surprised by what we found: for example, Section II provides background information for our study; about 3% (175) of .gov domains and 3% (246) of .edu Section III elaborates on the design of SEISE; Section IV domains are injected; also around 2% of the 62,667 Chinese reports the implementation details and evaluation of our government domains (.gov.cn) are contaminated with ads, technique; Section V elaborates on our measurement study defacement content, Phishing, etc. Of particular interest is and new ﬁndings; Section VI discusses the limitations of our a huge gambling campaign we discovered (Section V-C), current design and potential future research; Section VII reviews which covers about 800 sTLDs and 3000 gTLDs across related prior research and Section VIII concludes the paper.
12 countries and regions (US, China, Taiwan, Hong Kong, Singapore and others). Among the victims are 20 US academia institutes such as nyu.edu, ucsd.edu, 5 government agencies like In this section, we lay out the background information of va.gov, makinghomeaffordable.gov, together with 188 Chinese our research, including the promotional infection, sTLD, NLP universities and 510 Chinese government agencies. We even and the assumptions we made.
recovered the attack toolkit used in the campaign, which Promotional infection. As mentioned earlier, promotion in- supports automatic site vulnerability scan, shell acquisition, fection is caused by exploiting the weakness of a website to SEO page generation, etc. Also under California government's advertise some content. A typical form of such an attack is domain ca.gov, over one thousand promotion pages were black-hat SEO, a technique that improves the rank of certain found, all pointing to the same online casino site. Another content on the results page by taking advantage of the way campaign involves 102 US universities (mit.edu, princeton.edu, search engines work, regardless of the guidelines they provide.
stanford.edu, etc.), advertising "buy cheap essay". The scope of Such activities can happen on a dedicated host, for example, these attacks go beyond commercial advertising: we found that through stufﬁng the pages with the popular search terms that 12 Chinese government and university sites were vandalized may not be related to the advertised content, for the purpose with the content for promoting Falun Gong. Given the large of enhancing the chance for the user to ﬁnd the pages. In number of compromised sites discovered, we ﬁrst reported other cases, the perpetrator compromises a high-rank website the most high-impact ﬁndings to related parties (particularly to post an ad pointing to the site hosting promoted content, universities and government agencies) and will continue to do in an attempt to utilize the compromised site's reputation to so (Section VI).
make the content more visible to the user. This can also be Further, our measurement study shows that some sTLDs such done when the site does not check the content uploaded there, as .edu, .edu.cn and .gov.cn are less protected than the .com such as visitors' comments, which causes its display of blog or domains with similar Alexa ranks, and therefore become soft forum Spam. Such SEO approaches, the direct compromise and targets for promotional infections (Section V-B). By effectively the uploading of Spam ads, are considered to be promotional detecting the attacks on these sTLDs, SEISE raises the bar for infections. Different from the SEO on a dedicated host, these the adversary, who has to resort to less guarded gTLDs, which approaches leverage a legitimate site and also provide their typically have much lower Alexa ranks, making the attacks, ad-related keywords to the search engine crawler, to attract SEO in particular, less effective.
Contributions. The contributions of the paper are outlined as The promotional infection can be used for multiple goals such as malware distribution, phishing, blackhat SEO or • Efﬁcient semantics-based detection of promotional infections.
political agenda promotion. Black-hat SEO is often used We developed a novel technique that exploits the semantic to advertise counterfeit or unauthorized products. The same gap between domains (sTLDs in particular) and unauthorized promotional tricks have also been played to get other malicious content they host to detect the compromised websites that serve content to the audience at which the adversary aims. Prominent underground advertising. Our technique is highly effective, examples are Phishing websites that try to defraud the visitors incurring low false positives and negatives. Also importantly, of their private information (user names, passwords, credit- it is simple and efﬁcient: often a compromised domain can card numbers, etc.) and fake AV sites that cheat the user into be detected by querying Google no more than 3 times. This indicates that the technique can be easily scaled, with the help Sponsored top-level domains. A sponsored top-level domain of search providers.
(sTLD) is a specialized top-level domain that has private • Measurement study and new ﬁndings. We performed a agencies or organizations as its sponsors that establish and large-scale measurement study on promotional infections, the enforce rules restricting the eligibility to use the domain based ﬁrst of this kind. Our research brings to light several high- on community theme concepts. For example, .aero is sponsored by SITA, which limits registrations to members of the air-transport industry. Compared to unsponsored top-level domain "# (gTLD), an sTLD typically carries designated semantics from "$ its sponsors. For example, as a sponsored TLD, .edu, which is sponsored by EDUCAUSE, indicates that the correspondingsite is post-secondary institutions accredited by an agency recognized by the U.S. Department of Education. Note thatsTLDs for different countries are also associated with speciﬁc semantic meanings as stated in ICANN, e.g., edu.cn for Chinese In our research, we collected sTLDs for different countries according to the 10 categories provided by ICANN : .aero, .edu, .int, .jobs, .mil, .museum, .post, .gov, .travel, .xxx and thepublic sufﬁx list maintained by the Mozilla Foundation .
Fig. 2: Overview of the SEISE infrastructure.
All together, we got 403 sTLDs from 141 countries.
Natural language processing. The semantics information such as syntactically plausible terminological noun phrases.
SEISE relies on is automatically extracted from web content Then, the terminological candidates are further analyzed using using Natural Language Processing. Technical advances in the statistical approaches (e.g., point-wise mutual information) to area has already made effective keyword identiﬁcation and determine important terms.
sentence processing a reality. Below we brieﬂy introduce the Adversary model. In our research, we consider the adversary key NLP techniques used in our research.
who tries to exploit legitimate websites for promoting unau- thorized content. Examples of such content include unlicensed Word embedding (skip-gram model). A word embedding online pharmacies, fake AV, counterfeit, politics agenda or W : words → V n is a parameterized function mapping words Phishing sites. For this purpose, the adversary could inject ads to high-dimensional vectors (200 to 500 dimensions), e.g., or other content into the target sites to boost the search rank W (‘education) = (0.2, −0.4, 0.7, .), to represent the word's of the content he promotes or use sTLD sites as redirectors to relation with other words. Such a mapping can be done in monetize trafﬁc.
different ways, e.g., using the continual bag-of-words modeland the skip-gram technique to analyze the context in which III. SEISE: DESIGN the words show up. Such a vector representation ensures thatsynonyms are given similar vectors and antonyms are mapped to As mentioned earlier, promotional infections often do not dissimilar vectors. Also interestingly, the vector representations propagate malicious payloads (e.g., malware) directly and ﬁt well with our intuition about the semantic relations between instead only post ads or other content that legitimate websites words: e.g., the vectors for the words ‘queen', ‘king', ‘man' and may also contain. This makes detection of such attacks ‘woman' have the following relation: v extremely difﬁcult. In our research, we look at the problem from a unique perspective, the inconsistency between the king. In our research, we utilized the vectors to compare the semantics meanings of different words, by measuring the cosine malicious advertising content and the semantics of the website, distance between the vectors. For example, using Wikipedia particularly, what is associated with different sTLDs. More pages as a training set (for the context of individual words), our speciﬁcally, underlying SEISE are a suite of techniques that approach automatically identiﬁed the words semantically-close search sTLDs (.edu, .gov, etc.) using irrelevant bad terms to ‘casino', such as ‘gambling' (with a cosine distance 0.35), (IBT) (the search terms unrelated to the sTLDs but heavily ‘vegas' (0.46) and ‘blackjack' (0.48).
involved in malicious activities like Spam, Phishing) to ﬁndpotentially infected FQDNs, analyze the context of the IBTs • Parts-of-speech (POS) tagging and phrase parsing. POS under those FQDNs to remove false positives and leverage tagging is a procedure of labeling a word in the text (corpus) detected infections to identify new search terms, automatically as corresponding to a particular part of speech as well as its expanding the IBT list. Below we elaborate on this design.
context (such as nouns and verbs). POS tagging accepts the textas input and outputs the words labeling with POS such as noun, A. Overview verb, adjective, etc. Phrase parsing is the technique to divide Architecture. Figure 2 illustrates the architecture of SEISE, sentences into phrases that logically belong together. Phrase which includes Semantics Finder, Inconsistency Searcher, parsing accepts texts as input and outputs a series of phrases in Context Analyzer and IBT Collector. Semantics Finder takes the texts. The state-of-the-art POS tagging and phrase parsing as its input a set of sTLDs, automatically identifying the techniques can achieve over 90% accuracy , , . POS keywords that represent their semantics. These keywords are tagging and phrase parsing can be used in the content term compared with a seed set of IBTs to ﬁnd the most irrelevant extraction, i.e., determining important terms within a given terms. Such selected terms are then utilized by Inconsistency piece of text. Speciﬁcally, after parsing phrases from the given Searcher to search related sTLDs for the FQDNs carrying content, POS tagger helps to tag the terminological candidates, these terms. Under each detected FQDN, Context Analyzer further evaluates the context of discovered IBTs through detect other advertising targets (e.g., red wine) not included in a differential analysis to determine whether after removing the initial IBT list (e.g., those for promoting illegal drugs). The stop words, i.e., the most common words like ‘the' from same technique can also be applied to ﬁnd out compromised the context, frequently-used terms identiﬁed there (e.g., the gTLDs like the .com FQDNs involved in the same campaign.
search result of site:stanford.edu pharmacy) become rare acrossthe generic content of the FQDN (e.g., the search result of B. Semantics-based Detection site:stanford.edu), which indicates that the FQDN has indeedbeen compromised. Such FQDNs are reported by SEISE and In this section, we present the technical details for Semantics their snippets are used by IBT Collector to extract keywords.
Finder, Inconsistency Searcher and Context Analyzer.
Those with the largest semantic distance from the sTLDs are Finding semantics for sTLDs. The ﬁrst step of our approach added to the IBT list for detecting other infected FQDNs.
is to automatically build a semantic proﬁle for an sTLD. Such Example. To explain how SEISE works, let us take a look at a proﬁle is represented as a set of terms, which serve as an the example at the beginning of the paper (Figure 1). For the input to the Inconsistency Searcher for choosing right IBTs.
sTLD .edu, SEISE ﬁrst runs Semantics Finder to automatically For example, the semantic representation of the sTLD .edu.cn extract keywords to proﬁle sTLD, e.g., "education", "United could be "Chinese university", "education", "business school", States" and "student". In the meantime, a seed set of IBTs, etc. SEISE automatically identiﬁes these terms from different including "casino", "pharmacy" and others, are converted into sources using a term extraction technique. Speciﬁcally, the vectors using the word-embedding technique. Their semantic following two sources are currently utilized by our prototype: gap with the .edu sTLD is measured by calculating the cosine • Wikipedia: the Wikipedia pages for sTLDs provide a distances between individual terms (like "pharmacy") and the comprehensive summary of different sTLDs. For example, https: sTLD keywords (such as "education", "United States" and // en.wikipedia.org/ wiki/ .mil proﬁles the sTLD .mil, including "student"). It turns out that the terms like "pharmacy" are its sponsor ("DoD Information System Agency"), intended use among the most irrelevant (i.e., with a large distance with ("military entities"), registration restrictions ("tightly restricted .edu). It is then used to search Google under .edu, which shows to eligible agencies"), etc. In our research, we ran a crawler the FQDN stanford.edu hosting the content with the search that collected the wiki pages for 80 sTLDs.
term. Under this FQDN, SEISE again searches for "pharmacy." • Search results: the search results page for an sTLD query The results page is presented in Figure 1. As we can see, (e.g., site:gov) lists high-proﬁle websites under the sTLD. As many search result items (for different URLs) contain same mentioned earlier, each search result includes a snippet of a topic words, similar snippet and even URL patterns, which are website, which offers a concise but high-quality description typically caused by mass injection of unauthorized advertising of the website. Since the websites under the sTLD carry the materials. These items form the context for the IBT "pharmacy" semantic information of the sTLD, such descriptions can be used as another semantic source of the sTLD. Therefore, Our approach then converts the context (the result items) our approach collected the search result pages of all 403 found into a high-dimensional vector, with the frequency of sTLDs using automatically-generated queries in the form of each word (except those common stop words like ‘she', ‘does', "site:sTLD", such as site:edu. From each result page, top 100 etc.) as an element of the vector. The vector, considered to be search results are picked up for constructing the related sTLD's a representative of the context, then goes through a differential semantic proﬁle.
analysis: it is compared with the vector of a reference, the From such sTLD semantics sources, the Semantics Finder search results page of site:stanford.edu that describes the runs a content term extraction tool to automatically gather generic content under the FQDN. The purpose is to ﬁnd out keywords from the sources. These keywords are supposed to whether the context is compatible with the theme of the FQDN.
best summarize the topic of each source and therefore represent If the distance between them is large, then we know that the semantics of an sTLD. In our implementation, we utilized this FQDN hosts a large amount of similar text semantically an open-source tool topia.termextract  for this purpose.
incompatible with its theme (i.e., most of the high frequent From each keyword extracted, our approach further calculates words in the suspicious text, such as "viagra", rarely appear its frequency, which is assigned to the keyword as its weight.
in the common content of the FQDN). Also given the fact All together, top 20 keywords are chosen for each sTLD as its that such text is the context for the search terms irrelevant to semantics proﬁle.
the sTLD of the current FQDN but popular in promotional A problem is that among all 403 sTLDs, 71 of them are infections, we conclude that the FQDN stanford.edu is indeed non-English ones, which include Chinese, Russian, French, Arabic, etc., 89 languages altogether. Analyzing these sTLDs Once an infection is detected, the terms extracted from in their native languages is complicated, due to the challenges the context of "pharmacy" are then analyzed and those most in processing these languages: for example, segmenting Chinese irrelevant to the semantics of .edu are added to the IBT list characters into words is known to be hard . To solve this for ﬁnding other compromised FQDNs. Examples of the terms problem, we utilized Google Translate to convert the search include "viagra", "cialis", and "tadalaﬁl". In addition to the page of an non-English sTLD query into English and then words, the URL pattern of the infection is then generalized to extract their English keywords. The approach was found to Query — site:mysau3.arbor.edu "casino" Query — site:www.unlv.edu "casino" "title":"Online Courses International Gaming Institute University bookmarkportlet:10, viewhandler:10, "title":"Online Casino by DewaCasino.com: Live Casino Online .", online:8, promoter:6, dealers:6, "snippet":"DewaCasino is a promoter casino best online with live "snippet":"New online casino management classes are currently class:4, education:3, course:3, gambling:5, slot:5, roulette: 5, dealers reliable, Fair and is one of the largest in Asia today. Join!" being developed by the Center for Professional & Leadership management:3, center:2, Studies at UNLV (PLuS Center). Please visit ." professional:2, unit:2, university: 2, ics:0, student:0, university:0, graduate:0, alumni:0, department:0, snack:0, amentity:0, association:0, credit:0, center:0, "title":"Casino Marketing for Industry Professionals International "title":"iGamble247.com :: Live Casino Online - Casino Agent", "snippet":"Accreditation. You can earn Continuing Education Units "snippet":"Igamble247 is a promoter casino best online with live (CEUs) upon successful completion of any of our online casino dealers reliable, Fair and is one of the largest in Asia today. Join!" management courses. Please contact." Query — site:mysau3.arbor.edu Query — site:www.unlv.edu bookmarkportlet:0, viewhandler:0, "title":"Students - MySAU - Spring Arbor University", online:0, promoter:0, dealers:0, "title":"School of Social Work University of Nevada, Las Vegas", "snippet":"To print a certiﬁcate (proof) of enrollment or order a gambling:0, slot:0, roulette: 0, "snippet":"Behavioral Health Workforce Education and Training transcript, go to the National Student Clearinghouse site." Program for Professionals. The UNLV School of Social Work, education:4, program:3, university:3, ics:4, student:3, university:3, Masters Program has been awarded the…" student:3, course:2, school:2, graduate:3, alumni:2, department:2, training:2, center: 2, social:2, Default_Page.jnz", association:2, credit:2, center: 2, "title":"Default Page - MySAU - Spring Arbor University", "title":"Student Union University of Nevada, Las Vegas", "snippet":"The Spring Arbor University Alumni Association exists to "snippet":"Welcome. The Student Union offers conveniences and serve the University and its graduates by providing alumni with a amenities for everyone, whether you need to grab a snack, hold a continuing link among themselves and…" meeting, or just have some fun." (a) Differential analysis of an injected site. Cosine distance = 0.97 (b) Differential analysis of a non-injected site. Cosine distance = 0.14 Fig. 3: Differential analysis of an injected site and a non-injected site.
work effectively, capturing non-English promotional infections calculating the cosine distance between their vectors. For (see Section V).
each IBT, its average distance to all the keywords is used to Searching for inconsistency. The Inconsistency Searcher is determine its effectiveness in detecting promotional infections.
designed to ﬁnd out the IBTs with great semantic gaps with In our research, we found that when the distance becomes a given sTLD, and use the terms to search the sTLD for 0.6 (at least 20 terms are still there within our seed set) or suspicious (potentially compromised) FQDNs. To this end, we more, almost no compromised site is missing (see Figure 5(a) in ﬁrst selected a small set of seed IBTs as an input to the system.
Section V). The IBTs selected according to such a threshold are These IBTs were collected from spam trigger word lists , then sent to the search engine together with the sTLD through  and SEO competitive word list , which are popular the query site:sTLD+IBT (e.g.,site:edu casino). From the search terms used in counterfeit medicine selling, online gambling result page, top 100 items (URLs) are further inspected by and Phishing. From those terms, the most irrelevant ones are the Context Analyzer to determine whether related FQDNs picked up for analyzing a given sTLD. Such terms are found are indeed compromised, which is detailed in the followed by comparing them with the semantics proﬁle of the FQDN, that is, the set of keywords output by the Semantics Finder.
As an example, again, let us look at Figure 3: in this case, Speciﬁcally, such a semantic comparison is performed by the IBT "casino" has a distance of 0.72 with regard to the SEISE using a word-embedding tool called word2vec , semantics of .edu and therefore was run under the sTLD; from a neural network that builds a vector representation for each the search pages, top FQDNs, including mysau3.arbor.edu, term by learning from the context in which the term occurs. In www.unlv.edu, were examined to detect compromised FQDNS.
our research, we utilized the English Wikipedia pages as the Analyzing IBT context. As mentioned earlier, even the terms context for each term to compute its vector and measure the most irrelevant to an sTLD could show up on some of its pages distance between two words using their vectors. In this way, for a legitimate reason. For example, the word ‘casino' has a the IBTs irrelevant to a given sTLD can be found and used to signiﬁcant semantic distance with the sTLD .edu, which does search under the FQDN for detecting the suspicious ones. The not mean, however, that the .edu sites cannot carry a poster approach works as follows: about one's travel to Las Vegas or a research article about a • We downloaded all 30 GB Wikipedia pages and ran a program study on the gambling industry. Actually, a direct search of the to preprocess those pages by removing tables and images while term site:edu casino yields a result page with some of the items preserving their captions. Individual sentences on the pages being legitimate. To identify those compromised FQDNs, the were further tokenized into terms using a phrase parser.
Context Analyzer automatically examines the individual FQDN • Given an input term (an IBT or a keyword in the sTLD's on the result page, using a differential analysis (Figure 2) to semantics proﬁle), our approach runs word2vec to train a detect those truly compromised.
skip-gram model, which maps the term into a high-dimensional More speciﬁcally, the differential analysis involves two vector d1, d2, .di, . to describes the term's semantics. This independent queries, one on the suspicious FQDN together vector is generated from all the sentences involving the term, with the IBT (e.g., site:life.sunysb.edu casino) and the other on with individual elements describing the term's relations with the FQDN alone (e.g., site:life.sunysb.edu) whose results page other terms in the same sentence across all such sentences in serves as the reference. The idea is based on the observation the Wikipedia dataset.
that in a promotional infection, the adversary has to post • Given the vectors of an IBT and an sTLD keyword, our similar text on many different pages (sometimes pointing to approach measures the semantic distance between them by the same site) for promoting similar products or content. This Fig. 4: IBT SET Extension. The process to ﬁnd IBTs in new category consists of ﬁve steps: Injected URLs are collected to ﬁnd the injecteddirectory path (). Then, the injected directory path is used as search keyword, i.e., site:www.lgma.ca.gov/ play to list more search resultitems (). After fetching search result snippets(), critical terms are extracted (), and those that show semantics irrelevance are ﬁltered forclustering (). Once a new cluster is formed, we manually check and label it with its semantics.
is necessary because the target site's rank needs multiple vector V =< w0, w1, .wi, . >, where wi is the frequency highly-ranked pages on the compromised site to promote.
of a word corresponding to that position. For the two vectors The problem for such an attack is that the irrelevant content, Vb (the search page under the IBT) and Vg (the reference, that which is supposed to rarely appear under the FQDN, becomes is, the search page of the FQND without the IBT), SEISE anomalously homogenous and pervasive under a speciﬁc IBT.
calculates their Cosine distance: 1 − Vb·Vg VbVg .
As a result, when we look at the search results of the IBT In Figure 3(a), the distance of the vector for the IBT ‘casino' under the FQDN, their URLs and snippets tend to carry the with the reference vector is 0.97. In Figure 3(b), where the words rarely showing up across the generic content (i.e., the FQDN is not compromised, we see that the vector under the reference) with much higher frequencies than their accidental IBT ‘casino' is much closer to that of the reference, with a occurrences under the FQDN. On the other hand, in the case of distance of 0.14. In our research, we chose 0.9 as a threshold legitimate content including the IBT, the search results (for the to parameterize our system: whenever the Cosine distance IBT under the FQDN) will be much more diverse and the words between the results of querying an FQDN under an IBT and involved in the IBT's context often appear on the reference the reference of the FQDN goes above the threshold, the and are compatible with the generic content of the site; even Context Analyzer ﬂags it as infected. This approach turns out for the irrelevant terms in the context, their frequencies tend to be very effective, incurring almost no false positives, as to be much lower than those in the malicious context. This is elaborated in Section IV.
because it is unlikely that the term irrelevant to the theme of Discussion. SEISE is carefully designed to work on search the site accidentally appears in similar context across many result pages instead of the full content of individual FQDNs.
pages, which introduces an additional set of highly-frequent This is important because the design helps achieve not only high irrelevant terms. As an example, let us look at Figure 3(a) that performance but also high accuracy. Speciﬁcally, a semantic shows a compromised FQDN and Figure 3(b) that illustrates a analysis on a small amount of context information (title, legitimate FQDN. The highly-frequent words extracted from URL and snippet of a search result) is certainly much more the former under the IBT ‘casino', such as ‘bookmarkporlet', lightweight than that on the content of each web page. Also ‘dealers', ‘slot', never show up across the URLs and snippets of interestingly, focusing on such context helps avoid the noise the reference that represents the generic content of the FQDN introduced by the generic page content, since the snippet of (the result of the query site:mysau3.arbor.edu). In contrast, a each search result is exactly the text surrounding an IBT, the query of the legitimate FQDN using the same IBT yields a part of the web page most useful for analyzing the suspicious list of results whose URLs and snippets have highly diverse content it contains. In other words, our approach leverages the content, with some of their words also included in the generic search engine to zoom in on the context of the IBT, ignoring content, such as ‘class', ‘education' and ‘university', and most unrelated content on the same web page.
others (except the IBT itself) occurring infrequently.
C. IBT SET Extension To compare the two search result pages for identifying the A critical issue for the semantic-based detection is how to truly compromised site, the Context Analyzer picks up top obtain high-quality IBTs. Those terms need to be malicious 10 search results from each query and converts them into a and irrelevant to the semantics of an sTLD. Also importantly, high dimensional vector. Speciﬁcally, our approach focuses they should be diverse, covering not only different keywords on the URL and the content snippet for each result item.
the adversary may use in a speciﬁc category of promotional We segment them into words using delimiters such as space, infections, like unlicensed pharmacy, but also those associated comma, dash, etc., and remove stop words (those extremely with the promotional activities in different categories, such common words like ‘she', ‘do', etc.) using a stop word list .
as gambling, fake product advertising, academic cheating, etc.
In this way, each search item is tokenized and the frequency Such diversity is essential for the detection coverage SEISE of each token, across all 10 results is calculated to form a is capable of achieving, since a speciﬁc type of promotional attack (e.g., fake medicine) cannot be captured by a wrong the results page of the query, critical terms are extracted by IBT (e.g., ‘gambling').
analyzing snippets under individual result items. These terms As mentioned earlier, the seed IBT set used in our research are further compared with the semantics of the current sTLD: includes 30 terms, which were collected from several sources, those most irrelevant (with a cosine distance above the threshold including spam trigger word lists ,  and SEO competi- 0.9) are kept. Finally, the vectors of these terms are clustered tive word list . These IBTs are associated with the attacks using the classic k-Nearest-Neighbor (k-NN) algorithm (with such as blackhat SEO, fake AV and Phishing. To increase the k = 10) together with all existing IBTs. Once a new cluster diversity of the set, SEISE expands it in a largely automated is formed in this way, we manually look at the cluster and way, both within one category and across different categories.
label it with its semantics (gambling, drug selling, academic More speciﬁcally, our approach leverages NLP techniques to cheating, etc.). Note that this manual step is just for labeling, gather new IBTs from the search items reported to contain not for adjusting the clustering outcomes, which were found malicious content, and further cluster these IBTs to discover to be very accurate in our research (Section IV-C).
new categories. Here we elaborate on this design.
In the above example as illustrated in Figure 4, the query site: www.lgma.ca.gov/ play leads to the search results page. From Finding IBTs within a category. Once a compromised FQDN the items on the page, the IBT Collector automatically recovers has been identiﬁed using an IBT, the search results that lead a set of critical terms, including ‘goldslot', ‘payday loan', to the detection (for the query "site:FQDN+IBT") can then ‘cheap essay' and others. Clustering these terms, some of them be used to ﬁnd more terms within the IBT's category. This are classiﬁed into existing categories such as gambling, drug, is because the result items are the context of the IBT, and etc., while the rest are grouped into a new cluster, containing therefore include other bad terms related to the IBT. Speciﬁcally, ‘cheap essay', ‘free term paper' along with other 15 terms.
similar to the Semantics Finder, the IBT Collector runs the term This new cluster is found to be indeed a new attack category, extraction tool on each result item, including its title, URL and and labeled as ‘academia cheating'. In our research, we ran snippet, to gather the terms deemed important to the context of the approach to extend our IBT set, from 30 terms to 597 the IBT. Such terms are further inspected, automatically, against effective terms, from 3 categories (gambling, drug, etc.) to 10 the semantics of an sTLD by measuring their average distances large categories (ﬁnancial, cheating, politics, etc.). Our manual with the keywords of the FQDN (that is, converting each of validation shows that the results are mostly correct.
them into a vector using word2vec and then calculatingthe Cosine distance between two vectors). Those sufﬁciently IV. IMPLEMENTATION AND EVALUATION away from the FQDN's semantics (with a distance above the In this section, we report our implementation of SEISE aforementioned threshold) are selected as IBTs.
and evaluation of its efﬁcacy. Our study show that the Finding new categories. Extracting keywords from the context simple semantics-based approach works well in practice: it of an IBT can only provide us with new terms in the same automatically discovered IBTs, achieved an low false detection category. To detect the infections in other categories, we have to rate (1.5%) at over 90% of coverage and also captured 75% extend the IBT set to include the terms in other types of illicit infected domains never reported before (Section IV-C).
promotions. The question is how to capture new keywords suchas ‘prescription-free antibiotic' that are distinguished from the A. Implementing SEISE IBTs in the known category such as ‘gambling', ‘casino', etc. A The design of SEISE (Section III) was implemented into a key observation we leveraged in our study is that the adversary prototype system, on top of a set of building blocks. Here we sometimes compromises an FQDN to perform multiple types brieﬂy describe these nuts and bolts and then show how they of advertising: depending on the search terms the user enters, are assembled into the system.
an infected website may provide different kinds of promotional Nuts and bolts. Our prototype system was built upon three content, for drug, alcohol, gambling and others. Further the key functional components, term extractor, static crawler and ads serving such a purpose are often deposited under the same semantic comparator. Those components are extensively reused directory, along the same path under a compromised FQDN.
across the whole system, as illustrated in Figure 2. They were This enables us to exploit the URL included in a contaminated implemented as follows: result item (as detected by SEISE) to ﬁnd the promotional • Term extractor accepts text as its input, from which it automat- materials unrelated to the context of the IBT in use.
ically identiﬁes critical terms. The component was implemented Speciﬁcally, from each ﬂagged FQDN, the IBT Collector in Python using an open-source tool topia.termextract.
ﬁrst picks up all the URLs leading to malicious content, and • Static crawler accepts query terms, looks for the terms through from them, identiﬁes the most commonly shared path under search engines and returns results with a pre-determined number the FQDN. For example, from the URLs www.lgma.ca.gov/ of items. In our implementation, the crawler was developed in play/ popular/ 1*.html, www.lgma.ca.gov/ play/ home/ 2*.html Python and utilized the Google Web Search API  and the and www.lgma.ca.gov/ play/ club/ 3*.html (detected using the Bing Search API  to get search results.
IBT ‘casino'), the shared path under the FQDN is www.lgma.ca. • Semantic comparator accepts a set of terms and compares gov/ play. Using this path, our approach queries Google again them with the keywords of an input sTLD. It can return the with ‘site:FQDN+path': e.g., site:www.lgma.ca.gov/ play. From average distance of each term with those keywords or the terms whose distances are above a given threshold. This componentwas implemented as a Python program that integrates the open- source tool word2vec. As mentioned earlier, we trained thelanguage model used by word2vec with the whole Wikipediadataset, from which our implementation automatically collectedthe context for each term before converting it to a high-dimensional vector.
System building. Using these building blocks, we constructedthe whole system as illustrated in Figure 2. Speciﬁcally, the (a) False detection rate in differ- (b) False positive rate in differ- Semantic Finder was developed to run the static crawler ent semantics distances. Color bar ent semantics distances. Color barshows the coverage rate.
shows the coverage rate.
to gather the content under an sTLD and then call theterm extractor to identify the keywords for the domain. The Fig. 5: Evaluation results on good set and bad set.
Inconsistency Searcher invokes the semantic comparator todetermine the most irrelevant IBTs before using the crawler dataset was used as the unknown set for discovering new to search for the terms. The Context Analyzer includes a differential analyzer component implemented with around 300 Resources and validation. In all our experiments, our proto- lines of Python code. For each suspicious FQDN, the analyzer type system was run within Amazon EC2 C4.8xlarge instances calls the crawler to query the search engine twice, one under an equipped with Intel Xeon E5-2666 36 vCPU and 60GiB of IBT and the other for getting the reference (the generic content).
memory. To collect the data for the unknown set, we deployed It reports the domain considered to be compromised. Finally, 20 crawlers within virtual machines with different IP settings.
the IBT Collector uses the crawler to search for the selected These crawlers utilized the APIs provided by Google and Bing URL path under the detected domain, then the extractor to to dump the outcomes of the queries, from 2015/08 to 2015/10.
get critical terms from the search results and the semantics To validate the ﬁndings made on the unknown set, we em- comparator to ﬁnd out new IBTs. Over these IBTs, we further ployed a methodology that combined anti-virus (AV) scanning, integrated the k-NN module provided by the scikit-learn open blacklist checking and manual analysis. Speciﬁcally, for the source machine learning library  to cluster them and discover FQDN reported by our system, we ﬁrst scanned their URLs new bad-term categories.
with VirusTotal and considered that the URLs were indeed B. Experiment Setting suspicious when at least two scanners ﬂagged the domain.
Then, all such suspicious URLs were cross-checked against the Data collection. To evaluate SEISE, we ran our prototype blacklist of CleanMX. For those conﬁrmed by both VirusTotal on three datasets: the labeled bad set and good set, and the and CleanMX, their FQDNs were automatically labeled as unknown set including 100K FQDNs collected from search compromised. For other domains also detected by SEISE, we engines, using 597 search terms, as explicated below.
randomly sampled 20% of them and manually checked whether • Bad set. We collected the FQDNs conﬁrmed to have they were indeed compromised.
promotional infections from CleanMX , a blacklist ofcompromised URLs. A problem here is that these URLs are C. Evaluation Results associated with different kinds of malicious activities and it is Over the aforementioned datasets, we thoroughly evalu- less clear whether they are promotional infection. What we did ated our prototype. Our study shows that SEISE is highly is to collect all the sTLD URLs from the CleanMX feed from effective: it achieved near zero False Detection Rate (FDR, 2015/07 to 2015/08, and further manually inspected all these i.e., FP/(FP+TP)) and over 90% coverage (i.e., TP/(TP+FN)) URLs. Speciﬁcally, whenever we saw that advertising, Phishing, or below 4.7% FDR, 4.4% False Positive Rate (FPR, i.e., defacement content showing up in the search results of a URL, FP/(FP+TN)) and nearly 100% coverage on the labeled sets it is considered to be exploited for promotional infections. We (the bad and good set); with the threshold chosen to balance further classiﬁed these URLs into different categories and also FDR and FPR, we further ran SEISE over the unknown set, manually identiﬁed related IBTs. In this way, we built a bad set which reported over 11K compromised sites, with an FDR with 300 FQDNs (together with 15 IBTs in three categories).
of 1.5% and a coverage over 90%. Also importantly, 75% of • Good set. Using the IBTs collected from the bad set, we infections discovered from the unknown set are likely never further searched under the sTLDs for the FQDNs ("site:sTLD+ reported before, including 3 large-scale campaigns, on which IBT") that contained those terms but were not compromised.
we elaborate in Section V. All these ﬁndings were made in These domains were used to understand the false detections a highly efﬁcient and scalable way: on average, only 2.3 that could be introduced by SEISE. Altogether, we collected a queries were made for ﬁnding a new compromised FQDN good set of 300 FQDNs related to 15 IBTs and three categories.
and the delay caused by analyzing the query results and • Unknown set. As mentioned in Section II, we gathered 403 other computing resources consumed for this purpose were sTLDs and manually selected 30 IBTs in three categories.
Running these IBT seeds on these sTLDs, we crawled Google Accuracy and coverage. We evaluated the accuracy and and Bing over three months, collecting 100K FQDNs. This the coverage of SEISE under a given set of IBTs. In this case, what can be achieved are all dependent on the Context TABLE I: Number of IBTs in each round.
Analyzer, which ultimately decides whether to ﬂag an FQDN # of IBTs per category as compromised. In our research, we ﬁrst studied our system over the labeled good set and bad set, and then put it to test over the unknown set. Figure 5(a) and 5(b) illustrate the results over the labeled sets, in response to different thresholds for semantic distances (between the reference and the query of anIBT). As we can see here, when the threshold goes up, the prototype, in an attempt to understand the scalability of our FDR goes down and so does the coverage. On the other hand, design. We found that except the delay caused by receiving loosening the threshold, which means that the IBT is becoming the results from Google, the overhead for analyzing search less irrelevant to the semantics of the sTLD, improves the results and detecting compromised sites are exceedingly low: coverage, at the cost of the FDR. Overall, the results show by running 10000 randomly selected queries (50 IBTs over 200 that SEISE is highly accurate: by setting the threshold to 0.9, sTLDs), we observed that the average time for analyzing 1K we observe almost no false detection (FDR: 0.5% and FPR: result items, excluding the waiting time for the search engine, 0.4%) with a 92% of coverage; alternatively, if we can tolerate was 1ms, and also the memory and CPU usages stayed below 4.7% FDR (FPR: 4.4%), the coverage becomes close to 100%.
5% respectively. The main hurdle here is the delay caused In our research, the threshold 0.9 was then utilized to analyze by the search engine: for Google, it ranged from 5ms to 8ms the unknown set.
per one thousand queries. The design of SEISE already limits On the unknown set, we ran SEISE to query 597 IBTs under the number of queries that needed to be made for detecting 403 sTLDs. Our prototype inspected 100K FQDNs in total.
infected FQDNs: in the experiments, we found that on average, 11,473 of them were ﬂagged as compromised, about 11% of a compromised FQDN was detected after 2.3 term queries. We the whole unknown set. Table II and Table III summarize our believe that by working with the search provider (Google, Bing ﬁndings, which are further discussed in Section V. Among all etc.), SEISE can be easily scaled with a quick turnaround of that were detected, 3% were conﬁrmed by both VirusTotal  the search results.
and CleanMX , 22% were found by at least one of these two AV systems and further validated manually, and 1000 of Based upon what was detected by SEISE, we performed a the remaining were inspected manually. All together, the FDR measurement study to understand the promotional infections measured from the unknown set is as low as 1.5%. We further on sTLDs, particularly the semantic inconsistency these attacks randomly sampled 500 result pages related to 10 categories introduce. Our study brings to light the pervasiveness of the of IBTs and found that our prototype reported 53 infections attacks and their signiﬁcant impacts, affecting the websites of and missed 5, which indicates a coverage of about 90%. Also, leading academic institutions and government agencies around note that over 75% of the infections have never been reported the world. Further discovered are a set of surprising ﬁndings (missed by both VirusTotal and CleanMX). We have reported and their insights, which have never been known before. For the most prominent ones among them to related organizations example, apparently sTLDs are soft targets for promotional and are helping them ﬁx the problem, and will continue to infections, highly ranked and also easier to compromise work on other cases.
compared with gTLD sites of similar ranks; as a result, by IBT expansion. The effectiveness of SEISE also relies on its mitigating the threats to the sTLD domains, we raise the bar capability to discover new IBTs and ﬁnd new attack instances for the adversary, depriving him of easy access to the resources across different categories. As discussed before, our prototype highly valuable to the promotional attacks, which rely on the starts with a small set of seed IBTs, 30 terms in three categories.
compromised site's rank to boost the rating of malicious content.
After searching for all these terms under all the sTLDs, a set As another example, we show that semantic inconsistency can of compromised FQDNs are detected, which are further used also be observed in the promotional infections on gTLDs by the IBT Collector to extract new terms for searching all 403 such as .com, .net, etc., even though these domains tend to sTLDs again. In our research, we repeated such iteration 20 have a much more diverse semantic meaning. Based upon this times, expanding the IBT set to 597 terms and 10 categories.
observation, a preliminary exploration highlights the potential All the terms and categories were manually conﬁrmed to be of extending our approach to protect gTLD sites, indicating correct. Table I presents the numbers for the terms and the that a semantic model can also be built for some websites under categories, together with examples of new terms detected, after the gTLD domains to capture the promotional attacks on them.
the 1st, 5th, 10th, 15th and 20th iterations. As we can see here, Finally, we elaborate on a study on some prominent attack the number of categories and number of IBTs increase quickly cases discovered in our research, which, from the semantic (with a increase rate of 60% and 180%, respectively) in the ﬁrst perspectives, analyzes the techniques the adversary employ in 10 iterations, which indicate that our IBT expansion method the promotional infections.
is efﬁcient for both in-category and cross-category expansion.
Also, Table III illustrates the total categories of IBTs ﬂagged A. Landscape by SEISE after these iterations.
Scope and magnitude. Our study reveals that the promotional Performance. We further evaluated the performance of our infections are spread across the world, compromising websites TABLE II: Top 10 sTLDs with most injected domains.
FQDN: 1,840 URL: 172,244 FQDN: 312 URL: 22,543 FQDN: 250 URL: 29,580 FQDN: 403 URL: 34,308 FQDN: 223 URL: 21,563 FQDN: 253 URL: 23,022 FQDN: 178 URL: 15,720 FQDN: 163 URL: 14,572 FQDN: 172 URL: 12,034 FQDN: 144 URL: 11,056 TABLE III: Categories of IBTs.
casino, slot machine ca.gov (Alexa: 649) cheap xanax, no prescription princeton.edu (Alexa: 3558) nike air max, green coffee bean nih.gov (Alexa: 196) fake driving permit, cheap essay payday loan, quick loan cheap airfare, hotel deal gmu.edu (Alexa: 8058) cheap gucci, discounted channel tsinghua.edu.cn (Alexa: 6717) free download, system app islamic state, falun gong Fig. 6: Cumulative distribution of injected sTLD sites' Alexa rank and Top 20 injected sTLD sites with highest Alexa rank.
in all kinds of sTLDs. Altogether, SEISE detected around protected sTLD with a signiﬁcant portion of the FQDNs 1 million URLs leading to malicious content on 11,473 compromised (12%), which is followed by edu.vn 3% and infected FQDNs under 9,734 sTLD domains. The results are edu.cn 3%. The top-3 sponsoring registrars with the most summarized in Table II and Table III.
infected gov.cn sites are sfn.cn, alibaba.com, xinnet.com. On To understand the magnitude of the threat towards individual the other hand, .mil sites apparently are better protected than sTLDs, we studied the ratio of compromised FQDNs under each others. Among the 456 .mil domains we monitored, only 8 domain category. For this purpose, we ﬁrst tried to get some domains are injected.
idea about how many FQDNs are under each sTLD, using the Figure 7 describes the distributions of the compromised passive DNS dataset from DNSDB . The dataset includes sTLD sites across 141 countries, as determined by their the records of individual DNS RRsets as well as ﬁrst-seen, geolocation. Based upon the number of infected domains, last-seen timestamps for each domain and the DNS bailiwick countries are colored with different shades of blue. As we from Farsight Security's Security Information Exchange and the can see here, most of infected sites are found in China (15%), authoritative DNS data. The number of FQDNs under an sTLD followed by United States (6%) and Poland (5%).
was estimated from those under the sTLD queried between Impacts of the infections. We further looked into the Alexa 2014/01 and 2015/08, as reported by the passive DNS records.
ranks of injected sTLD websites, which are presented in The results were further cross-validated by comparing them Figure 6. Across different sTLDs, highly ranked websites were with the estimated domain counts given by DomainTools  found to be exploited, getting involved in various types of for each TLD.
malicious activities, SEO, Phishing, fake drug selling, academic Table II illustrates the top-10 sTLD with the largest number cheating, etc. Figure 6 illustrates the cumulative distributions of of infected domains, together with the number of domains we the ranks: a signiﬁcant portion of the infections (75%) actually monitored and the total number of domains we estimated for happen to those among the top 1M. Figure 6 further shows each sTLD. According to our ﬁndings, gov.cn is the least the top-20 websites with the highest Alexa ranks. Among them, 12 are under .edu, including the websites of leadinginstitutions like mit.edu (Alexa:789), harvard.edu (Alexa:1034),stanford.edu (Alexa:1050) and berkeley.edu (Alexa:1452), and7 under .gov, such as nih.gov (Alexa:196), state.gov (Alexa:719)and noaa.gov (Alexa:1126). In general, China is the countrythat hosts most injected sTLD sites; however, when it comesto top ranked sites (Alexa rank < 10K), 67% of them are inthe United States and Australia.
Also interesting is the types of malicious activities in which those domains are involved. Table III shows the number of the Fig. 8: The distribution of the infection time.
domains utilized for promoting each type of content (acrossall 10 categories). As we can see here, most of the injectedsTLD sites (19%) are in the Gambling category, which is between the semantics of the promoted content and that of an followed by those related to Drug (15%) and General Product infected domain's generic content: in our labeled bad set (the (14%) such as shoes and healthcare products. When we look collection of compromised domains reported by CleanMX; see at the top-20 domains, many of them are infected to promote Section IV-B), all sTLD-related infections contain the malicious Drug. Also, many .edu domains advertise unlicensed pharmacy, content inconsistent with the semantics of their hosting websites.
while .gov are mainly compromised to promote gambling and The implication of this observation is that by exploiting this fake AV. Interestingly, the injected domains associated with feature, a weakness of the sTLD-based promotional infections, different countries tend to serve different types of content. For a semantic-based approach, like SEISE, can effectively suppress example, the most common promotions on Chinese domains such a threat to sTLDs. This is signiﬁcant, since our study, are gambling (which is illegal in that country), while most as elaborated below, shows that sTLDs are valuable to the injected US domains are linked to unlicensed online pharmacy.
adversary because they are less protected and highly ranked.
Since the infected country code sTLDs (e.g., .cn) can make the Further, even for gTLDs, which tends to have highly diverse content they promote more visible to the audience in related and less speciﬁed semantics, the malicious content uploaded countries (e.g., boosting the ranks of malicious sites in the there also tends to be incompatible with the compromised results of country-related searches), it is likely that promotional websites' themes. This indicates that our approach can be infections target speciﬁc groups of Internet users, just like applied beyond sTLDs. Following we report our ﬁndings.
sTLD as a soft target. To understand the importance of Our study further shows that many of such infections have sTLDs to the adversary, we compared the compromised sTLD been there for a while. Figure 8 shows the distribution of the sites with those under the gTLDs, within the same attack infection time for the injected page in sTLD sites. We estimated campaign. A campaign here includes a set of websites infected the durations of their infections by continuously crawling the for promoting unauthorized or malicious content and those sites 20K injected pages (which were detected in 2015/08) every share a set of common features, speciﬁcally, they all pointing two days from 2015/08 to 2015/11 to ﬁnd out whether they to the same target site being advertised, their malicious URLs were still alive. As we can see from the ﬁgure, most infections having the same features (such as same afﬁliate ID as URL last 10-20 days, while some of them have indeed been there parameter) and they all share the same redirection chain. In our for a while, at least 1 months. A prominent example is the research, we discovered a campaign through infected websites' injection on ca.gov, whose infection starts no later than 60 "link-farm" structure, i.e., a compromised site pointing to another one. Following the links on the compromised sTLDsites enabled us to reach a set of infected gTLD sites, mainly B. Implications of Semantics Inconsistency under .com. We then compared the features of those sites with Our study shows that promotional infections, particularly those of sTLD domains, in terms of Alexa rank, pagerank for those under sTLDs, are characterized by the inconsistency (PR) and lifetime, in an attempt to ﬁnd out what type of TLD domains are more valuable to promotional infections.
Table IV presents the top-3 campaigns (all organized as link farms) discovered in our study. The largest one covers about 872sTLDs and 3426 gTLDs across 12 countries and regions (US,China, Taiwan, Hong Kong, Singapore and others). Among thevictims are 20 US academic institution such as nyu.edu, ucsd.
edu, 5 government agencies like va.gov, makinghomeaffordable.
gov, together with 188 Chinese universities and 510 Chinesegovernment agencies. Also among the victims are 1507 .comsites. Figure 9(a) and Figure 9(b) compare the Alexa globalranks and the page rank (PR) of those gTLD and sTLD websites.
Fig. 7: The geolocation distributions of the compromised sTLD As we can see from the ﬁgures, 50%-75% of sTLD sites are sites across 141 countries.
TABLE IV: Top 3 link-farm campaigns with most injected sTLD ranked within the Alexa top 1M, while only 10%-30% of gTLDsites are at this level. Actually, more than 40% of the gTLDsites have Alexa rank outside the top 5M. By comparison, less Fig. 10: Example of search engine results of an injected gTLD than 20% of sTLDs have ranks outside the top 5M. In terms of PR, more than 30% of the sTLD sites have PR from 4 to 6, while less than 5% of gTLD sites are PR4-PR6. Also, more than half of gTLD sites have PR as 0, which have a weaker also found that some compromised gTLD sites show semantic SEO effectiveness than those with high PR. This indicates consistent with the promotional content. For example, online that the majority of sTLD sites have a stronger effect on the drug library druglibrary.org (in Campaign 3) was injected to promoted sites than gTLD sites with no or low PR.
promoted "cheap xanax". Hence, to identify those suspicious We further compared the durations of the infections for these sites (before they are checked with the Context Analyzer), we two types of domains. Again, we continuously crawled the utilized the similarsites website query API  to fetch the site compromised pages (identiﬁed in 2015/08-2015/09) every two tags (e.g., "recycling" and "water" for site:iceriversprings.com) days from 2015/09 to 2015/11 to check whether the infections to determine a gTLD site's semantics, and only use the gTLD were still there. Figure 9(c) illustrates the distributions of the sites showing semantic inconsistency with the IBT (i.e., the sTLD site's life spans and those of gTLD sites. As can be seen site's tags semantically distance away from the IBT) as the from the ﬁgure, gTLD sites were cleaned up more quickly than suspicious candidates for the input of the Context Analyzer.
the sTLD sites. Over 25% of the gTLD sites were cleaned This ﬁltering step (for the purpose of increasing the "toxicity within 10 days, while 12% of the sTLD sites were cleaned level"  of the inputs) is built as the Semantic comparator, within 10 days.
which accepts the threshold for the IBT semantics distance Our study demonstrates that the sTLDs are ranked higher (Section III-B) and outputs the candidate gTLD sites that than the gTLD sites and much more effective in elevating have great semantic distances with the IBT used for the the ranks of promoted content, thereby more valuable to query. For example, iceriversprings.com, which has the site promotional infections. In the meantime, they are less protected tag "recycling", "water" which shows semantic inconsistency than the gTLDs: once compromised, the infections will stay (determined by Semantic comparator Figure 2) with the IBT there for a longer period of time. This indicates that, indeed, "payday loan", will be regarded as suspicious FQDNs and the sTLDs are valuable assets to the adversary and effective become the input of the Context Analyzer.
protection of the site, as SEISE does, indeed makes the Figure 9(d) shows the semantic distances between the promotional attacks less effective.
reference and the search results of querying an IBT with Extension to gTLDs. Compared with sTLDs, gTLDs (e.g., and without the Semantic comparator. We observe that the .com, .net and .org) do not have ﬁxed semantic meanings.
Context Analyzer can still identify the semantics inconsistency, However, we found that still the malicious content injected particularly with the help of the Semantic comparator that here tends to be incompatible with the semantics of the sites, selects sites with great semantic distances with the IBT: 97% which can be captured by the search engine results. Figure 10 of the injected sites have semantic distance larger than 0.8 presents an example of search engine results for an injected when the threshold of Semantic comparator is set to 0.9; by gTLD site iceriversprings.com, which is the website of Ice comparison, 85% of the injected sites have semantic distance River Green brand of bottled water. However, the injected page larger than 0.8 in the absence of the Semantic comparator.
show the semantically inconsistent content for "payday loan" Further, we measure the semantic inconsistency of unknown injected gTLD sites. This is nontrivial because simply searching Then, we measure the semantics inconsistency on the site:.com "payday loan" will return mostly legitimate search 3,000 gTLD sites, which are randomly sampled from the results. Even though we could validate these FQDNs one by one aforementioned campaigns. Speciﬁcally, we use the Context through the Semantic comparator and the Context Analyzer, the Analyzer component in SEISE to calculate the semantic cost for ﬁnding truly compromised sites becomes overwhelming.
distance between the generic content of those known injected As mentioned earlier, with a similar PR, gTLD sites are sites (the reference, e.g., the search result of the query better protected than sTLD sites. Hence, when searching site:iceriversprings.com) and the results of querying IBTs on gTLDs under the IBT (e.g., site:.com "payday loan"), high- these sites, which mostly contain injected malicious content PR gTLD sites tend to appear on top of the search results, (e.g., site:iceriversprings.com "payday loan"). However, we which are actually less likely to be compromised. For example, of (b) Cumulative of (c) Distribution of the infection (d) Cumulative distribution of se- Alexa global ranks per sites in 3 Alexa bounce rate per sites in 3 time for the injected pages in sTLD mantics distance per monitored sites and gTLD sites.
Fig. 9: Alexa global rank, PR and life span of sites in three campaigns, and cumulative distribution of semanticsdistance per monitored sites.
when searching "payday loan", many high-PR sites such as toolkit, called xise, was discovered on a cloud drive. By analyzing its code, we found that xise has the functionalities will show up within the top-100 search results. None of them for automatic site collection, shell acquisition, customized appear to be compromised. To address this challenge and injected page generation and a series of evasion techniques such identify the sites likely to be compromised (which will be as redirection cloaking and code obfuscation. More speciﬁcally, further determined by the Context Analyzer), we utilized long it automatically discovers the domains of high-proﬁle websites IBTs (word length larger than 4) to feed search engine to from Google and other search engines, and also scans the obtain suspicious FQDNs. Generally, longer query keywords websites for the vulnerabilities within the components such have less search competition , i.e., websites with lower PRs as phpmyadmin, kindeditor, ueditor, alipay and are more likely to appear in the search results. For example, fckeditor. Further, it lets its user provide the promoted when searching for "payday loan no credit check" under .com, site's URL and keywords and automatically generates the pages bottled water website iceriversprings.com and ATM company to be injected to the compromised websites along a speciﬁc path website carolinaatm.com are within the top-10 search results.
(e.g., ﬁlemanager/ browser/ default/ images/ icons). The tool also In our experiments, we utilized 1000 long IBTs in 10 uploads a conﬁguration ﬁle to the compromised web server to individual categories to do the search, and 23,098 gTLD perform redirection cloaking: i.e., it will redirect visitors based FQDNs were collected for the semantic inconsistency analysis.
on their HTTP referers to protect the compromised site. Also, We set the threshold of the Context Analyzer to 0.9, and 7,430 of to guarantee the malicious content to be indexed by search the gTLD FQDNs were reported to have promotional infections.
engines, xise also uploads scripts to keep generating pages We further randomly sampled 400 results (200 injected and 200 to guarantee SEO effectiveness. Note that adding and changes not-injected) and manually checked the ﬁndings. We conﬁrmed is a freshness factor for high search engine ranking. In our that 182 were indeed infections and 196 were not injected, research, we manually generated signatures for xise as listed which gives us an FDR of 9% and FPR of 8.4%. With this in Table V. 1037 of sTLD sites we detected are related to encouraging outcome, how to detect compromised gTLDs xise with the average semantics distance 0.87 to it sTLDs.
through semantics-based approaches remains to be an open Academic cheating infections. Our research also discovered question. Particularly, new techniques need to be developed to many infections promoting academic cheating sites. Those sites further suppress FDR and improve its coverage. Also, query provide online services for preparing any kind of homework terms for detection should also be automatically discovered.
at the high school and college levels, and even taking onlinetests for students. We found that such attacks mainly aim at C. Case Studies .edu domains and the examples of the IBTs involved include Perhaps the most surprising ﬁndings of our study is the ‘free essay', ‘cheap term paper' and others. These terms were discovery of several large-scale attacks, infecting many leading found to be very effective at ﬁnding such malicious activities.
organizations around the world. In addition to the afore- SEISE detected 428 compromised sites, including high-proﬁle mentioned gambling campaign, we also found the infections .edu domains such as mit.edu, princeton.edu, havard.edu, etc.
for promoting counterfeit products, fake essays and political Table VI compares the compromised .edu sites in different materials on university and government sites. Here we present keyword categories. We observe that such malicious activities the studies on two cases as examples to provide additional have apparently already become a global industry. 119 edu- information about what techniques the adversary uses and how cation TLDs in 109 countries have 428 infected domains to the attacks are organized.
promote academic cheating sites. The Top 3 education TLDs Exploit kit discovered. We found an exploit toolkit used in with most infected sites are edu (23%), edu.mn (11%) and multiple gambling campaigns, for example, Campaign 1. The edu.cn (7%).
TABLE V: Example of signatures.
<img width="20" height="20" border="0" hspace="0" vspace="0" src="http:// count51.51yes.com/ count1.gif"> <iframe marginwidth="0" marginheight="0" hspace="0" vspace="0" frameborder="0" scrolling="no" src="" height="0" width="0"> TABLE VI: Comparison of injected education TLDs sites in hiding the inconsistent content by embedding it within images.
different keyword categories.
However, even in the presence of relevant content, the malicious keywords can still be recovered and cause an observable semantic deviation from the theme of the original website, as long as the keywords are sufﬁciently frequent to be picked up by the search engine and contribute to the change of the malicious content's rank in search results. Hiding content in images results in neglect of malicious content in the search results, which is not what the adversary wants. Fundamentally, no matter what the adversary does, the fact remains that any attempt to cover the content being advertised will inevitably underminethe effectiveness of the promotional effort. Another evasion strategy is to just compromise the website with compatible Our research shows that semantics-inconsistency search semantics. This approach will signiﬁcantly limit the attack offers a highly-effective solution to the promotional-infection targets the adversary can have. Particularly, it is less clear how threat. In this section, we discuss the tricks the adversary can this can be done for sTLDs. Note that even selling medicine play to evade our detection, limitations of our technique and on a health institution's site can be captured, as the infections future research, together with the lesson learnt from our study of the NIH pages shown at the beginning of the paper.
and our communication with the victims.
Limitations. As mentioned earlier, our current design is Evasion. The current implementation of SEISE is based upon focused on detecting the infections of sTLD sites, since they the search results returned from Google and Bing. While have well-deﬁned semantic meanings and are a soft target both are mainstream search engines targeted by promotional for the adversary. In the meantime, gTLDs are also known infections, the data we crawled are limited to the sites that to be extensively compromised for promotion purposes. A indexed by Google and Bing. Hence, to evade SEISE, the natural follow-up step is to develop the semantic technologies adversary, who has full control of a compromised website, for protecting those domains. This is completely feasible, may set robots.txt to prevent part of its content from being as demonstrated in our preliminary study (Section V-B): by scanned. Such evasion techniques, however, will cause the leveraging the Alexa categories, the semantics of even those promotion pages to lose the visitors from the search engines more generic domains can also be identiﬁed and compared and also the high-proﬁle links to the sites being promoted.
with that of the content it hosts.
This defeats the purpose of the promotional infections, which are meant to advertise malicious content through the search Moreover, our semantic-based detection technique does not engines and therefore should aggressively expose its content differentiate between server injected domains, blog/forum Spam (promotional pages) to the search engines, instead of hiding it and URL redirection  (e.g., posting ads on a .edu forum or from them. Other issues related to search results include the utilizing the server-side script of a .gov domain to dynamically delay introduced by page indexing and page expiration. Again, create a page under the domain with promotion content, see although our approach is not designed to capture a promotional Section I). In our research, we randomly sampled 100 detected infection before it is indexed by the search engines, the impact pages and found that about 20% of them are Spam, which of the infection is also limited at that time, simply because its are also considered illicit advertising . A follow-up step whole purpose is to advertise some malicious materials, which is to develop automatic technologies to identify those cases, is not well served without the infected pages being discovered so we can respond to them in a different way (e.g., through by the search engine. For page expiration, we need to consider input sanitization). For example, a comment page oftentimes the fact that as long as the URLs of the promoted content are can be detected from the keywords such as "comment" or still alive, the attack is still in effect, since letting people ﬁnd "redirect" involved in its link; such a page, once found to the URLs is the very purpose of the attack. Whether the URLs promote malicious content, can be further analyzed to determine are still there can be conﬁrmed by crawling the links. Further, whether the content is link Spam or caused by an infection.
the snippet of the search results, even for the pages that are Also, the use of search engines has a performance implica- already expired, can still be utilized to ﬁnd new keywords.
tion. Search service providers often have limits on the crawling The adversary may play other evasion tricks, by adding frequency one can have, which causes delay in detecting more relevant keywords to the infected page to make the malicious content and affects the scalability of our technique.
content look more consistent with the website's theme, or On the other hand, given the effectiveness of SEISE in catching promotional infections, we believe that a collaboration with the detect malicious redirect scripts, and Shady Path  that search provider to detect Internet-wide infections is completely captures a malicious web page by looking at its redirection graph. Compared with those techniques, our approach isdifferent in that it automatically analyzes the semantics of web Lesson learnt. Our study shows that sTLD sites are often content and looks for its inconsistency with the theme of the under-protected. Particularly for universities and other research hosting website. We believe that the semantics-based approach institutions, their IT infrastructures tend to be open and loosely is the most effective solution to promotional infections, which controlled. As a prominent example, in a university, individual can be easily detected by checking the semantics of infected servers are often protected at the department levels while sites but hard to identify by just looking at the syntactic the university-level IT often only takes care of network-level elements of the sites: e.g., both legitimate and malicious ads protection (e.g., intrusion detection). The problem is that, can appear on a website, using the same techniques like oftentimes, the hosts are administrated by less experienced redirections, iframe, etc. Further, we do not look into web people and include out-dated and vulnerable software, while content or infrastructure at all, and instead, leverage the search given the nature of the promotional infections, they are less results to detect infections. Our study shows that this treatment conspicuous in the network trafﬁc, compared with other is sufﬁcient for ﬁnding promotional infections and much more intrusions (e.g., setting up a campus bot net). We believe efﬁcient than content and infrastructure-based approaches.
that SEISE, particularly its Context Analyzer, can play the Similar to our work, Evilseed  also uses search results role of helping the web administrators of these organizations for malicious website detection. However, the approach is only detect the problems with those less-protected hosts. Of course, based upon searching the URL patterns extracted from the a more fundamental solution is to have a better centralized malicious links and never touches the semantics of search control, at least in terms of discovering the security risks at results. Our study shows that focusing only on the syntactic the host level and urging the administrators of these hosts to features such as URL patterns is insufﬁcient for accurate keep their software up-to-date.
detection of promotional infections. Indeed, Evilseed reports Responsible disclosure. Since the discovery of infected do- a huge false detection rate, above 90%, and can only serve mains, we have been in active communication with the parties as a pre-ﬁltering system. On the other hand, our technique affected. So far, we have reported over 120 FQDNs to CERT inspects all the snippet of search results (not just URLs), in US and 136 FQDNs to CCERT (responsible for .edu.cn) automatically discovering and analyzing their semantics. This in China, the two countries hosting most infected domains.
turns out to be much more effective when it comes to malicious By now, CCERT have conﬁrmed our report, and notiﬁed all promotional content: SEISE achieves low FDR (1.5%) at a related organizations, in which 27 responded and ﬁxed their detection coverage over 90%.
problems. However, it is difﬁcult for us to directly contact the Study on blackhat SEO. Among the malicious activities victims to get more details (like log access) from the infected performed by a promotional infection is blackhat SEO (also servers. On the other hand, given the scale of the attacks we referred to webspam), which has also been intensively studied.
discovered, the whole reporting process will take time.
For instance, Wang et al. investigated the longitudinal oper- ations of SEO campaigns by inﬁltrating an SEO botnet .
Leontiadis et al. conducted a long-term study using 5 million Detection of injected sites. How to detect injection of search results covering nearly 4 years to investigate the malicious content has been studied for long. Techniques have evolution of search engine poisoning . Also, Wang et al.
been developed to analyze web content, redirection chains examined the effectiveness of the interventions against the SEO and URL pattern. Examples of the content-based detection abuse for counterfeit luxury goods . Moore et al. studied the include a DOM-based clustering systems for monitoring Scam trending terms used in search-engine manipulation . Also, websites , and a system monitoring the evolution of Leontiadis et al. observed .edu sites that were compromised web content, called Delta , which keeps track of the for search redirection attack in illicit online prescription drug content and structure modiﬁcations across different versions trade, and brieﬂy discussed their lifetime and volume . In of a website, and identiﬁes an infection using signatures our paper, we conduct a more comprehensive measurement on generated from such modiﬁcations. More recently, Soska et 403 sTLD, and multiple illicit practices beside drug trade were al. works on detecting new attack trends instead of the attacks themselves . Their proposed system leverages the featuresfrom web trafﬁc, ﬁle system and page content, and is able to predict whether currently benign websites will be compromised In this paper, we report our study on promotional infections, in the near future. Borgolte et al. introduces Meerkat , a which introduce a large semantic gap between the infected computer vision approach to website defacement detection. The sTLD and the illicit promotional content injected. Exploiting technique is capable of identifying malicious content changes this gap, our semantic-based approach, SEISE, utilizes NLP from screenshots of the website. Other studies focus on mali- techniques to automatically choose IBTs and analyze search cious redirectors and attack infrastructures. Examples include result pages to ﬁnd those truly compromised. Our study shows JsRED  that uses a differential analysis to automatically that SEISE introduces low false detection rate (about 1.5%) with over 90% coverage. It is also capable of automatically  CleanMX, "Viruswatch – viruswatch watching adress changes of malware expanding its IBT list to not only include new terms but also  M. F. Der, L. K. Saul, S. Savage, and G. M. Voelker, "Knock it terms from new IBT categories. Running on 100K FQDNs, off: Proﬁling the online storefronts of counterfeit merchandise," in SEISE automatically detects 11K infected FQDN, which brings Proceedings of the 20th ACM SIGKDD international conference on to light the signiﬁcant impact of the promotional infections: Knowledge discovery and data mining.
ACM, 2014, pp. 1759–1768.
 R. Garside and N. Smith, "A hybrid grammatical tagger: Claws4," Corpus among those infected are the domains belonging to leading annotation: Linguistic information from computer text corpora, pp. 102– educational institutions, government agencies, even the military, with 3% of .edu and .gov, and over one thousand domains  L. Invernizzi, P. M. Comparetti, S. Benvenuti, C. Kruegel, M. Cova, and G. Vigna, "Evilseed: A guided approach to ﬁnding malicious web pages," of .gov.cn falling prey to illicit advertising campaigns. Our in Security and Privacy (SP), 2012 IEEE Symposium on.
research further demonstrates the importance of sTLDs to the pp. 428–442.
 N. Leontiadis, T. Moore, and N. Christin, "Measuring and analyzing adversary and the bar our technique raises for the attacks.
search-redirection attacks in the illicit online prescription drug trade." in Moving forward, we believe that there is a great potential to USENIX Security Symposium, 2011.
extend the technique for protecting gTLDs, as indicated by our  N. Leontiadis, T. Moore, and N. Christin, "A nearly four-year longitudinal study of search-engine poisoning," in Proceedings of the 2014 ACM preliminary study. Further, we are exploring the possibility to SIGSAC Conference on Computer and Communications Security. ACM, provide a public service for detecting such infections.
2014, pp. 930–941.
 Z. Li, S. Alrwais, X. Wang, and E. Alowaisheq, "Hunting the red fox IX. ACKNOWLEDGMENT online: Understanding and detection of mass redirect-script injections," This work was supported by the National Science Foundation in Security and Privacy (SP), 2014 IEEE Symposium on.
(grants CNS-1223477, CNS-1223495 and CNS-1527141);  T. Moore, N. Leontiadis, and N. Christin, "Fashion crimes: trending-term Natural Science Foundation of China (grant 61472215). We exploitation on the web," in Proceedings of the 18th ACM conference thank our anonymous reviewers for their useful comments.
on Computer and communications security.
ACM, 2011, pp. 455–466.
 H. Schmid, "Probabilistic part-of-speech tagging using decision trees," in Proceedings of the international conference on new methods in languageprocessing, vol. 12.
Citeseer, 1994, pp. 44–49.
 "Bing search api." https://datamarket.azure.com/dataset/bing/search.
 B. Skiera, J. Eckert, and O. Hinz, "An analysis of the importance of the long tail in search engine marketing," Electronic Commerce Research  "Farsight security information exchange," https://api.dnsdb.info/.
and Applications, vol. 9, no. 6, pp. 488–494, 2010.
 "Google web search api." https://developers.google.com/web-search/?hl=  "Phishtank," https://www.phishtank.com.
 "Public sufﬁx list," https://publicsufﬁx.org/list/.
 K. Soska and N. Christin, "Automatically detecting vulnerable websites  "scikit-learn, machine learning in python." http://scikit-learn.org/stable/.
before they turn malicious," in Proc. USENIX Security, 2014.
 R. Stephan and F. Russ, "topia.termextract 1.1.0," https://pypi.python.
 "Sponsored top level domain (stld)," http://icannwiki.com/index.php/  G. Stringhini, C. Kruegel, and G. Vigna, "Shady paths: Leveraging surﬁng crowds to detect malicious web pages," in Proceedings of the  "Stopword lists," http://www.ranks.nl/stopwords.
2013 ACM SIGSAC conference on Computer & communications security.
ACM, 2013, pp. 133–144.
 "word2vec, tool for computing continuous distributed representations of  K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, "Feature-rich part- of-speech tagging with a cyclic dependency network," in Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume  "Email spam ﬁlter trigger words to avoid in your e-campaigns," http: Association for Computational Linguistics, 2003, pp. 173–180.
 D. Y. Wang, M. Der, M. Karami, L. Saul, D. McCoy, S. Savage, and  "50 of the most competitive seo keywords!" https://moz.com/ugc/ G. M. Voelker, "Search+ seizure: The effectiveness of interventions on seo campaigns," in Proceedings of the 2014 Conference on Internet  K. Borgolte, C. Kruegel, and G. Vigna, "Delta: automatic identiﬁcation ACM, 2014, pp. 359–372.
of unknown web-based infection campaigns," in Proceedings of the  D. Y. Wang, S. Savage, and G. M. Voelker, "Juice: A longitudinal study 2013 ACM SIGSAC conference on Computer & communications security.
of an seo botnet." in NDSS, 2013.
ACM, 2013, pp. 109–120.
 N. Xue et al., "Chinese word segmentation as character tagging,"  K. Borgolte, C. Kruegel, and G. Vigna, "Meerkat: detecting website Computational Linguistics and Chinese Language Processing, vol. 8, defacements through image-based object recognition," in Proceedings no. 1, pp. 29–48, 2003.
of the 24th USENIX Conference on Security Symposium.
Association, 2015, pp. 595–610.
Lessons from Pfizer's Disputes Over its Viagra Trademark in ChinaDaniel Chow Follow this and additional works at: Part of the nd the Recommended CitationDaniel Chow, Lessons from Pfizer's Disputes Over its Viagra Trademark in China, 27 Md. J. Int'l L. 82 (2012).Available at: http://digitalcommons.law.umaryland.edu/mjil/vol27/iss1/9 This Article is brought to you for free and open access by [email protected] Carey Law. It has been accepted for inclusion in Maryland Journal ofInternational Law by an authorized administrator of [email protected] Carey Law. For more information, please contact.