Seeking nonsense, looking for trouble: efficient promotional-infection detection through semantic inconsistency search
2016 IEEE Symposium on Security and Privacy
Seeking Nonsense, Looking for Trouble: Efficient
Promotional-Infection Detection through Semantic
Inconsistency Search
Xiaojing Liao1, Kan Yuan2, XiaoFeng Wang2, Zhongyu Pei3, Hao Yang3, Jianjun Chen3, Haixin Duan3,
Kun Du3, Eihal Alowaisheq2, Sumayah Alrwais2, Luyi Xing2, and Raheem Beyah1
1Georgia Institute of Technology 2Indiana University 3Tsinghua University
Abstract—Promotional infection is an attack in which the
viagra and other drugs under nidcr.nih.gov (National Institute
adversary exploits a website's weakness to inject illicit advertising
of Dental and Craniofacial Research), counterfeit luxury
content. Detection of such an infection is challenging due to
handbag under dap.dau.mil (Defense Acquisition Portal), and
its similarity to legitimate advertising activities. An interestingobservation we make in our research is that such an attack
replica Rolex under nv.gov, the domain of the Nevada state
almost always incurs a great semantic gap between the infected
government. Clearly, all those FQDNs have been unauthorizedly
domain (e.g., a university site) and the content it promotes
changed for promoting counterfeit or illicit products. This type
(e.g., selling cheap viagra). Exploiting this gap, we developed a
of attacks (exploiting a legitimate domain for underground
semantic-based technique, called Semantic Inconsistency Search
advertising) is called promotional infection in our research.
(SEISE), for efficient and accurate detection of the promotional
injections on sponsored top-level domains (sTLD) with explicit
Promotional infection is an attack exploiting the weakness
semantic meanings. Our approach utilizes Natural Language
of a website to promote content. It has been used to serve
Processing (NLP) to identify the bad terms (those related to
various malicious online activities (e.g., black-hat search engine
illicit activities like fake drug selling, etc.) most irrelevant to an
optimization (SEO), site defacement, fake antivirus (AV)
sTLD's semantics. These terms, which we call irrelevant bad terms
promotion, Phishing) through various exploit channels (e.g.,
(IBTs), are used to query search engines under the sTLD for
suspicious domains. Through a semantic analysis on the results
SQL injection, URL redirection attack and blog/forum Spam).
page returned by the search engines, SEISE is able to detect
Unlike the attacks hiding malicious payloads (e.g., malware)
those truly infected sites and automatically collect new IBTs
from the search engine crawler, such as a drive-by download
from the titles/URLs/snippets of their search result items for
campaign, the promotional attacks never shy away from search
finding new infections. Running on 403 sTLDs with an initial 30
engines. Instead, their purpose sometimes is to leverage the
seed IBTs, SEISE analyzed 100K fully qualified domain names
(FQDN), and along the way automatically gathered nearly 600
compromised domain's reputation to boost the rank of the
IBTs. In the end, our approach detected 11K infected FQDN
promoted content (either what is directly displayed under the
with a false detection rate of 1.5% and over 90% coverage.
domain or the doorway page pointed by the domain) in the
Our study shows that by effective detection of infected sTLDs,
search results returned to the user when content-related terms
the bar to promotion infections can be substantially raised,
are included in her query. Such infections can inflict significant
since other non-sTLD vulnerable domains typically have muchlower Alexa ranks and are therefore much less attractive for
harm on the compromised websites through loss in reputation,
underground advertising. Our findings further bring to light the
search engine penalty, traffic hijacking and may even have legal
stunning impacts of such promotional attacks, which compromise
ramifications. They are also pervasive: as an example, a study
FQDNs under 3% of .edu, .gov domains and over one thousand
shows that over 80% doorway pages involved in black-hat SEO
gov.cn domains, including those of leading universities such as
are from injected domains [28].
stanford.edu, mit.edu, princeton.edu, havard.edu and governmentinstitutes such as nsf.gov and nih.gov. We further demonstrate
Catching promotional infections: challenges. Even with the
the potential to extend our current technique to protect generic
prevalence of the promotional infections, they are surprisingly
domains such as .com and .org.
elusive and difficult to catch. Those attacks often do notcause automatic download of malware and therefore may notbe detected by virus scanners like VirusTotal and Microsoft
Forefront. Even the content injected into a compromised
Imagine that you google the following search term: site:
website can appear perfectly normal, no difference from the
stanford.edu pharmacy. Figure 1 shows what we got on
legitimate ads promoting similar products (e.g., drugs, red
October 9, 2015. Under the domain of Stanford University
wine, etc.), ideological and religious messages (e.g., cult theory
are advertisements (ad) for selling cheap viagra! Using various
promotion) and others, unless its semantics has been carefully
search terms, we also found the ads for prescription-free
examined under the context of the compromised site (e.g.,
2375-1207/16 $31.00 2016 IEEE
2016, Xiaojing Liao. Under license to IEEE.
DOI 10.1109/SP.2016.48
General Service Administration, EDUCAUSE, DoD Network
Information Center), represents a narrow community and carries
designated semantics (Section III-A). Later we show that thetechnique has the potential to be extended to generic TLD
(gTLD, see Section V-B).
SEISE is designed to search for a set of strategically selected
irrelevant terms under an sTLD (e.g., .edu) to find out thesuspicious FQDNs (e.g., stanford.edu) associated with theterms, and then further search under the domains and inspect thesnippets of the results before flagging them as compromised.
To make this approach work, a few technical issues need to be
Fig. 1: Search findings of promotional injections in stanford.edu.
addressed: (1) how to identify semantic inconsistency between
Search engine result is organized as title, URL and snippet.
injected pages and the main content of a domain; (2) how tocontrol the false positives caused by the legitimate contentincluding the terms, e.g., a health center sites on Stanford
selling red wine is unusual on a government's website). So
University (containing the irrelevant term "pharmacy"); (3)
far, detection of the promotional infections mostly relies on
how to gather the search terms related to diverse promotional
the community effort, based upon the discoveries made by
content. For the first issue, our approach starts with a small
human visitors (e.g., PhishTank [5]) or the integrity checks that
set of manually selected terms popular in illicit activities (e.g.,
a compromised website's owner performs. Although attempts
gambling, drug and adult) and runs a word embedding based
have been made to detect such attacks automatically, e.g.,
tool to calculate the semantic distance between these terms and
through a long term monitoring of changes in a website's
a set of keywords extracted from the sTLD's search content,
DOM structure to identify anomalies [16] or through computer
which describe the sTLD's semantics. Those most irrelevant
vision techniques to recognize a web page's visual change [17],
are utilized for detection (Section III-B). To suppress false
existing approaches are often inefficient (requiring long term
positives, our approach leverages the observation that similar
monitoring or analyzing the website's visual effects) and less
promotional content always appear on many different pages
effective, due to the complexity of the infections, which, for
under a compromised domain for the purpose of improving
example, can introduce a redirection URL indistinguishable
the rank of the attack website pointed to by the content. As a
from a legitimate link or make injected content only visible to
result, a search of the irrelevant term under the domain will
the search engine.
yield a result page on which many highly frequent terms (such
Semantic inconsistency search. As mentioned earlier, fun-
as "no prescription", "low price" in the promotional content)
damentally, promotional infections can only be captured by
turn out to rarely occur across the generic content under the
analyzing the semantic meaning of web content and the
same domain (e.g., stanford.edu). This is very different from
context in which they appear. To meet the demand for a large-
the situation, for example, when a research article mentions
scale online scan, such a semantic analysis should also be
viagra, since the article will not be scattered across many pages
fully automated and highly efficient. Techniques of this type,
under the site and tends to contain the terms also showing
however, have never been studied before, possibly due to the
up in the generic content under the Stanford domain, such as
concern that a semantic-based approach tends to be complicated
"study", "finding", etc (Section III-B). Finally, using the terms
and less accurate. In this paper, we report a design that makes a
extracted from the result snippets of the sites detected, SEISE
big step forward on this direction, demonstrating it completely
further automatically expands the list of the search terms for
possible to incorporate Natural Language Processing (NLP)
finding other attacks (Section III-C).
techniques into a lightweight security analysis for efficient and
We implemented SEISE and evaluated its efficacy in our
accurate detection of promotional infections. A key observation
research (Section IV). Using 30 seed terms and 403 sTLDs
here is that for the attacks in Figure 1, inappropriate content
(across 141 countries and 89 languages), our system automati-
shows up in the domains with specific meanings: no one expects
cally analyzed 100K FQDNs and along the way, expanded the
that a .gov or .edu site promotes prohibited drugs, counterfeit
keyword list to 597 terms. In the end, it reported 11K infected
luxury handbags, replica watches, etc. Such inconsistency can
FQDNs, which have been confirmed to be compromised1
be immediately identified and located from the itemized search
through random sampling and manual validation. With its
result on a returned search result page, which includes the
low false detection rate (1.5%), SEISE also achieved over 90%
title, URL and snippet for each result (as marked out in
detection rate. Moving beyond sTLD, we further explore the
Figure 1). This approach, which detects a compromised domain
(e.g., stanford.edu) based upon the inconsistency between the
1Note that in line with the prior research [22], the term "compromise" here
domain's semantics and the content of its result snippet reported
refers to not only direct intrusion of a web domain, which was found to
by a search engine with regard to some search terms, is
be the most common cases in our research (80%, see Section VI), but alsoposting of illicit advertising content onto the domain through exploiting its
called semantic inconsistency search or simply SEISE. Our
weak (or lack of) input sanitization: e.g., blog/forum Spam and link Spam
current design of SEISE focuses on sponsored top-level domain
(using exposed server-side scripts to dynamically generate promotion pages
(sTLD) like .gov, .edu, .mil, etc., that has a sponsor (e.g., US
under the legitimate domain).
potential extension of the technique to gTLDs such as .com
impact, ongoing underground promotion campaigns, affecting
(Section V-B). A preliminary design analyzes .com domains
leading educational institutions and government agencies, and
using their site tag labeled by SimilarSites [8], which is found
the unique techniques the perpetrator employs. Further we
to be pretty effective: achieving a false detection rate (FDR) of
demonstrate the impacts of our innovation, which significantly
9% when long keywords gathered from compromised sTLDs
raises the bar to promotional infections and can potentially be
extended to protect generic domains.
Our findings. Looking into the promotional infections detected
Roadmap. The rest of the paper is organized as follows:
by SEISE, we were surprised by what we found: for example,
Section II provides background information for our study;
about 3% (175) of .gov domains and 3% (246) of .edu
Section III elaborates on the design of SEISE; Section IV
domains are injected; also around 2% of the 62,667 Chinese
reports the implementation details and evaluation of our
government domains (.gov.cn) are contaminated with ads,
technique; Section V elaborates on our measurement study
defacement content, Phishing, etc. Of particular interest is
and new findings; Section VI discusses the limitations of our
a huge gambling campaign we discovered (Section V-C),
current design and potential future research; Section VII reviews
which covers about 800 sTLDs and 3000 gTLDs across
related prior research and Section VIII concludes the paper.
12 countries and regions (US, China, Taiwan, Hong Kong,
Singapore and others). Among the victims are 20 US academia
institutes such as nyu.edu, ucsd.edu, 5 government agencies like
In this section, we lay out the background information of
va.gov, makinghomeaffordable.gov, together with 188 Chinese
our research, including the promotional infection, sTLD, NLP
universities and 510 Chinese government agencies. We even
and the assumptions we made.
recovered the attack toolkit used in the campaign, which
Promotional infection. As mentioned earlier, promotion in-
supports automatic site vulnerability scan, shell acquisition,
fection is caused by exploiting the weakness of a website to
SEO page generation, etc. Also under California government's
advertise some content. A typical form of such an attack is
domain ca.gov, over one thousand promotion pages were
black-hat SEO, a technique that improves the rank of certain
found, all pointing to the same online casino site. Another
content on the results page by taking advantage of the way
campaign involves 102 US universities (mit.edu, princeton.edu,
search engines work, regardless of the guidelines they provide.
stanford.edu, etc.), advertising "buy cheap essay". The scope of
Such activities can happen on a dedicated host, for example,
these attacks go beyond commercial advertising: we found that
through stuffing the pages with the popular search terms that
12 Chinese government and university sites were vandalized
may not be related to the advertised content, for the purpose
with the content for promoting Falun Gong. Given the large
of enhancing the chance for the user to find the pages. In
number of compromised sites discovered, we first reported
other cases, the perpetrator compromises a high-rank website
the most high-impact findings to related parties (particularly
to post an ad pointing to the site hosting promoted content,
universities and government agencies) and will continue to do
in an attempt to utilize the compromised site's reputation to
so (Section VI).
make the content more visible to the user. This can also be
Further, our measurement study shows that some sTLDs such
done when the site does not check the content uploaded there,
as .edu, .edu.cn and .gov.cn are less protected than the .com
such as visitors' comments, which causes its display of blog or
domains with similar Alexa ranks, and therefore become soft
forum Spam. Such SEO approaches, the direct compromise and
targets for promotional infections (Section V-B). By effectively
the uploading of Spam ads, are considered to be promotional
detecting the attacks on these sTLDs, SEISE raises the bar for
infections. Different from the SEO on a dedicated host, these
the adversary, who has to resort to less guarded gTLDs, which
approaches leverage a legitimate site and also provide their
typically have much lower Alexa ranks, making the attacks,
ad-related keywords to the search engine crawler, to attract
SEO in particular, less effective.
targeted visitors.
Contributions. The contributions of the paper are outlined as
The promotional infection can be used for multiple goals
such as malware distribution, phishing, blackhat SEO or
• Efficient semantics-based detection of promotional infections.
political agenda promotion. Black-hat SEO is often used
We developed a novel technique that exploits the semantic
to advertise counterfeit or unauthorized products. The same
gap between domains (sTLDs in particular) and unauthorized
promotional tricks have also been played to get other malicious
content they host to detect the compromised websites that serve
content to the audience at which the adversary aims. Prominent
underground advertising. Our technique is highly effective,
examples are Phishing websites that try to defraud the visitors
incurring low false positives and negatives. Also importantly,
of their private information (user names, passwords, credit-
it is simple and efficient: often a compromised domain can
card numbers, etc.) and fake AV sites that cheat the user into
be detected by querying Google no more than 3 times. This
indicates that the technique can be easily scaled, with the help
Sponsored top-level domains. A sponsored top-level domain
of search providers.
(sTLD) is a specialized top-level domain that has private
• Measurement study and new findings. We performed a
agencies or organizations as its sponsors that establish and
large-scale measurement study on promotional infections, the
enforce rules restricting the eligibility to use the domain based
first of this kind. Our research brings to light several high-
on community theme concepts. For example, .aero is sponsored
by SITA, which limits registrations to members of the air-transport industry. Compared to unsponsored top-level domain
"#
(gTLD), an sTLD typically carries designated semantics from
"$
its sponsors. For example, as a sponsored TLD, .edu, which is
sponsored by EDUCAUSE, indicates that the correspondingsite is post-secondary institutions accredited by an agency
recognized by the U.S. Department of Education. Note thatsTLDs for different countries are also associated with specific
semantic meanings as stated in ICANN, e.g., edu.cn for Chinese
In our research, we collected sTLDs for different countries
according to the 10 categories provided by ICANN [9]: .aero,
.edu, .int, .jobs, .mil, .museum, .post, .gov, .travel, .xxx and thepublic suffix list maintained by the Mozilla Foundation [6].
Fig. 2: Overview of the SEISE infrastructure.
All together, we got 403 sTLDs from 141 countries.
Natural language processing. The semantics information
such as syntactically plausible terminological noun phrases.
SEISE relies on is automatically extracted from web content
Then, the terminological candidates are further analyzed using
using Natural Language Processing. Technical advances in the
statistical approaches (e.g., point-wise mutual information) to
area has already made effective keyword identification and
determine important terms.
sentence processing a reality. Below we briefly introduce the
Adversary model. In our research, we consider the adversary
key NLP techniques used in our research.
who tries to exploit legitimate websites for promoting unau-
thorized content. Examples of such content include unlicensed
Word embedding (skip-gram model). A word embedding
online pharmacies, fake AV, counterfeit, politics agenda or
W : words → V n is a parameterized function mapping words
Phishing sites. For this purpose, the adversary could inject ads
to high-dimensional vectors (200 to 500 dimensions), e.g.,
or other content into the target sites to boost the search rank
W (‘education) = (0.2, −0.4, 0.7, .), to represent the word's
of the content he promotes or use sTLD sites as redirectors to
relation with other words. Such a mapping can be done in
monetize traffic.
different ways, e.g., using the continual bag-of-words modeland the skip-gram technique to analyze the context in which
III. SEISE: DESIGN
the words show up. Such a vector representation ensures thatsynonyms are given similar vectors and antonyms are mapped to
As mentioned earlier, promotional infections often do not
dissimilar vectors. Also interestingly, the vector representations
propagate malicious payloads (e.g., malware) directly and
fit well with our intuition about the semantic relations between
instead only post ads or other content that legitimate websites
words: e.g., the vectors for the words ‘queen', ‘king', ‘man' and
may also contain. This makes detection of such attacks
‘woman' have the following relation: v
extremely difficult. In our research, we look at the problem
from a unique perspective, the inconsistency between the
king. In our research, we utilized the vectors to compare the
semantics meanings of different words, by measuring the cosine
malicious advertising content and the semantics of the website,
distance between the vectors. For example, using Wikipedia
particularly, what is associated with different sTLDs. More
pages as a training set (for the context of individual words), our
specifically, underlying SEISE are a suite of techniques that
approach automatically identified the words semantically-close
search sTLDs (.edu, .gov, etc.) using irrelevant bad terms
to ‘casino', such as ‘gambling' (with a cosine distance 0.35),
(IBT) (the search terms unrelated to the sTLDs but heavily
‘vegas' (0.46) and ‘blackjack' (0.48).
involved in malicious activities like Spam, Phishing) to findpotentially infected FQDNs, analyze the context of the IBTs
• Parts-of-speech (POS) tagging and phrase parsing. POS
under those FQDNs to remove false positives and leverage
tagging is a procedure of labeling a word in the text (corpus)
detected infections to identify new search terms, automatically
as corresponding to a particular part of speech as well as its
expanding the IBT list. Below we elaborate on this design.
context (such as nouns and verbs). POS tagging accepts the textas input and outputs the words labeling with POS such as noun,
A. Overview
verb, adjective, etc. Phrase parsing is the technique to divide
Architecture. Figure 2 illustrates the architecture of SEISE,
sentences into phrases that logically belong together. Phrase
which includes Semantics Finder, Inconsistency Searcher,
parsing accepts texts as input and outputs a series of phrases in
Context Analyzer and IBT Collector. Semantics Finder takes
the texts. The state-of-the-art POS tagging and phrase parsing
as its input a set of sTLDs, automatically identifying the
techniques can achieve over 90% accuracy [20], [32], [26]. POS
keywords that represent their semantics. These keywords are
tagging and phrase parsing can be used in the content term
compared with a seed set of IBTs to find the most irrelevant
extraction, i.e., determining important terms within a given
terms. Such selected terms are then utilized by Inconsistency
piece of text. Specifically, after parsing phrases from the given
Searcher to search related sTLDs for the FQDNs carrying
content, POS tagger helps to tag the terminological candidates,
these terms. Under each detected FQDN, Context Analyzer
further evaluates the context of discovered IBTs through
detect other advertising targets (e.g., red wine) not included in
a differential analysis to determine whether after removing
the initial IBT list (e.g., those for promoting illegal drugs). The
stop words, i.e., the most common words like ‘the' from
same technique can also be applied to find out compromised
the context, frequently-used terms identified there (e.g., the
gTLDs like the .com FQDNs involved in the same campaign.
search result of site:stanford.edu pharmacy) become rare acrossthe generic content of the FQDN (e.g., the search result of
B. Semantics-based Detection
site:stanford.edu), which indicates that the FQDN has indeedbeen compromised. Such FQDNs are reported by SEISE and
In this section, we present the technical details for Semantics
their snippets are used by IBT Collector to extract keywords.
Finder, Inconsistency Searcher and Context Analyzer.
Those with the largest semantic distance from the sTLDs are
Finding semantics for sTLDs. The first step of our approach
added to the IBT list for detecting other infected FQDNs.
is to automatically build a semantic profile for an sTLD. Such
Example. To explain how SEISE works, let us take a look at
a profile is represented as a set of terms, which serve as an
the example at the beginning of the paper (Figure 1). For the
input to the Inconsistency Searcher for choosing right IBTs.
sTLD .edu, SEISE first runs Semantics Finder to automatically
For example, the semantic representation of the sTLD .edu.cn
extract keywords to profile sTLD, e.g., "education", "United
could be "Chinese university", "education", "business school",
States" and "student". In the meantime, a seed set of IBTs,
etc. SEISE automatically identifies these terms from different
including "casino", "pharmacy" and others, are converted into
sources using a term extraction technique. Specifically, the
vectors using the word-embedding technique. Their semantic
following two sources are currently utilized by our prototype:
gap with the .edu sTLD is measured by calculating the cosine
• Wikipedia: the Wikipedia pages for sTLDs provide a
distances between individual terms (like "pharmacy") and the
comprehensive summary of different sTLDs. For example, https:
sTLD keywords (such as "education", "United States" and
// en.wikipedia.org/ wiki/ .mil profiles the sTLD .mil, including
"student"). It turns out that the terms like "pharmacy" are
its sponsor ("DoD Information System Agency"), intended use
among the most irrelevant (i.e., with a large distance with
("military entities"), registration restrictions ("tightly restricted
.edu). It is then used to search Google under .edu, which shows
to eligible agencies"), etc. In our research, we ran a crawler
the FQDN stanford.edu hosting the content with the search
that collected the wiki pages for 80 sTLDs.
term. Under this FQDN, SEISE again searches for "pharmacy."
• Search results: the search results page for an sTLD query
The results page is presented in Figure 1. As we can see,
(e.g., site:gov) lists high-profile websites under the sTLD. As
many search result items (for different URLs) contain same
mentioned earlier, each search result includes a snippet of a
topic words, similar snippet and even URL patterns, which are
website, which offers a concise but high-quality description
typically caused by mass injection of unauthorized advertising
of the website. Since the websites under the sTLD carry the
materials. These items form the context for the IBT "pharmacy"
semantic information of the sTLD, such descriptions can be
used as another semantic source of the sTLD. Therefore,
Our approach then converts the context (the result items)
our approach collected the search result pages of all 403
found into a high-dimensional vector, with the frequency of
sTLDs using automatically-generated queries in the form of
each word (except those common stop words like ‘she', ‘does',
"site:sTLD", such as site:edu. From each result page, top 100
etc.) as an element of the vector. The vector, considered to be
search results are picked up for constructing the related sTLD's
a representative of the context, then goes through a differential
semantic profile.
analysis: it is compared with the vector of a reference, the
From such sTLD semantics sources, the Semantics Finder
search results page of site:stanford.edu that describes the
runs a content term extraction tool to automatically gather
generic content under the FQDN. The purpose is to find out
keywords from the sources. These keywords are supposed to
whether the context is compatible with the theme of the FQDN.
best summarize the topic of each source and therefore represent
If the distance between them is large, then we know that
the semantics of an sTLD. In our implementation, we utilized
this FQDN hosts a large amount of similar text semantically
an open-source tool topia.termextract [30] for this purpose.
incompatible with its theme (i.e., most of the high frequent
From each keyword extracted, our approach further calculates
words in the suspicious text, such as "viagra", rarely appear
its frequency, which is assigned to the keyword as its weight.
in the common content of the FQDN). Also given the fact
All together, top 20 keywords are chosen for each sTLD as its
that such text is the context for the search terms irrelevant to
semantics profile.
the sTLD of the current FQDN but popular in promotional
A problem is that among all 403 sTLDs, 71 of them are
infections, we conclude that the FQDN stanford.edu is indeed
non-English ones, which include Chinese, Russian, French,
Arabic, etc., 89 languages altogether. Analyzing these sTLDs
Once an infection is detected, the terms extracted from
in their native languages is complicated, due to the challenges
the context of "pharmacy" are then analyzed and those most
in processing these languages: for example, segmenting Chinese
irrelevant to the semantics of .edu are added to the IBT list
characters into words is known to be hard [35]. To solve this
for finding other compromised FQDNs. Examples of the terms
problem, we utilized Google Translate to convert the search
include "viagra", "cialis", and "tadalafil". In addition to the
page of an non-English sTLD query into English and then
words, the URL pattern of the infection is then generalized to
extract their English keywords. The approach was found to
Query — site:mysau3.arbor.edu "casino"
Query — site:www.unlv.edu "casino"
"title":"Online Courses International Gaming Institute University
bookmarkportlet:10, viewhandler:10,
"title":"Online Casino by DewaCasino.com: Live Casino Online .",
online:8, promoter:6, dealers:6,
"snippet":"DewaCasino is a promoter casino best online with live
"snippet":"New online casino management classes are currently
class:4, education:3, course:3,
gambling:5, slot:5, roulette: 5,
dealers reliable, Fair and is one of the largest in Asia today. Join!"
being developed by the Center for Professional & Leadership
management:3, center:2,
Studies at UNLV (PLuS Center). Please visit ."
professional:2, unit:2, university: 2,
ics:0, student:0, university:0,
graduate:0, alumni:0, department:0,
snack:0, amentity:0,
association:0, credit:0, center:0,
"title":"Casino Marketing for Industry Professionals International
"title":"iGamble247.com :: Live Casino Online - Casino Agent",
"snippet":"Accreditation. You can earn Continuing Education Units
"snippet":"Igamble247 is a promoter casino best online with live
(CEUs) upon successful completion of any of our online casino
dealers reliable, Fair and is one of the largest in Asia today. Join!"
management courses. Please contact."
Query — site:mysau3.arbor.edu
Query — site:www.unlv.edu
bookmarkportlet:0, viewhandler:0,
"title":"Students - MySAU - Spring Arbor University",
online:0, promoter:0, dealers:0,
"title":"School of Social Work University of Nevada, Las Vegas",
"snippet":"To print a certificate (proof) of enrollment or order a
gambling:0, slot:0, roulette: 0,
"snippet":"Behavioral Health Workforce Education and Training
transcript, go to the National Student Clearinghouse site."
Program for Professionals. The UNLV School of Social Work,
education:4, program:3, university:3,
ics:4, student:3, university:3,
Masters Program has been awarded the…"
student:3, course:2, school:2,
graduate:3, alumni:2, department:2,
training:2, center: 2, social:2,
Default_Page.jnz",
association:2, credit:2, center: 2,
"title":"Default Page - MySAU - Spring Arbor University",
"title":"Student Union University of Nevada, Las Vegas",
"snippet":"The Spring Arbor University Alumni Association exists to
"snippet":"Welcome. The Student Union offers conveniences and
serve the University and its graduates by providing alumni with a
amenities for everyone, whether you need to grab a snack, hold a
continuing link among themselves and…"
meeting, or just have some fun."
(a) Differential analysis of an injected site. Cosine distance = 0.97
(b) Differential analysis of a non-injected site. Cosine distance = 0.14
Fig. 3: Differential analysis of an injected site and a non-injected site.
work effectively, capturing non-English promotional infections
calculating the cosine distance between their vectors. For
(see Section V).
each IBT, its average distance to all the keywords is used to
Searching for inconsistency. The Inconsistency Searcher is
determine its effectiveness in detecting promotional infections.
designed to find out the IBTs with great semantic gaps with
In our research, we found that when the distance becomes
a given sTLD, and use the terms to search the sTLD for
0.6 (at least 20 terms are still there within our seed set) or
suspicious (potentially compromised) FQDNs. To this end, we
more, almost no compromised site is missing (see Figure 5(a) in
first selected a small set of seed IBTs as an input to the system.
Section V). The IBTs selected according to such a threshold are
These IBTs were collected from spam trigger word lists [13],
then sent to the search engine together with the sTLD through
[14] and SEO competitive word list [15], which are popular
the query site:sTLD+IBT (e.g.,site:edu casino). From the search
terms used in counterfeit medicine selling, online gambling
result page, top 100 items (URLs) are further inspected by
and Phishing. From those terms, the most irrelevant ones are
the Context Analyzer to determine whether related FQDNs
picked up for analyzing a given sTLD. Such terms are found
are indeed compromised, which is detailed in the followed
by comparing them with the semantics profile of the FQDN,
that is, the set of keywords output by the Semantics Finder.
As an example, again, let us look at Figure 3: in this case,
Specifically, such a semantic comparison is performed by
the IBT "casino" has a distance of 0.72 with regard to the
SEISE using a word-embedding tool called word2vec [12],
semantics of .edu and therefore was run under the sTLD; from
a neural network that builds a vector representation for each
the search pages, top FQDNs, including mysau3.arbor.edu,
term by learning from the context in which the term occurs. In
www.unlv.edu, were examined to detect compromised FQDNS.
our research, we utilized the English Wikipedia pages as the
Analyzing IBT context. As mentioned earlier, even the terms
context for each term to compute its vector and measure the
most irrelevant to an sTLD could show up on some of its pages
distance between two words using their vectors. In this way,
for a legitimate reason. For example, the word ‘casino' has a
the IBTs irrelevant to a given sTLD can be found and used to
significant semantic distance with the sTLD .edu, which does
search under the FQDN for detecting the suspicious ones. The
not mean, however, that the .edu sites cannot carry a poster
approach works as follows:
about one's travel to Las Vegas or a research article about a
• We downloaded all 30 GB Wikipedia pages and ran a program
study on the gambling industry. Actually, a direct search of the
to preprocess those pages by removing tables and images while
term site:edu casino yields a result page with some of the items
preserving their captions. Individual sentences on the pages
being legitimate. To identify those compromised FQDNs, the
were further tokenized into terms using a phrase parser.
Context Analyzer automatically examines the individual FQDN
• Given an input term (an IBT or a keyword in the sTLD's
on the result page, using a differential analysis (Figure 2) to
semantics profile), our approach runs word2vec to train a
detect those truly compromised.
skip-gram model, which maps the term into a high-dimensional
More specifically, the differential analysis involves two
vector d1, d2, .di, . to describes the term's semantics. This
independent queries, one on the suspicious FQDN together
vector is generated from all the sentences involving the term,
with the IBT (e.g., site:life.sunysb.edu casino) and the other on
with individual elements describing the term's relations with
the FQDN alone (e.g., site:life.sunysb.edu) whose results page
other terms in the same sentence across all such sentences in
serves as the reference. The idea is based on the observation
the Wikipedia dataset.
that in a promotional infection, the adversary has to post
• Given the vectors of an IBT and an sTLD keyword, our
similar text on many different pages (sometimes pointing to
approach measures the semantic distance between them by
the same site) for promoting similar products or content. This
Fig. 4: IBT SET Extension. The process to find IBTs in new category consists of five steps: Injected URLs are collected to find the injecteddirectory path (). Then, the injected directory path is used as search keyword, i.e., site:www.lgma.ca.gov/ play to list more search resultitems (). After fetching search result snippets(), critical terms are extracted (), and those that show semantics irrelevance are filtered forclustering (). Once a new cluster is formed, we manually check and label it with its semantics.
is necessary because the target site's rank needs multiple
vector V =< w0, w1, .wi, . >, where wi is the frequency
highly-ranked pages on the compromised site to promote.
of a word corresponding to that position. For the two vectors
The problem for such an attack is that the irrelevant content,
Vb (the search page under the IBT) and Vg (the reference, that
which is supposed to rarely appear under the FQDN, becomes
is, the search page of the FQND without the IBT), SEISE
anomalously homogenous and pervasive under a specific IBT.
calculates their Cosine distance: 1 − Vb·Vg
VbVg .
As a result, when we look at the search results of the IBT
In Figure 3(a), the distance of the vector for the IBT ‘casino'
under the FQDN, their URLs and snippets tend to carry the
with the reference vector is 0.97. In Figure 3(b), where the
words rarely showing up across the generic content (i.e., the
FQDN is not compromised, we see that the vector under the
reference) with much higher frequencies than their accidental
IBT ‘casino' is much closer to that of the reference, with a
occurrences under the FQDN. On the other hand, in the case of
distance of 0.14. In our research, we chose 0.9 as a threshold
legitimate content including the IBT, the search results (for the
to parameterize our system: whenever the Cosine distance
IBT under the FQDN) will be much more diverse and the words
between the results of querying an FQDN under an IBT and
involved in the IBT's context often appear on the reference
the reference of the FQDN goes above the threshold, the
and are compatible with the generic content of the site; even
Context Analyzer flags it as infected. This approach turns out
for the irrelevant terms in the context, their frequencies tend
to be very effective, incurring almost no false positives, as
to be much lower than those in the malicious context. This is
elaborated in Section IV.
because it is unlikely that the term irrelevant to the theme of
Discussion. SEISE is carefully designed to work on search
the site accidentally appears in similar context across many
result pages instead of the full content of individual FQDNs.
pages, which introduces an additional set of highly-frequent
This is important because the design helps achieve not only high
irrelevant terms. As an example, let us look at Figure 3(a) that
performance but also high accuracy. Specifically, a semantic
shows a compromised FQDN and Figure 3(b) that illustrates a
analysis on a small amount of context information (title,
legitimate FQDN. The highly-frequent words extracted from
URL and snippet of a search result) is certainly much more
the former under the IBT ‘casino', such as ‘bookmarkporlet',
lightweight than that on the content of each web page. Also
‘dealers', ‘slot', never show up across the URLs and snippets of
interestingly, focusing on such context helps avoid the noise
the reference that represents the generic content of the FQDN
introduced by the generic page content, since the snippet of
(the result of the query site:mysau3.arbor.edu). In contrast, a
each search result is exactly the text surrounding an IBT, the
query of the legitimate FQDN using the same IBT yields a
part of the web page most useful for analyzing the suspicious
list of results whose URLs and snippets have highly diverse
content it contains. In other words, our approach leverages the
content, with some of their words also included in the generic
search engine to zoom in on the context of the IBT, ignoring
content, such as ‘class', ‘education' and ‘university', and most
unrelated content on the same web page.
others (except the IBT itself) occurring infrequently.
C. IBT SET Extension
To compare the two search result pages for identifying the
A critical issue for the semantic-based detection is how to
truly compromised site, the Context Analyzer picks up top
obtain high-quality IBTs. Those terms need to be malicious
10 search results from each query and converts them into a
and irrelevant to the semantics of an sTLD. Also importantly,
high dimensional vector. Specifically, our approach focuses
they should be diverse, covering not only different keywords
on the URL and the content snippet for each result item.
the adversary may use in a specific category of promotional
We segment them into words using delimiters such as space,
infections, like unlicensed pharmacy, but also those associated
comma, dash, etc., and remove stop words (those extremely
with the promotional activities in different categories, such
common words like ‘she', ‘do', etc.) using a stop word list [10].
as gambling, fake product advertising, academic cheating, etc.
In this way, each search item is tokenized and the frequency
Such diversity is essential for the detection coverage SEISE
of each token, across all 10 results is calculated to form a
is capable of achieving, since a specific type of promotional
attack (e.g., fake medicine) cannot be captured by a wrong
the results page of the query, critical terms are extracted by
IBT (e.g., ‘gambling').
analyzing snippets under individual result items. These terms
As mentioned earlier, the seed IBT set used in our research
are further compared with the semantics of the current sTLD:
includes 30 terms, which were collected from several sources,
those most irrelevant (with a cosine distance above the threshold
including spam trigger word lists [13], [14] and SEO competi-
0.9) are kept. Finally, the vectors of these terms are clustered
tive word list [15]. These IBTs are associated with the attacks
using the classic k-Nearest-Neighbor (k-NN) algorithm (with
such as blackhat SEO, fake AV and Phishing. To increase the
k = 10) together with all existing IBTs. Once a new cluster
diversity of the set, SEISE expands it in a largely automated
is formed in this way, we manually look at the cluster and
way, both within one category and across different categories.
label it with its semantics (gambling, drug selling, academic
More specifically, our approach leverages NLP techniques to
cheating, etc.). Note that this manual step is just for labeling,
gather new IBTs from the search items reported to contain
not for adjusting the clustering outcomes, which were found
malicious content, and further cluster these IBTs to discover
to be very accurate in our research (Section IV-C).
new categories. Here we elaborate on this design.
In the above example as illustrated in Figure 4, the query site:
www.lgma.ca.gov/ play leads to the search results page. From
Finding IBTs within a category. Once a compromised FQDN
the items on the page, the IBT Collector automatically recovers
has been identified using an IBT, the search results that lead
a set of critical terms, including ‘goldslot', ‘payday loan',
to the detection (for the query "site:FQDN+IBT") can then
‘cheap essay' and others. Clustering these terms, some of them
be used to find more terms within the IBT's category. This
are classified into existing categories such as gambling, drug,
is because the result items are the context of the IBT, and
etc., while the rest are grouped into a new cluster, containing
therefore include other bad terms related to the IBT. Specifically,
‘cheap essay', ‘free term paper' along with other 15 terms.
similar to the Semantics Finder, the IBT Collector runs the term
This new cluster is found to be indeed a new attack category,
extraction tool on each result item, including its title, URL and
and labeled as ‘academia cheating'. In our research, we ran
snippet, to gather the terms deemed important to the context of
the approach to extend our IBT set, from 30 terms to 597
the IBT. Such terms are further inspected, automatically, against
effective terms, from 3 categories (gambling, drug, etc.) to 10
the semantics of an sTLD by measuring their average distances
large categories (financial, cheating, politics, etc.). Our manual
with the keywords of the FQDN (that is, converting each of
validation shows that the results are mostly correct.
them into a vector using word2vec and then calculatingthe Cosine distance between two vectors). Those sufficiently
IV. IMPLEMENTATION AND EVALUATION
away from the FQDN's semantics (with a distance above the
In this section, we report our implementation of SEISE
aforementioned threshold) are selected as IBTs.
and evaluation of its efficacy. Our study show that the
Finding new categories. Extracting keywords from the context
simple semantics-based approach works well in practice: it
of an IBT can only provide us with new terms in the same
automatically discovered IBTs, achieved an low false detection
category. To detect the infections in other categories, we have to
rate (1.5%) at over 90% of coverage and also captured 75%
extend the IBT set to include the terms in other types of illicit
infected domains never reported before (Section IV-C).
promotions. The question is how to capture new keywords suchas ‘prescription-free antibiotic' that are distinguished from the
A. Implementing SEISE
IBTs in the known category such as ‘gambling', ‘casino', etc. A
The design of SEISE (Section III) was implemented into a
key observation we leveraged in our study is that the adversary
prototype system, on top of a set of building blocks. Here we
sometimes compromises an FQDN to perform multiple types
briefly describe these nuts and bolts and then show how they
of advertising: depending on the search terms the user enters,
are assembled into the system.
an infected website may provide different kinds of promotional
Nuts and bolts. Our prototype system was built upon three
content, for drug, alcohol, gambling and others. Further the
key functional components, term extractor, static crawler and
ads serving such a purpose are often deposited under the same
semantic comparator. Those components are extensively reused
directory, along the same path under a compromised FQDN.
across the whole system, as illustrated in Figure 2. They were
This enables us to exploit the URL included in a contaminated
implemented as follows:
result item (as detected by SEISE) to find the promotional
• Term extractor accepts text as its input, from which it automat-
materials unrelated to the context of the IBT in use.
ically identifies critical terms. The component was implemented
Specifically, from each flagged FQDN, the IBT Collector
in Python using an open-source tool topia.termextract.
first picks up all the URLs leading to malicious content, and
• Static crawler accepts query terms, looks for the terms through
from them, identifies the most commonly shared path under
search engines and returns results with a pre-determined number
the FQDN. For example, from the URLs www.lgma.ca.gov/
of items. In our implementation, the crawler was developed in
play/ popular/ 1*.html, www.lgma.ca.gov/ play/ home/ 2*.html
Python and utilized the Google Web Search API [4] and the
and www.lgma.ca.gov/ play/ club/ 3*.html (detected using the
Bing Search API [1] to get search results.
IBT ‘casino'), the shared path under the FQDN is www.lgma.ca.
• Semantic comparator accepts a set of terms and compares
gov/ play. Using this path, our approach queries Google again
them with the keywords of an input sTLD. It can return the
with ‘site:FQDN+path': e.g., site:www.lgma.ca.gov/ play. From
average distance of each term with those keywords or the terms
whose distances are above a given threshold. This componentwas implemented as a Python program that integrates the open-
source tool word2vec. As mentioned earlier, we trained thelanguage model used by word2vec with the whole Wikipediadataset, from which our implementation automatically collectedthe context for each term before converting it to a high-dimensional vector.
System building. Using these building blocks, we constructedthe whole system as illustrated in Figure 2. Specifically, the
(a) False detection rate in differ- (b) False positive rate in differ-
Semantic Finder was developed to run the static crawler
ent semantics distances. Color bar ent semantics distances. Color barshows the coverage rate.
shows the coverage rate.
to gather the content under an sTLD and then call theterm extractor to identify the keywords for the domain. The
Fig. 5: Evaluation results on good set and bad set.
Inconsistency Searcher invokes the semantic comparator todetermine the most irrelevant IBTs before using the crawler
dataset was used as the unknown set for discovering new
to search for the terms. The Context Analyzer includes a
differential analyzer component implemented with around 300
Resources and validation. In all our experiments, our proto-
lines of Python code. For each suspicious FQDN, the analyzer
type system was run within Amazon EC2 C4.8xlarge instances
calls the crawler to query the search engine twice, one under an
equipped with Intel Xeon E5-2666 36 vCPU and 60GiB of
IBT and the other for getting the reference (the generic content).
memory. To collect the data for the unknown set, we deployed
It reports the domain considered to be compromised. Finally,
20 crawlers within virtual machines with different IP settings.
the IBT Collector uses the crawler to search for the selected
These crawlers utilized the APIs provided by Google and Bing
URL path under the detected domain, then the extractor to
to dump the outcomes of the queries, from 2015/08 to 2015/10.
get critical terms from the search results and the semantics
To validate the findings made on the unknown set, we em-
comparator to find out new IBTs. Over these IBTs, we further
ployed a methodology that combined anti-virus (AV) scanning,
integrated the k-NN module provided by the scikit-learn open
blacklist checking and manual analysis. Specifically, for the
source machine learning library [7] to cluster them and discover
FQDN reported by our system, we first scanned their URLs
new bad-term categories.
with VirusTotal and considered that the URLs were indeed
B. Experiment Setting
suspicious when at least two scanners flagged the domain.
Then, all such suspicious URLs were cross-checked against the
Data collection. To evaluate SEISE, we ran our prototype
blacklist of CleanMX. For those confirmed by both VirusTotal
on three datasets: the labeled bad set and good set, and the
and CleanMX, their FQDNs were automatically labeled as
unknown set including 100K FQDNs collected from search
compromised. For other domains also detected by SEISE, we
engines, using 597 search terms, as explicated below.
randomly sampled 20% of them and manually checked whether
• Bad set. We collected the FQDNs confirmed to have
they were indeed compromised.
promotional infections from CleanMX [18], a blacklist ofcompromised URLs. A problem here is that these URLs are
C. Evaluation Results
associated with different kinds of malicious activities and it is
Over the aforementioned datasets, we thoroughly evalu-
less clear whether they are promotional infection. What we did
ated our prototype. Our study shows that SEISE is highly
is to collect all the sTLD URLs from the CleanMX feed from
effective: it achieved near zero False Detection Rate (FDR,
2015/07 to 2015/08, and further manually inspected all these
i.e., FP/(FP+TP)) and over 90% coverage (i.e., TP/(TP+FN))
URLs. Specifically, whenever we saw that advertising, Phishing,
or below 4.7% FDR, 4.4% False Positive Rate (FPR, i.e.,
defacement content showing up in the search results of a URL,
FP/(FP+TN)) and nearly 100% coverage on the labeled sets
it is considered to be exploited for promotional infections. We
(the bad and good set); with the threshold chosen to balance
further classified these URLs into different categories and also
FDR and FPR, we further ran SEISE over the unknown set,
manually identified related IBTs. In this way, we built a bad set
which reported over 11K compromised sites, with an FDR
with 300 FQDNs (together with 15 IBTs in three categories).
of 1.5% and a coverage over 90%. Also importantly, 75% of
• Good set. Using the IBTs collected from the bad set, we
infections discovered from the unknown set are likely never
further searched under the sTLDs for the FQDNs ("site:sTLD+
reported before, including 3 large-scale campaigns, on which
IBT") that contained those terms but were not compromised.
we elaborate in Section V. All these findings were made in
These domains were used to understand the false detections
a highly efficient and scalable way: on average, only 2.3
that could be introduced by SEISE. Altogether, we collected a
queries were made for finding a new compromised FQDN
good set of 300 FQDNs related to 15 IBTs and three categories.
and the delay caused by analyzing the query results and
• Unknown set. As mentioned in Section II, we gathered 403
other computing resources consumed for this purpose were
sTLDs and manually selected 30 IBTs in three categories.
Running these IBT seeds on these sTLDs, we crawled Google
Accuracy and coverage. We evaluated the accuracy and
and Bing over three months, collecting 100K FQDNs. This
the coverage of SEISE under a given set of IBTs. In this
case, what can be achieved are all dependent on the Context
TABLE I: Number of IBTs in each round.
Analyzer, which ultimately decides whether to flag an FQDN
# of IBTs per category
as compromised. In our research, we first studied our system
over the labeled good set and bad set, and then put it to test
over the unknown set. Figure 5(a) and 5(b) illustrate the results
over the labeled sets, in response to different thresholds for
semantic distances (between the reference and the query of anIBT). As we can see here, when the threshold goes up, the
prototype, in an attempt to understand the scalability of our
FDR goes down and so does the coverage. On the other hand,
design. We found that except the delay caused by receiving
loosening the threshold, which means that the IBT is becoming
the results from Google, the overhead for analyzing search
less irrelevant to the semantics of the sTLD, improves the
results and detecting compromised sites are exceedingly low:
coverage, at the cost of the FDR. Overall, the results show
by running 10000 randomly selected queries (50 IBTs over 200
that SEISE is highly accurate: by setting the threshold to 0.9,
sTLDs), we observed that the average time for analyzing 1K
we observe almost no false detection (FDR: 0.5% and FPR:
result items, excluding the waiting time for the search engine,
0.4%) with a 92% of coverage; alternatively, if we can tolerate
was 1ms, and also the memory and CPU usages stayed below
4.7% FDR (FPR: 4.4%), the coverage becomes close to 100%.
5% respectively. The main hurdle here is the delay caused
In our research, the threshold 0.9 was then utilized to analyze
by the search engine: for Google, it ranged from 5ms to 8ms
the unknown set.
per one thousand queries. The design of SEISE already limits
On the unknown set, we ran SEISE to query 597 IBTs under
the number of queries that needed to be made for detecting
403 sTLDs. Our prototype inspected 100K FQDNs in total.
infected FQDNs: in the experiments, we found that on average,
11,473 of them were flagged as compromised, about 11% of
a compromised FQDN was detected after 2.3 term queries. We
the whole unknown set. Table II and Table III summarize our
believe that by working with the search provider (Google, Bing
findings, which are further discussed in Section V. Among all
etc.), SEISE can be easily scaled with a quick turnaround of
that were detected, 3% were confirmed by both VirusTotal [11]
the search results.
and CleanMX [18], 22% were found by at least one of these
two AV systems and further validated manually, and 1000 of
Based upon what was detected by SEISE, we performed a
the remaining were inspected manually. All together, the FDR
measurement study to understand the promotional infections
measured from the unknown set is as low as 1.5%. We further
on sTLDs, particularly the semantic inconsistency these attacks
randomly sampled 500 result pages related to 10 categories
introduce. Our study brings to light the pervasiveness of the
of IBTs and found that our prototype reported 53 infections
attacks and their significant impacts, affecting the websites of
and missed 5, which indicates a coverage of about 90%. Also,
leading academic institutions and government agencies around
note that over 75% of the infections have never been reported
the world. Further discovered are a set of surprising findings
(missed by both VirusTotal and CleanMX). We have reported
and their insights, which have never been known before. For
the most prominent ones among them to related organizations
example, apparently sTLDs are soft targets for promotional
and are helping them fix the problem, and will continue to
infections, highly ranked and also easier to compromise
work on other cases.
compared with gTLD sites of similar ranks; as a result, by
IBT expansion. The effectiveness of SEISE also relies on its
mitigating the threats to the sTLD domains, we raise the bar
capability to discover new IBTs and find new attack instances
for the adversary, depriving him of easy access to the resources
across different categories. As discussed before, our prototype
highly valuable to the promotional attacks, which rely on the
starts with a small set of seed IBTs, 30 terms in three categories.
compromised site's rank to boost the rating of malicious content.
After searching for all these terms under all the sTLDs, a set
As another example, we show that semantic inconsistency can
of compromised FQDNs are detected, which are further used
also be observed in the promotional infections on gTLDs
by the IBT Collector to extract new terms for searching all 403
such as .com, .net, etc., even though these domains tend to
sTLDs again. In our research, we repeated such iteration 20
have a much more diverse semantic meaning. Based upon this
times, expanding the IBT set to 597 terms and 10 categories.
observation, a preliminary exploration highlights the potential
All the terms and categories were manually confirmed to be
of extending our approach to protect gTLD sites, indicating
correct. Table I presents the numbers for the terms and the
that a semantic model can also be built for some websites under
categories, together with examples of new terms detected, after
the gTLD domains to capture the promotional attacks on them.
the 1st, 5th, 10th, 15th and 20th iterations. As we can see here,
Finally, we elaborate on a study on some prominent attack
the number of categories and number of IBTs increase quickly
cases discovered in our research, which, from the semantic
(with a increase rate of 60% and 180%, respectively) in the first
perspectives, analyzes the techniques the adversary employ in
10 iterations, which indicate that our IBT expansion method
the promotional infections.
is efficient for both in-category and cross-category expansion.
Also, Table III illustrates the total categories of IBTs flagged
A. Landscape
by SEISE after these iterations.
Scope and magnitude. Our study reveals that the promotional
Performance. We further evaluated the performance of our
infections are spread across the world, compromising websites
TABLE II: Top 10 sTLDs with most injected domains.
FQDN: 1,840 URL: 172,244
FQDN: 312 URL: 22,543
FQDN: 250 URL: 29,580
FQDN: 403 URL: 34,308
FQDN: 223 URL: 21,563
FQDN: 253 URL: 23,022
FQDN: 178 URL: 15,720
FQDN: 163 URL: 14,572
FQDN: 172 URL: 12,034
FQDN: 144 URL: 11,056
TABLE III: Categories of IBTs.
casino, slot machine
ca.gov (Alexa: 649)
cheap xanax, no prescription
princeton.edu (Alexa: 3558)
nike air max, green coffee bean
nih.gov (Alexa: 196)
fake driving permit, cheap essay
payday loan, quick loan
cheap airfare, hotel deal
gmu.edu (Alexa: 8058)
cheap gucci, discounted channel
tsinghua.edu.cn (Alexa: 6717)
free download, system app
islamic state, falun gong
Fig. 6: Cumulative distribution of injected sTLD sites' Alexa rank and Top 20 injected sTLD sites with highest Alexa rank.
in all kinds of sTLDs. Altogether, SEISE detected around
protected sTLD with a significant portion of the FQDNs
1 million URLs leading to malicious content on 11,473
compromised (12%), which is followed by edu.vn 3% and
infected FQDNs under 9,734 sTLD domains. The results are
edu.cn 3%. The top-3 sponsoring registrars with the most
summarized in Table II and Table III.
infected gov.cn sites are sfn.cn, alibaba.com, xinnet.com. On
To understand the magnitude of the threat towards individual
the other hand, .mil sites apparently are better protected than
sTLDs, we studied the ratio of compromised FQDNs under each
others. Among the 456 .mil domains we monitored, only 8
domain category. For this purpose, we first tried to get some
domains are injected.
idea about how many FQDNs are under each sTLD, using the
Figure 7 describes the distributions of the compromised
passive DNS dataset from DNSDB [3]. The dataset includes
sTLD sites across 141 countries, as determined by their
the records of individual DNS RRsets as well as first-seen,
geolocation. Based upon the number of infected domains,
last-seen timestamps for each domain and the DNS bailiwick
countries are colored with different shades of blue. As we
from Farsight Security's Security Information Exchange and the
can see here, most of infected sites are found in China (15%),
authoritative DNS data. The number of FQDNs under an sTLD
followed by United States (6%) and Poland (5%).
was estimated from those under the sTLD queried between
Impacts of the infections. We further looked into the Alexa
2014/01 and 2015/08, as reported by the passive DNS records.
ranks of injected sTLD websites, which are presented in
The results were further cross-validated by comparing them
Figure 6. Across different sTLDs, highly ranked websites were
with the estimated domain counts given by DomainTools [2]
found to be exploited, getting involved in various types of
for each TLD.
malicious activities, SEO, Phishing, fake drug selling, academic
Table II illustrates the top-10 sTLD with the largest number
cheating, etc. Figure 6 illustrates the cumulative distributions of
of infected domains, together with the number of domains we
the ranks: a significant portion of the infections (75%) actually
monitored and the total number of domains we estimated for
happen to those among the top 1M. Figure 6 further shows
each sTLD. According to our findings, gov.cn is the least
the top-20 websites with the highest Alexa ranks. Among
them, 12 are under .edu, including the websites of leadinginstitutions like mit.edu (Alexa:789), harvard.edu (Alexa:1034),stanford.edu (Alexa:1050) and berkeley.edu (Alexa:1452), and7 under .gov, such as nih.gov (Alexa:196), state.gov (Alexa:719)and noaa.gov (Alexa:1126). In general, China is the countrythat hosts most injected sTLD sites; however, when it comesto top ranked sites (Alexa rank < 10K), 67% of them are inthe United States and Australia.
Also interesting is the types of malicious activities in which
those domains are involved. Table III shows the number of the
Fig. 8: The distribution of the infection time.
domains utilized for promoting each type of content (acrossall 10 categories). As we can see here, most of the injectedsTLD sites (19%) are in the Gambling category, which is
between the semantics of the promoted content and that of an
followed by those related to Drug (15%) and General Product
infected domain's generic content: in our labeled bad set (the
(14%) such as shoes and healthcare products. When we look
collection of compromised domains reported by CleanMX; see
at the top-20 domains, many of them are infected to promote
Section IV-B), all sTLD-related infections contain the malicious
Drug. Also, many .edu domains advertise unlicensed pharmacy,
content inconsistent with the semantics of their hosting websites.
while .gov are mainly compromised to promote gambling and
The implication of this observation is that by exploiting this
fake AV. Interestingly, the injected domains associated with
feature, a weakness of the sTLD-based promotional infections,
different countries tend to serve different types of content. For
a semantic-based approach, like SEISE, can effectively suppress
example, the most common promotions on Chinese domains
such a threat to sTLDs. This is significant, since our study,
are gambling (which is illegal in that country), while most
as elaborated below, shows that sTLDs are valuable to the
injected US domains are linked to unlicensed online pharmacy.
adversary because they are less protected and highly ranked.
Since the infected country code sTLDs (e.g., .cn) can make the
Further, even for gTLDs, which tends to have highly diverse
content they promote more visible to the audience in related
and less specified semantics, the malicious content uploaded
countries (e.g., boosting the ranks of malicious sites in the
there also tends to be incompatible with the compromised
results of country-related searches), it is likely that promotional
websites' themes. This indicates that our approach can be
infections target specific groups of Internet users, just like
applied beyond sTLDs. Following we report our findings.
sTLD as a soft target. To understand the importance of
Our study further shows that many of such infections have
sTLDs to the adversary, we compared the compromised sTLD
been there for a while. Figure 8 shows the distribution of the
sites with those under the gTLDs, within the same attack
infection time for the injected page in sTLD sites. We estimated
campaign. A campaign here includes a set of websites infected
the durations of their infections by continuously crawling the
for promoting unauthorized or malicious content and those sites
20K injected pages (which were detected in 2015/08) every
share a set of common features, specifically, they all pointing
two days from 2015/08 to 2015/11 to find out whether they
to the same target site being advertised, their malicious URLs
were still alive. As we can see from the figure, most infections
having the same features (such as same affiliate ID as URL
last 10-20 days, while some of them have indeed been there
parameter) and they all share the same redirection chain. In our
for a while, at least 1 months. A prominent example is the
research, we discovered a campaign through infected websites'
injection on ca.gov, whose infection starts no later than 60
"link-farm" structure, i.e., a compromised site pointing to
another one. Following the links on the compromised sTLDsites enabled us to reach a set of infected gTLD sites, mainly
B. Implications of Semantics Inconsistency
under .com. We then compared the features of those sites with
Our study shows that promotional infections, particularly
those of sTLD domains, in terms of Alexa rank, pagerank
for those under sTLDs, are characterized by the inconsistency
(PR) and lifetime, in an attempt to find out what type of TLD
domains are more valuable to promotional infections.
Table IV presents the top-3 campaigns (all organized as link
farms) discovered in our study. The largest one covers about 872sTLDs and 3426 gTLDs across 12 countries and regions (US,China, Taiwan, Hong Kong, Singapore and others). Among thevictims are 20 US academic institution such as nyu.edu, ucsd.
edu, 5 government agencies like va.gov, makinghomeaffordable.
gov, together with 188 Chinese universities and 510 Chinesegovernment agencies. Also among the victims are 1507 .comsites. Figure 9(a) and Figure 9(b) compare the Alexa globalranks and the page rank (PR) of those gTLD and sTLD websites.
Fig. 7: The geolocation distributions of the compromised sTLD
As we can see from the figures, 50%-75% of sTLD sites are
sites across 141 countries.
TABLE IV: Top 3 link-farm campaigns with most injected sTLD
ranked within the Alexa top 1M, while only 10%-30% of gTLDsites are at this level. Actually, more than 40% of the gTLDsites have Alexa rank outside the top 5M. By comparison, less
Fig. 10: Example of search engine results of an injected gTLD
than 20% of sTLDs have ranks outside the top 5M. In terms
of PR, more than 30% of the sTLD sites have PR from 4 to 6,
while less than 5% of gTLD sites are PR4-PR6. Also, more
than half of gTLD sites have PR as 0, which have a weaker
also found that some compromised gTLD sites show semantic
SEO effectiveness than those with high PR. This indicates
consistent with the promotional content. For example, online
that the majority of sTLD sites have a stronger effect on the
drug library druglibrary.org (in Campaign 3) was injected to
promoted sites than gTLD sites with no or low PR.
promoted "cheap xanax". Hence, to identify those suspicious
We further compared the durations of the infections for these
sites (before they are checked with the Context Analyzer), we
two types of domains. Again, we continuously crawled the
utilized the similarsites website query API [8] to fetch the site
compromised pages (identified in 2015/08-2015/09) every two
tags (e.g., "recycling" and "water" for site:iceriversprings.com)
days from 2015/09 to 2015/11 to check whether the infections
to determine a gTLD site's semantics, and only use the gTLD
were still there. Figure 9(c) illustrates the distributions of the
sites showing semantic inconsistency with the IBT (i.e., the
sTLD site's life spans and those of gTLD sites. As can be seen
site's tags semantically distance away from the IBT) as the
from the figure, gTLD sites were cleaned up more quickly than
suspicious candidates for the input of the Context Analyzer.
the sTLD sites. Over 25% of the gTLD sites were cleaned
This filtering step (for the purpose of increasing the "toxicity
within 10 days, while 12% of the sTLD sites were cleaned
level" [21] of the inputs) is built as the Semantic comparator,
within 10 days.
which accepts the threshold for the IBT semantics distance
Our study demonstrates that the sTLDs are ranked higher
(Section III-B) and outputs the candidate gTLD sites that
than the gTLD sites and much more effective in elevating
have great semantic distances with the IBT used for the
the ranks of promoted content, thereby more valuable to
query. For example, iceriversprings.com, which has the site
promotional infections. In the meantime, they are less protected
tag "recycling", "water" which shows semantic inconsistency
than the gTLDs: once compromised, the infections will stay
(determined by Semantic comparator Figure 2) with the IBT
there for a longer period of time. This indicates that, indeed,
"payday loan", will be regarded as suspicious FQDNs and
the sTLDs are valuable assets to the adversary and effective
become the input of the Context Analyzer.
protection of the site, as SEISE does, indeed makes the
Figure 9(d) shows the semantic distances between the
promotional attacks less effective.
reference and the search results of querying an IBT with
Extension to gTLDs. Compared with sTLDs, gTLDs (e.g.,
and without the Semantic comparator. We observe that the
.com, .net and .org) do not have fixed semantic meanings.
Context Analyzer can still identify the semantics inconsistency,
However, we found that still the malicious content injected
particularly with the help of the Semantic comparator that
here tends to be incompatible with the semantics of the sites,
selects sites with great semantic distances with the IBT: 97%
which can be captured by the search engine results. Figure 10
of the injected sites have semantic distance larger than 0.8
presents an example of search engine results for an injected
when the threshold of Semantic comparator is set to 0.9; by
gTLD site iceriversprings.com, which is the website of Ice
comparison, 85% of the injected sites have semantic distance
River Green brand of bottled water. However, the injected page
larger than 0.8 in the absence of the Semantic comparator.
show the semantically inconsistent content for "payday loan"
Further, we measure the semantic inconsistency of unknown
injected gTLD sites. This is nontrivial because simply searching
Then, we measure the semantics inconsistency on the
site:.com "payday loan" will return mostly legitimate search
3,000 gTLD sites, which are randomly sampled from the
results. Even though we could validate these FQDNs one by one
aforementioned campaigns. Specifically, we use the Context
through the Semantic comparator and the Context Analyzer, the
Analyzer component in SEISE to calculate the semantic
cost for finding truly compromised sites becomes overwhelming.
distance between the generic content of those known injected
As mentioned earlier, with a similar PR, gTLD sites are
sites (the reference, e.g., the search result of the query
better protected than sTLD sites. Hence, when searching
site:iceriversprings.com) and the results of querying IBTs on
gTLDs under the IBT (e.g., site:.com "payday loan"), high-
these sites, which mostly contain injected malicious content
PR gTLD sites tend to appear on top of the search results,
(e.g., site:iceriversprings.com "payday loan"). However, we
which are actually less likely to be compromised. For example,
of (b) Cumulative
of (c) Distribution of the infection (d) Cumulative distribution of se-
Alexa global ranks per sites in 3 Alexa bounce rate per sites in 3 time for the injected pages in sTLD mantics distance per monitored
sites and gTLD sites.
Fig. 9: Alexa global rank, PR and life span of sites in three campaigns, and cumulative distribution of semanticsdistance per monitored sites.
when searching "payday loan", many high-PR sites such as
toolkit, called xise, was discovered on a cloud drive. By
analyzing its code, we found that xise has the functionalities
will show up within the top-100 search results. None of them
for automatic site collection, shell acquisition, customized
appear to be compromised. To address this challenge and
injected page generation and a series of evasion techniques such
identify the sites likely to be compromised (which will be
as redirection cloaking and code obfuscation. More specifically,
further determined by the Context Analyzer), we utilized long
it automatically discovers the domains of high-profile websites
IBTs (word length larger than 4) to feed search engine to
from Google and other search engines, and also scans the
obtain suspicious FQDNs. Generally, longer query keywords
websites for the vulnerabilities within the components such
have less search competition [27], i.e., websites with lower PRs
as phpmyadmin, kindeditor, ueditor, alipay and
are more likely to appear in the search results. For example,
fckeditor. Further, it lets its user provide the promoted
when searching for "payday loan no credit check" under .com,
site's URL and keywords and automatically generates the pages
bottled water website iceriversprings.com and ATM company
to be injected to the compromised websites along a specific path
website carolinaatm.com are within the top-10 search results.
(e.g., filemanager/ browser/ default/ images/ icons). The tool also
In our experiments, we utilized 1000 long IBTs in 10
uploads a configuration file to the compromised web server to
individual categories to do the search, and 23,098 gTLD
perform redirection cloaking: i.e., it will redirect visitors based
FQDNs were collected for the semantic inconsistency analysis.
on their HTTP referers to protect the compromised site. Also,
We set the threshold of the Context Analyzer to 0.9, and 7,430 of
to guarantee the malicious content to be indexed by search
the gTLD FQDNs were reported to have promotional infections.
engines, xise also uploads scripts to keep generating pages
We further randomly sampled 400 results (200 injected and 200
to guarantee SEO effectiveness. Note that adding and changes
not-injected) and manually checked the findings. We confirmed
is a freshness factor for high search engine ranking. In our
that 182 were indeed infections and 196 were not injected,
research, we manually generated signatures for xise as listed
which gives us an FDR of 9% and FPR of 8.4%. With this
in Table V. 1037 of sTLD sites we detected are related to
encouraging outcome, how to detect compromised gTLDs
xise with the average semantics distance 0.87 to it sTLDs.
through semantics-based approaches remains to be an open
Academic cheating infections. Our research also discovered
question. Particularly, new techniques need to be developed to
many infections promoting academic cheating sites. Those sites
further suppress FDR and improve its coverage. Also, query
provide online services for preparing any kind of homework
terms for detection should also be automatically discovered.
at the high school and college levels, and even taking onlinetests for students. We found that such attacks mainly aim at
C. Case Studies
.edu domains and the examples of the IBTs involved include
Perhaps the most surprising findings of our study is the
‘free essay', ‘cheap term paper' and others. These terms were
discovery of several large-scale attacks, infecting many leading
found to be very effective at finding such malicious activities.
organizations around the world. In addition to the afore-
SEISE detected 428 compromised sites, including high-profile
mentioned gambling campaign, we also found the infections
.edu domains such as mit.edu, princeton.edu, havard.edu, etc.
for promoting counterfeit products, fake essays and political
Table VI compares the compromised .edu sites in different
materials on university and government sites. Here we present
keyword categories. We observe that such malicious activities
the studies on two cases as examples to provide additional
have apparently already become a global industry. 119 edu-
information about what techniques the adversary uses and how
cation TLDs in 109 countries have 428 infected domains to
the attacks are organized.
promote academic cheating sites. The Top 3 education TLDs
Exploit kit discovered. We found an exploit toolkit used in
with most infected sites are edu (23%), edu.mn (11%) and
multiple gambling campaigns, for example, Campaign 1. The
edu.cn (7%).
TABLE V: Example of signatures.
<img width="20" height="20" border="0" hspace="0" vspace="0" src="http:// count51.51yes.com/ count1.gif">
<iframe marginwidth="0" marginheight="0" hspace="0" vspace="0" frameborder="0" scrolling="no" src="" height="0" width="0">
TABLE VI: Comparison of injected education TLDs sites in
hiding the inconsistent content by embedding it within images.
different keyword categories.
However, even in the presence of relevant content, the malicious
keywords can still be recovered and cause an observable
semantic deviation from the theme of the original website, as
long as the keywords are sufficiently frequent to be picked up by
the search engine and contribute to the change of the malicious
content's rank in search results. Hiding content in images results
in neglect of malicious content in the search results, which
is not what the adversary wants. Fundamentally, no matter
what the adversary does, the fact remains that any attempt to
cover the content being advertised will inevitably underminethe effectiveness of the promotional effort. Another evasion
strategy is to just compromise the website with compatible
Our research shows that semantics-inconsistency search
semantics. This approach will significantly limit the attack
offers a highly-effective solution to the promotional-infection
targets the adversary can have. Particularly, it is less clear how
threat. In this section, we discuss the tricks the adversary can
this can be done for sTLDs. Note that even selling medicine
play to evade our detection, limitations of our technique and
on a health institution's site can be captured, as the infections
future research, together with the lesson learnt from our study
of the NIH pages shown at the beginning of the paper.
and our communication with the victims.
Limitations. As mentioned earlier, our current design is
Evasion. The current implementation of SEISE is based upon
focused on detecting the infections of sTLD sites, since they
the search results returned from Google and Bing. While
have well-defined semantic meanings and are a soft target
both are mainstream search engines targeted by promotional
for the adversary. In the meantime, gTLDs are also known
infections, the data we crawled are limited to the sites that
to be extensively compromised for promotion purposes. A
indexed by Google and Bing. Hence, to evade SEISE, the
natural follow-up step is to develop the semantic technologies
adversary, who has full control of a compromised website,
for protecting those domains. This is completely feasible,
may set robots.txt to prevent part of its content from being
as demonstrated in our preliminary study (Section V-B): by
scanned. Such evasion techniques, however, will cause the
leveraging the Alexa categories, the semantics of even those
promotion pages to lose the visitors from the search engines
more generic domains can also be identified and compared
and also the high-profile links to the sites being promoted.
with that of the content it hosts.
This defeats the purpose of the promotional infections, which
are meant to advertise malicious content through the search
Moreover, our semantic-based detection technique does not
engines and therefore should aggressively expose its content
differentiate between server injected domains, blog/forum Spam
(promotional pages) to the search engines, instead of hiding it
and URL redirection [22] (e.g., posting ads on a .edu forum or
from them. Other issues related to search results include the
utilizing the server-side script of a .gov domain to dynamically
delay introduced by page indexing and page expiration. Again,
create a page under the domain with promotion content, see
although our approach is not designed to capture a promotional
Section I). In our research, we randomly sampled 100 detected
infection before it is indexed by the search engines, the impact
pages and found that about 20% of them are Spam, which
of the infection is also limited at that time, simply because its
are also considered illicit advertising [22]. A follow-up step
whole purpose is to advertise some malicious materials, which
is to develop automatic technologies to identify those cases,
is not well served without the infected pages being discovered
so we can respond to them in a different way (e.g., through
by the search engine. For page expiration, we need to consider
input sanitization). For example, a comment page oftentimes
the fact that as long as the URLs of the promoted content are
can be detected from the keywords such as "comment" or
still alive, the attack is still in effect, since letting people find
"redirect" involved in its link; such a page, once found to
the URLs is the very purpose of the attack. Whether the URLs
promote malicious content, can be further analyzed to determine
are still there can be confirmed by crawling the links. Further,
whether the content is link Spam or caused by an infection.
the snippet of the search results, even for the pages that are
Also, the use of search engines has a performance implica-
already expired, can still be utilized to find new keywords.
tion. Search service providers often have limits on the crawling
The adversary may play other evasion tricks, by adding
frequency one can have, which causes delay in detecting
more relevant keywords to the infected page to make the
malicious content and affects the scalability of our technique.
content look more consistent with the website's theme, or
On the other hand, given the effectiveness of SEISE in catching
promotional infections, we believe that a collaboration with the
detect malicious redirect scripts, and Shady Path [31] that
search provider to detect Internet-wide infections is completely
captures a malicious web page by looking at its redirection
graph. Compared with those techniques, our approach isdifferent in that it automatically analyzes the semantics of web
Lesson learnt. Our study shows that sTLD sites are often
content and looks for its inconsistency with the theme of the
under-protected. Particularly for universities and other research
hosting website. We believe that the semantics-based approach
institutions, their IT infrastructures tend to be open and loosely
is the most effective solution to promotional infections, which
controlled. As a prominent example, in a university, individual
can be easily detected by checking the semantics of infected
servers are often protected at the department levels while
sites but hard to identify by just looking at the syntactic
the university-level IT often only takes care of network-level
elements of the sites: e.g., both legitimate and malicious ads
protection (e.g., intrusion detection). The problem is that,
can appear on a website, using the same techniques like
oftentimes, the hosts are administrated by less experienced
redirections, iframe, etc. Further, we do not look into web
people and include out-dated and vulnerable software, while
content or infrastructure at all, and instead, leverage the search
given the nature of the promotional infections, they are less
results to detect infections. Our study shows that this treatment
conspicuous in the network traffic, compared with other
is sufficient for finding promotional infections and much more
intrusions (e.g., setting up a campus bot net). We believe
efficient than content and infrastructure-based approaches.
that SEISE, particularly its Context Analyzer, can play the
Similar to our work, Evilseed [21] also uses search results
role of helping the web administrators of these organizations
for malicious website detection. However, the approach is only
detect the problems with those less-protected hosts. Of course,
based upon searching the URL patterns extracted from the
a more fundamental solution is to have a better centralized
malicious links and never touches the semantics of search
control, at least in terms of discovering the security risks at
results. Our study shows that focusing only on the syntactic
the host level and urging the administrators of these hosts to
features such as URL patterns is insufficient for accurate
keep their software up-to-date.
detection of promotional infections. Indeed, Evilseed reports
Responsible disclosure. Since the discovery of infected do-
a huge false detection rate, above 90%, and can only serve
mains, we have been in active communication with the parties
as a pre-filtering system. On the other hand, our technique
affected. So far, we have reported over 120 FQDNs to CERT
inspects all the snippet of search results (not just URLs),
in US and 136 FQDNs to CCERT (responsible for .edu.cn)
automatically discovering and analyzing their semantics. This
in China, the two countries hosting most infected domains.
turns out to be much more effective when it comes to malicious
By now, CCERT have confirmed our report, and notified all
promotional content: SEISE achieves low FDR (1.5%) at a
related organizations, in which 27 responded and fixed their
detection coverage over 90%.
problems. However, it is difficult for us to directly contact the
Study on blackhat SEO. Among the malicious activities
victims to get more details (like log access) from the infected
performed by a promotional infection is blackhat SEO (also
servers. On the other hand, given the scale of the attacks we
referred to webspam), which has also been intensively studied.
discovered, the whole reporting process will take time.
For instance, Wang et al. investigated the longitudinal oper-
ations of SEO campaigns by infiltrating an SEO botnet [34].
Leontiadis et al. conducted a long-term study using 5 million
Detection of injected sites. How to detect injection of
search results covering nearly 4 years to investigate the
malicious content has been studied for long. Techniques have
evolution of search engine poisoning [23]. Also, Wang et al.
been developed to analyze web content, redirection chains
examined the effectiveness of the interventions against the SEO
and URL pattern. Examples of the content-based detection
abuse for counterfeit luxury goods [33]. Moore et al. studied the
include a DOM-based clustering systems for monitoring Scam
trending terms used in search-engine manipulation [25]. Also,
websites [19], and a system monitoring the evolution of
Leontiadis et al. observed .edu sites that were compromised
web content, called Delta [16], which keeps track of the
for search redirection attack in illicit online prescription drug
content and structure modifications across different versions
trade, and briefly discussed their lifetime and volume [22]. In
of a website, and identifies an infection using signatures
our paper, we conduct a more comprehensive measurement on
generated from such modifications. More recently, Soska et
403 sTLD, and multiple illicit practices beside drug trade were
al. works on detecting new attack trends instead of the attacks
themselves [29]. Their proposed system leverages the featuresfrom web traffic, file system and page content, and is able to
predict whether currently benign websites will be compromised
In this paper, we report our study on promotional infections,
in the near future. Borgolte et al. introduces Meerkat [17], a
which introduce a large semantic gap between the infected
computer vision approach to website defacement detection. The
sTLD and the illicit promotional content injected. Exploiting
technique is capable of identifying malicious content changes
this gap, our semantic-based approach, SEISE, utilizes NLP
from screenshots of the website. Other studies focus on mali-
techniques to automatically choose IBTs and analyze search
cious redirectors and attack infrastructures. Examples include
result pages to find those truly compromised. Our study shows
JsRED [24] that uses a differential analysis to automatically
that SEISE introduces low false detection rate (about 1.5%)
with over 90% coverage. It is also capable of automatically
[18] CleanMX, "Viruswatch – viruswatch watching adress changes of malware
expanding its IBT list to not only include new terms but also
[19] M. F. Der, L. K. Saul, S. Savage, and G. M. Voelker, "Knock it
terms from new IBT categories. Running on 100K FQDNs,
off: Profiling the online storefronts of counterfeit merchandise," in
SEISE automatically detects 11K infected FQDN, which brings
Proceedings of the 20th ACM SIGKDD international conference on
to light the significant impact of the promotional infections:
Knowledge discovery and data mining.
ACM, 2014, pp. 1759–1768.
[20] R. Garside and N. Smith, "A hybrid grammatical tagger: Claws4," Corpus
among those infected are the domains belonging to leading
annotation: Linguistic information from computer text corpora, pp. 102–
educational institutions, government agencies, even the military,
with 3% of .edu and .gov, and over one thousand domains
[21] L. Invernizzi, P. M. Comparetti, S. Benvenuti, C. Kruegel, M. Cova, and
G. Vigna, "Evilseed: A guided approach to finding malicious web pages,"
of .gov.cn falling prey to illicit advertising campaigns. Our
in Security and Privacy (SP), 2012 IEEE Symposium on.
research further demonstrates the importance of sTLDs to the
pp. 428–442.
[22] N. Leontiadis, T. Moore, and N. Christin, "Measuring and analyzing
adversary and the bar our technique raises for the attacks.
search-redirection attacks in the illicit online prescription drug trade." in
Moving forward, we believe that there is a great potential to
USENIX Security Symposium, 2011.
extend the technique for protecting gTLDs, as indicated by our
[23] N. Leontiadis, T. Moore, and N. Christin, "A nearly four-year longitudinal
study of search-engine poisoning," in Proceedings of the 2014 ACM
preliminary study. Further, we are exploring the possibility to
SIGSAC Conference on Computer and Communications Security. ACM,
provide a public service for detecting such infections.
2014, pp. 930–941.
[24] Z. Li, S. Alrwais, X. Wang, and E. Alowaisheq, "Hunting the red fox
IX. ACKNOWLEDGMENT
online: Understanding and detection of mass redirect-script injections,"
This work was supported by the National Science Foundation
in Security and Privacy (SP), 2014 IEEE Symposium on.
(grants CNS-1223477, CNS-1223495 and CNS-1527141);
[25] T. Moore, N. Leontiadis, and N. Christin, "Fashion crimes: trending-term
Natural Science Foundation of China (grant 61472215). We
exploitation on the web," in Proceedings of the 18th ACM conference
thank our anonymous reviewers for their useful comments.
on Computer and communications security.
ACM, 2011, pp. 455–466.
[26] H. Schmid, "Probabilistic part-of-speech tagging using decision trees," in
Proceedings of the international conference on new methods in languageprocessing, vol. 12.
Citeseer, 1994, pp. 44–49.
[1] "Bing search api." https://datamarket.azure.com/dataset/bing/search.
[27] B. Skiera, J. Eckert, and O. Hinz, "An analysis of the importance of the
long tail in search engine marketing," Electronic Commerce Research
[3] "Farsight security information exchange," https://api.dnsdb.info/.
and Applications, vol. 9, no. 6, pp. 488–494, 2010.
[4] "Google web search api." https://developers.google.com/web-search/?hl=
[5] "Phishtank," https://www.phishtank.com.
[6] "Public suffix list," https://publicsuffix.org/list/.
[29] K. Soska and N. Christin, "Automatically detecting vulnerable websites
[7] "scikit-learn, machine learning in python." http://scikit-learn.org/stable/.
before they turn malicious," in Proc. USENIX Security, 2014.
[30] R. Stephan and F. Russ, "topia.termextract 1.1.0," https://pypi.python.
[9] "Sponsored top level domain (stld)," http://icannwiki.com/index.php/
[31] G. Stringhini, C. Kruegel, and G. Vigna, "Shady paths: Leveraging
surfing crowds to detect malicious web pages," in Proceedings of the
[10] "Stopword lists," http://www.ranks.nl/stopwords.
2013 ACM SIGSAC conference on Computer & communications security.
ACM, 2013, pp. 133–144.
[12] "word2vec, tool for computing continuous distributed representations of
[32] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer, "Feature-rich part-
of-speech tagging with a cyclic dependency network," in Proceedings of
the 2003 Conference of the North American Chapter of the Association
for Computational Linguistics on Human Language Technology-Volume
[14] "Email spam filter trigger words to avoid in your e-campaigns," http:
Association for Computational Linguistics, 2003, pp. 173–180.
[33] D. Y. Wang, M. Der, M. Karami, L. Saul, D. McCoy, S. Savage, and
[15] "50 of the most competitive seo keywords!" https://moz.com/ugc/
G. M. Voelker, "Search+ seizure: The effectiveness of interventions on
seo campaigns," in Proceedings of the 2014 Conference on Internet
[16] K. Borgolte, C. Kruegel, and G. Vigna, "Delta: automatic identification
ACM, 2014, pp. 359–372.
of unknown web-based infection campaigns," in Proceedings of the
[34] D. Y. Wang, S. Savage, and G. M. Voelker, "Juice: A longitudinal study
2013 ACM SIGSAC conference on Computer & communications security.
of an seo botnet." in NDSS, 2013.
ACM, 2013, pp. 109–120.
[35] N. Xue et al., "Chinese word segmentation as character tagging,"
[17] K. Borgolte, C. Kruegel, and G. Vigna, "Meerkat: detecting website
Computational Linguistics and Chinese Language Processing, vol. 8,
defacements through image-based object recognition," in Proceedings
no. 1, pp. 29–48, 2003.
of the 24th USENIX Conference on Security Symposium.
Association, 2015, pp. 595–610.
Source: http://netsec.ccert.edu.cn/duanhx/files/2010/12/2016-SP-Semantic.pdf
162 ISSN 1014-1200 FAO PRODUCCIÓN Y SANIDAD ANIMAL USO DE ANTIMICROBIANOS EN ANIMALES DE CONSUMO incidencia del desarrollo de resistencias en salud pública FAO PRODUCCIÓN Y SANIDAD ANIMAL USO DE ANTIMICROBIANOSEN ANIMALES DE CONSUMO incidencia del desarrollo de resistencias en salud pública J. O. ErrecaldeFacultad de Ciencias Veterinarias,Universidad Nacional de La Plata, Argentina
Lessons from Pfizer's Disputes Over its Viagra Trademark in ChinaDaniel Chow Follow this and additional works at: Part of the nd the Recommended CitationDaniel Chow, Lessons from Pfizer's Disputes Over its Viagra Trademark in China, 27 Md. J. Int'l L. 82 (2012).Available at: http://digitalcommons.law.umaryland.edu/mjil/vol27/iss1/9 This Article is brought to you for free and open access by DigitalCommons@UM Carey Law. It has been accepted for inclusion in Maryland Journal ofInternational Law by an authorized administrator of DigitalCommons@UM Carey Law. For more information, please contact.