We identified malicious web servers with the high interaction client honeypot Capture-HPC. Capture-HPC is an open source client honeypot developed by Victoria University of Wellington in conjunction with the New Zealand Honeynet Project. This high interaction client honeypot monitors the system at various levels:

* registry modifications to detect modification of the Windows registry, like new or modified keys
* file system modifications to detect changes to the file system, like files created or deleted files, and
* creation/destruction of processes to detect changes in the process structure.

Client honeypot instances are run within a VMware virtual machine. If unauthorized state changes are detected - in other words when a malicious server is encountered - the event is recorded and the virtual machine is reset to a clean state before interacting with the next server. (Appendix A contains download details as well as a detailed description of the tool’s functionality and underlying technical aspects).

With our focus on unauthorized state changes to identify a malicious server, we are narrowing our view to a particular type of malicious server: the ones that can alter the state of the client without user consent or interaction, which usually means that the server is able to control and install malware on the client machine without the user noticing such actions. These attacks are commonly referred to as drive-by-downloads. Many more types of malicious web servers do exist, which we excluded from this initial study. Some examples are phishing servers that try to obtain user sensitive data by imitating a legitimate site or transparent proxy, web servers that specialize in obtaining sensitive files, such as the browser history, from the client machines, and web servers that simply host malicious executables (that need to be explicitly downloaded and executed by a user). Capture-HPC is not suitable for detecting these types of malicious web servers. Rather, we concentrate on powerful malicious servers that take control of the client machine.

We collected our data in the first half of May 2007 using twelve virtual machine instances of Capture-HPC. These instances were connected to the security lab at Victoria University of Wellington, NZ, from which they had direct access to the Internet. No content filtering or firewalling was performed between the security lab and the Internet. Twelve instances of Windows XP SP2 with Capture-HPC V1.1 and Internet Explorer 6 SP2 were installed to interact with a list of web servers. The client honeypot instances were configured to ignore authorized state changes, such as write activity within the browser cache (the complete list of authorized events is included in our downloadable data set which is available at Besides the Softperfect Personal Firewall, which was configured to block any non-web related traffic and prevent potential malware from spreading, we did not install any additional patches, software or plug-ins, nor did we modify the configuration of the operating system or browser in any way to make it less secure, but rather we used the default settings from a routine installation of Windows XP SP2. We suspected that this configuration would solicit a high number of attacks, since many remote execution vulnerabilities, which enable a server to execute code on the client machine, are known and the corresponding exploits are publicly available. The intent was to gather a comprehensive data set of malicious URLs, some of which we would analyze in-depth.

We were not only interested in analyzing attacks in-depth, but also in gaining a general understanding about the risk of accessing the web with such a configuration, and learning whether there are areas of the Internet that are more risky than others. As a result, we categorized the web servers’ URLs prior to inspection by our client honeypot. The following categories were used:

* Content areas: These URLs point to content of a specific type or topic. They were obtained by issuing keywords of the specific content area to the Yahoo! Search Engine. Five different content areas were inspected:
o Adult – pages that contain adult entertainment/pornographic material
o Music – pages that contain information about popular artists and bands
o News – pages that contain current news items or news stories in sports, politics, business, technology, and entertainment
o User content – pages that contain user-generated content, such as forums and blogs
o Warez – pages that contain hacking information, including exploits, cracks, serial numbers, etc.
* Previously defaced/vulnerable web servers: These URLs point to pages of servers that have been previously defaced or run vulnerable web applications. We obtained the previously defaced sites from . A list of vulnerable web servers was generated by issuing Google Hack keywords to search engines.
* Sponsored links: These URLs are a list of sponsored links encountered on the Google search engine results page. It was generated by collecting the links from the site for various keywords.
* Typo squatter URLs: These URLs are common typing mistakes of the 500 most popular sites from (for example instead of The typo URLs were generated using Microsoft’s StriderURLTracer.
* Spam: These URLs were mined from several months’ worth of Spam messages of the public Spam archive at

The list of URLs as well as the keywords used to obtain the URLs are included in our downloadable data set at

We inspected a little over 300,000 URLs from approximately 150,000 hosts in these categories (approximately 130,000 hosts when accounting for the overlap of hosts between categories. No significant overlap between URLs existed). Table 1 and Figure 3 show the detailed breakdown for the different categories and content areas. With each URL that was inspected by the client honeypot, we recorded the classification of the client honeypot (malicious or benign) as well as any unauthorized state changes that occurred in case an attack by a server was encountered (an example can be found as part of our in-depth analysis of the Keith Jarrett fan site). Unfortunately, we could not determine which vulnerability was actually exploited because analysis tools are still immature and not suitable for an automated analysis.

Category  Inspected Hosts Inspected URLs 
Adult  16,375 33,999 
Music  13,106 49,269 
News  21,188 47,224 
User Content  24,331 45,835 
Warez  23,530 44,870 
Defacement/Vuln  4,844  5,151 
Sponsored Links  17,179 42,092 
Typo  22,902 22,912 
Spam  5,481 11,460 
Total  148,936  302,812 

Table 1 - Input URLs/ hosts by category


Figure 3 - Input URLs by category

Using these input URLs, we identified a total of 306 malicious URLs from 194 hosts (No significant overlap of hosts or URLs exited). Simply retrieving any one of these URLs with the vulnerable browser caused an unauthorized state change that indicated the attack was successful, and that the server effectively gained control of the client machine. At that point, the server would be able to install malware or obtain confidential information without the user’s consent or notice.

The percentage of malicious URLs within each category ranged from 0.0002% for previously defaced/vulnerable web servers to 0.5735% for URLs in the adult category. Table 2 shows the breakdown of the various categories. As is clear from these results, all categories contained malicious URLs. This is an important discovery as it means that anybody accessing the web is at risk regardless of the type of content they browse for or the way the content is accessed. Adjusting browsing behavior is not sufficient to entirely mitigate such risk. Even if a user makes it a policy to only type in URLs rather than following hyperlinks, they are still at risk from typo-squatter URLs. Likewise if a user only accesses hyperlinks or accesses links that are served up by search engines, they are still at risk. Nevertheless, there seem to be some categories that are riskier than others. A Chi-Square test (p<0.01) shows statistical significance between the adult category and Spam; also between Spam and any other category. In other words, there exists an elevated risk of attack when accessing adult content or links in Spam messages.

Table 2 – Identified malicious URLs/ hosts by category

Category  Malicious Hosts  Malicious URLs  % Malicious URLs 
Adult  102  195  0.5735 
Spam  17  19  0.1658 
Warez  19  27  0.0602 
Typo  13  13  0.0567 
News  15  20  0.0424 
User Content  12  13  0.0284 
Music  10  11  0.0223 
Sponsored Links  0.0166 
Defacement/Vuln  0.0002 


Figure 4 - Identified malicious URLs by category

We interacted with the 306 malicious URLs repeatedly over a three-week period and made several interesting observations. First, malicious URLs might not be consistently malicious. While one might expect that this is a result of a malicious advertisement that is only displayed occasionally on a page, we did observe some URLs that behaved sporadically that did not contain advertisements, but rather contained a static link to the exploit code. A malicious web page might solicit malicious behavior only once upon repeated interactions, but it also might just behave benign for a few times before it solicits malicious behavior again. We suspect that such behavior is intentionally employed on the malicious servers in order to evade detection, such as detection by our client honeypot. Furthermore, there are several attack kits that serve malicious content only once per visiting IP address. This is mainly done to disguise the attack. Second, over the three-week period we observed a decline in the number of servers that behaved maliciously from the original set of 306 URLs. It declined to 180 after two weeks and to 109 after the third week. This is not likely to be indicative of a declining trend of malicious URLs, but rather points to the dynamic nature of the malicious server landscape. Exploits are disappearing from URLs, but we suspect that they are reappearing on another page at the same time. Again, this measure is likely to be employed method of evading detection, but also to ensure that attacks are consistently being executed. Such a moving target is more difficult to hit.