Identifying spammers by their resource usage patterns

Kevin S. Xu, Mark Kliger, and Alfred O. Hero III

Abstract

Most studies on spam thus far have focused on its content or source. These types of studies, however, reveal little about the behavioral characteristics of spammers. In addition, privacy issues may prevent wide access to email content. In this paper, we try to identify spammers by investigating their resource usage patterns. Specifically, we look at usage patterns of harvesters, the bots that are used to acquire email addresses, and spam servers, the email servers being used to send the spam emails. We perform spectral biclustering on both harvesters and servers to reveal groups of resources that are used together, which we believe correspond to individual spammers or groups of spammers. We make several interesting discoveries including a division into phishing and non-phishing spammers and a group of harvesters with highly correlated behavior that have IP addresses belonging to a rogue Internet service provider.

Paper

K. S. Xu, M. Kliger, A. O. Hero III, “Identifying spammers by their resource usage patterns,” Collaboration, Electronic Messaging, Anti-Abuse and Spam Conf. (CEAS), 2010. (.pdf)

Biclustering results

The Cytoscape visualization shows the bicluster interaction network. Each bicluster is represented by two vertices: a circular vertex corresponding to a cluster of harvesters and a triangular vertex corresponding to a cluster of servers. The size of a vertex is representative of the number of harvesters or servers in that particular cluster, and the color of a vertex corresponds to the average phishing level of the harvesters or servers in the bicluster. Only the biclusters in the giant connected component (GCC) are displayed! In some months, such as March 2006, all of the phishing biclusters were disconnected from the GCC, so they are excluded from the visualization. The smaller connected components are, however, included in the contingency tables.

The contingency tables show the distribution of phishing and non-phishing harvesters and servers in phishing and non-phishing biclusters.

Month Visualization Contingency tables
2006-01 clu_2006_01.png   Pharvester Nharvester   Pserver Nserver
Pcluster 120 4 Pcluster 155 8
Ncluster 16 466 Pcluster 21 3077
2006-02 clu_2006_02.png   Pharvester Nharvester   Pserver Nserver
Pcluster 127 1 Pcluster 202 3
Ncluster 14 557 Pcluster 25 3745
2006-03 clu_2006_03.png   Pharvester Nharvester   Pserver Nserver
Pcluster 209 25 Pcluster 299 51
Ncluster 18 639 Pcluster 28 5290
2006-04 clu_2006_04.png   Pharvester Nharvester   Pserver Nserver
Pcluster 244 26 Pcluster 331 34
Ncluster 19 586 Pcluster 22 4389
2006-05 clu_2006_05.png   Pharvester Nharvester   Pserver Nserver
Pcluster 222 21 Pcluster 255 24
Ncluster 24 697 Pcluster 34 4723
2006-06 clu_2006_06.png   Pharvester Nharvester   Pserver Nserver
Pcluster 187 11 Pcluster 232 20
Ncluster 20 788 Pcluster 52 17639
2006-07 clu_2006_07.png   Pharvester Nharvester   Pserver Nserver
Pcluster 188 10 Pcluster 238 18
Ncluster 33 864 Pcluster 43 22431
2006-08 clu_2006_08.png   Pharvester Nharvester   Pserver Nserver
Pcluster 193 14 Pcluster 214 24
Ncluster 19 879 Pcluster 71 29738
2006-09 clu_2006_09.png   Pharvester Nharvester   Pserver Nserver
Pcluster 165 9 Pcluster 215 19
Ncluster 25 965 Pcluster 108 37554
2006-10 clu_2006_10.png   Pharvester Nharvester   Pserver Nserver
Pcluster 172 7 Pcluster 231 11
Ncluster 20 1333 Pcluster 1465 73748
2006-11 clu_2006_11.png   Pharvester Nharvester   Pserver Nserver
Pcluster 147 10 Pcluster 192 16
Ncluster 44 1287 Pcluster 832 61970
2006-12 clu_2006_12.png   Pharvester Nharvester   Pserver Nserver
Pcluster 135 13 Pcluster 180 11
Ncluster 18 1316 Pcluster 1832 59957
2007-01 clu_2007_01.png   Pharvester Nharvester   Pserver Nserver
Pcluster 132 7 Pcluster 171 18
Ncluster 23 1245 Pcluster 1502 59982
2007-02 clu_2007_02.png   Pharvester Nharvester   Pserver Nserver
Pcluster 130 6 Pcluster 198 23
Ncluster 37 1228 Pcluster 1085 65320
2007-03 clu_2007_03.png   Pharvester Nharvester   Pserver Nserver
Pcluster 110 5 Pcluster 146 24
Ncluster 17 1131 Pcluster 609 67203
2007-04 clu_2007_04.png   Pharvester Nharvester   Pserver Nserver
Pcluster 117 8 Pcluster 293 21
Ncluster 26 1760 Pcluster 1057 79528
2007-05 clu_2007_05.png   Pharvester Nharvester   Pserver Nserver
Pcluster 115 13 Pcluster 258 37
Ncluster 100 1965 Pcluster 1380 74307
2007-06 clu_2007_06.png   Pharvester Nharvester   Pserver Nserver
Pcluster 163 18 Pcluster 396 59
Ncluster 103 2347 Pcluster 1807 88705
2007-07 clu_2007_07.png   Pharvester Nharvester   Pserver Nserver
Pcluster 168 32 Pcluster 448 105
Ncluster 110 2700 Pcluster 2229 147110
2007-08 clu_2007_08.png   Pharvester Nharvester   Pserver Nserver
Pcluster 256 58 Pcluster 891 164
Ncluster 87 2918 Pcluster 2667 195619
2007-09 clu_2007_09.png   Pharvester Nharvester   Pserver Nserver
Pcluster 410 73 Pcluster 1248 372
Ncluster 86 3938 Pcluster 3205 236327
2007-10 clu_2007_10.png   Pharvester Nharvester   Pserver Nserver
Pcluster 295 71 Pcluster 990 2416
Ncluster 122 5390 Pcluster 258 452859
2007-11 clu_2007_11.png   Pharvester Nharvester   Pserver Nserver
Pcluster 323 84 Pcluster 1000 297
Ncluster 127 5983 Pcluster 5773 740193
2007-12 clu_2007_12.png   Pharvester Nharvester   Pserver Nserver
Pcluster 242 49 Pcluster 698 200
Ncluster 124 7137 Pcluster 4232 992336

Code and Data

To request access to the code and data used in the analysis, please contact Kevin Xu at the address below.

Contact