|
Discovery Challenge
Introduction
This
year's competition deals with personalized spam filtering and
generalization across related learning tasks. People spend an
increasing amount of time for reading messages and deciding whether
they are spam
or non-spam. Some users spend
additional time to label their received spam messages for training local
spam filters running on their desktop machines. Email service providers
want to relieve users from this burden by installing server-based spam
filters. Training such filters cannot rely on labeled messages from the
individual users, but on publicly available sources, such as newsgroup
messages or emails received through "spam traps" (spam traps are email
addresses published visually invisible for humans but get collected by
the web crawlers of spammers).
This combined source of training data is different from the
distributions of the emails received by individual users. When
learning spam filters for individual users from this type of data one
needs to cope with a discrepancy between the distributions governing
training and test data and one needs a balance between generalization
and adaptation. The generalization/adaptation can rely on large amounts
of unlabeled emails in the user's inboxes that are accessible for
server-based spam filters. Utilizing this unlabeled data a spam filter
can be adapted to the properties of specific user's inboxes but when
little unlabeled data for a user are available a generalization over
multiple users is advised.
We provide labeled training data collected from publicly available
sources. The unlabeled inboxes of several users serve as test data. The
inboxes differ in the distribution of emails. The goal is to construct
a spam filter for each single user that correctly classifies its emails
as spam or
non-spam. A clever way of
utilizing the available sets of unlabeled emails from different users is
required.
There will be a Discovery Challenge workshop at ECML/PKDD 2006 in
Berlin, where we will discuss the results, different approaches, and
other issues related to the problem setting.
We are looking forward to an interesting competition/workshop and encourage your
participation.
Workshop Schedule
The workshop will take place on September 22nd, 2006, at
Humboldt-Universität zu Berlin, Germany, Unter den Linden 6, room
3094/96.
10:40-11:10 |
ECML/PKDD Discovery Challenge 2006 Overview [paper] Steffen Bickel |
11:10-11:45 |
(invited) Semi-Supervised Support Vectors Machines and Application to Spam Filtering [slides] Alexander Zien, Max Planck Institute for Biological Cybernetics, Tuebingen, Germany |
11:45-12:10 |
A Semi-Supervised Spam Mail Detector [paper] Bernhard Pfahringer |
12:10-13:30 |
Lunch Break
|
13:30-13:55 |
Harnessing Unlabeled Examples through Iterative Application of Dynamic Markov Modeling [paper] Gordon Cormack |
13:55-14:20 |
A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering [paper] Khurum Junejo, Mirza Yousaf, Asim Karim |
14:20-14:45 |
Text Classification Using Clustering [paper] Antonia Kyriakopoulou, Theodore Kalamboukis |
14:45-15:00 |
Discussion
|
15:00-15:30 |
Coffee Break
|
15:30-15:55 |
Using Tri-Training and Support Vector Machines for Addressing the ECML/PKDD 2006 Discovery Challenge [paper] Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, Dimosthenis Christopoulos, Michalis Vazirgiannis |
15:55-16:20 |
TPN² Using Positive-Only Learning to deal with the Heterogeneity of Labeled and Unlabeled Data [paper] Nikolaos Trogkanis, Georgios Paliouras |
16:20-16:35 |
Discussion, final remarks
|
(canceled) |
Identifying SPAM with Predictive Models [paper] Dan Steinberg, Mikhaylo Golovnya |
Workshop Proceedings
The print version of the proceedings will be available at the
registration desk of the ECML/PKDD 2006 conference. The online
proceedings can be accessed as one PDF file here. Single papers can be accessed from the links next to the entries in the workshop schedule.
Important Dates
March 1st, 2006 |
Tasks and datasets available online. |
June 7th, 2006 |
Submissions of spam filtering results due (by midnight CET). |
June 9th, 2006 |
Submissions of algorithm description due (by midnight CET). |
June 14th 2006 |
Notification of winners/publication of results on webpage. |
June 16th 2006 |
Notification of creativity award winners. |
July 26th, 2006 |
Workshop paper submission deadline. |
July 31st, 2006 |
Notification of acceptance. |
August 14th, 2006 |
Workshop proceedings (camera-ready) deadline. |
September 20nd, 2006 |
Discovery Challenge Overview, time: 14:00, location: Audimax. |
September 22nd, 2006 |
ECML/PKDD 2006 Discovery Challenge Workshop, time: 10:40-17:00, location: room 3094/96. |
Winners
- Spam Filtering Performance Award - Task A, the following three teams share the first rank:
-
Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim
Lahore University of Management Sciences, Pakistan
-
Bernhard Pfahringer
University of Waikato, New Zealand
-
Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja
Inductis India Pvt Ltd
- Spam Filtering Performance Award - Task B:
-
Gordon Cormack
University of Waterloo, Canada
- Spam Filtering Creativity Award - Task A/B, we decided to award one team for both tasks instead of one
for each task because most teams have used the same algorithm
for both tasks:
-
Bernhard Pfahringer
University of Waikato, New Zealand
Please find the detailed results below in the next section.
Detailed Results
57 teams from 19 different countries have participated in the challenge,
26 teams submitted results for the evaluation. Not all participants submitted results for both tasks.
We averaged the AUC values for all inboxes as described above
and determined the ranking.
We conducted significance tests using a significance level of 5% to test the null hypothesis
that the second rank has a higher AUC value than the first. The test statistic
is computed as described in Hanley and McNeil (A Method of Comparing the Areas under
Receiver Operating Characteristic Curves Derived from the Same Cases, 1983).
For Task A we could not reject the null hypothesis for rank two and three,
this means there is no statistically significant difference between them and they all win the
Spam Filtering Performance Award - Task A. For Task B we could reject the null hypothesis for
the second rank,
this means there is one winner for the
Spam Filtering Performance Award - Task B.
Detailed results Task A:
Rank Task A |
Average AUC |
Team |
1 |
0.950667 |
Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim Lahore University of Management Sciences, Pakistan |
1 |
0.949094 |
Bernhard Pfahringer University of Waikato, New Zealand |
1 |
0.948666 |
Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd |
2 |
0.936457 |
Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
|
3 |
0.927839 |
Chao Xu, Yiming Zhou Beihang University, Beijing, China |
4 |
0.927694 |
Lalit Wangikar, Mansi Khanna, Ankush Talwar Inductis India Pvt Ltd |
5 |
0.914380 |
Dimitrios Mavroeidis, Konstantinos Chaidos,
Stefanos Pirillos, Dimosthenis Christopoulos, and Michalis Vazirgiannis DB-NET Lab, Informatics Dept., Athens University EB, Greece
|
6 |
0.912040 |
|
7 |
0.900059 |
|
8 |
0.897554 |
|
9 |
0.892118 |
|
10 |
0.887429 |
|
11 |
0.886873 |
|
12 |
0.840133 |
|
13 |
0.838834 |
|
14 |
0.823030 |
|
15 |
0.806275 |
|
16 |
0.743754 |
|
17 |
0.733333 |
|
18 |
0.661819 |
|
19 |
0.637488 |
|
20 |
0.548266 |
|
21 |
0.523404 |
|
Detailed results Task B:
Rank Task B |
Average AUC |
Team |
1 |
0.946469 |
Gordon Cormack University of Waterloo, Canada |
2 |
0.918251 |
Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
|
3 |
0.907398 |
Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd |
4 |
0.899241 |
D’yakonov Alexander Moscow State University, Russia
|
5 |
0.893333 |
Wenyuan Dai Apex Data & Knowledge Management Lab, Shanghai Jiao Tong University
|
6 |
0.870906 |
|
7 |
0.864615 |
|
8 |
0.850165 |
|
9 |
0.815666 |
|
10 |
0.800084 |
|
11 |
0.744548 |
|
12 |
0.685038 |
|
13 |
0.597331 |
|
14 |
0.597331 |
|
15 |
0.572055 |
|
16 |
0.562000 |
|
17 |
0.556495 |
|
18 |
0.352228 |
|
Updated Results (unofficial)
Some participants have improved their algorithms after the challenge
evaluation deadline and after the publication of the true labels.
The following tables show the updated rankings according to the results
reported in the workshop papers. Please note that this are no official
results and the organizers cannot guarantee correctness.
Updated results Task A:
Average AUC |
Team |
0.9875 |
Khurram Nazir Junejo, Mirza Muhammad Yousaf, Asim Karim Lahore University of Management Sciences, Pakistan |
0.9731 |
Antonia Kyriakopoulou, Theodore Kalamboukis Athens University of Economics and Business
|
0.9672 |
Alexander Zien (invited speaker), ∇S3VM algorithm Max Planck Institute for Biological Cybernetics, Tuebingen, Germany |
0.9588 |
Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
|
0.9491 |
Bernhard Pfahringer University of Waikato, New Zealand |
0.9487 |
Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd |
0.9300 |
Gordon Cormack University of Waterloo, Canada |
Updated results Task B:
Average AUC |
Team |
0.9767 |
Antonia Kyriakopoulou, Theodore Kalamboukis Athens University of Economics and Business
|
0.9524 |
Nikolaos Trogkanis, National Technical University of Athens, Greece Georgios Paliouras, National Center of Scientific Research "Demokritos", Greece
|
0.9490 |
Gordon Cormack University of Waterloo, Canada |
0.9374 |
Alexander Zien (invited speaker), ∇S3VM algorithm Max Planck Institute for Biological Cybernetics, Tuebingen, Germany |
0.9074 |
Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah, Chirag Taneja Inductis India Pvt Ltd |
Tasks
The competition is split into two tasks, A and B.
For each task we provide the following resources:
- Labeled training emails comprising of 50% spam and 50% non-spam collected
from publicly available sources.
- Several inboxes from different users with unlabeled evaluation emails.
- Tuning data consisting of an additional set of labeled training data and
inboxes with labeled tuning emails.
Tasks:
- Task A deals with a setting
where many training emails and many unlabeled emails for each user are
available. Our assumption is, that for this task the adaptation to the
user specific characteristics of the inboxes is critical for a good
classification performance.
- For Task B we provide many
separate inboxes but with only little training data and little
unlabeled data in the inboxes available. To be successful in this
setting we assume that a learning algorithm needs to generalize over
the different users in a way that user specific properties are taken
into account but the data from the other users is utilized in a way
that enhances the classification performance.
The goal in
both tasks is to train different
spam/non-spam classifiers, one for each user that correctly
classifies the evaluation emails in the inboxes. You are given the
evaluation data without labels. You need to submit the classification
results for the evaluation data which we will use to calculate the
classification performance based on the held back true labels.
The task specific number of emails and inboxes can be found in the
following table:
|
Task A |
Task B |
Number of labeled training emails |
4000 |
100 |
Number of emails within one evaluation/tuning inbox |
2500 |
400 |
Number of inboxes for evaluation |
3 |
15 |
Number of inboxes for tuning |
1 |
2 |
The tuning data can be used for parameter tuning but can
not be used to augment the training data because the word/feature
IDs (the dictionary) of the tuning data differ from the
training/evaluation data.
Submission data will be a list of decision function values for each inbox. Please follow
the detailed instructions under section "Submission" for submitting your results.
You will be asked to submit a short description of your algorithm (please see section
"Submission" for details). The interestingness of the algorithms will be
judged. Non-straightforward approaches will be valued and most
innovative ideas will be selected by the Discovery Challenge chair and a
group of experts for the "Creativity Award".
Evaluation Criterion and Winner Selection
Evaluation Criterion
You are asked
to submit the output of your classifier as decision function values,
separately for each inbox (please see section "Submission" for details).
The evaluation will run on the held back labels of the emails in the
user's inboxes. The evaluation criterion is the AUC value. The AUC value
is the area under the ROC curve (Receiver Operating Characteristic
curve). A ROC curve is a plot of true positive rate vs. false positive
rate as the prediction threshold sweeps through all the possible values.
The area under this curve has the nice property that it specifies the
probability that, when we draw one positive and one negative example at
random, the decision function assigns a higher value to the positive
than to the negative example.
We compute AUC values for each inbox separately and average over all
inboxes of the task. In the case you want to tune parameters of your
algorithm on the provided tuning data and you do not know how to compute
the AUC value, we provide a small program with C source code below. You
do not need to use this program but it may save your time.
Winner Selection
There will be four winners, two for each task A and B:
- Spam
Filtering Performance Award - Task A,
- Spam
Filtering Performance Award - Task B,
- Spam
Filtering Creativity Award - Task A,
- Spam
Filtering Creativity Award - Task B.
The winners for
the "Spam Filtering Performance Award - Task A/B" will be determined
according to the following method. All participants are ranked according to
their average AUC value on the evaluation data. Winner is the
participant with maximum average AUC value.
Winner of "Spam
Filtering Creativity Award - Task A/B " will be the participant whose
algorithm is highly outstanding at its innovative ideas judged by the
challenge chair and a group of experts.
Download of Data Sets
Data set task A (7 MB)
Data set task B (2 MB)
AUC calculation C source code (8 kB)
Evaluation data task A and B with true labels (5 MB)
Format
The files you downloaded are archives compressed with zip. Most decompression programs
(e.g. Winzip, RAR) can decompress this format. The archives contain the following files:
Task |
File name |
Data set size |
Description |
Task A |
task_a_labeled_train.tf |
4000 emails |
Labeled training emails. |
|
task_a_u00_eval.tf
...
task_a_u02_eval.tf |
2500 emails each |
Unlabeled evaluation data: 3 inboxes from different users. |
|
task_a_labeled_tune.tf |
4000 emails |
Labeled training emails for parameter tuning.
Feature representation corresponds only to file task_a_u00_tune.tf. |
|
task_a_u00_tune.tf |
2500 emails |
Labeled test emails of one user's inbox for parameter tuning.
Feature representation corresponds only to file task_a_labeled_tune.tf. |
Task B |
task_b_labeled_train.tf |
100 emails |
Labeled training emails. |
|
task_b_u00_eval.tf
...
task_b_u14_eval.tf |
400 emails each |
Unlabeled evaluation data: 15 inboxes from different users. |
|
task_b_labeled_tune.tf |
100 emails |
Labeled training emails for parameter tuning.
Feature representation corresponds only to files task_b_u00_tune.tf
and task_b_u01_tune.tf. |
|
task_b_u00_tune.tf
task_b_u01_tune.tf |
400 emails each |
Labeled test emails from two user's inboxes for parameter tuning.
Feature representation corresponds only to file task_b_labeled_tune.tf. |
We do not provide the raw text of the emails. The emails are in a
bag-of-words vector space representation. Attributes are the term
frequencies of the words. We removed words with less than four counts in
the data set resulting in a dictionary size of about 150,000 words. The
data set files are in the sparse data format used by SVMlight. Each line
represents one email, the first token in each line is the class label
(+1=spam; -1=non-spam;
0=unlabeled-evaluation-data).
The tokens following the label information are pairs of word IDs and
term frequencies in ascending order of the word IDs.
To give an example, the first line in task_a_labeled_train.tf starts like this:
1 9:3 94:1 109:1 163:1
This line represents a spam
email (starting with class label "1") with four words. The word ID of
the first token is 9 and the word occurs 3 times within this email,
indicated by ":3".
The program for the calculation of the AUC is written in C and compiles
with gcc. There is a makefile and a short readme included.
Organization
Challenge Chair: Steffen Bickel
Humboldt-Universität zu Berlin, Germany
The Discovery Challenge 2006 is organized in cooperation with Strato Rechenzentrum AG.
|
|
|
|