IEEE ICDM Data Mining Competition

Call for Proposals

ICDM 2011 Data Mining Contest

We invite proposals for organizers of the ICDM 2011 Data Mining Contest. The Data Mining Contest is an integral part of the ICDM conference and provides an opportunity for teams of scientists and domain experts to compete in order to develop data mining techniques for real-world applications.


Proposals should contain the following information.

1. Contest title and abstract

2. Contact information for the organizer(s)

3. General description of the problem

4. Description of the dataset to be used

5. Description of the specific competition tasks

6. Evaluation metrics or tool to be used for evaluation

7. Plans for contest web site (e.g.,,, other)

8. Prize structure

9. Short biography of organizer(s)

10. Description of compensation (if any) that could be awarded to winners (organizers are encouraged to solicit awards and prizes from sponsoring entities)

General Guidelines

•  The dataset should be interesting, available, and sufficiently large

•  The task(s) should be interesting and accessible without too much domain knowledge

•  The evaluation measures should be well-defined

•  The contest timeline should adhere to the dates below

•  Proposers are encouraged to review past ICDM contests

Important Dates

•  April 29, 2011: Deadline for contest proposals emailed to both contest co-chairs

•  May 13, 2011: Notification of acceptance

•  June 17, 2011: Contest begins, announcement to participants (tasks, datasets, metrics, website available)

•  September 16, 2011: Contest ends, final submission of contest results

•  September 23, 2011: Submission of 2-3 page report from each contest team in ICDM format

•  October 7, 2011: Submission of camera-ready papers by selected teams

•  December 11-14, 2011: Winners announced at ICDM conference

ICDM Contest Co-Chairs

• Larry Holder, Washington State University,

• Ashok Srivastava, NASA Ames,



NASA TV Show: The Leading Edge

I’ll be on NASA TV on March 23, 2011 for the Leading Edge TV Show at 11 am ET.  You can watch on NASA TV or for a streamed broadcast.  We will also be doing a NASA Chat later that day.

NASA Chat, Mar 23, 2pm ET

When an airplane flies, hundreds of data streams fly from it every second — pilot reports, incident reports, control positions, instrument positions, warning modes. But there’s so much data, it’s been nearly impossible for airlines to do anything other than look back for the cause of something that’s already happened. Data mining is the art of digging through mountains of data when you don’t know what you’re looking for or what you might find. Popular search engines like Google™ do this every second. NASA is mining terabytes of aviation data to find issues before they become incidents. Ashok Srivastava will talk to us about what computer tools NASA is building to do the digging.

Jeff Hamlett will talk about how Southwest Airlines is already using data mining “gold” to update their flight operations.

  • How is NASA figuring out how to find the needle in a haystack when we don’t know what either looks like?
  • What’s an “algorithm”? What’s an “anomaly”? What’s a “precursor” and why do data miners use those words all the time?
  • What has Southwest changed in its practices thanks to data mining?
  • How is our data mining different from Google’s or Amazon’s? How is it the same?

Join the chat on Mar 23, a few minutes before 2 pm ET at:


Conference on Intelligent Data Understanding 2010

The conference ended last week and featured many international participants as well as numerous talks and posters from top researchers.  The conference featured a keynote talk by Stephen Boyd from Stanford University,  and numerous invited speakers such as Piero Bonnisone from GE, Vipin Kumar from the University of Minnesota, Dinkar Mylaraswamy from Honeywell, and Rama Nemani from NASA.

Thanks to everyone for participating and see you at CIDU 2011.

Studying Pilot Fatigue with Data Mining Technologies

Here is an article that recently appeared in Flight International about some novel data mining and machine learning techniques that we are developing to potentially detect pilot fatigue.  We are using anomaly detection techniques as well as predictive methods to look for objective indicators of fatigue.  This is ongoing work between a number of groups within NASA and outside partners.

Knowledge Discovery and Data Mining 2010

The annual ACM SIGKDD conference is the premier international forum for data mining researchers and practitioners from academia, industry, and government to share their ideas, research results and experiences. KDD-2010 will feature keynote presentations, oral paper presentations, poster sessions, workshops, tutorials, panels, exhibits, demonstrations, and the KDD Cup competition.

KDD-2010 will run between from July 25-28 in Washington, DC and will feature hundreds of practitioners and academic data miners converging on the one location.

I’ll be giving an invited talk on Discovering Precursors to Aviation Safety Incidents and participating on a panel on the Next Generation of Transportation Systems:  Greenhouse Emissions and Data Mining.  We also have a paper on Multiple Kernel Learning for Heterogeneous Anomaly Detection.

Algorithms for Modern Massive Data Sets

The Workshops on Algorithms for Modern Massive Data Sets (MMDS) will address algorithmic, mathematical, and statistical challenges in modern large-scale data analysis. The goals of this series of workshops are to explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearly-structured scientific and internet data sets, and to bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to promote cross-fertilization of ideas.

Next Generation Data Mining

The world is facing a number of critical challenges. Finding the next generation of solutions for energy supply, reducing greenhouse emissions, and transportation problems is critical to sustain the world and our civilization. Energy crisis is a major challenge that needs to be addressed for sustaining and further developing the world. Greenhouse emissions is widely believed to be connected with energy consumption. Transportation system has significant effect on the energy consumption and on greenhouse emissions. Many problems related to greenhouse emissions and transportation industry are critically connected to the consumption and supply of energy. Information processing and advanced data analysis techniques are likely to play important roles in solving these problems for the next generation.

Efficient production, distribution, and consumption of existing and alternate energy would require supporting information processing networks in order to adaptively control and protect the underlying physical systems. Understanding the effects of greenhouse emissions requires advanced data analysis techniques for understanding remotely sensed data. Reducing the carbon footprints of buildings, vehicles, and airplanes would require continuous monitoring of sensors and detecting deviation from desired behavior. Designing the next generation of transportation network becomes particularly challenging in the context of increasing demand for energy supplies and reducing greenhouse emissions. Sensor networks for highways and vehicles equipped with diagnostic data bus along with the availability of machine-to-machine wireless communication networks are going to make the role of advanced data mining techniques very important in the transportation industry. Computing in itself is under scrutiny from the perspective of its effect on greenhouse emissions and pollution. We need to pay close attention to the environmental impacts of computing and the supporting infrastructure. Overall, we need to explore technology for sustainable computing and computing technology for a sustainable world.

The “Next Generation Data Mining (NGDM’09) Summit: Dealing with Energy Crisis, Greenhouse Emissions, and Transportation Challenges” will bring together data mining researchers, scientists and engineers from a diverse background along with domain experts.

NGDM’09 will focus on the following areas:

1)      Energy crisis, information processing and data mining
2)      Greenhouse emissions, climate changes, and data mining
3)      Transportation, emissions, and data mining

The summit will generate a report based on the presentations and discussions of the participants.


Chandra Bhat, University of Texas at Austin
Kirk Borne, George Mason University
Alok Choudhary, Northwestern University
Umesh Dayal, HP Labs
Wei Fan, IBM T. J. Watson Research Laboratory
Douglas Fisher, National Science Foundation
Auroop Ganguly, Oak ridge National Laboratory
Johannes Gehrke, Cornell University

Carla Gomes, Cornell University
Vipin Kumar, University of Minnesota
Rich Lechner, IBM
Edward Maibach, George Mason University
Mark McGranaghan, Electric Power Research Inst.
Paul Melby, MITRE Corporation
Robert Neff, UMBC
Dino Pedreschi, Univ. of Pisa & Northeastern Univ,
Krishna Rajan, Iowa State University
Shashi Shekhar, University of Minnesota
Ashok Srivastava, NASA Ames Research Center
Eugene Tierney, US Env. Protection Agency
Ramasamy Uthurusamy, General Motors (Ret.)
Brian Worley, Oak Ridge National Laboratory
Philip Yu, University of Illinois at Chicago
Vince Mow, Mactec Federal Programs
Chris Stock, Verizon

Tutorial on Anomaly Detection and Prediction

A tutorial at the International Workshop on Structural Health Monitoring 2009

September 8th, 2009 1pm-5pm

The tutorial will present methods and applications in the area of data mining and machine learning for large-scale systems such as those found in structural health management applications. The purpose of the tutorial is to discuss and disseminate new publicly available data mining algorithms for anomaly detection and prediction in large-scale applications including distributed systems. We will discuss technical hurdles and possible solutions. Specific focus areas include:

  • New anomaly detection algorithms that are fast and highly accurate.
  • New prediction algorithms appropriate for massive data sets
  • Distributed data mining algorithms which are provably correct (they give the same answer whether data is centralized or distributed).

Tutorial Format

The 4 hour tutorial will be organized as a series of short lectures with adequate time for audience participation. We will provide an overview of the algorithms covered as well as demonstrations of the methods on real-world data sets. The tutorial will feature multiple speakers who are experts in data mining.


Ashok Srivastava, Ph.D., NASA Ames Research Center

Nikunj Oza, Ph.D., NASA Ames Research Center

Santanu Das, Ph.D., UARC, NASA Ames Research Center

Kanishka Bhaduri, Ph.D., MCT Inc, NASA Ames Research Center

Cost:  $50

To Register, please visit: Tutorial Registration