Skip to main content

Real-Time Active Inference and Learning (RAIL)

A Bayesian inference tool using active probing for real-time, adaptive problem diagnosis in distributed systems.

Date Posted: July 17, 2007

alphaworks tab navigation

 

What is Real-Time Active Inference and Learning (RAIL)?

As distributed computer systems and networks continue to grow in size and complexity, systems management tasks such as real-time fault localization and problem diagnosis become significantly more challenging and call for higher levels of automation. RAIL is a part of our ongoing development of self-healing systems capable of making inferences about their own behavior, such as diagnosing faults and performance degradations.

How does it work?

This tool uses a cost-efficient technique for adaptive diagnosis that combines probabilistic inference with online, active selection of the most-informative measurements called probes. Probes are end-to-end test transactions that collect information about the availability and performance of a distributed system. Examples of probes include ping or traceroute commands, Web-, e-mail- and database-access transactions, and application-specific transactions. Given the probe results (symptoms), RAIL performs Bayesian inference in order to find the most likely explanation (cause), such as a failed system component; the cause can include both hardware components (such as network nodes and links) and software components (such as database tables, applications, etc.).

An important difference between RAIL's approach and ''passive'' data analysis is in RAIL's ability to select and execute probes online. This approach, called active probing, uses an information-theoretic criterion called information gain in order to select adaptively only a small set of the most informative probes at any given time; this approach significantly reduces the overall number of probes required to diagnose a problem (in many cases, up to 75%) and thus reduces probing costs and speeds up diagnosis.

About the technology author(s)

Irina Rish

Irina Rish, Ph.D., is a research staff member at IBM® T. J. Watson Research Center. She received an M.S. in applied mathematics from Moscow Gubkin Institute, Russia, and a Ph.D. in computer science from the University of California, Irvine. Dr. Rish""s primary research interests are in the areas of probabilistic inference, machine learning, and information theory. In particular, she has been working on approximate inference in probabilistic graphical models, information-theoretic experiment design, active learning, and their applications to automated management of complex distributed systems. Dr. Rish also taught several machine learning courses at the Electrical Engineering and Computer Science departments of Columbia University as an adjunct professor.

Natalia Odintsova worked at IBM T. J. Watson Research Center on algorithms and software for problem diagnosis in distributed systems and was one of the primary developers of RAIL tool and demonstration as well as tools for network topology analysis and cost-efficient probe selection. Mrs. Odintsova holds an M.S. in physics from Lomonosov University in Moscow, Russia, and an M.S. in computer science from Polytechnic University, N.Y. Her research interests include machine learning, distributed systems management, and graph visualization algorithms.

Genady Grabarnik, Ph.D., works in the Distributed Computing Department at the IBM T. J. Watson Research Center in Hawthorne, N.Y. He received his Ph.D. from the Mathematical Institute of the Academy of Science, Uzbekistan. Dr. Grabarnik's research interests include operator algebras, automated planning, data mining, and management of computer systems.

The team would like to thank Shang Guo and David Loewenstern for further extending the functionality of RAIL, as well as Mark Brodie, Alina Beygelzimer, and Shang Ma for contributing their ideas at various stages of the project. The team would especially like to thank Herb Lee and his EPP team, particularly Mariusz Sabath and Jeff Perry, for helping to combine RAIL with the EPP measurement tool, and for multiple other contributions at various stages of their joint project, from putting together the RAIL/EPP demonstration to maintaining the production version in multiple customer environments.

Trademarks




Related technologies