Computer science and engineering research group awarded NSF grant

A group of researchers in Computer Science and Engineering have been awarded a National Science Foundation (NSF) grant for research in improving distributed computing environments.Dr. Jennifer L. WelchGraduate student Srikanth Sastry, TEES Research Assistant Professor Scott Pike and Professor Jennifer Welch received the award for their project, "A Prescription for Partial Synchrony," which will be funded through 2014.Their research seeks to develop a new theory of partial synchrony to build reliable distributed services within crash-prone distributed systems. Here, synchrony refers to the timing properties of events within a distributed system. The new theory will allow for the detection and tolerance of processor crashes within distributed systems.Welch explains, "If you have a well behaved system where you know how fast the processors run and you know how long the messages take to get to their destinations, it is easy to detect if a processor within the system has crashed. Unfortunately, many real-life distributed systems are only partially synchronous. That is, the knowledge of how fast the processors run and how long messages take is limited."The new theory of partial synchrony being developed will ensure that even with such limited knowledge, if one part of the system crashes unannounced, the applications and services running on top of the system continue to behave correctly, thus eliminating errors that can occur due to machine crashes."Sastry continues, "The old models of partial synchrony have used real-time to specify bounds for message delays and process speed. Unfortunately, such models are intrinsically limited because the characteristic property of partial synchrony is not timeliness with respect to real time, but rather with respect to fairness. By fairness we mean the relative order of steps which is independent of the real time it takes for these steps to be executed. We are placing more importance on looking at the correct ordering of computational steps and communication of data rather than real times at which they take place, or the real-time rates at which computation and communication occur."Today's large distributed systems that support, for instance, online commerce require significant amounts of human intervention for reliability. Typically, applications running on these systems use real time to detect crashes.  The researchers contend that much of this human effort could be reduced by replacing the existing real-time-based techniques with the fairness-based approaches to be developed in the project.Sastry is a doctoral student in the Department of Computer Science and Engineering. Prior to joining Texas A&M in 2004, he worked for Cisco Systems as a software engineer from 2001 to 2004. In 2001, he graduated with B. Tech. in computer science and engineering from the National Institute of Technology, Calicut, in India.Pike works on the Windows Fundamentals Performance team at Microsoft and holds a research assistant professor appointment with the Department of Computer Science and Engineering at Texas A&M. He received his Ph.D. from Ohio State University and his B.S. in philosophy from Yale University.Welch received her S.M. and Ph.D. from the Massachusetts Institute of Technology and her B.A. from the University of Texas at Austin. She is currently holder of the Chevron II Professorship in the Department of Computer Science and Engineering at Texas A&M. Her research interests are in the theory of distributed computing; recent foci have included metamorphic robots, mobile ad hoc networks, and shared memory consistency conditions.