While Big Data is a recent trend, data-intensive computing in the eScience and eEngineering space has been an evolution. The recent emphasis is just a culmination of a continuum over the last decade. While raw benchmarks may be one approach to evaluating research advanced in distributed systems and Big Data frameworks [i], validation on real scientific and engineering applications achieves this validation while also helping science progress.
The abstractions, algorithms and frameworks the form the lab’s research are firmly grounded in computer science principles, spanning programming models, distributed algorithms, scheduling/optimization techniques and heuristics, and software architectures. These problems are inherent to emerging distributed systems such as Clouds and accelerators. However, data-driven scientific domains offer further inspiration in defining novel but pragmatic problems to tackle. Consequently, it is valuable to interact with research scientists and faculty from other disciplines in identifying and characterizing Grand Challenge problems that they face but which also does not have a natural, out-of-the-box software solution. Rather, it requires key advances in computer science research to develop a viable solution. Initiatives such as the National Science Foundation’s (NSF) Software Infrastructure for Sustained Innovation (SI2) program identify software cyber-infrastructure as a key component of transformative scientific advancement. In this spirit, the research activities of the lab will initially interact with two data-driven scientific domains, selected based on prior experience in working with researchers and applications in this space as well as the existing expertise within IISc and peer institutions: Cyber Physical Systems (CPS) and Computational Biology.
CPS is a class of applications that lie at the intersection of physical infrastructure and computational cyber-infrastructure. The operational aspect of CPS poses an inherent sense-and-actuate logic that require computing systems to monitor and manage the physical system. As a result, the physical system can operate more efficiently and reliably. Smart Power Grids [ii] and Smart Transportation are a canonical example of CPS. These also exhibit Big Data features where is data fusion and management for continuous integration and analysis at large volumes and varieties is important for effective optimization. Consequently, Cyber Physical Informatics is a prime space for sustained research.
Computational Biology and Genomics have seen rapid changes due to the enhancements in Next Generation Sequencing (NGS) techniques [iii]. This has also lead to an explosion in the quanta of data generated by biologists, and present serious challenges to their effective analysis and interpretation. Research in bioinformatics is still nascent when it comes to developing parallel algorithms and utilizing distributed systems for managing the data volume. Of particular importance, given the need to feed our growing population, is plant genomics where there is much less focus on tools and algorithms compared to human genomics. Plant genome data also has unique properties that make them harder to sequence using existing tools and even harder to scale. Genomics is a classic Big Data domain with as yet untackled problems on scalable computing and distributed algorithms that can effective deal with the growing ability to cheaply sequence genomes.