Harp: Collective Communication on Hadoop

Friday, 7 November 2014 - 11:00am - 12:00pm
Dr. Judy Qiu, Assistant Professor of Computer Science, School of Informatics and Computing, Indiana University
SL 108 IUPUI Campus

Dr. Judy Qiu is an assistant professor of Computer Science in the School of Informatics and Computing at Indiana University and an assistant director of the school’s Digital Science Center. Her research interests are parallel and distributed systems, cloud computing, and high-performance computing. Qiu leads the SALSA project, involving professional staff and Ph.D. students from the School of Informatics and Computing. SALSA focuses on data-intensive computing at the intersection of cloud and multicore technologies with an emphasis on scientific data analysis applications by using MapReduce and traditional parallel computing approaches.

Her research has been funded by NSF, NIH, Microsoft, Google and Indiana University. She is a recipient of NSF CAREER Award in 2012 and Indiana University Trustees Award for TeachingExcellence in 2013-2014.

Link to website: http://www.cs.indiana.edu/~xqiu/

Many scientific applications are data-intensive. It is estimated that organizations with high end computing infrastructures and data centers are doubling the amount of data that they are archiving every year. Harp extends MapReduce, enabling HPC-Cloud Interoperability. We show how to apply Harp to support large-scale iterative computations that are common in many important data mining and machine learning applications. Further one needs additional communication patterns than those made familiar in MapReduce.

This leads us to the Map-Collective programming model that captures the full range of traditional MapReduce and MPI features, which is built on a new communication abstraction, Harp, that is integrated with Hadoop. It provides optimized communication operations on different data abstractions such as arrays, key-values and graphs.

With improved expressiveness and performance on collective communication, Hadoop/Harp can do in-memory communication between Map tasks without writing intermediate data to HDFS, enabling simultaneous support of applications from HPC to Cloud. Our work includes a detailed performance evaluation on IaaS or HPC environments such as FutureGrid and the Big Red II supercomputer, and provides useful insights to both frameworks and applications.

View event flyer