High Energy Physics/CS Special Seminar
Abstract: The demand for analyzing large scale telemetry, machine, and quality data is rapidly increasing in software industry. Data scientists are becoming popular within software teams. We conducted a large scale survey with 793 professional data scientists at Microsoft to understand their educational background, problem topics that they work on, tool usages, and activities.
To process massive quantities of data, data scientists leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Hadoop, and Apache Spark. While DISC systems help to address the scalability challenges of big data analytics, they also introduce new challenges in debugging. In this talk, I will first describe interactive, real-time debugging primitives that we designed for the next generation data-intensive scalable cloud computing platform, Apache Spark and briefly describe data provenance and optimized incremental computation capabilities that we built within Apache Spark to effectively and efficiently support debugging. Then, I will describe automated debugging that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failure-inducing inputs.