A few years ago, Hive brought SQL to Hadoop and enabled its widespread adoption by data analysts. Today, Spark has become the tool of choice for data engineers, who can build powerful data pipelines. However, Spark is fairly complex. Using it efficiently requires some understanding of the inner workings (shuffler, caching, memory, …). We will cover the challenges we faced in bringing Spark to an audience of less technical users, some of the solutions (like auto-tuning), and how improvements to Spark (memory management, statistics, new APIs, …) help bring its power to every data citizen.
Clément Stenac is a passionate software engineer, CTO of Dataiku. We are the makers of DSS, an integrated development environment that helps data analysts, scientists and engineers collaborate to build and run data applications. Clément was previously head of development at Exalead, leading the design and implementation of large-scale search engine software. He also has extended experience with open source software, as a former developer of the VideoLAN (VLC) and Debian projects.