In times of huge amounts of heterogeneous data available, processing and extracting knowledge requires more and more efforts on building complex software architectures. In this context, Apache Spark provides a powerful and efficient approach for large-scale data processing. This talk will briefly introduce a powerful machine learning library (MLlib) along with a general overview of the Spark framework, describing how to launch applications within a cluster. In this way, a demo will show how to simulate a Spark cluster in a local machine using images available on a Docker Hub public repository. In the end, another demo will show how to save time using unit tests for validating jobs before running them in a cluster.
Joel has received his Bachelor degree in Computer Science from Universidade Federal de Pelotas (Brazil) in 2005 and his PhD in Informatics from Universidad de Salamanca (Spain) in 2010. His thesis topics were mainly within Big Data and Recommender Systems fields. For two years, he was part of the R&D sector of HP Brazil and, subsequently, he was responsible for building recommender systems architectures in Mobjoy Games. In the last three years, he works as a data scientist in Tail Target.