Data + AI Summit 2020 Highlights: What’s new for the Apache Spark community?
I just attended the first edition of the Data + AI Summit - the rebranded name of the Spark Summit conference organized twice a year by Databricks. This was the European edition, meaning the talks took place at a European-friendly time zone. In reality it drew participants from all over the world, as the conference was fully virtual and free because of the ongoing epidemic.
The conference featured pre-recorded video talks aired at a specific time (as if they were live), with the speakers available to answer questions in real time on a “Live Chat”. All the talks are available for replay for free on the conference’s website: http://dataaisummit.com/. The conference featured 125 seminars, keynotes and breakout sessions from worldwide technical leaders.
In this article we’ll go over the highlights of the conference, focusing on the new developments which were recently added to Apache Spark or are coming up in the coming months.
Keynotes: Focus on PySpark and COVID-19
On Wednesday morning, November 18, we had a keynote introduced by Ali Ghodsi, the CEO and a co-founder of Databricks. He brought on stage quite a few people, highlighting especially the need to bolster the Python experience of Apache Spark with a discussion of Project Zen, and the new Koalas API, as well as the performance boosts enabled by the latest release of Spark 3.0.
After about an hour and a half of technical discussion, Malcom Gladwell (a relatively famous author of The Tipping Point, and Blink) discussed COVID-19 data, and how this could have helped us to navigate the pandemic differently, if we used this data to make better decisions. He went on to apply data science to the pandemic, which he said Japan, China, and South Korea handled well, but not Europe or the USA.
He went on to give three guiding principles: Framing, Usefulness and Courage. We need to use the data to properly frame our understanding of the problem. Make sure the data you have selected is the most useful (and useful doesn’t mean perfect). Lastly, by courage, he means: “to have the courage to follow the data, to do what the data tells you to do.” As an example, he said that “10% of the American population has a risk of dying from COVID-19” based on obesity. And that we need to take special precautions to help the obese but we don’t do this because we have a social norm against shaming people. He felt those who do these three things will in fact become the proper leaders of the rest of us.
Getting Started with Apache Spark on Kubernetes, as this architecture will be declared “production ready” in Spark 3.1
Link to the talk: https://databricks.com/session_eu20/getting-started-with-apache-spark-on-kubernetes
In this talk, the co-founder of Data Mechanics Jean-Yves Stephan (an ex-Databricks engineer himself) went over the pros and cons for running Spark on Kubernetes followed by a concrete guide on how to make Spark on Kubernetes stable and cost-effective at scale.
They also revealed that Spark on Kubernetes will officially be declared Generally Available and Production-Ready with the upcoming version of Spark (3.1). Update (March 2021): Spark 3.1 has been officially released!
One specific feature includes the ability to move shuffle files before a Spark executor is terminated (due to dynamic allocation or because of a Spot kill), which will significantly improve Spark’s robustness and performance in elastic cloud architectures.
His co-founder Julien Dumazert then illustrated the development workflow with a live demo running on the Data Mechanics platform. He used a single script to build a docker image with all the required dependencies, push it to the registry, and then run it at scale on the Kubernetes cluster - all happening with a 30 second iteration cycle. This is a big win for the developer experience, compared to my experience deploying Spark on traditional Hadoop YARN clusters.
What is New with Apache Spark Performance Monitoring in Spark 3.0
Link to the talk: https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring-in-spark-3-0
The speaker of this talk, Luca Canali, is a Data Engineer at CERN, which processes one Exabyte of data from the Large Hadron Collider using Spark (on Kubernetes). What was exciting about this talk is that it explained clearly how to write Spark 3 custom metrics monitoring listener code by extending the new SparkListener API. This will make monitoring Spark 3 applications much easier, and it is a big win for the Spark dev community:
Some open-source projects have started to make use of these new capabilities in Apache Spark 3.0 like Data Mechanics Delight, a free and cross-platform Spark UI replacement with new metrics and data visualizations. You can install it in your Spark infrastructure by downloading their open-sourced Spark agent (which uses this new SparkListener interface).
Updated (April 2021): Delight has been officially released!
From Query Plan to Query Performance: Supercharging your Spark Queries using the Spark UI SQL Tab
Link to the talk: https://databricks.com/session_eu20/from-query-plan-to-query-performance-supercharging-your-apache-spark-queries-using-the-spark-ui-sql-tab
If you love JOIN strategies, and efficiency, you will be intrigued by this use case, to JOIN a ship to a port, but only if the location of the ship made it possible for the ship to be physically at that port.
They created clever geographical hash regions based on latitude and longitude, and don’t attempt the JOIN unless the geo-hashing strategy is satisfied, which converts to an equi-join implemented by BroadcastHashJoin. This geo-filtering technique sped up the JOIN considerably. Whew!
Interoperability Between Koalas and Apache Spark
For those who haven’t heard of Koalas and its relationship to PySpark, “the Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark.”
Koalas allows for a much richer experience of PySpark DataFrames, in that the Pandas functionality we know and love is exposed and yet at runtime we have a Spark Job. An example of the new features is simply generating a Koalas DataFrame using the df.to_koalas() function, which easily provides custom indexing of a Koalas DataFrame (as shown below).
Project Zen: Improving Apache Spark for Python Users
I used to work at Databricks, in 2015, and at that time Python was not nearly as popular as it is today. Back then people still believed that Scala would take off, but it never really did. Instead, we have seen the dramatic rise in popularity of Python development worldwide, at the expense of Java (which is considered old), and Scala (which is considered obscure these days). According to the speaker Hyukjin Kwon, PySpark applications now represent 68% of the queries run in Databricks notebook (vs 11% for Scala). This is a huge jump in only five years, when it was around 35% Python. That is simply amazing.
This talk was about how PySpark has been found lacking in many ways and the effort to fortify it for the current needs of the PySpark dev community. The talk mentioned a clever poem called the Zen of Python:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
The goal of the Apache Spark Project Zen (JIRA) is to make PySpark the first class citizen it was always marketed to be. While those of us on the inside knew from the start it was a 2nd class citizen, it is now high time to write this, and build out the PySpark API and documentation properly. As of June 2020 there are more than a million downloads of PySpark each week.
Conclusion: Spark is alive and thriving
I was at the Spark Summit in 2015 and it was extremely exciting. Back then I was just learning Spark, and started to teach and consult about Spark for Databricks. Since then I have been around the world teaching, and mentoring teams about Spark. If this current Data+AI summit we just attended is any indicator, then Apache Spark is still alive and thriving.
About the “AI” part of the event title, well, those who know will say: “If it is written in PowerPoint it is AI, and if it is written in Python it is Machine Learning.” I will stick to Python and PySpark ML, and now Koalas, thank you!
About the author
Laurent Weichberger is a full time consultant at Hashmap, Inc. where he works as a Sr. Cloud Architect. He is also a published author of spiritual books which are available on amazon.com. Laurent lives with his wife and children (and their pet cat), in North Carolina.
by