Facebook open sources its SQL-on-Hadoop engine, and the web rejoices


SUMMARY:Facebook has open sourced Presto, a SQL engine it says is on average 10 times faster than Hive for running queries across large data sets stored in Hadoop and elsewhere.

Facebook has open sourced Presto, the interactive SQL-on-Hadoop engine the company first discussed in June. Presto is Facebook’s take on Cloudera’s Impala or Google’s Dremel, and it already has some big-name fans in Dropbox and Airbnb.

Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database.

Presto is 10 times faster than Hive for most queries, according to Facebook software engineer Martin Traverso in a blog post detailing today’s news.

Source: Facebook

Source: Facebook

Technologically, Hive and Presto are very different, namely because the former relies on MapReduce to carry out its processing and the latter does not. This is by and large the difference that makes Presto suitable for low-latency queries while the MapReduce-based Hive can take a long time — especially over Facebook’s many petabytes of data — because it must scan everything in the cluster and requires lots of disk writes. Presto also works with a variety of non-Hadoop-Distributed-File-System data sources and uses ANSI SQL compared with Hive’s SQL-like language.

Presto is currently running in numerous Facebook data centers and the company has scaled a single cluster up to 1,000 nodes. More than 1,000 employees run queries on Presto, and they do more than 30,000 of them per day over a petabyte of data. Traverso’s post gives a lot more details about how Presto works and how Facebook plans to improve its speed and functionality in the near term.

A Presto screenshot

A Presto screenshot

However, I think the most-interesting part about Presto might be less technological and more about its effects on the Hadoop industry, which is projected to be worth tens of billions of dollars in the next few years. The mere fact that Facebook chose to create a website for the project says something about how serious the company takes it. And although Facebook has technically open sourced quite a few Hadoop improvements over the years, this is the first since Hive where I’ve noticed such fast (if any) uptake from external companies.

It will be interesting to watch how, if at all, Presto affects adoption of Cloudera’s ImpalaHortonworks’ Stinger project, Pivotal’s HAWQ or any other of the myriad SQL-on-Hadoop engines currently making fighting for mindshare. The fact that Presto is open source and ready to use certainly has to be a big draw for some users, and could help it establish a solid user base while other technologies are still coming to be.

Facebook isn’t looking to compete with other projects and doesn’t have a horse in the race from a business perspective — it will likely go along using and improving Presto at its own pace regardless what happens — but serious uptake could inspire the Hadoop vendors to change their strategies when it comes to the SQL engines they support. Much of the early innovation from Hadoop came from power users (including Yahoo and Facebook) rather software companies, and it’s possible we haven’t seen the end of that trend.