Parallel databases vs MySQL

Following my discussion with Mikael Ronstrom regarding Parallel databases through comments on a old blog post ( mysql-for-a-massively-parallel-database/#comments ) I want to clarify why i like the Shard-Query tool.

Shard-Query is a open source distributed parallel query engine for MySQL. It offers a way to answer the question of parallel query execution across shards http://code.google.com/p/shard-query/.

But first what about MySQL Cluster as a parallel database.
MySQL Cluster has effectively a very efficient MPP architecture. It is a remarkable product for massive write and primary key lookups. Its fast failover, thanks to it true shared nothing architecture is quite unique. With respect to these two points it is much better that other architectures. Oracle RAC with its shared disk architecture, distributed cache and distributed Lock manager is much less efficient on both points.

Bust as usual you have the default associated with your qualities. Regarding complex read queries MySQL cluster has always been weak.I agree that some great progress have been with condition pushdown and Adaptive Query localization that allow to push joins and filtering to the data nodes. Both mechanism alleviate the sql node from doing work (filtering, joins) and also reduce network bandwidth consumption.

But this is far from what do parallel databases like Greenplum or Microsoft Parallel Datawarehouse Solution. To have a truly parallel database you really need to analyze the query and split it in pieces. There is a lot of intelligence to be put at the SQL node to split the query plan into subtasks. You sometime even have to implement a multi path approach(a map reduce pattern) to fully execute the request. In that case the sql node need to implement a lot of intelligence regarding parallel query execution. The MySQL Cluster sql node does not currently do it. If this work was ever to happen in the SQL layer it would be great to have it work with various storage engines. it should be abstracted from any particular storage engine like NDB.

Shard-Query presented by Justin Swanhart at FOSDEM 2013 (slides) does it at the client layer. It is using php and Gearman to spread the work in parallel to various MySQL shards. This is not perfect but this is a nice approach that can greatly help in many cases. It can parallelize across shards or using MySQL partitioning. Another interesting point about Shard-Query is that it can also works with MySQL compliant column store.

I would prefer to have this intelligence implemented in the MySQL server with full guaranty of correctly supporting the client/server protocol but this is a much harder approach.

Leave a Reply Cancel reply