Parallel databases vs MySQL

Following my discussion with Mikael Ronstrom regarding Parallel databases through comments on a old blog post ( mysql-for-a-massively-parallel-database/#comments ) I want to clarify why i like the Shard-Query tool.

Shard-Query is a open source distributed parallel query engine for MySQL. It offers a way to answer the question of parallel query execution across shards http://code.google.com/p/shard-query/.

But first what about MySQL Cluster as a parallel database.
MySQL Cluster has effectively a very efficient MPP architecture. It is a remarkable product for massive write and primary key lookups. Its fast failover, thanks to it true shared nothing architecture is quite unique. With respect to these two points it is much better that other architectures. Oracle RAC with its shared disk architecture, distributed cache and distributed Lock manager is much less efficient on both points.

Bust as usual you have the default associated with your qualities. Regarding complex read queries MySQL cluster has always been weak.I agree that some great progress have been with condition pushdown and Adaptive Query localization that allow to push joins and filtering to the data nodes. Both mechanism alleviate the sql node from doing work (filtering, joins) and also reduce network bandwidth consumption.

But this is far from what do parallel databases like Greenplum or Microsoft Parallel Datawarehouse Solution. To have a truly parallel database you really need to analyze the query and split it in pieces. There is a lot of intelligence to be put at the SQL node to split the query plan into subtasks. You sometime even have to implement a multi path approach(a map reduce pattern) to fully execute the request. In that case the sql node need to implement a lot of intelligence regarding parallel query execution. The MySQL Cluster sql node does not currently do it. If this work was ever to happen in the SQL layer it would be great to have it work with various storage engines. it should be abstracted from any particular storage engine like NDB.

Shard-Query presented by Justin Swanhart at FOSDEM 2013 (slides) does it at the client layer. It is using php and Gearman to spread the work in parallel to various MySQL shards. This is not perfect but this is a nice approach that can greatly help in many cases. It can parallelize across shards or using MySQL partitioning. Another interesting point about Shard-Query is that it can also works with MySQL compliant column store.

I would prefer to have this intelligence implemented in the MySQL server with full guaranty of correctly supporting the client/server protocol but this is a much harder approach.

1 comment to Parallel databases vs MySQL

  • Hi,

    Shard-Query comes with a MySQL proxy module which uses the MySQL gearman UDF functions to send queries to shard-query and get the results back. The LUA script for the proxy is very simple:
    1) intercept query
    2) send query to shard_query
    3) get result from shard_query as JSON encoded array. The already is already structured for direct proxy output.
    4) decode JSON and return output array to client

    The same thing could be done with mysqld actually, using the open source ‘shim’ from Kickfire (if the source is still somewhere on the net somewhere) to intercept the query, send it via the gearman C function, then return the results :)

Leave a Reply

  

  

  


six + = 8

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">