The CONNECT Storage engine implement the concept of a table made of multiple tables. These underlying tables can be distributed remotely. For example the underlying remote tables can be of ODBC or MySQL table type. this allows to execute distributed queries. What is nice is that we can execute this distributed query with parallelism.
How does it work ?
To explain it let us suppose we have 4 nodes : Node0 and Node1 Node2 Node3
Node0 has a MariaDB 10.0.4 installed with the CONNECT storage engine activated.
MariaDB [dbt3]> install plugin connect soname 'ha_connect.so';
On the 3 other nodes we have MariaDB or plain MySQL installed.
the 'lineitem' table of the dbt3 benchmark has been created and loaded with one chunk of the 'lineitem' table on each node. This is a sharded table. For example to generate the second chunk of 3 chunks 'lineitem' table:
./dbgen -T L -fF -q -b dists.dss -s 1 -C 3 -S 2
Now on node0 we can create a CONNECT table. This table definition embed a SRCDEF parameter that force the computation of an aggregate on the remote server :
create Table lineitem1 ENGINE=CONNECT TABLE_TYPE=MYSQL
SRCDEF='select l_suppkey, sum(l_quantity) qt from dbt3.lineitem3 group by l_suppkey'
connection='mysql://proxy:manager1@node1:3306/dbt3/lineitem3';
if we do a select on this table on node0 we get the result data that has been aggregated on node1. We access only one shard of the data.
MariaDB [dbt3]> select * from lineitem1;
+-----------+------+
| l_suppkey | qt |
+-----------+------+
| 1 | 2302 |
| 2 | 1960 |
...
| 9999 | 1908 |
| 1000 | 1756 |
+-----------+------+
10000 rows in set (26.82 sec)
Let us now do the same on for node2 and node3 :
create Table lineitem2 ENGINE=CONNECT TABLE_TYPE=MYSQL
SRCDEF='select l_suppkey, sum(l_quantity) qt from dbt3.lineitem3 group by l_suppkey'
connection='mysql://proxy:manager1@node2:3306/dbt3/lineitem3';
create Table lineitem3 ENGINE=CONNECT TABLE_TYPE=MYSQL
SRCDEF='select l_suppkey, sum(l_quantity) qt from dbt3.lineitem3 group by l_suppkey'
connection='mysql://proxy:manager1@node3:3306/dbt3/lineitem3';
Now we can use the ability of the the CONNECT storage engine to build a table that allows to see the 3 shards as a single table :
create Table alllineitem (`l_suppkey` INT(4) NOT NULL,`qt` DOUBLE ) ENGINE=CONNECT
TABLE_TYPE=TBL
table_list='lineitem1,lineitem2,lineitem3';
We can now query this table. All the aggregation computation will be done on the remote nodes and not on node0:
MariaDB [dbt3]> select * from alllineitem;
+-----------+------+
| l_suppkey | qt |
+-----------+------+
| 1 | 2302 |
| 2 | 1960 |
...
| 9998 | 2395 |
| 9999 | 1908 |
| 1000 | 1756 |
+-----------+------+
30000 rows in set (1 min 19.23 sec)
This basically takes 3 times longer than querying a single shard. This is because the request is executed sequentially on node 1 2 and 3.
We can request the CONNECT storage engine to do the job in parallel by adding the extra option : option_list='thread=1';
create Table alllineitem2 (`l_suppkey` INT(4) NOT NULL,`qt` DOUBLE ) ENGINE=CONNECT
TABLE_TYPE=TBL
table_list='lineitem1,lineitem2,lineitem3' option_list='thread=1';
If we rerun the query we get
MariaDB [dbt3]> select * from alllineitem2;
+-----------+------+
| l_suppkey | qt |
+-----------+------+
| 1 | 2302 |
| 2 | 1960 |
...
| 9999 | 1908 |
| 1000 | 1756 |
+-----------+------+
30000 rows in set (26.69 sec)
We go back to the same execution time as with a single shard ! Is not that cool. I am sure you have a lot of objections, ideas about this approach. I do. But it is nice anyways.
If you want to know more do not forget to attend to the MariaDB CONNECT Storage Engine webinar :
November 7, 2013 - 5 pm CET / 4 PM UTC / 8 AM PST Register
and if you want to give it a try you are welcome and the doc to help is here:
MariaDB Connect Storage Engine documentation
Your help is welcome : bug reports, documentation fixes, usage feedback, suggestions.
Be indulgent it is still Alpha software 😉