Probability perspective on MySQL Group replication and Galera Cluster

Comparing Oracle MySQL Group Replication and Galera Cluster through a probability perpective seems quite interesting.

At commit time both use a group certification process that requires network round trips. The required time for these network roundtrips is what will mainly determined the cost of a transaction. Let us try to compute an estimate of the certification process cost. The duration of these network roundtrips duration can be model by random variable with an associated probability distribution law.

For Galera all nodes take part in the process so the time taken is dimensioned by the worst roundtrip duration.
For group replication it is a majority voting and the time taken is dimensioned by the worst case of the majority.

Said in another way : if we have N+1 nodes (N being an even number to be sure to have a good majority voting process) you randomly get N non correlated durations for each round trip. We exclude one node(the local node were the transaction is taking place) as no network roundtrip is required even if logically it takes part in the certification process. This local cost is anyway much smaller that the others roundtrip cots.

For Galera Cluster the longest of the N durations is what will dimension the duration of the certification process.
For Group Replication the maximum of the N/2 smallest duration will be the duration of the certification process.

For example for a 9 nodes cluster we need to compare the slowest of 8 vs the slowest of the 4 fastest among 8 ;-)
These two metrics have a very different statistical behaviour ! One is converging to the mean while the other one is increasing with the number of nodes and unfortunately depending on the statistical standard deviation ! The bigger the number of nodes and the bigger the standard deviation the bigger it will diverge from the mean value :-(

What is very important to notice is the more you increase the total number of nodes in the cluster the bigger the random variable expressed by the maximum of the N random numbers grows. The speed of the growth depends on the standard deviation and the number of nodes. On the other hand the biggest of the N/2 smallest durations converges to the mean and will show less variance with more nodes.

So the more nodes we have in a cluster the biggest is the performance gap between Group replication and Galera Cluster. In the same way with a high standard deviation Galera Cluster will have a natural tendency to go far from the mean . It is important to notice that the standard deviation will naturally increase with a loaded system. For a loaded system the network roundtrip cost can statistically have outliers values far from the mean value. These outliers do not mater for Group replication but they do for Galera Cluster as the cost of certification is based on the worst roundtrip duration.

This very simple statistical fact certainly explains the difference in benchmarks comparing Group replication and Galera Cluster performance.

For those interested by the mathematics behind this behaviour the expected value for the maximum of N normal random variables is describe here :

http://math.stackexchange.com/questions/473229/expected-value-for-maximum-of-n-normal-random-variable

This document basically tells that if you have N independent random variables distributed according to a normal law with mean M and standard deviation S then the expected value of the maximum of all these random variables is :
M + k*S
in our case k is a coefficient that depends on the number of nodes :

.56418958354775628695 for 2 nodes
.84628437532163443042 for 3 nodes
1.0293753730039641321 for 4 nodes
1.1629644736405196128 for 5 nodes
1.2672063606114712976 for 6 nodes

So to summarise if we have 4 nodes the Galera cluster certification process will cost on average M + 1.02*S
This can be big as it depends on the standard deviation and unfortunately when you benchmark you are close to the limit and this can introduce very high variance in the network roundtrip cost.

The distribution law was supposed to be a normal law but it seems for network roundtrip a better law might be an Erlang_distribution. I think that might just make the result even worse. There might be some folks in the readers of this blog post that have better knowledge than me on system/network modelling : . any objection is welcome ;-)

MariaDB Backup released

MariaDB Backup has been released with MariaDB Server 10.1.23. It offers support for #MariaDB Compression and Encryption on Linux and Windows Server.

MariaDB Engineering rocks !

This is different from Oracle MySQL were you have to have a commercial licence for your server to make backup with MySQL Enterprise Backup (MEB) ;-) MariaDB Backup is under GPL licence and can be used freely.

Of course there is the XtraBackup product but it is an extra project not part of the Oracle MySQL server codebase and not developed by the Oracle engineering team that drives Oracle MySQL . XtraBackup does not support Windows Server or MariaDB Encryption. MariaDB Backup is a fork of XtraBackup.

https://mariadb.com/resources/blog/mariadb-backup-released-mariadb-server-10123

MariaDB Backup is Alpha maturity code which means it is OK for test but not recommended for production environment.

MariaDB AWS Key Management Service (KMS) Encryption Plugin

MariaDB 10.1 introduced Data at Rest Encryption. By default we provide a file_key_management plugin. This is a basic plugin storing keys in a file that can be itself encrypted. This file can come from a usb stick removed once keys have been brought into memory. But this remains a basic solution not suitable for security compliance rules.

To secure keys in a better way we have introduced a new plugin call « Amazon Web Services (AWS) Key Management Service (KMS) Encryption Plugin. We provide a setup guide and an advanced setup guide with some nice go code to do 2 factors authentication (sample code written by Kolbe).

The AWS KMS encryption plugin is only compiled in the MariaDB Enterprise binaries. The sources code of this plugin is GPL and part of the MariaDB Server repository available here. The instructions for building the plugin from source are there.

This plugin is a good example of how to write a plugin to interface to a KMS. It can serve as an example for developing plugins for other KMS (Thales, Gemalto/Safenet, Azure Key Vault…). The KMS itself can be software only or associated with an HSM (Hardware Security Module) to introduced hardware protected keys and hardware encryption through cryptoprocessor. For some businesses this is part of compliance rules (PCI PTS).

MariaDB and Native JSON support ?

A question raised by my previous post is : What about MariaDB and native JSON support ? In my previous post I mentioned the possibility to use the MariaDB CONNECT storage Engine to store and access JSON content in normal text field. Of course having a native JSON datatype brings more value. It introduces JSON validation, a more efficient access to attribute through an optimized storage format.

What about MariaDB and native JSON support ?
Native JSON support, indexing pure virtual column and optimizer improvements to smartly decide to use this index will come with MariaDB 10.2.
These JSON features are part of MariaDB 10.2 roadmap. Here is the MariaDB 10.2 roadmap.
For a more detailed features list for MariaDB 10.2 : Here is a full list of MariaDB 10.2 community voted features
The MariaDB engineering team has taken it time to evaluate various aspect of native JSON implementation. Two different implementations exist. Beside Oracle’s MySQL implementation Facebook has developed its own implementation called DocStore that is part of the Facebook MySQL branch. Both have been evaluated by MariaDB engineering. Another area that must be addressed is the current evolution of the standard regarding SQL/JSON

Oracle MySQL 5.7 already supports a Native JSON data type. This JSON datatype comes with an extensive set of functions to create/modify/search this JSON content. Oracle MySQL 5.7 has also introduced virtual columns (call generated columns). MySQL 5.7 Generated columns can be materialized or not. MariaDB already had virtual columns but Oracle MySQL 5.7 is currently ahead of MariaDB 10.1 as it is possible to add an index on a pure virtual column that is not persistant. MySQL 5.7 optimizer has also been improved to detect the use in the where clause of an expression corresponding to the virtual column. These virtual columns with indexing capabilities allows optimized filtering of native JSON datatype through a WHERE clause.

Neither MariaDB 10.2 nor Oracle MySQL 5.7 implement inplace update of JSON content. This is an area where a fully JSON based datastore like MongoDB performs better. This advantage is balanced by extra space consumption to introduce padding. This extra padding space is required to avoid having to move data around and to avoid to have to update many indexes for a single field update. This area definitely needs to be benchmark and depends on JSON content size and access pattern.

However for certain use cases Native JSON has some drawbacks compare to to storing JSON into a text field. The insert rate is much lower as JSON needs to be parsed and converted to the internal storage format. If the client application is completely handling the JSON content we are wasting server resources.

So for MariaDB fans and users be patient native JSON is coming with MariaDB 10.2. And of course feel free to comment. Our engineering and development is not done behind closed doors and you can be part of it. ;-)

MariaDB JSON text indexing

It is not new that we can store a JSON content in a normal table text field. This has always been the case in the past. But two key features were missing : filtering based on JSON content attributes and indexing of the JSON content. With MariaDB 10.1 CONNECT storage Engine we offer support for external content including JSON files. The MariaDB CONNECT storage Engine also comes with a set of JSON related UDFs. This allows us to do the following thing.

MariaDB > CREATE TABLE jassets (
       item_name varchar(32) primary key, /* A common attribute for all items */
       json_cols varchar(512)  /* JSON data will be stored here */
     );
MariaDB > INSERT INTO jassets VALUES ('MariaDB T-shirt', Json_Object('blue' color, 'XL' size));
MariaDB > INSERT INTO jassets VALUES ('Thinkpad Laptop', Json_Object('black' color, 500 price));
...
MariaDB > SELECT * FROM jassets;
+-----------------+-------------------------------+
| item_name       | json_cols                     |
+-----------------+-------------------------------+
| MariaDB T-shirt | {"color":"blue","size":"XL"}  |
| Thinkpad Laptop | {"color":"black","price":500} |
...
+-----------------+-------------------------------+

Through the JSON UDFs We can select a limited number of field’s values from the JSON content. If we want to restrict access only to records having a specific value for one of the JSON attribute we could write a request like :

MariaDB [test]> SELECT item_name, JsonGet_String(json_cols, 'color') AS color FROM jassets 
    ->   where JsonGet_String(json_cols, 'color') = 'blue';
+-------------------+-------+
| item_name         | color |
+-------------------+-------+
| MariaDB T-shirt   | blue  |
+-------------------+-------+

This is quite inefficient. MariaDB need to do a full table scan and for each row we need to parse the JSON content to evaluate the JSON extract function (defined as an UDF for the MariaDB CONNECT Storage Engine ). Is there any possible speedup ?

First in MariaDB we can define a virtual column implementing this JSON function call. We can make this virtual column persistent. We can then put an index on this persistent column (For MariaDB being persistent is mandatory to be able to put an index :-( )

ALTER TABLE jassets ADD column color varchar(10) AS JsonGet_String(json_cols, 'color') PERSISTENT;
CREATE INDEX colori  ON jassets (color);

We can then issue a request like :

MariaDB [test]> SELECT item_name, color FROM jassets where color='blue';
+-------------------+-------+
| item_name         | color |
+-------------------+-------+
| MariaDB T-shirt   | blue  |
+-------------------+-------+

This is much more efficient as only records satisfying the where clause through the index will be accessed. This does not requires any full table scan. JSON Parsing will still be required to extract the color value from the JSON content that match the where clause.

It must be noticed that having a JSON field with multiple associated indexes can turn very inefficient on updates. It must be noted that an innoDB table can contain a maximum of 64 secondary indexes. This can be a limitation for complex JSON structures. After a modification of the JSON content there is currently no way to discriminate the only impacted indexes. All of them need to be updated.

This technique of storing JSON into a text field works with any storage engine. This can be of interest for some specific workload requiring continuous high insert rate like what we can obtain with TokuDB and it fractal tree indexing technology.

Here is the Documentation on MariaDB CONNECT Storage Engine support for JSON.

Using JSON’s Arrays for MariaDB Dynamic Columns

The JSON format includes the concept of array. A JSON object cant contain an attribute of array type. We have seen that we can use the MariaDB CONNECT Storage Engine provided UDFs (user defined functions) to implement dynamic columns.

Let us create a table with a text column containing a a JSON string and let us insert some JSON content into this column. In JSON an attribute can be an array. Here the size attribute is a JSON array. We will use this array to manage the list of available sizes for a product.

MariaDB > CREATE TABLE jassets (
       item_name varchar(32) primary key, 
       json_cols varchar(512) 
     );
MariaDB > INSERT INTO jassets VALUES ('T-shirt', 
       Json_Object('blue' color, Json_Array('XL','M') json_size));
MariaDB > INSERT INTO jassets VALUES ('Thinkpad Laptop', Json_Object('black' color, 500 price));
MariaDB > SELECT * FROM jassets;
+-----------------+------------------------------------+
| item_name       | json_cols                          |
+-----------------+------------------------------------+
| T-shirt         | {"color":"blue","size":["XL","M"]} |
| Thinkpad Laptop | {"color":"black","price":500}      |
+-----------------+------------------------------------+

Through the available MariaDB JSON UDFs It is possible to Add or remove values to this array :

MariaDB > UPDATE jassets SET json_cols = json_array_add(json_cols, "S", 0, 'size')  WHERE item_name='T-shirt';
MariaDB > SELECT * FROM jassets;
+-----------------+----------------------------------------+
| item_name       | json_cols                              |
+-----------------+----------------------------------------+
| T-shirt         | {"color":"blue","size":["S","XL","M"]} |
| Thinkpad Laptop | {"color":"black","price":500}          |
+-----------------+----------------------------------------+
MariaDB >
MariaDB > UPDATE jassets SET json_cols = json_array_delete(json_cols, 2, 'size')  WHERE item_name='T-shirt';
MariaDB > SELECT *  from jassets;
+-----------------+------------------------------------+
| item_name       | json_cols                          |
+-----------------+------------------------------------+
| T-shirt         | {"color":"blue","size":["S","XL"]} |
| Thinkpad Laptop | {"color":"black","price":500}      |
+-----------------+------------------------------------+

This new way of addressing the dynamic column problem using JSON arrays gives great flexibility. This capability to use arrays does not exist currently in MariaDB dynamic columns.
This hybrid approach mixing relational fixed column content with JSON schemaless content overcomes the limitations of a pure NoSQL databases. It relies on the very strongly proven and consistent capabilities a relational database.

As I said previously this JSON access through MariaDB CONNECT storage engine should not be confused with the native JSON support available in Oracle/MySQL 5.7 (also in Facebook/MySQL Docstore).

Using JSON as Dynamic Columns with MariaDB

MariaDB CONNECT storage engine handles access to JSON files through standard SQL. It comes with a set of UDFs (user defined functions) to manipulate the JSON format. This JSON content can be stored in a normal text column. This approach can be used to implement dynamic columns. The dynamic column concept was first introduced with MariaDB 5.3.

This schema-less flexibility offered by a free content JSON column gives MariaDB a NoSQL flavor. This creates more flexibility for developers than the usual relational database static column definition.
Let us create a table with a column json_cols of type varchar(512). This column can contain a JSON string that will embed dynamic attributes.

MariaDB > CREATE TABLE jassets (
       item_name varchar(32) primary key, /* A common attribute for all items */
       json_cols varchar(512)  /* JSON data will be stored here */
     );

Let us insert some JSON content into this column. This content represent content that would not fit in a strictly typed predefined column set. This schemaless JSON content can be freely modified as in NoSQL database. This is a very classic pattern for modeling products and attributes in an ecomerce web site.

MariaDB > INSERT INTO jassets VALUES ('MariaDB T-shirt', Json_Object('blue' color, 'XL' size));
MariaDB > INSERT INTO jassets VALUES ('Thinkpad Laptop', Json_Object('black' color, 500 price));
MariaDB > SELECT * FROM jassets;
+-----------------+-------------------------------+
| item_name       | json_cols                     |
+-----------------+-------------------------------+
| MariaDB T-shirt | {"color":"blue","size":"XL"}  |
| Thinkpad Laptop | {"color":"black","price":500} |
+-----------------+-------------------------------+

We cant access a subset of the attributes embedded in this JSON column’s content.

SELECT item_name, JsonGet_String(json_cols, 'color') AS color FROM jassets;
+-----------------+-------+
| item_name       | color |
+-----------------+-------+
| MariaDB T-shirt | blue  |
| Thinkpad Laptop | black |
+-----------------+-------+
2 rows in set (0.00 sec)

It possible to add or remove a attributes from the dynamic column

MariaDB > UPDATE jassets SET json_cols=Json_Object_Delete(json_cols, 'price')
    -> WHERE JsonGet_String(json_cols, 'color')='black';
MariaDB > UPDATE jassets SET json_cols=Json_Object_Add(json_cols, '3 years' warranty)
    -> WHERE item_name='Thinkpad Laptop';
 
MariaDB > select * from jassets;
+-----------------+----------------------------------------+
| item_name       | json_cols                              |
+-----------------+----------------------------------------+
| MariaDB T-shirt | {"color":"blue","size":"XL"}           |
| Thinkpad Laptop | {"color":"black","warranty":"3 years"} |
+-----------------+----------------------------------------+
2 rows in set (0.00 sec)

We can also list all attributes names of the dynamic columns, or get them together with their values in JSON format:

SELECT item_name, Json_Object_List(json_cols) FROM jassets;
+-----------------+-----------------------------+
| item_name       | Json_Object_List(json_cols) |
+-----------------+-----------------------------+
| MariaDB T-shirt | ["color","size"]            |
| Thinkpad Laptop | ["color","warranty"]        |
+-----------------+-----------------------------+

As we have seen before we can also get the full JSON value of the column as it is plain text. This is very useful for a client application that directly manipulate JSON (JAVASCRIPT app for example).

SELECT item_name, json_cols FROM jassets;
+-----------------+----------------------------------------+
| item_name       | json_cols                              |
+-----------------+----------------------------------------+
| MariaDB T-shirt | {"color":"blue","size":"XL"}           |
| Thinkpad Laptop | {"color":"black","warranty":"3 years"} |
+-----------------+----------------------------------------+

This new way of addressing the dynamic column problem with JSON format gives great flexibility. This approach is open to the use of a JSON friendly language like JAVASCRIPT on the client side. In some use cases the full JSON content will be sent to the client were it will be created, validated and modified.

This JSON access through MariaDB CONNECT storage engine should not be confused with the native JSON support available in MySQL 5.7 (also in Facebook Docstore). Storing JSON in a binary optimized format or storing JSON in a standart text have respective advantages and drawbacks.

The advantage of the text format is that it works with any storage engine : innoDB, Tokudb, RocksDB.
If the JSON content is handled by the client there will be no overhead paid on the MariaDB server for validation and parsing. The binary format also has a space consumption impact.

On the contrary the binary format might be more efficient for server side manipulations.

MariaDB CONNECT Storage Engine JSON Autodiscovery

The MariaDB CONNECT storage engine offers access to JSON file and allows you to see a external JSON file as a MariaDB table.
A nice feature of the CONNECT storage Engine is its capability to auto discover a table structure when the table correspond to external data. In our case the CONNECT storage engine will automatically define the columns based on the JSON file hierarchical structure.
This is possible thanks to a feature of MariaDB Storage Engine API. This auto discovery feature allows a create statement to delegate the discovery of the table structure to the chosen storage engine.

The CONNECT storage Engine also use this auto discovery feature for other table types like the ODBC table. This alleviates the developer of the burden of defining all the columns that have already been defined in the target data source.

Let us start from the file “biblio3.json”. This file consists in an array of objects. Some object’s properties have object or array values :

[
{
  "ISBN": "9782212090819",
  "LANG": "fr",
  "SUBJECT": "applications",
  "AUTHOR": [
    {
      "FIRSTNAME": "Jean-Christophe",
      "LASTNAME": "Bernadac"
    },
    {
      "FIRSTNAME": "François",
      "LASTNAME": "Knab"
    }
  ],
  "TITLE": "Construire une application XML",
  "PUBLISHER": {
    "NAME": "Eyrolles",
    "PLACE": "Paris"
  },
  "DATEPUB": 1999
  },
  {
    "ISBN": "9782840825685",
    "LANG": "fr",
    "SUBJECT": "applications",
    "AUTHOR": [
      {
        "FIRSTNAME": "William J.",
        "LASTNAME": "Pardi"
      }
    ],
    "TITLE": "XML en Action",
    "TRANSLATED": {
      "PREFIX": "adapté de l'anglais par",
      "TRANSLATOR": {
        "FIRSTNAME": "James",
        "LASTNAME": "Guerin"
      }
    },
    "PUBLISHER": {
      "NAME": "Microsoft Press",
      "PLACE": "Paris"
    },
    "DATEPUB": 1999
  }
]

To use the CONNECT storage engine JSON autodiscovery capabilities the  ’create table‘ is minimalist. There is no need to describe what should be retrieved from the JSON file and how it should be processed. :

MariaDB > CREATE TABLE jsample engine=CONNECT table_type=JSON 
file_name='/var/lib/mysql/json/biblio3.jsn';

We can check how the table has been created with the SHOW CREATE TABLE statement:

MariaDB > show create table jsample;
...
CREATE TABLE `jsample` (
  `ISBN` char(13) NOT NULL,
  `LANG` char(2) NOT NULL,
  `SUBJECT` char(12) NOT NULL,
  `AUTHOR` varchar(256) DEFAULT NULL,
  `TITLE` char(30) NOT NULL,
  `TRANSLATED` varchar(256) DEFAULT NULL,
  `PUBLISHER` varchar(256) DEFAULT NULL,
  `DATEPUB` int(4) NOT NULL
) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='JSON' `FILE_NAME`='biblio3.json';

Autodiscovery does a mapping from JSON hierarchical data to tabular data. Doing a CREATE TABLE with auto discovery is a good way to help build a more precise create table once the JSON structure has been discovered. After autodiscovery the column names come from the objects keys and the column definition is deducted from the objects values.

The column sizes is calculated from the file as the maximum length of the corresponding column when it was a normal value. For columns that are JSON arrays or objects, the column is specified as a VARCHAR string of length 256, supposedly big enough to contain the sub-object’s concatenated values. Nullable is set to true if the column is null or missing in some rows or if its JPATH contains arrays.

MariaDB [test]> select * from jsample;
+---------------+------+--------------+--------------------------+--------------------------------+--------------------------+--------------+-----------------------+---------+
| ISBN          | LANG | SUBJECT      | AUTHOR                   | TITLE                          | TRANSLATION              | TRANSLATOR   | PUBLISHER             | DATEPUB |
+---------------+------+--------------+--------------------------+--------------------------------+--------------------------+--------------+-----------------------+---------+
| 9782212090819 | fr   | applications | Jean-Christophe Bernadac | Construire une application XML | NULL                     | NULL         | Eyrolles Paris        |    1999 |
| 9782840825685 | fr   | applications | William J. Pardi         | XML en Action                  | adapté de l'anglais par  | James Guerin | Microsoft Press Paris |    1999 |
+---------------+------+--------------+--------------------------+--------------------------------+--------------------------+--------------+-----------------------+---------+
2 rows in set (0.00 sec)

It is important to notice that CONNECT cannot guess what you want to do with arrays. Here the AUTHOR array is left undefined, which means that only its first value will be retrieved. You can specify through a CONNECT table creation parameter if you want to present this array in a normalized or denormalized way (“Expand=AUTHOR” in the option list) with all the array elements presented.

It is also important to notice that by default the autodiscovery feature only looks at the first level of the JSON structure.
If a more complex definition is desired, you can ask CONNECT to analyze the JPATH up to a given level using the LEVEL option in the option list. The level value is the number of sub-objects that are taken in the JPATH. For instance:

create table jsample2 engine=connect table_type=JSON 
file_name='biblio3.json' 
option_list='level=1';

This will define the table as:

CREATE TABLE `jsample2` (
  `ISBN` char(13) NOT NULL,
  `LANG` char(2) NOT NULL,
  `SUBJECT` char(12) NOT NULL,
  `AUTHOR_FIRSTNAME` char(15) NOT NULL `FIELD_FORMAT`='AUTHOR::FIRSTNAME',
  `AUTHOR_LASTNAME` char(8) NOT NULL `FIELD_FORMAT`='AUTHOR::LASTNAME',
  `TITLE` char(30) NOT NULL,
  `TRANSLATED_PREFIX` char(23) DEFAULT NULL `FIELD_FORMAT`='TRANSLATED:PREFIX',
  `TRANSLATED_TRANSLATOR` varchar(256) DEFAULT NULL `FIELD_FORMAT`='TRANSLATED:TRANSLATOR',
  `PUBLISHER_NAME` char(15) NOT NULL `FIELD_FORMAT`='PUBLISHER:NAME',
  `PUBLISHER_PLACE` char(5) NOT NULL `FIELD_FORMAT`='PUBLISHER:PLACE',
  `DATEPUB` int(4) NOT NULL
) ENGINE=CONNECT DEFAULT CHARSET=latin1 
`TABLE_TYPE`='JSON' `FILE_NAME`='biblio3.json' `OPTION_LIST`='level=1';

In this table definition the column names come from concatenation of attributes in the JSON hierarchy. For example AUTHOR_FIRSTNAME comes from the concatenation of AUTHOR and FIRSTNAME attribute name.

MariaDB [test]> select * from jsample2;
+---------------+------+--------------+------------------+-----------------+--------------------------------+--------------------------+----------------------+---------------------+-----------------+-----------------+---------+
| ISBN          | LANG | SUBJECT      | AUTHOR_FIRSTNAME | AUTHOR_LASTNAME | TITLE                          | TRANSLATION              | TRANSLATOR_FIRSTNAME | TRANSLATOR_LASTNAME | PUBLISHER_NAME  | PUBLISHER_PLACE | DATEPUB |
+---------------+------+--------------+------------------+-----------------+--------------------------------+--------------------------+----------------------+---------------------+-----------------+-----------------+---------+
| 9782212090819 | fr   | applications | Jean-Christophe  | Bernadac        | Construire une application XML | NULL                     | NULL                 | NULL                | Eyrolles        | Paris           |    1999 |
| 9782840825685 | fr   | applications | William J.       | Pardi           | XML en Action                  | adapté de l'anglais par  | James                | Guerin              | Microsoft Press | Paris           |    1999 |
+---------------+------+--------------+------------------+-----------------+--------------------------------+--------------------------+----------------------+---------------------+-----------------+-----------------+---------+
2 rows in set (0.00 sec)

This method can be used as a quick way to make a “template” table definition that can later be edited to make the desired definition. Column names are constructed from all the object keys of their path in order to have distinct column names. This can be manually edited to have the desired names, provided their JPATH key names are not modified.

Here we want the array of authors to be transformed into a concatenated string.

CREATE TABLE `jsample6` (
    -> `ISBN` char(13) NOT NULL,
    -> `AUTHOR` char(128) DEFAULT NULL `field_format`='AUTHOR:[" and "]'
    -> ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`='JSON' `FILE_NAME`='biblio.json' ;
Query OK, 0 rows affected (0.00 sec)

This give this flat representation of the author array attribute :

MariaDB [test]> select * from jsample6;
+---------------+---------------------------------------------+
| ISBN          | AUTHOR                                      |
+---------------+---------------------------------------------+
| 9782212090819 | Jean-Christophe Bernadac and François Knab  |
| 9782840825685 | William J. Pardi                            |
+---------------+---------------------------------------------+
2 rows in set (0.00 sec)

MariaDB CONNECT Storage Engine JSON Autodiscovery capability can be a very easy way to search or compute aggregates on a JSON structured file. This can be very useful as loading this hierarchical data in tables can be quite complex. You get the benefit of well known SQL query language.

This JSON file access through MariaDB CONNECT storage engine should not be confused with the native JSON support available in MySQL 5.7 (also in Facebook Docstore). MariaDB CONNECT storage engine target is to provide access to data sources that remain external to MariaDB.

Happy to see this MySQL JSON momentum !

Last week at Percona Live Facebook has presented for the first time Docstore which is a native JSON implementation in MySQL. Oracle has also presented their MySQL 5.7 lab release that includes the implementation of a native JSON type.
This is an important move as MySQL was behind other other RDMS regarding JSON (PostgreSQL already had a nice and complete implementation). JSON being widely adopted in various area (JS dev, data exchange and database format, …) this move was awaited.

The 2 implementations of Native JSON type are different but they both provide validation of input data and a binary format JSON storage.

If we look at the native JSON Oracle implementation the native JSON format is optimized for property retrieval. To achieve good performance the binary JSON storage format includes a preamble with a lookup table. This lookup table will speedup key/value pair access when executing functions like JSN_EXTRACT.
Speeding up JSN_EXTRACT is important for indexing JSON fields. For the Oracle implementation the JSON Indexes are indexes on a non-materialized Virtual Columns(functional indexes). These virtual column are a call to JSN_EXTRACT for the JSON property we want to index.

The Facebook native JSON implementation use a docpath syntax that is understood by the MySQL parser. It is possible to have a secondary composite index containing a JSON index part. The optimizer can use the secondary index for covering index, index only range scan … There is not yet a full optimizer support but this is a good start.
Contrary to Facebook, Oracle has not modified the parser and all is wrapped by JSON helper functions.

The main difference I sees between these 2 implementation is the fact that with oracle implementation the secondary index is bound to a virtual column that need to be in the DDL. With the Oracle implementation it is not possible to have a composite index mixing a regular column. So for the moment the Facebook implementation seems to be more flexible on this part.
What I like about the Oracle implementation is that the provided JSON function work both for native JSON format or for JSON text. That might explain why the udf function lab release has been remove from downloads.

So will it still make sense to store JSON as a text field ?
It depends on use case. Having no validation, no binary format conversion in and out to present the JSON payload to the application makes it faster for insert. For small JSON document the speedup resulting from lookup might be marginal when running JSN_EXTRACT. Building the index entry is just done once so access through the index will not have to redo the JSON_EXTRACT.

What I really like about this addition of the JSON format to MySQL is that we are able to have an hybrid approach mixing tabular data with JSON column. Contrary to JSON only, schema less datastore like MongoDB this hybrid data store might turn out to be the best choice for a wide range of application and might please developers.

Select from a mariaDB table into JSON format ?

Can we output content of a mariaDB table into JSON using the CONNECT Storage Engine ?
MariaDB and MySQL are not currently JSON friendly databases. The usage of JSON in MySQL is almost inexistent with the excetion of the explain JSON output and of MariaDB dynamic column dump.
There also exist a few udf (in MySQL lab area) to help manipulate JSON strings(searching, modification …).

A basic question often raised is how to output the content of a MariaDB table in JSON format. Can the MariaDB CONNECT storage engine help do that ?

Suppose we have a simple table :

MariaDB > create table book
(
  id       integer,
  isbn     varchar(20),
  lang     varchar(20),
  subject  varchar(80),
  title    varchar(80),
  datepub  date
)  engine=innodb;

Then insert a few rows.

insert into book values (1,'9782212090819','fr','applications','Construire une application XML','2008-7-04');
insert into book values (1,'9782840825685','fr','applications','XML en Action','1999-7-04');

There is no direct way to query this innoDB table and get the result in a JSON format. Many users have written PHP(or other language) scripts to do this in a more or less generic way.
Let us create a CONNECT/JSON table based on this innoDB table.

MariaDB > create table bookj  
  engine=CONNECT table_type=JSON 
  file_name='/var/lib/mysql/json/bookj.jsn'  
  as select * from book;

This create table will produce a json file with the JSON representation of the table content :

bookj.jsn :
[
	{
		"id": 1,
		"isbn": "9782212090819",
		"lang": "fr",
		"subject": "applications",
		"title": "Construire une application XML",
		"datepub": 1215129600
	},
	{
		"id": 2,
		"isbn": "9782840825685",
		"lang": "fr",
		"subject": "applications",
		"title": "XML en Action",
		"datepub": 931046400
	}
]

But how can we directly request the JSON representation through a select ?
We will create a second CONNECT/JSON table based on the same underlying file produced by previous step.

create table bookj2 (
	ROW varchar(500) field_format='*'
)
engine=CONNECT table_type=JSON file_name='/var/lib/mysql/json/bookj.jsn';

The field_format=’*’ is a JSONPath meaning that we take the JSON as is and do not try to navigate properties.

MariaDB > select * from bookj2;
+------------------------------------------------------------------------------------------------------------------------------------+
| ROW                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------+
| {"id":1,"isbn":"9782212090819","lang":"fr","subject":"applications","title":"Construire une application XML","datepub":1215129600} |
| {"id":2,"isbn":"9782840825685","lang":"fr","subject":"applications","title":"XML en Action","datepub":931046400}                   |
+------------------------------------------------------------------------------------------------------------------------------------+

That is not really simple but that do the job. The structure produced is a flat structure without array or embeded JSON.
In the real world the equivalent of a JSON hierarchy is a set of tables linked through foreign keys. With MariaDB this can also include dynamic column for multi values columns.