Continuing our series on storing JSON directly within your database, we are now looking at the JSON functionality built into PostgreSQL.  You can see the first two posts in this series exploring MySQL’s JSON functionality with Storing JSON in Your Databases: Tips and Tricks For MySQL Part One and Storing JSON in Your Databases: Tips and Tricks For MySQL Part Two.  I used the exact same table structures and datasets as I did within my MySQL tests.  You can get the instructions to follow along at home here on GitHub.

PostgreSQL has two JSON datatypes available to you.  First JSON, and second JSONB.  The docs highlight the differences pretty well:

JSON JSON B Compare

Basically, JSONB is stored decomposed, making it more efficient for the database to interact with, while the JSON type is stored as an exact text copy of what was input.  This means formatting, white space, etc., will be lost with JSONB.  That said, the suggestion is to generally use JSONB.  The trade-off being slightly slower inserts vs. faster overall read performance (on top of the storage differences mentioned above).

In terms of performance, how much of a difference is there?  That is tricky, as it will depend heavily on the size and complexity of the JSON document being stored.  The larger, more complex the document, the larger the difference you could see.  But to illustrate an example, we will use our movie JSON to select the same data from both JSON and JSONB.

I created the following tables and loaded them with movie-related JSON:

Querying the JSON column results in slower retrieval:

While the JSONB column is faster (3x):

To ensure this is not a single data point I wrote a small script to run this test (with this title and other random titles).  Over 100 runs we had pretty consistent results:

JSONJSONB
Average Time: 2.5492

Min Time: 2.5297428970225155

Max Time: 2.56536191503983

Average Time: 0.747

Min Time: 0.7297536049736664

Max Time: 0.7827945239841938

As stated, insert/updates will lose some of that performance, so keep that in mind.  It is, however, easier to interact and update things in a JSONB datatype.  For instance, some functions are only available for JSONB data types, like the JSONB_SET function which allows you to in-place update your JSON.

I have seen references in several places to the insert performance difference between JSON and JSONB, but I have not seen any concrete numbers.  I decided to run a few tests:

  • Inserting 10K records into a JSON column (via insert into select from):  72.851 ms
  • Inserting 10K records into a JSONB column (via insert into select from):  754.045 ms

If you are doing a heavy insert workload, the difference is significant enough to take note and plan for it.  

Another distinct advantage for JSONB is the ability to use GIN indexes over the JSON document.  Gin indexes are designed and optimized to work for text searches.  This focus lends itself well to JSON.  That said, the syntax may be a bit more complicated to make use of GIN indexes.

Here is how you create a GIN index. 

You can see the original query is still not using this index:

Why not?  The ->> returns the value as a text, so the conversion from JSON to Text causes some issues.  There are a couple of ways to work around this.  The first is to use @@ which returns the first JSON item:

This works.  You can also use the @> which checks if the JSON value entries exist.

Not only did this use the index it is now over 3x faster overall.  The PostgreSQL documentation offers a fairly detailed overview of the different options to interact with JSON, especially when it comes to functions and operators (note the table in the PG 12 Docs giving you an overview of the operators is a little more readable ) available to you. It’s important to note that not all operators and functions will work with GIN indexes.  For example, using ‘like_regex’:

In addition to the GIN indexes, you can create Hash or Btree indexes on your JSON columns, however, they are only useful if you are comparing or searching the entire JSON document.  Instead, you want to use an expression index (often referred to as functional indexes in other databases).  

Generated Columns and Expression Indexes

While the GIN indexes work for some use cases, you often will find it easier and more convenient to use actual columns or some of the normal functions you are familiar with.  The simplest use of expression indexes is to simply pull out and index a field you will use often.  Let’s say we will often want to search for a title.  We can create an index on the extracted JSON from our previous example.

This creates a btree index on the jsonb_column->> title path, allowing us to search for a title.

Originally, when we searched for movies with the ‘Avengers Endgame (2019)’ title there was no index (unless you modified the SQL to make use of the GIN index).  Now with this new index, you can see our original query is no longer doing a sequential scan but is instead using the btree index.  If we want the index to also be used for ‘ like string%’ statements as well to help with partial matching we need something a bit extra added to the index.  Note: this is not something you can do by default using only the GIN index (however you can look into the pg_trgm extension potentially for this).  Here we will create the same index with the operator class “text_pattern_ops” which is designed to work with Like and Regex.

Before:

After:

Now you can also perform a similar operation, but store the results as a queryable generated column instead:

Now we have the option to add an index on the title column, select it without the JSON formatting, etc.  Note: generated columns can not be updated.

We can create our entire movie_json table with extra columns and expression indexes and make it look a bit more like a normal table:

You may have noticed we had to explicitly cast the columns to a specific data type.  One of the challenges or at least the more difficult things to get used to with both MySQL and PostgreSQL.

It’s worth remembering that a JSON document does not have explicit data types, so you will often find that some functions or indexes may not work as expected because you are comparing data of two different types (and will return different results from some operations, i.e. sorting ascii vs. numeric).  For example: to add 1 to our IMDB rating:

Here the imdb_rating that is returned is not numeric so you can not add 1.  So logically, you want to cast the value to a numeric to allow for the addition. 

Again we run into a type issue.  This time the jsonb_set function can not pass in a numeric value (nor does it implicitly convert) when the function is looking for a JSONB value.  So let’s cast the result to JSONB:

Here we find we can not cast a numeric directly to JSONB.  So we have to convert the numeric to text, then to JSONB.

For some developers, this may prove a bit tricky to get the hang of.  This will also extend to things that work, but maybe just slow or may produce the wrong data.  For instance, when using compare, sort, or filter you may or may not pick up the index you are looking for and you may be comparing the ASCII value for numerics.  Below you can see us explicitly convert certain columns to compare them. You can also see that when you generate a column or create an expression index you need to do the same:

A quick side note:  there will be a performance difference for inserts/updates on JSON/JSONB columns with one or more generated columns or expression indexes.  Because of how JSONB is stored, you can see a boost in performance when you update a column that has a lot of generated columns/expression indexes.  In my tests, I saw 3-4x or more performance improvement in doing updates on a JSONB column with several generated/expression indexes vs. a JSON column with the same setup.  That said, the more generated columns and expression indexes you have on a table, the more it can impact performance… this is a trade-off.

  • A table with a single generated column, took 7650ms to load 100K rows.
  • A table with 6 generated columns, took 8576ms to load 100K rows.

While this is a very simple setup with just a single JSONB column and the generated columns, you can see a slight overhead.  This held true with the same 100K records being updated, updating only one column in the table with six generated columns took 4.7 seconds vs. 3.6 seconds for a single generated column (a ~25% difference).  Again, not an in-depth test, but it illustrates that adding a ton of expression indexes or generated columns will not be without some cost.

Recap & What’s Next

A few quick takeaways:

  • Using JSONB is probably going to be your best option in most use cases
  • Be very careful of type conversions and making assumptions on the data within your JSON/JSONB columns.  You may get errors or odd results.
  • Use the available indexes, generated columns, and expression indexes to gain substantial performance benefits

Now you have the basics of JSON in PostgreSQL.  In part two of this series, we review some of the more advanced options and do a deeper dive on performance and explore a normalized schema -vs- one that is heavily JSON.

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
GrandFelix

You have specified wrong index column names in create index statements when creating movies_json and movies_jsonb tables. Switch jsonb_column with json_column and vice versa.