Trying ZomboDB

11/04/2021

ZomboDB is a postgres extension that provides an interesting solution for mixing postgres and elasticsearch (ES). The basic idea is to use ES as a postgres index and to enforce multiversion concurrency control (MVCC) correctness. It’s important to note that there’s been rewrite of zombodb to Rust and it’s considered in beta at the moment. Installation notes are at the end of the post, otherwise lets jump straight in.

Create the extension:

CREATE EXTENSION IF NOT EXISTS zombodb;

Create an example report table with the zombodb type:

DROP TABLE IF EXISTS report;

CREATE TABLE report (
    id serial primary key,
    content zdb.fulltext
);

note: zdb.fulltext maps to {"type": "text", "copy_to": "zdb_all", "analyzer": "zdb_standard"}. All mappings are found here.

Generate 10^6 rows with duplicated sample_reports (sample data at the end of the post):

INSERT INTO
    report (content)
SELECT
    (select * from sample_reports ORDER BY random()+(i-i) LIMIT 1)
FROM
    generate_series(1, 1000000) s(i);

note: random()+(i-i) is a hack so that postgres will not optimise random().

At this point there is nothing particularly special about the report table. To enable the ES integration we need to create the zombodb index:


DROP INDEX IF EXISTS idx_report;

CREATE INDEX idx_report
    ON report
    USING zombodb ((report.*))
    WITH (url='http://localhost:9200/');

Depending on how large your table is, this might take a while. For my 10^6 rows it took ~30s. Internally zombodb is now indexing the table in ES. If the indexing fails, you will need to create the index again. We can verify the index by looking at ES:

curl localhost:9200/_cat/indices
green open 16386.2200.1938790.1938804 ECLYLKcwQRyU6oqD8wVSsQ 5 0 1000001 1 168.9mb 168.9mb

Notice that there are 10^6+1 rows. That’s not bug, zombodb stores transaction information in ES in a separate document.

We can also inspect the mapping that zombodb generated:

curl localhost:9200/16386.2200.1938790.1938804?pretty
{
    "16386.2200.1938790.1938804" : {
    ...
    "mappings" : {
            "properties" : {
                "content" : {
                    "type" : "text",
                    "analyzer" : "fulltext",
                    "fielddata" : true
                },
                "id" : {
                    "type" : "integer"
                },
                "zdb_aborted_xids" : {
                    "type" : "long"
                },
                "zdb_all" : {
                    "type" : "text",
                    "analyzer" : "zdb_all_analyzer"
                },
                "zdb_cmax" : {
                    "type" : "integer"
                },
                "zdb_cmin" : {
                    "type" : "integer"
                },
                "zdb_ctid" : {
                    "type" : "long"
                },
                "zdb_xmax" : {
                    "type" : "long"
                },
                "zdb_xmin" : {
                    "type" : "long"
                }
                ...
        }
    ...

This is one of the nice features of zombodb. If you were to roll your own integration, you’d have to keep the mapping up-to-date yourself. Also, notice all the zdb_xxx properties which zombodb is using to keep track of transaction states.

We can see the first couple documents with:

curl localhost:9200/16386.2200.1938790.1938804/_search?pretty -H 'Content-Type: application/json' -d '{"size": 2, "query": {"match_all": {}}}'
{
  "took" : 2,
  ...
  "hits" : {
    ...
    "hits" : [
      {
        "_index" : "16386.2200.1938790.1938804",
        "_type" : "_doc",
        "_id" : "zdb_aborted_xids",
        "_score" : 1.0,
        "_source" : {
          "zdb_aborted_xids" : [ ]
        }
      },
      {
        "_index" : "16386.2200.1938790.1938804",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "id" : 2,
          "content" : "2020-02-26\n\n# Chief Complaint\nNo complaints.\n\n# History of Present Illness\nJeanene972\n is a 62 year-old non-hispanic white female. Patient has a history of acute bronchitis (disorder), viral sinusitis (disorder).\n\n# Social History\nPatient is married. Patient is an active smoker and is an alcoholic.\n Patient identifies as heterosexual.\n\nPatient comes from a high socioeconomic background.\n Patient is a college graduate.\nPatient currently has Humana.\n\n# Allergies\nNo Known Allergies.\n\n# Medications\nacetaminophen 325 mg oral tablet; hydrochlorothiazide 25 mg oral tablet\n\n# Assessment and Plan\nPatient is presenting with pneumonia (disorder), hypoxemia (disorder), respiratory distress (finding), acute deep venous thrombosis (disorder). ",
          "zdb_ctid" : 2,
          "zdb_cmin" : 0,
          "zdb_cmax" : 0,
          "zdb_xmin" : 3
        }
      }
    ]
  }
}

The first hit is the extra document that zombodb uses to keep track of pending transactions.

Let’s start querying the db:

test=# SELECT id
test-#   FROM report
test-#  WHERE report ==> 'content:(dyspnea)' LIMIT 3;
 id
----
  2
  6
  7
(3 rows)

Time: 45.459 ms

You can get the count with:

test=# SELECT count(*)
  FROM report
 WHERE report ==> 'content:(dyspnea)';
 count
--------
 333410
(1 row)

Time: 1542.891 ms (00:01.543)

But it’s a bit slow because postgres doesn’t know it can use the ES index to get the count directly, so it’s just doing an index scan.

test=# EXPLAIN SELECT count(*)
  FROM report
 WHERE report ==> 'content:(dyspnea)';
                                        QUERY PLAN
-------------------------------------------------------------------------------------------
 Aggregate  (cost=396.75..396.76 rows=1 width=8)
   ->  Index Scan using idx_report on report  (cost=0.00..390.50 rows=2500 width=0)
         Index Cond: (ctid ==> '{"query_string":{"query":"content:(dyspnea)"}}'::zdbquery)
(3 rows)

To get around this you can query ES directly using zombodb sql functions:

SELECT zdb.count('idx_report', 'content:(dyspnea)');
 count
--------
 333410
(1 row)

Time: 107.924 ms)

Afaik there’s similar issues with joins, since postgres will need to fetch all rows from ES before trying to do the join. In ES you can do joins with nested documents or parent/child, but it doesn’t look like zombodb supports that at the moment.

Finally, zombodb leverages VACUUM to remove dead rows from ES. You can also force the re-index with VACUUM FULL. More info on the process here.

Lets delete some rows and see what zombodb does:

test=# DELETE FROM report WHERE report ==> 'content:(dyspnea)';
DELETE 333410
Time: 24001.698 ms (00:24.002)
curl localhost:9200/_cat/indices
green open 16386.2200.1938790.1938804 objcHsq_SomKKsUyRnn10Q 5 0 1000001 333412 242.6mb 242.6mb
curl localhost:9200/_cat/indices
green open 16386.2200.1938790.1938804 objcHsq_SomKKsUyRnn10Q 5 0 666591 1000232 264.3mb 264.3mb
test=#VACUUM report;
Time: 249.829 ms
curl localhost:9200/_cat/indices?pretty
green open 16386.2200.1938790.1938804 objcHsq_SomKKsUyRnn10Q 5 0 666591 666821 361.2mb 361.2mb

The doc count makes sense, but the deleted count looks a little strange but I’m guessing it’s related to how zombodb is enforcing MVCC. Let do a full re-index anyway.

test=#VACUUM FULL report;
Time: 36720.200 ms (00:36.720)
curl localhost:9200/_cat/indices
green open 16386.2200.1938856.1938865 8VxgtTRVSYOk5Q5050a7Qg 5 0 666591 1 111.5mb 111.5mb

There’s plenty more to investigate in zombodb but I’ve seen enough to know that for certain projects it could help simplify database management significantly. I’ll be keeping a eye on it and I’m looking forward to the full release. The project looks like it is supported by sponsorships alone, you can sponsor here (I am not affiliated with the project at all).

Installation

Zombodb requires postgres 10+ and ES 7+. Pre-built CentOS/RHEL or Ubuntu/Debian binaries are available for sponsors only. Otherwise it’s pretty straight forward to install from source. Follow the general instructions in the readme. Below is how I installed it from source on Ubuntu server 18.04-LTS.

Install packages

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

sudo apt-get update

sudo apt -y install build-essential bison flex zlib1g zlib1g-dev pkg-config libssl-dev libreadline-dev clang libclang-dev postgresql-12 postgresql-server-dev-12 elasticsearch-7

Install rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Build zombodb from source

cargo pgx init --pg12=`which pg_config`
cargo pgx install --release

Sample data

Sample texts were created using synthea.

DROP TABLE IF EXISTS sample_reports;

CREATE TABLE sample_reports (
    value text
);

INSERT INTO sample_reports (value)
VALUES
    (
        '2020-08-09

# Chief Complaint
No complaints.

# History of Present Illness
Jeanene972
 is a 62 year-old non-hispanic white female. Patient has a history of covid-19, acute bronchitis (disorder), fever (finding), fatigue (finding), respiratory distress (finding), cough (finding), viral sinusitis (disorder), loss of taste (finding), suspected covid-19, acute deep venous thrombosis (disorder), pneumonia (disorder).

# Social History
Patient is married. Patient is an active smoker and is an alcoholic.
 Patient identifies as heterosexual.

Patient comes from a high socioeconomic background.
 Patient is a college graduate.
Patient currently has Humana.

# Allergies
No Known Allergies.

# Medications
acetaminophen 325 mg oral tablet; hydrochlorothiazide 25 mg oral tablet

# Assessment and Plan


## Plan
Patient was given the following immunizations: influenza, seasonal, injectable, preservative free.
The patient was prescribed the following medications:
- hydrochlorothiazide 25 mg oral tablet'
    ),
    (
        '2020-09-10

# Chief Complaint
No complaints.

# History of Present Illness
Marilu588
 is a 37 year-old non-hispanic white female. Patient has a history of dyspnea (finding), wheezing (finding), sore throat symptom (finding), viral sinusitis (disorder), suspected covid-19, fracture of ankle, covid-19, normal pregnancy, acute bronchitis (disorder), fever (finding), sputum finding (finding), sinusitis (disorder), cough (finding).

# Social History
Patient is single. Patient is an active smoker and is an alcoholic.
 Patient identifies as heterosexual.

Patient comes from a middle socioeconomic background.
 Patient has completed some college courses.
Patient currently has UnitedHealthcare.

# Allergies
No Known Allergies.

# Medications
acetaminophen 325 mg oral tablet; acetaminophen 325 mg / hydrocodone bitartrate 7.5 mg oral tablet; etonogestrel 68 mg drug implant; ibuprofen 200 mg oral tablet; seasonique 91 day pack; carbamazepine[tegretol]; amoxicillin 250 mg / clavulanate 125 mg oral tablet; diazepam 5 mg oral tablet

# Assessment and Plan


## Plan

The following procedures were conducted:
- removal of subcutaneous contraceptive
- insertion of subcutaneous contraceptive'
    ),
    (
        '2020-06-08

# Chief Complaint
No complaints.

# History of Present Illness
Twila243
 is a 52 year-old non-hispanic white female. Patient has a history of muscle pain (finding), covid-19, fever (finding), sputum finding (finding), joint pain (finding), acute viral pharyngitis (disorder), cough (finding), viral sinusitis (disorder), loss of taste (finding), suspected covid-19.

# Social History
Patient is married. Patient is an active smoker and is an alcoholic.
 Patient identifies as heterosexual.

Patient comes from a middle socioeconomic background.
 Patient has completed some college courses.
Patient currently has Humana.

# Allergies
No Known Allergies.

# Medications
amoxicillin 250 mg / clavulanate 125 mg oral tablet

# Assessment and Plan


## Plan
Patient was given the following immunizations: influenza, seasonal, injectable, preservative free.
The following procedures were conducted:
- medication reconciliation (procedure)'
    );