Exploring genomic data with Superset



As part of a recent hackathon we wanted to explore visualization tools that could help diagnose rare diseases. A critical part of the diagnoses process is to compare variations of the patients genetic information with reference information. There can be lots of variations, we are talking in the order of millions sometimes tens of millions per patient. I think a centralized tool that researchers can use to explore their genomic data and share useful visualizations is missing. In the business world these kind of tools have been around for a while, they are generally called business intelligence applications and one such tool is Superset.

Superset is an open source business intelligence tool built by airbnb. It lets you create interactive dashboards and connects to any database that supports SQLAlchemy.


Deployment was simple and straightforward. Just follow the instructions and you’ll be running a superset instance in under 10min. If you followed the instructions you should also have some example data to play with.


Once you have the Superset app running you’ll want to connect it to a database with your genomic information. For simplicity we concentrated on patient VCF data only. In genomics there is no standard database or schema for storing these variants. In fact I would say it’s one of the challenges in our industry at the moment but I’ll save that discussion for another blog post. Luckily for us we already have a pipeline that produces an sqlite database through Gemini. For performance reasons we exported that data into a memsql server. Alternatively you could simply import a VCF into Hail and export a csv to be imported into any database that supports SQLAlchemy.


The following are some thoughts based on a days usage of Superset:




I think Superset is a nice visualization tool and let’s you explore large SQL datasets quickly. Keep in mind that a lot of the visualizations are quite simple and any customization requires knowledge of SQL.

From a genomics perspective I think Superset is lacking the visualizations that would make it useful. That being said I could imagine a team of people customizing Superset for genomics. A lot of the fundamental features are there and given some genomic specific visualizations, I think it could become a useful tool in genomics.