GGPlot Tutorial

[1]:
import hail as hl
from hail.ggplot import *

import plotly
Loading BokehJS ...

The Hail team has implemented a plotting module for hail based on the very popular ggplot2 package from R’s tidyverse. That library is very fully featured and we will never be quite as flexible as it, but with just a subset of its functionality we can make highly customizable plots.

The Grammar of Graphics

The key idea here is that there’s not one magic function to make the plot you want. Plots are built up from a set of core primitives that allow for extensive customization. Let’s start with an example. We are going to plot y = x^2 for x from 0 to 10. First we make a hail table representing that data:

[2]:
ht = hl.utils.range_table(10)
ht = ht.annotate(squared = ht.idx**2)

Every plot starts with a call to ggplot, and then requires adding a geom to specify what kind of plot you’d like to create.

[3]:
fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_line()
fig.show()
Initializing Hail with default parameters...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-09f2439d4b:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.133-4c60fddb171a
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20241004-2013-0.2.133-4c60fddb171a.log
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.

aes creates an “aesthetic mapping”, which maps hail expressions to aspects of the plot. There is a predefined list of aesthetics supported by every geom. Most take an x and y at least.

With this interface, it’s easy to change out our plotting representation separate from our data. We can plot bars:

[4]:
fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_col()
fig.show()

Or points:

[5]:
fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_point()
fig.show()

There are optional aesthetics too. If we want, we could color the points based on whether they’re even or odd:

[6]:
fig = ggplot(ht, aes(x=ht.idx, y=ht.squared, color=hl.if_else(ht.idx % 2 == 0, "even", "odd"))) + geom_point()
fig.show()

Note that the color aesthetic by default just takes in an expression that evaluates to strings, and it assigns a discrete color to each string.

Say we wanted to plot the line with the colored points overlayed on top of it. We could try:

[7]:
fig = (ggplot(ht, aes(x=ht.idx, y=ht.squared, color=hl.if_else(ht.idx % 2 == 0, "even", "odd"))) +
       geom_line() +
       geom_point()
      )
fig.show()

But that is coloring the line as well, causing us to end up with interlocking blue and orange lines, which isn’t what we want. For that reason, it’s possible to define aesthetics that only apply to certain geoms.

[8]:
fig = (ggplot(ht, aes(x=ht.idx, y=ht.squared)) +
       geom_line() +
       geom_point(aes(color=hl.if_else(ht.idx % 2 == 0, "even", "odd")))
      )
fig.show()

All geoms can take in their own aesthetic mapping, which lets them specify aesthetics specific to them. And geom_point still inherits the x and y aesthetics from the mapping defined in ggplot().

Geoms that group

Some geoms implicitly do an aggregation based on the x aesthetic, and so don’t take a y value. Consider this dataset from gapminder with information about countries around the world, with one datapoint taken per country in the years 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, and 2007.

[9]:
gp = hl.Table.from_pandas(plotly.data.gapminder())
gp.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'country': str
    'continent': str
    'year': int32
    'lifeExp': float64
    'pop': int32
    'gdpPercap': float64
    'iso_alpha': str
    'iso_num': int32
----------------------------------------
Key: []
----------------------------------------

Let’s filter the data to 2007 for our first experiments

[10]:
gp_2007 = gp.filter(gp.year == 2007)

If we want to see how many countries from each continent we have, we can use geom_bar, which just takes in an x aesthetic and then implicitly counts how many values of each x there are.

[11]:
ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar()
[11]:

To make it a little prettier, let’s color per continent as well. We use fill to specify color of shapes (as opposed to color for points and lines. color on a bar chart sets the color of the bar outline.)

[12]:
ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar(aes(fill=gp_2007.continent))
[12]:

Maybe we instead want to see not the number of countries per continent, but the number of people living on each continent. We can do this with geom_bar as well by specifying a weight.

[13]:
ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar(aes(fill=gp_2007.continent, weight=gp_2007.pop))
[13]:

Histograms are similar to bar plots, except they break a continuous x axis into bins. Let’s import the iris dataset for this.

[14]:
iris = hl.Table.from_pandas(plotly.data.iris())
iris.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Row fields:
    'sepal_length': float64
    'sepal_width': float64
    'petal_length': float64
    'petal_width': float64
    'species': str
    'species_id': int32
----------------------------------------
Key: []
----------------------------------------

Let’s make a histogram:

[15]:
ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram()
[15]:

By default histogram plots groups stacked on top of each other, which is not always easy to interpret. We can specify the position argument to histogram to get different behavior. "dodge" puts the bars next to each other:

[16]:
ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram(position="dodge")
[16]:

And "identity" plots them over each other. It helps to set an alpha value to make them slightly transparent in these cases

[17]:
ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram(position="identity", alpha=0.8)
[17]:

Labels and Axes

It’s always a good idea to label your axes. This can be done most easily with xlab and ylab. We can also use ggtitle to add a title. Let’s pull in the same plot from above, and add labels.

[18]:
(ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) +
 geom_histogram(position="identity", alpha=0.8) +
 xlab("Sepal Length") + ylab("Number of samples") + ggtitle("Sepal length by flower type")
)
[18]: