Plotting Tutorial

The Hail plot module allows for easy plotting of data. This notebook contains examples of how to use the plotting functions in this module, many of which can also be found in the first tutorial.

In [1]:
import hail as hl

from import show, output_notebook
from bokeh.layouts import gridplot
Running on Apache Spark version 2.2.0
SparkUI available at
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.8-ce91c8aa54d8
LOGGING: writing to /hail/repo/hail/build/tmp/python/hail/docs/tutorials/hail-20190120-2208-0.2.8-ce91c8aa54d8.log
Loading BokehJS ...
In [2]:
mt = hl.read_matrix_table('data/')
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
mt = mt.annotate_cols(**table[mt.s])
mt = hl.sample_qc(mt)

2019-01-20 22:08:23 Hail: INFO: 1KG files found
2019-01-20 22:08:25 Hail: INFO: Reading table to impute column types
Global fields:
Column fields:
    's':     str
    'Population':     str
    'SuperPopulation':     str
    'isFemale':     bool
    'PurpleHair':     bool
    'CaffeineConsumption':     int32
    'sample_qc':     struct {
        dp_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        gq_stats: struct {
            mean: float64,
            stdev: float64,
            min: float64,
            max: float64
        call_rate: float64,
        n_called: int64,
        n_not_called: int64,
        n_hom_ref: int64,
        n_het: int64,
        n_hom_var: int64,
        n_non_ref: int64,
        n_singleton: int64,
        n_snp: int64,
        n_insertion: int64,
        n_deletion: int64,
        n_transition: int64,
        n_transversion: int64,
        n_star: int64,
        r_ti_tv: float64,
        r_het_hom_var: float64,
        r_insertion_deletion: float64
Row fields:
    'locus':     locus<GRCh37>
    'alleles':     array<str>
    'rsid':     str
    'qual':     float64
    'filters':     set<str>
    'info':     struct {
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        BaseQRankSum: float64,
        ClippingRankSum: float64,
        DP: int32,
        DS: bool,
        FS: float64,
        HaplotypeScore: float64,
        InbreedingCoeff: float64,
        MLEAC: array<int32>,
        MLEAF: array<float64>,
        MQ: float64,
        MQ0: int32,
        MQRankSum: float64,
        QD: float64,
        ReadPosRankSum: float64,
        set: str
Entry fields:
    'GT':     call
    'AD':     array<int32>
    'DP':     int32
    'GQ':     int32
    'PL':     array<int32>
Column key: ['s']
Row key: ['locus', 'alleles']
2019-01-20 22:08:25 Hail: INFO: Finished type imputation
  Loading column 'Sample' as type 'str' (imputed)
  Loading column 'Population' as type 'str' (imputed)
  Loading column 'SuperPopulation' as type 'str' (imputed)
  Loading column 'isFemale' as type 'bool' (imputed)
  Loading column 'PurpleHair' as type 'bool' (imputed)
  Loading column 'CaffeineConsumption' as type 'int32' (imputed)


The histogram() method takes as an argument an aggregated hist expression, as well as optional arguments for the legend and title of the plot.

In [3]:
dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')

This method, like all Hail plotting methods, also allows us to pass in fields of our data set directly. Choosing not to specify the range and bins arguments would result in a range being computed based on the largest and smallest values in the dataset and a default bins value of 50.

In [4]:
p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)

Cumulative Histogram

The cumulative_histogram() method works in a similar way to histogram().

In [5]:
p = hl.plot.cumulative_histogram(mt.DP, range=(0,30), bins=30)


The scatter() method can also take in either Python types or Hail fields as arguments for x and y.

In [6]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.call_rate, xlabel='Mean DP', ylabel='Call Rate')
2019-01-20 22:08:33 Hail: INFO: Coerced sorted dataset

We can also pass in a Hail field as a label argument, which determines how to color the data points.

In [7]:
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
                        (mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
                        (mt.GT.is_hom_var() & (ab >= 0.9)))
mt = mt.filter_entries(filter_condition_ab)
mt = hl.variant_qc(mt).cache()
common_mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
gwas = hl.linear_regression_rows(y=common_mt.CaffeineConsumption, x=common_mt.GT.n_alt_alleles(), covariates=[1.0])
pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(common_mt.GT)
2019-01-20 22:08:34 Hail: INFO: Coerced sorted dataset
2019-01-20 22:08:36 Hail: INFO: Coerced sorted dataset
2019-01-20 22:08:39 Hail: INFO: linear_regression_rows: running on 250 samples for 1 response variable y,
    with input variable x, and 1 additional covariate...
2019-01-20 22:08:44 Hail: INFO: hwe_normalized_pca: running PCA using 9169 variants.
2019-01-20 22:08:46 Hail: INFO: pca: running PCA with 10 components...
2019-01-20 22:08:52 Hail: INFO: Coerced sorted dataset
In [8]:
p = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
                    title='PCA', xlabel='PC1', ylabel='PC2', collect_all=True)
2019-01-20 22:08:52 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2019-01-20 22:08:53 Hail: INFO: Coerced sorted dataset

Hail’s downsample aggregator is incorporated into the scatter(), qq(), and manhattan() functions. The collect_all parameter tells the plot function whether to collect all values or downsample. Choosing not to set this parameter results in downsampling.

In [9]:
p2 = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
                    title='PCA (downsampled)', xlabel='PC1', ylabel='PC2', collect_all=False, n_divisions=50)

show(gridplot([p, p2], ncols=2, plot_width=400, plot_height=400))
2019-01-20 22:08:53 Hail: INFO: Coerced sorted dataset

Q-Q (Quantile-Quantile)

The qq() function requires either a Python type or a Hail field containing p-values to be plotted. This function also allows for downsampling.

In [10]:
p = hl.plot.qq(gwas.p_value, collect_all=True)
p2 = hl.plot.qq(gwas.p_value, n_divisions=75)

show(gridplot([p, p2], ncols=2, plot_width=400, plot_height=400))
2019-01-20 22:08:56 Hail: INFO: Ordering unsorted dataset with network shuffle


The manhattan() function requires a Hail field containing p-values.

In [11]:
p = hl.plot.manhattan(gwas.p_value)