Plotting Tutorial
The Hail plot module allows for easy plotting of data. This notebook contains examples of how to use the plotting functions in this module, many of which can also be found in the first tutorial.
[1]:
import hail as hl
hl.init()
from bokeh.io import show
from bokeh.layouts import gridplot
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See https://www.slf4j.org/codes.html#noProviders for further details.
SLF4J: Class path contains SLF4J bindings targeting slf4j-api versions 1.7.x or earlier.
SLF4J: Ignoring binding found at [jar:file:/usr/local/lib/python3.9/dist-packages/pyspark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See https://www.slf4j.org/codes.html#ignoredBindings for an explanation.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.0
SparkUI available at http://hostname-8dc6a28a63:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.124-13536b531342
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20230921-1904-0.2.124-13536b531342.log
[2]:
hl.utils.get_1kg('data/')
mt = hl.read_matrix_table('data/1kg.mt')
table = (hl.import_table('data/1kg_annotations.txt', impute=True)
.key_by('Sample'))
mt = mt.annotate_cols(**table[mt.s])
mt = hl.sample_qc(mt)
mt.describe()
2023-09-21 19:04:26.912 Hail: INFO: 1KG files found
2023-09-21 19:04:33.516 Hail: INFO: Reading table to impute column types
2023-09-21 19:04:35.270 Hail: INFO: Finished type imputation
Loading field 'Sample' as type str (imputed)
Loading field 'Population' as type str (imputed)
Loading field 'SuperPopulation' as type str (imputed)
Loading field 'isFemale' as type bool (imputed)
Loading field 'PurpleHair' as type bool (imputed)
Loading field 'CaffeineConsumption' as type int32 (imputed)
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
'Population': str
'SuperPopulation': str
'isFemale': bool
'PurpleHair': bool
'CaffeineConsumption': int32
'sample_qc': struct {
dp_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
gq_stats: struct {
mean: float64,
stdev: float64,
min: float64,
max: float64
},
call_rate: float64,
n_called: int64,
n_not_called: int64,
n_filtered: int64,
n_hom_ref: int64,
n_het: int64,
n_hom_var: int64,
n_non_ref: int64,
n_singleton: int64,
n_snp: int64,
n_insertion: int64,
n_deletion: int64,
n_transition: int64,
n_transversion: int64,
n_star: int64,
r_ti_tv: float64,
r_het_hom_var: float64,
r_insertion_deletion: float64
}
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
ClippingRankSum: float64,
DP: int32,
DS: bool,
FS: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
QD: float64,
ReadPosRankSum: float64,
set: str
}
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
Histogram
The histogram()
method takes as an argument an aggregated hist expression, as well as optional arguments for the legend and title of the plot.
[3]:
dp_hist = mt.aggregate_entries(hl.expr.aggregators.hist(mt.DP, 0, 30, 30))
p = hl.plot.histogram(dp_hist, legend='DP', title='DP Histogram')
show(p)
[Stage 3:> (0 + 1) / 1]
This method, like all Hail plotting methods, also allows us to pass in fields of our data set directly. Choosing not to specify the range
and bins
arguments would result in a range being computed based on the largest and smallest values in the dataset and a default bins value of 50.
[4]:
p = hl.plot.histogram(mt.DP, range=(0, 30), bins=30)
show(p)
[Stage 4:> (0 + 1) / 1]
Cumulative Histogram
The cumulative_histogram()
method works in a similar way to histogram()
.
[5]:
p = hl.plot.cumulative_histogram(mt.DP, range=(0,30), bins=30)
show(p)
[Stage 5:> (0 + 1) / 1]
Scatter
The scatter()
method can also take in either Python types or Hail fields as arguments for x and y.
[6]:
p = hl.plot.scatter(mt.sample_qc.dp_stats.mean, mt.sample_qc.call_rate, xlabel='Mean DP', ylabel='Call Rate')
show(p)
2023-09-21 19:04:46.278 Hail: WARN: aggregate_cols(): Aggregates over cols ordered by 'col_key'.
To preserve matrix table column order, first unkey columns with 'key_cols_by()'
[Stage 7:> (0 + 1) / 1]
We can also pass in a Hail field as a label
argument, which determines how to color the data points.
[7]:
mt = mt.filter_cols((mt.sample_qc.dp_stats.mean >= 4) & (mt.sample_qc.call_rate >= 0.97))
ab = mt.AD[1] / hl.sum(mt.AD)
filter_condition_ab = ((mt.GT.is_hom_ref() & (ab <= 0.1)) |
(mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
(mt.GT.is_hom_var() & (ab >= 0.9)))
mt = mt.filter_entries(filter_condition_ab)
mt = hl.variant_qc(mt).cache()
common_mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01)
gwas = hl.linear_regression_rows(y=common_mt.CaffeineConsumption, x=common_mt.GT.n_alt_alleles(), covariates=[1.0])
pca_eigenvalues, pca_scores, _ = hl.hwe_normalized_pca(common_mt.GT)
2023-09-21 19:05:13.613 Hail: INFO: wrote matrix table with 10879 rows and 250 columns in 1 partition to /tmp/persist_MatrixTableV4r1rzE386
2023-09-21 19:05:14.951 Hail: INFO: linear_regression_rows: running on 250 samples for 1 response variable y,
with input variable x, and 1 additional covariate...
2023-09-21 19:05:17.784 Hail: INFO: wrote table with 9095 rows in 1 partition to /tmp/persist_TableqTwgYWPFQ8
2023-09-21 19:05:19.145 Hail: INFO: hwe_normalize: found 9087 variants after filtering out monomorphic sites.
2023-09-21 19:05:20.903 Hail: INFO: pca: running PCA with 10 components...
2023-09-21 19:05:27.775 Hail: INFO: wrote table with 0 rows in 0 partitions to /tmp/persist_Tablek7OJNi39H1
[8]:
p = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
label=common_mt.cols()[pca_scores.s].SuperPopulation,
title='PCA', xlabel='PC1', ylabel='PC2', collect_all=True)
show(p)
2023-09-21 19:05:28.810 Hail: INFO: Coerced sorted dataset
2023-09-21 19:05:29.341 Hail: INFO: Coerced sorted dataset
Hail’s downsample aggregator is incorporated into the scatter()
, qq()
, and manhattan()
functions. The collect_all
parameter tells the plot function whether to collect all values or downsample. Choosing not to set this parameter results in downsampling.
[9]:
p2 = hl.plot.scatter(pca_scores.scores[0], pca_scores.scores[1],
label=common_mt.cols()[pca_scores.s].SuperPopulation,
title='PCA (downsampled)', xlabel='PC1', ylabel='PC2', collect_all=False, n_divisions=50)
show(gridplot([p, p2], ncols=2, width=400, height=400))
2023-09-21 19:05:30.855 Hail: INFO: Coerced sorted dataset
2023-09-21 19:05:31.291 Hail: INFO: Coerced sorted dataset
2-D histogram
For visualizing relationships between variables in large datasets (where scatter plots may be less informative since they highlight outliers), the histogram_2d()
function will create a heatmap with the number of observations in each section of a 2-d grid based on two variables.
[10]:
p = hl.plot.histogram2d(pca_scores.scores[0], pca_scores.scores[1])
show(p)
2023-09-21 19:05:32.687 Hail: WARN: At least one range was not defined in histogram_2d. Doing two passes...
2023-09-21 19:05:35.209 Hail: INFO: Ordering unsorted dataset with network shuffle
Q-Q (Quantile-Quantile)
The qq()
function requires either a Python type or a Hail field containing p-values to be plotted. This function also allows for downsampling.
[11]:
p = hl.plot.qq(gwas.p_value, collect_all=True)
p2 = hl.plot.qq(gwas.p_value, n_divisions=75)
show(gridplot([p, p2], ncols=2, width=400, height=400))
2023-09-21 19:05:37.824 Hail: INFO: Ordering unsorted dataset with network shuffle
2023-09-21 19:05:40.854 Hail: INFO: Ordering unsorted dataset with network shuffle
2023-09-21 19:05:43.138 Hail: INFO: Ordering unsorted dataset with network shuffle
2023-09-21 19:05:45.177 Hail: INFO: Ordering unsorted dataset with network shuffle
Manhattan
The manhattan()
function requires a Hail field containing p-values.
[12]:
p = hl.plot.manhattan(gwas.p_value)
show(p)
We can also pass in a dictionary of fields that we would like to show up as we hover over a data point, and choose not to downsample if the dataset is relatively small.
[13]:
hover_fields = dict([('alleles', gwas.alleles)])
p = hl.plot.manhattan(gwas.p_value, hover_fields=hover_fields, collect_all=True)
show(p)