MatrixTable Tutorial
If you’ve gotten this far, you’re probably thinking:
“Can’t I do all of this in
pandas
orR
?”“What does this have to do with biology?”
The two crucial features that Hail adds are scalability and the domain-specific primitives needed to work easily with biological data. Fear not! You’ve learned most of the basic concepts of Hail and now are ready for the bit that makes it possible to represent and compute on genetic matrices: the MatrixTable.
In the last example of the Table Joins Tutorial, the ratings table had a compound key: movie_id
and user_id
. The ratings were secretly a movie-by-user matrix!
However, since this matrix is very sparse, it is reasonably represented in a so-called “coordinate form” Table
, where each row of the table is an entry of the sparse matrix. For large and dense matrices (like sequencing data), the per-row overhead of coordinate reresentations is untenable. That’s why we built MatrixTable
, a 2-dimensional generalization of Table
.
MatrixTable Anatomy
Recall that Table
has two kinds of fields:
global fields
row fields
MatrixTable
has four kinds of fields:
global fields
row fields
column fields
entry fields
Row fields are fields that are stored once per row. These can contain information about the rows, or summary data calculated per row.
Column fields are stored once per column. These can contain information about the columns, or summary data calculated per column.
Entry fields are the piece that makes this structure a matrix – there is an entry for each (row, column) pair.
Importing and Reading
Like tables, matrix tables can be imported from a variety of formats: VCF, (B)GEN, PLINK, TSV, etc. Matrix tables can also be read from a “native” matrix table format. Let’s read a sample of prepared 1KG data.
[1]:
import hail as hl
from bokeh.io import output_notebook, show
output_notebook()
hl.utils.get_1kg('data/')
Initializing Hail with default parameters...
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Running on Apache Spark version 3.5.0
SparkUI available at http://hostname-96b797a886:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.132-678e1f52b999
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20240709-1724-0.2.132-678e1f52b999.log
2024-07-09 17:25:04.643 Hail: INFO: 1KG files found
[2]:
mt = hl.read_matrix_table('data/1kg.mt')
mt.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
ClippingRankSum: float64,
DP: int32,
DS: bool,
FS: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
QD: float64,
ReadPosRankSum: float64,
set: str
}
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
There are a few things to note:
There is a single column field
s
. This is the sample ID from the VCF. It is also the column key.There is a compound row key:
locus
andalleles
.locus
has typelocus<GRCh37>
alleles
has typearray<str>
GT has type
call
. That’s a genotype call!
Whereas table expressions could be indexed by nothing or indexed by rows, matrix table expression have four options: nothing, indexed by row, indexed by column, or indexed by row and column (the entries). Let’s see some examples.
[3]:
mt.s.describe()
--------------------------------------------------------
Type:
str
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f289c301c10>
Index:
['column']
--------------------------------------------------------
[4]:
mt.GT.describe()
--------------------------------------------------------
Type:
call
--------------------------------------------------------
Source:
<hail.matrixtable.MatrixTable object at 0x7f289c301c10>
Index:
['column', 'row']
--------------------------------------------------------
MatrixTable operations
We belabored the operations on tables because they all have natural analogs (sometimes several) on matrix tables. For example:
count
=>count_{rows, cols}
(andcount
which returns both)filter
=>filter_{rows, cols, entries}
annotate
=>annotate_{rows, cols, entries}
(and globals for both)select
=>select_{rows, cols, entries}
(and globals for both)transmute
=>transmute_{rows, cols, entries}
(and globals for both)group_by
=>group_{rows, cols}_by
explode
=>expode_{rows, cols}
aggregate
=>aggregate_{rows, cols, entries}
Some operations are unique to MatrixTable
:
The row fields can be accessed as a
Table
with rowsThe column fields can be accessed as a
Table
with cols.The entire field space of a
MatrixTable
can be accessed as a coordinate-formTable
with entries. Be careful with this! While it’s fast to aggregate or query, trying to write thisTable
to disk could produce files thousands of times larger than the correspondingMatrixTable
.
Let’s explore mt
using these tools. Let’s get the size of the dataset.
[5]:
mt.count() # (rows, cols)
[5]:
(10879, 284)
Let’s look at the first few row keys (variants) and column keys (sample IDs).
[6]:
mt.rows().select().show()
SLF4J: Failed to load class "org.slf4j.impl.StaticMDCBinder".
SLF4J: Defaulting to no-operation MDCAdapter implementation.
SLF4J: See http://www.slf4j.org/codes.html#no_static_mdc_binder for further details.
showing top 10 rows
[7]:
mt.s.show()
showing top 10 rows
Let’s investigate the genotypes and the call rate. Let’s look at the first few genotypes:
[8]:
mt.GT.show()
showing top 10 rows
showing the first 4 of 284 columns
Let’s look at the distribution of genotype calls:
[9]:
mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))
[9]:
{0: 1730651, 1: 741176, 2: 576444, None: 41365}
Let’s compute the overall call rate directly, and then plot the distribution of call rate per variant.
[10]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
[10]:
0.9866116914743355
Here’s a nice trick: you can use an aggregator inside annotate_rows
and it will aggregate over columns, that is, summarize the values in the row using the aggregator. Let’s compute and plot call rate per variant.
[11]:
mt2 = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
mt2.describe()
----------------------------------------
Global fields:
None
----------------------------------------
Column fields:
's': str
----------------------------------------
Row fields:
'locus': locus<GRCh37>
'alleles': array<str>
'rsid': str
'qual': float64
'filters': set<str>
'info': struct {
AC: array<int32>,
AF: array<float64>,
AN: int32,
BaseQRankSum: float64,
ClippingRankSum: float64,
DP: int32,
DS: bool,
FS: float64,
HaplotypeScore: float64,
InbreedingCoeff: float64,
MLEAC: array<int32>,
MLEAF: array<float64>,
MQ: float64,
MQ0: int32,
MQRankSum: float64,
QD: float64,
ReadPosRankSum: float64,
set: str
}
'call_rate': float64
----------------------------------------
Entry fields:
'GT': call
'AD': array<int32>
'DP': int32
'GQ': int32
'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
[12]:
p = hl.plot.histogram(mt2.call_rate, range=(0,1.0), bins=100,
title='Variant Call Rate Histogram', legend='Call Rate')
show(p)
Exercise: GQ vs DP
In this exercise, you’ll use Hail to investigate a strange property of sequencing datasets.
The DP
field is the sequencing depth (the number of reads).
Let’s first plot a histogram of DP
:
[13]:
p = hl.plot.histogram(mt.DP, range=(0,40), bins=40, title='DP Histogram', legend='DP')
show(p)
[Stage 9:> (0 + 1) / 1]
Now, let’s do the same thing for GQ.
The GQ
field is the phred-scaled “genotype quality”. The formula to convert to a linear-scale confidence (0 to 1) is 10 ** -(mt.GQ / 10)
. GQ is truncated to lie between 0 and 99.
[14]:
p = hl.plot.histogram(mt.GQ, range=(0,100), bins=100, title='GQ Histogram', legend='GQ')
show(p)
[Stage 10:> (0 + 1) / 1]
Whoa! That’s a strange distribution! There’s a big spike at 100. The rest of the values have roughly the same shape as the DP distribution, but form a Dimetrodon. Use Hail to figure out what’s going on!
[ ]: