MatrixTable Tutorial

If you’ve gotten this far, you’re probably thinking:

  • “Can’t I do all of this in pandas or R?”

  • “What does this have to do with biology?”

The two crucial features that Hail adds are scalability and the domain-specific primitives needed to work easily with biological data. Fear not! You’ve learned most of the basic concepts of Hail and now are ready for the bit that makes it possible to represent and compute on genetic matrices: the MatrixTable.

In the last example of the Table Joins Tutorial, the ratings table had a compound key: movie_id and user_id. The ratings were secretly a movie-by-user matrix!

However, since this matrix is very sparse, it is reasonably represented in a so-called “coordinate form” Table, where each row of the table is an entry of the sparse matrix. For large and dense matrices (like sequencing data), the per-row overhead of coordinate reresentations is untenable. That’s why we built MatrixTable, a 2-dimensional generalization of Table.

MatrixTable Anatomy

Recall that Table has two kinds of fields:

  • global fields

  • row fields

MatrixTable has four kinds of fields:

  • global fields

  • row fields

  • column fields

  • entry fields

Row fields are fields that are stored once per row. These can contain information about the rows, or summary data calculated per row.

Column fields are stored once per column. These can contain information about the columns, or summary data calculated per column.

Entry fields are the piece that makes this structure a matrix – there is an entry for each (row, column) pair.

Importing and Reading

Like tables, matrix tables can be imported from a variety of formats: VCF, (B)GEN, PLINK, TSV, etc. Matrix tables can also be read from a “native” matrix table format. Let’s read a sample of prepared 1KG data.

[1]:
import hail as hl
from bokeh.io import output_notebook, show
output_notebook()

hl.utils.get_1kg('data/')
Loading BokehJS ...
Loading BokehJS ...
Initializing Hail with default parameters...
24/02/16 20:04:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.2
SparkUI available at http://hostname-e0866086c5:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.128-eead8100a1c1
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20240216-2004-0.2.128-eead8100a1c1.log
2024-02-16 20:04:50.507 Hail: INFO: 1KG files found
[2]:
mt = hl.read_matrix_table('data/1kg.mt')
mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        BaseQRankSum: float64,
        ClippingRankSum: float64,
        DP: int32,
        DS: bool,
        FS: float64,
        HaplotypeScore: float64,
        InbreedingCoeff: float64,
        MLEAC: array<int32>,
        MLEAF: array<float64>,
        MQ: float64,
        MQ0: int32,
        MQRankSum: float64,
        QD: float64,
        ReadPosRankSum: float64,
        set: str
    }
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

There are a few things to note:

  • There is a single column field s. This is the sample ID from the VCF. It is also the column key.

  • There is a compound row key: locus and alleles.

    • locus has type locus<GRCh37>

    • alleles has type array<str>

  • GT has type call. That’s a genotype call!

Whereas table expressions could be indexed by nothing or indexed by rows, matrix table expression have four options: nothing, indexed by row, indexed by column, or indexed by row and column (the entries). Let’s see some examples.

[3]:
mt.s.describe()
--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7ff81625e580>
Index:
    ['column']
--------------------------------------------------------
[4]:
mt.GT.describe()
--------------------------------------------------------
Type:
        call
--------------------------------------------------------
Source:
    <hail.matrixtable.MatrixTable object at 0x7ff81625e580>
Index:
    ['row', 'column']
--------------------------------------------------------

MatrixTable operations

We belabored the operations on tables because they all have natural analogs (sometimes several) on matrix tables. For example:

  • count => count_{rows, cols} (and count which returns both)

  • filter => filter_{rows, cols, entries}

  • annotate => annotate_{rows, cols, entries} (and globals for both)

  • select => select_{rows, cols, entries} (and globals for both)

  • transmute => transmute_{rows, cols, entries} (and globals for both)

  • group_by => group_{rows, cols}_by

  • explode => expode_{rows, cols}

  • aggregate => aggregate_{rows, cols, entries}

Some operations are unique to MatrixTable:

  • The row fields can be accessed as a Table with rows

  • The column fields can be accessed as a Table with cols.

  • The entire field space of a MatrixTable can be accessed as a coordinate-form Table with entries. Be careful with this! While it’s fast to aggregate or query, trying to write this Table to disk could produce files thousands of times larger than the corresponding MatrixTable.

Let’s explore mt using these tools. Let’s get the size of the dataset.

[5]:
mt.count() # (rows, cols)
[5]:
(10879, 284)

Let’s look at the first few row keys (variants) and column keys (sample IDs).

[6]:
mt.rows().select().show()
[Stage 0:===========================================================(1 + 0) / 1]
locus
alleles
locus<GRCh37>array<str>
1:904165["G","A"]
1:909917["G","A"]
1:986963["C","T"]
1:1563691["T","G"]
1:1707740["T","G"]
1:2252970["C","T"]
1:2284195["T","C"]
1:2779043["T","C"]
1:2944527["G","A"]
1:3761547["C","A"]

showing top 10 rows

[7]:
mt.s.show()
s
str
"HG00096"
"HG00099"
"HG00105"
"HG00118"
"HG00129"
"HG00148"
"HG00177"
"HG00182"
"HG00242"
"HG00254"

showing top 10 rows

Let’s investigate the genotypes and the call rate. Let’s look at the first few genotypes:

[8]:
mt.GT.show()
'HG00096'
'HG00099'
'HG00105'
'HG00118'
locus
alleles
GT
GT
GT
GT
locus<GRCh37>array<str>callcallcallcall
1:904165["G","A"]0/00/00/00/0
1:909917["G","A"]0/00/00/00/0
1:986963["C","T"]0/00/00/00/0
1:1563691["T","G"]NA0/00/00/0
1:1707740["T","G"]0/10/10/10/0
1:2252970["C","T"]0/0NA0/00/0
1:2284195["T","C"]1/10/10/10/1
1:2779043["T","C"]0/10/11/10/0
1:2944527["G","A"]0/00/10/10/1
1:3761547["C","A"]0/00/00/00/0

showing top 10 rows

showing the first 4 of 284 columns

Let’s look at the distribution of genotype calls:

[9]:
mt.aggregate_entries(hl.agg.counter(mt.GT.n_alt_alleles()))
[Stage 6:>                                                          (0 + 1) / 1]
[9]:
{0: 1730651, 1: 741176, 2: 576444, None: 41365}

Let’s compute the overall call rate directly, and then plot the distribution of call rate per variant.

[10]:
mt.aggregate_entries(hl.agg.fraction(hl.is_defined(mt.GT)))
[Stage 7:>                                                          (0 + 1) / 1]
[10]:
0.9866116914743355

Here’s a nice trick: you can use an aggregator inside annotate_rows and it will aggregate over columns, that is, summarize the values in the row using the aggregator. Let’s compute and plot call rate per variant.

[11]:
mt2 = mt.annotate_rows(call_rate = hl.agg.fraction(hl.is_defined(mt.GT)))
mt2.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AC: array<int32>,
        AF: array<float64>,
        AN: int32,
        BaseQRankSum: float64,
        ClippingRankSum: float64,
        DP: int32,
        DS: bool,
        FS: float64,
        HaplotypeScore: float64,
        InbreedingCoeff: float64,
        MLEAC: array<int32>,
        MLEAF: array<float64>,
        MQ: float64,
        MQ0: int32,
        MQRankSum: float64,
        QD: float64,
        ReadPosRankSum: float64,
        set: str
    }
    'call_rate': float64
----------------------------------------
Entry fields:
    'GT': call
    'AD': array<int32>
    'DP': int32
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
[12]:
p = hl.plot.histogram(mt2.call_rate, range=(0,1.0), bins=100,
                      title='Variant Call Rate Histogram', legend='Call Rate')
show(p)
[Stage 8:>                                                          (0 + 1) / 1]

Exercise: GQ vs DP

In this exercise, you’ll use Hail to investigate a strange property of sequencing datasets.

The DP field is the sequencing depth (the number of reads).

Let’s first plot a histogram of DP:

[13]:
p = hl.plot.histogram(mt.DP, range=(0,40), bins=40, title='DP Histogram', legend='DP')
show(p)
[Stage 9:>                                                          (0 + 1) / 1]

Now, let’s do the same thing for GQ.

The GQ field is the phred-scaled “genotype quality”. The formula to convert to a linear-scale confidence (0 to 1) is 10 ** -(mt.GQ / 10). GQ is truncated to lie between 0 and 99.

[14]:
p = hl.plot.histogram(mt.GQ, range=(0,100), bins=100, title='GQ Histogram', legend='GQ')
show(p)
[Stage 10:>                                                         (0 + 1) / 1]

Whoa! That’s a strange distribution! There’s a big spike at 100. The rest of the values have roughly the same shape as the DP distribution, but form a Dimetrodon. Use Hail to figure out what’s going on!

[ ]: