Datasets

Warning

All functionality described on this page is experimental. Datasets and method are subject to change.

This page describes genetic datasets that are hosted in a public repository on Google Cloud Platform and are available for use through Hail’s load_dataset() function.

To load a dataset from this repository into a Hail pipeline, provide the name, version, and reference genome build of the dataset you would like to use as strings to the load_dataset() function. The available dataset names, versions, and reference genome builds are listed in the table below.

Name

Versions

Reference Genomes

1000_genomes

phase3

GRCh37, GRCh38

Ensembl_CDS_regions

release_93

GRCh37, GRCh38

Ensembl_cDNA_regions

release_93

GRCh37, GRCh38

Ensembl_human_reference_genome

release_93

GRCh37, GRCh38

Ensembl_low_complexity_regions

release_93

GRCh37, GRCh38

Ensembl_ncRNA_regions

release_93

GRCh37, GRCh38

Ensembl_peptide_sequences

release_93

GRCh37, GRCh38

GERP_elements

GERP++

GRCh37, GRCh38

GERP_scores

GERP++

GRCh37, GRCh38

GTEx_eQTL_associations

v7

GRCh37

GTEx_exons

v7

GRCh37, GRCh38

GTEx_genes

v7

GRCh37, GRCh38

GTEx_transcripts

v7

GRCh37, GRCh38