Datasets

Warning

All functionality described on this page is experimental. Datasets and method are subject to change.

This page describes genetic datasets that are hosted in a public repository on Google Cloud Platform and are available for use through Hail’s load_dataset() function.

To load a dataset from this repository into a Hail pipeline, provide the name, version, and reference genome build of the dataset you would like to use as strings to the load_dataset() function. The available dataset names, versions, and reference genome builds are listed in the table below.

Name Versions Reference Genomes
1000_genomes phase3 GRCh37, GRCh38
Ensembl_CDS_regions release_93 GRCh37, GRCh38
Ensembl_cDNA_regions release_93 GRCh37, GRCh38
Ensembl_human_reference_genome release_93 GRCh37, GRCh38
Ensembl_low_complexity_regions release_93 GRCh37, GRCh38
Ensembl_ncRNA_regions release_93 GRCh37, GRCh38
Ensembl_peptide_sequences release_93 GRCh37, GRCh38
GERP_elements GERP++ GRCh37, GRCh38
GERP_scores GERP++ GRCh37, GRCh38
GTEx_eQTL_associations v7 GRCh37
GTEx_exons v7 GRCh37, GRCh38
GTEx_genes v7 GRCh37, GRCh38
GTEx_transcripts v7 GRCh37, GRCh38