Datasets

Warning

All functionality described on this page is experimental. Datasets and method are subject to change.

This page describes genetic datasets that are hosted in a public repository on Google Cloud Platform and are available for use through Hail’s load_dataset() function.

To load a dataset from this repository into a Hail pipeline, provide the name, version, and reference genome build of the dataset you would like to use as strings to the load_dataset() function. You will also need to provide the region (‘us’ or ‘eu’) to access the appropriate bucket. The available dataset names, versions, and reference genome builds are listed in the table below.

Name

Versions

Reference Genomes

1000_Genomes_autosomes

phase_3

GRCh37, GRCh38

1000_Genomes_chrMT

phase_3

GRCh37

1000_Genomes_chrX

phase_3

GRCh37, GRCh38

1000_Genomes_chrY

phase_3

GRCh37, GRCh38

CADD

1.4

GRCh37, GRCh38

DANN

None

GRCh37, GRCh38

Ensembl_homo_sapiens_low_complexity_regions

release_95

GRCh37, GRCh38

Ensembl_homo_sapiens_reference_genome

release_95

GRCh37, GRCh38

GTEx_RNA_seq_gene_read_counts

v7

GRCh37

GTEx_RNA_seq_gene_TPMs

v7

GRCh37

GTEx_RNA_seq_junction_read_counts

v7

GRCh37

UK_Biobank_Rapid_GWAS_both_sexes

v2

GRCh37

UK_Biobank_Rapid_GWAS_female

v2

GRCh37

UK_Biobank_Rapid_GWAS_male

v2

GRCh37

clinvar_gene_summary

2019-07

None

clinvar_variant_summary

2019-07

GRCh37, GRCh38

dbNSFP_genes

4.0

None

dbNSFP_variants

4.0

GRCh37, GRCh38

gencode

v19, v31

GRCh37, GRCh38

gerp_elements

hg19

GRCh37, GRCh38

gerp_scores

hg19

GRCh37, GRCh38

gnomad_exome_sites

2.1.1

GRCh37, GRCh38

gnomad_genome_sites

2.1.1

GRCh37, GRCh38

gnomad_lof_metrics

2.1.1

GRCh37, GRCh38

ldsc_baselineLD_annotations

2.2

GRCh37

ldsc_baselineLD_ldscores

2.2

GRCh37

ldsc_baseline_ldscores

1.1

GRCh37