ReferenceGenome

class hail.genetics.ReferenceGenome[source]

An object that represents a reference genome.

Examples

>>> contigs = ["1", "X", "Y", "MT"]
>>> lengths = {"1": 249250621, "X": 155270560, "Y": 59373566, "MT": 16569}
>>> par = [("X", 60001, 2699521)]
>>> my_ref = hl.ReferenceGenome("my_ref", contigs, lengths, "X", "Y", "MT", par)

Notes

Hail comes with predefined reference genomes (case sensitive!):

  • GRCh37, Genome Reference Consortium Human Build 37

  • GRCh38, Genome Reference Consortium Human Build 38

  • GRCm38, Genome Reference Consortium Mouse Build 38

  • CanFam3, Canis lupus familiaris (dog)

You can access these reference genome objects using get_reference():

>>> rg = hl.get_reference('GRCh37')
>>> rg = hl.get_reference('GRCh38')
>>> rg = hl.get_reference('GRCm38')
>>> rg = hl.get_reference('CanFam3')

Note that constructing a new reference genome, either by using the class constructor or by using read will add the reference genome to the list of known references; it is possible to access the reference genome using get_reference() anytime afterwards.

Note

Reference genome names must be unique. It is not possible to overwrite the built-in reference genomes.

Note

Hail allows setting a default reference so that the reference_genome argument of import_vcf() does not need to be used constantly. It is a current limitation of Hail that a custom reference genome cannot be used as the default_reference argument of init(). In order to set a custom reference genome as default, pass the reference as an argument to default_reference() after initializing Hail.

Parameters:
  • name (str) – Name of reference. Must be unique and NOT one of Hail’s predefined references: 'GRCh37', 'GRCh38', 'GRCm38', 'CanFam3' and 'default'.

  • contigs (list of str) – Contig names.

  • lengths (dict of str to int) – Dict of contig names to contig lengths.

  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.

  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.

  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.

  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)

Attributes

contigs

Contig names.

global_positions_dict

Get a dictionary mapping contig names to their global genomic positions.

lengths

Dict of contig name to contig length.

mt_contigs

Mitochondrial contigs.

name

Name of reference genome.

par

Pseudoautosomal regions.

x_contigs

X contigs.

y_contigs

Y contigs.

Methods

add_liftover

Register a chain file for liftover.

add_sequence

Load the reference sequence from a FASTA file.

contig_length

Contig length.

from_fasta_file

Create reference genome from a FASTA file.

has_liftover

True if a liftover chain file is available from this reference genome to the destination reference.

has_sequence

True if the reference sequence has been loaded.

locus_from_global_position

"

read

Load reference genome from a JSON file.

remove_liftover

Remove liftover to dest_reference_genome.

remove_sequence

Remove the reference sequence.

write

"Write this reference genome to a file in JSON format.

add_liftover(chain_file, dest_reference_genome)[source]

Register a chain file for liftover.

Examples

Access GRCh37 and GRCh38 using get_reference():

>>> rg37 = hl.get_reference('GRCh37') 
>>> rg38 = hl.get_reference('GRCh38') 

Add a chain file from 37 to 38:

>>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38) 

Notes

This method can only be run once per reference genome. Use has_liftover() to test whether a chain file has been registered.

The chain file format is described here.

Chain files are hosted on google cloud for some of Hail’s built-in references:

GRCh37 to GRCh38 gs://hail-common/references/grch37_to_grch38.over.chain.gz

GRCh38 to GRCh37 gs://hail-common/references/grch38_to_grch37.over.chain.gz

Public download links are available here.

Parameters:
  • chain_file (str) – Path to chain file. Can be compressed (GZIP) or uncompressed.

  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.

add_sequence(fasta_file, index_file=None)[source]

Load the reference sequence from a FASTA file.

Examples

Access the GRCh37 reference genome using get_reference():

>>> rg = hl.get_reference('GRCh37') 

Add a sequence file:

>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz',
...                 'gs://hail-common/references/human_g1k_v37.fasta.fai') 

Add a sequence file with the default index location:

>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz') 

Notes

This method can only be run once per reference genome. Use has_sequence() to test whether a sequence is loaded.

FASTA and index files are hosted on google cloud for some of Hail’s built-in references:

GRCh37

  • FASTA file: gs://hail-common/references/human_g1k_v37.fasta.gz

  • Index file: gs://hail-common/references/human_g1k_v37.fasta.fai

GRCh38

  • FASTA file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.gz

  • Index file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.fai

Public download links are available here.

Parameters:
  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.

  • index_file (None or str) – Path to FASTA index file. Must be uncompressed. If None, replace the fasta_file’s extension with fai.

contig_length(contig)[source]

Contig length.

Parameters:

contig (str) – Contig name.

Returns:

int – Length of contig.

property contigs

Contig names.

Returns:

list of str

classmethod from_fasta_file(name, fasta_file, index_file, x_contigs=[], y_contigs=[], mt_contigs=[], par=[])[source]

Create reference genome from a FASTA file.

Parameters:
  • name (str) – Name for new reference genome.

  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.

  • index_file (str) – Path to FASTA index file. Must be uncompressed.

  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.

  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.

  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.

  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)

Returns:

ReferenceGenome

property global_positions_dict

Get a dictionary mapping contig names to their global genomic positions.

Returns:

dict – A dictionary of contig names to global genomic positions.

has_liftover(dest_reference_genome)[source]

True if a liftover chain file is available from this reference genome to the destination reference.

Parameters:

dest_reference_genome (str or ReferenceGenome)

Returns:

bool

has_sequence()[source]

True if the reference sequence has been loaded.

Returns:

bool

property lengths

Dict of contig name to contig length.

Returns:

dict of str to int

locus_from_global_position(global_pos)[source]

” Constructs a locus from a global position in reference genome. The inverse of Locus.position().

Examples

>>> rg = hl.get_reference('GRCh37')
>>> rg.locus_from_global_position(0)
Locus(contig=1, position=1, reference_genome=GRCh37)
>>> rg.locus_from_global_position(2824183054)
Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> rg = hl.get_reference('GRCh38')
>>> rg.locus_from_global_position(2824183054)
Locus(contig=chr22, position=1, reference_genome=GRCh38)
Parameters:

global_pos (int) – Zero-based global base position along the reference genome.

Returns:

Locus

property mt_contigs

Mitochondrial contigs.

Returns:

list of str

property name

Name of reference genome.

Returns:

str

property par

Pseudoautosomal regions.

Returns:

list of Interval

classmethod read(path)[source]

Load reference genome from a JSON file.

Notes

The JSON file must have the following format:

{"name": "my_reference_genome",
 "contigs": [{"name": "1", "length": 10000000},
             {"name": "2", "length": 20000000},
             {"name": "X", "length": 19856300},
             {"name": "Y", "length": 78140000},
             {"name": "MT", "length": 532}],
 "xContigs": ["X"],
 "yContigs": ["Y"],
 "mtContigs": ["MT"],
 "par": [{"start": {"contig": "X","position": 60001},"end": {"contig": "X","position": 2699521}},
         {"start": {"contig": "Y","position": 10001},"end": {"contig": "Y","position": 2649521}}]
}

name must be unique and not overlap with Hail’s pre-instantiated references: 'GRCh37', 'GRCh38', 'GRCm38', 'CanFam3', and 'default'. The contig names in xContigs, yContigs, and mtContigs must be present in contigs. The intervals listed in par must have contigs in either xContigs or yContigs and must have positions between 0 and the contig length given in contigs.

Parameters:

path (str) – Path to JSON file.

Returns:

ReferenceGenome

remove_liftover(dest_reference_genome)[source]

Remove liftover to dest_reference_genome.

Parameters:

dest_reference_genome (str or ReferenceGenome)

remove_sequence()[source]

Remove the reference sequence.

write(output)[source]

“Write this reference genome to a file in JSON format.

Examples

>>> my_rg = hl.ReferenceGenome("new_reference", ["x", "y", "z"], {"x": 500, "y": 300, "z": 200})
>>> my_rg.write(f"output/new_reference.json")

Notes

Use read() to reimport the exported reference genome in a new HailContext session.

Parameters:

output (str) – Path of JSON file to write.

property x_contigs

X contigs.

Returns:

list of str

property y_contigs

Y contigs.

Returns:

list of str