ReferenceGenome

class hail.genetics.ReferenceGenome(name, contigs, lengths, x_contigs=[], y_contigs=[], mt_contigs=[], par=[], _builtin=False)[source]

An object that represents a reference genome.

Examples

>>> contigs = ["1", "X", "Y", "MT"]
>>> lengths = {"1": 249250621, "X": 155270560, "Y": 59373566, "MT": 16569}
>>> par = [("X", 60001, 2699521)]
>>> my_ref = hl.ReferenceGenome("my_ref", contigs, lengths, "X", "Y", "MT", par)

Notes

Hail comes with predefined reference genomes (case sensitive!):

  • GRCh37

  • GRCh38

  • GRCm38

You can access these reference genome objects using get_reference():

>>> rg = hl.get_reference('GRCh37')

Note that constructing a new reference genome, either by using the class constructor or by using ReferenceGenome.read() will add the reference genome to the list of known references; it is possible to access the reference genome using get_reference() anytime afterwards.

Note

Reference genome names must be unique. It is not possible to overwrite the built-in reference genomes.

Parameters
  • name (str) – Name of reference. Must be unique and NOT one of Hail’s predefined references: 'GRCh37', 'GRCh38', 'GRCm38', and 'default'.

  • contigs (list of str) – Contig names.

  • lengths (dict of str to int) – Dict of contig names to contig lengths.

  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.

  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.

  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.

  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)

Attributes

contigs

Contig names.

lengths

Dict of contig name to contig length.

mt_contigs

Mitochondrial contigs.

name

Name of reference genome.

par

Pseudoautosomal regions.

x_contigs

X contigs.

y_contigs

Y contigs.

Methods

__init__

Initialize self.

add_liftover

Register a chain file for liftover.

add_sequence

Load the reference sequence from a FASTA file.

contig_length

Contig length.

from_fasta_file

Create reference genome from a FASTA file.

has_liftover

True if a liftover chain file is available from this reference genome to the destination reference.

has_sequence

True if the reference sequence has been loaded.

read

Load reference genome from a JSON file.

remove_liftover

Remove liftover to dest_reference_genome.

remove_sequence

Remove the reference sequence.

write

“Write this reference genome to a file in JSON format.

add_liftover(chain_file, dest_reference_genome)[source]

Register a chain file for liftover.

Examples

Access GRCh37 and GRCh38 using get_reference():

>>> rg37 = hl.get_reference('GRCh37') # doctest: +SKIP
>>> rg38 = hl.get_reference('GRCh38') # doctest: +SKIP

Add a chain file from 37 to 38:

>>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38) # doctest: +SKIP

Notes

This method can only be run once per reference genome. Use has_liftover() to test whether a chain file has been registered.

The chain file format is described here.

Chain files are hosted on google cloud for some of Hail’s built-in references:

GRCh37 to GRCh38 gs://hail-common/references/grch37_to_grch38.over.chain.gz

GRCh38 to GRCh37 gs://hail-common/references/grch38_to_grch37.over.chain.gz

Public download links are available here.

Parameters
  • chain_file (str) – Path to chain file. Can be compressed (GZIP) or uncompressed.

  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.

add_sequence(fasta_file, index_file=None)[source]

Load the reference sequence from a FASTA file.

Examples

Access the GRCh37 reference genome using get_reference():

>>> rg = hl.get_reference('GRCh37') # doctest: +SKIP

Add a sequence file:

>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz',
...                 'gs://hail-common/references/human_g1k_v37.fasta.fai') # doctest: +SKIP

Add a sequence file with the default index location:

>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz') # doctest: +SKIP

Notes

This method can only be run once per reference genome. Use has_sequence() to test whether a sequence is loaded.

FASTA and index files are hosted on google cloud for some of Hail’s built-in references:

GRCh37

  • FASTA file: gs://hail-common/references/human_g1k_v37.fasta.gz

  • Index file: gs://hail-common/references/human_g1k_v37.fasta.fai

GRCh38

  • FASTA file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.gz

  • Index file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.fai

Public download links are available here.

Parameters
  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.

  • index_file (None or str) – Path to FASTA index file. Must be uncompressed. If None, replace the fasta_file’s extension with fai.

contig_length(contig)[source]

Contig length.

Parameters

contig (str) – Contig name.

Returns

int – Length of contig.

contigs

Contig names.

Returns

list of str

classmethod from_fasta_file(name, fasta_file, index_file, x_contigs=[], y_contigs=[], mt_contigs=[], par=[])[source]

Create reference genome from a FASTA file.

Parameters
  • name (str) – Name for new reference genome.

  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.

  • index_file (str) – Path to FASTA index file. Must be uncompressed.

  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.

  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.

  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.

  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)

Returns

ReferenceGenome

has_liftover(dest_reference_genome)[source]

True if a liftover chain file is available from this reference genome to the destination reference.

Parameters

dest_reference_genome (str or ReferenceGenome)

Returns

bool

has_sequence()[source]

True if the reference sequence has been loaded.

Returns

bool

lengths

Dict of contig name to contig length.

Returns

list of str

mt_contigs

Mitochondrial contigs.

Returns

list of str

name

Name of reference genome.

Returns

str

par

Pseudoautosomal regions.

Returns

list of Interval

classmethod read(path)[source]

Load reference genome from a JSON file.

Notes

The JSON file must have the following format:

{"name": "my_reference_genome",
 "contigs": [{"name": "1", "length": 10000000},
             {"name": "2", "length": 20000000},
             {"name": "X", "length": 19856300},
             {"name": "Y", "length": 78140000},
             {"name": "MT", "length": 532}],
 "xContigs": ["X"],
 "yContigs": ["Y"],
 "mtContigs": ["MT"],
 "par": [{"start": {"contig": "X","position": 60001},"end": {"contig": "X","position": 2699521}},
         {"start": {"contig": "Y","position": 10001},"end": {"contig": "Y","position": 2649521}}]
}

name must be unique and not overlap with Hail’s pre-instantiated references: 'GRCh37', 'GRCh38', 'GRCm38', and 'default'. The contig names in xContigs, yContigs, and mtContigs must be present in contigs. The intervals listed in par must have contigs in either xContigs or yContigs and must have positions between 0 and the contig length given in contigs.

Parameters

path (str) – Path to JSON file.

Returns

ReferenceGenome

remove_liftover(dest_reference_genome)[source]

Remove liftover to dest_reference_genome.

Parameters

dest_reference_genome (str or ReferenceGenome)

remove_sequence()[source]

Remove the reference sequence.

Returns

bool

write(output)[source]

“Write this reference genome to a file in JSON format.

Examples

>>> my_rg = hl.ReferenceGenome("new_reference", ["x", "y", "z"], {"x": 500, "y": 300, "z": 200})
>>> my_rg.write("output/new_reference.json")

Notes

Use read to reimport the exported reference genome in a new HailContext session.

Parameters

output (str) – Path of JSON file to write.

x_contigs

X contigs.

Returns

list of str

y_contigs

Y contigs.

Returns

list of str