ReferenceGenome

class hail.genetics.ReferenceGenome(name, contigs, lengths, x_contigs=[], y_contigs=[], mt_contigs=[], par=[])[source]

An object that represents a reference genome.

Examples

>>> contigs = ["1", "X", "Y", "MT"]
>>> lengths = {"1": 249250621, "X": 155270560, "Y": 59373566, "MT": 16569}
>>> par = [("X", 60001, 2699521)]
>>> my_ref = hl.ReferenceGenome("my_ref", contigs, lengths, "X", "Y", "MT", par)
Parameters:
  • name (str) – Name of reference. Must be unique and NOT one of Hail’s predefined references: 'GRCh37', 'GRCh38', 'GRCm38', and 'default'.
  • contigs (list of str) – Contig names.
  • lengths (dict of str to int) – Dict of contig names to contig lengths.
  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.
  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.
  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.
  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)

Attributes

contigs Contig names.
lengths Dict of contig name to contig length.
mt_contigs Mitochondrial contigs.
name Name of reference genome.
par Pseudoautosomal regions.
x_contigs X contigs.
y_contigs Y contigs.

Methods

__init__ Initialize self.
add_liftover Register a chain file for liftover.
add_sequence Load the reference sequence from a FASTA file.
contig_length Contig length.
from_fasta_file Create reference genome from a FASTA file.
has_liftover True if a liftover chain file is available from this reference genome to the destination reference.
has_sequence True if the reference sequence has been loaded.
read Load reference genome from a JSON file.
remove_liftover Remove liftover to dest_reference_genome.
remove_sequence Remove the reference sequence.
write “Write this reference genome to a file in JSON format.
add_liftover(chain_file, dest_reference_genome)[source]

Register a chain file for liftover.

Notes

This method can only be run once per reference genome. Use has_liftover() to test whether a chain file has been registered.

The chain file format is described here.

Chain files are hosted on google cloud for some of Hail’s built-in references:

GRCh37 to GRCh38 gs://hail-common/references/grch37_to_grch38.over.chain.gz

GRCh38 to GRCh37 gs://hail-common/references/grch38_to_grch37.over.chain.gz

Public download links are available here.

Parameters:
  • chain_file (str) – Path to chain file. Can be compressed (GZIP) or uncompressed.
  • dest_reference_genome (str or ReferenceGenome) – Reference genome to convert to.
add_sequence(fasta_file, index_file)[source]

Load the reference sequence from a FASTA file.

Notes

This method can only be run once per reference genome. Use has_sequence() to test whether a sequence is loaded.

FASTA and index files are hosted on google cloud for some of Hail’s built-in references:

GRCh37

  • FASTA file: gs://hail-common/references/human_g1k_v37.fasta.gz
  • Index file: gs://hail-common/references/human_g1k_v37.fasta.fai

GRCh38

  • FASTA file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.gz
  • Index file: gs://hail-common/references/Homo_sapiens_assembly38.fasta.fai

Public download links are available here.

Parameters:
  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.
  • index_file (str) – Path to FASTA index file. Must be uncompressed.
contig_length(contig)[source]

Contig length.

Parameters:contig (str) – Contig name.
Returns:int – Length of contig.
contigs

Contig names.

Returns:list of str
classmethod from_fasta_file(name, fasta_file, index_file, x_contigs=[], y_contigs=[], mt_contigs=[], par=[])[source]

Create reference genome from a FASTA file.

Parameters:
  • name (str) – Name for new reference genome.
  • fasta_file (str) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.
  • index_file (str) – Path to FASTA index file. Must be uncompressed.
  • x_contigs (str or list of str) – Contigs to be treated as X chromosomes.
  • y_contigs (str or list of str) – Contigs to be treated as Y chromosomes.
  • mt_contigs (str or list of str) – Contigs to be treated as mitochondrial DNA.
  • par (list of tuple of (str, int, int)) – List of tuples with (contig, start, end)
Returns:

ReferenceGenome

has_liftover(dest_reference_genome)[source]

True if a liftover chain file is available from this reference genome to the destination reference.

Parameters:dest_reference_genome (str or ReferenceGenome)
Returns:bool
has_sequence()[source]

True if the reference sequence has been loaded.

Returns:bool
lengths

Dict of contig name to contig length.

Returns:list of str
mt_contigs

Mitochondrial contigs.

Returns:list of str
name

Name of reference genome.

Returns:str
par

Pseudoautosomal regions.

Returns:list of Interval
classmethod read(path)[source]

Load reference genome from a JSON file.

Notes

The JSON file must have the following format:

{"name": "my_reference_genome",
 "contigs": [{"name": "1", "length": 10000000},
             {"name": "2", "length": 20000000},
             {"name": "X", "length": 19856300},
             {"name": "Y", "length": 78140000},
             {"name": "MT", "length": 532}],
 "xContigs": ["X"],
 "yContigs": ["Y"],
 "mtContigs": ["MT"],
 "par": [{"start": {"contig": "X","position": 60001},"end": {"contig": "X","position": 2699521}},
         {"start": {"contig": "Y","position": 10001},"end": {"contig": "Y","position": 2649521}}]
}

name must be unique and not overlap with Hail’s pre-instantiated references: 'GRCh37', 'GRCh38', 'GRCm38', and 'default'. The contig names in xContigs, yContigs, and mtContigs must be present in contigs. The intervals listed in par must have contigs in either xContigs or yContigs and must have positions between 0 and the contig length given in contigs.

Parameters:path (str) – Path to JSON file.
Returns:ReferenceGenome
remove_liftover(dest_reference_genome)[source]

Remove liftover to dest_reference_genome.

Parameters:dest_reference_genome (str or ReferenceGenome)
Returns:bool
remove_sequence()[source]

Remove the reference sequence.

Returns:bool
write(output)[source]

“Write this reference genome to a file in JSON format.

Examples

>>> my_rg = hl.ReferenceGenome("new_reference", ["x", "y", "z"], {"x": 500, "y": 300, "z": 200})
>>> my_rg.write("output/new_reference.json")

Notes

Use read to reimport the exported reference genome in a new HailContext session.

Parameters:output (str) – Path of JSON file to write.
x_contigs

X contigs.

Returns:list of str
y_contigs

Y contigs.

Returns:list of str