ReferenceGenome
- class hail.genetics.ReferenceGenome[source]
An object that represents a reference genome.
Examples
>>> contigs = ["1", "X", "Y", "MT"] >>> lengths = {"1": 249250621, "X": 155270560, "Y": 59373566, "MT": 16569} >>> par = [("X", 60001, 2699521)] >>> my_ref = hl.ReferenceGenome("my_ref", contigs, lengths, "X", "Y", "MT", par)
Notes
Hail comes with predefined reference genomes (case sensitive!):
GRCh37, Genome Reference Consortium Human Build 37
GRCh38, Genome Reference Consortium Human Build 38
GRCm38, Genome Reference Consortium Mouse Build 38
CanFam3, Canis lupus familiaris (dog)
You can access these reference genome objects using
get_reference()
:>>> rg = hl.get_reference('GRCh37') >>> rg = hl.get_reference('GRCh38') >>> rg = hl.get_reference('GRCm38') >>> rg = hl.get_reference('CanFam3')
Note that constructing a new reference genome, either by using the class constructor or by using read will add the reference genome to the list of known references; it is possible to access the reference genome using
get_reference()
anytime afterwards.Note
Reference genome names must be unique. It is not possible to overwrite the built-in reference genomes.
Note
Hail allows setting a default reference so that the
reference_genome
argument ofimport_vcf()
does not need to be used constantly. It is a current limitation of Hail that a custom reference genome cannot be used as thedefault_reference
argument ofinit()
. In order to set a custom reference genome as default, pass the reference as an argument todefault_reference()
after initializing Hail.- Parameters:
name (
str
) – Name of reference. Must be unique and NOT one of Hail’s predefined references:'GRCh37'
,'GRCh38'
,'GRCm38'
,'CanFam3'
and'default'
.lengths (
dict
ofstr
toint
) – Dict of contig names to contig lengths.x_contigs (
str
orlist
ofstr
) – Contigs to be treated as X chromosomes.y_contigs (
str
orlist
ofstr
) – Contigs to be treated as Y chromosomes.mt_contigs (
str
orlist
ofstr
) – Contigs to be treated as mitochondrial DNA.par (
list
oftuple
of (str, int, int)) – List of tuples with (contig, start, end)
Attributes
Contig names.
Get a dictionary mapping contig names to their global genomic positions.
Dict of contig name to contig length.
Mitochondrial contigs.
Name of reference genome.
Pseudoautosomal regions.
X contigs.
Y contigs.
Methods
Register a chain file for liftover.
Load the reference sequence from a FASTA file.
Contig length.
Create reference genome from a FASTA file.
True
if a liftover chain file is available from this reference genome to the destination reference.True if the reference sequence has been loaded.
"
Load reference genome from a JSON file.
Remove liftover to dest_reference_genome.
Remove the reference sequence.
"Write this reference genome to a file in JSON format.
- add_liftover(chain_file, dest_reference_genome)[source]
Register a chain file for liftover.
Examples
Access GRCh37 and GRCh38 using
get_reference()
:>>> rg37 = hl.get_reference('GRCh37') >>> rg38 = hl.get_reference('GRCh38')
Add a chain file from 37 to 38:
>>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38)
Notes
This method can only be run once per reference genome. Use
has_liftover()
to test whether a chain file has been registered.The chain file format is described here.
Chain files are hosted on google cloud for some of Hail’s built-in references:
GRCh37 to GRCh38 gs://hail-common/references/grch37_to_grch38.over.chain.gz
GRCh38 to GRCh37 gs://hail-common/references/grch38_to_grch37.over.chain.gz
Public download links are available here.
- Parameters:
chain_file (
str
) – Path to chain file. Can be compressed (GZIP) or uncompressed.dest_reference_genome (
str
orReferenceGenome
) – Reference genome to convert to.
- add_sequence(fasta_file, index_file=None)[source]
Load the reference sequence from a FASTA file.
Examples
Access the GRCh37 reference genome using
get_reference()
:>>> rg = hl.get_reference('GRCh37')
Add a sequence file:
>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz', ... 'gs://hail-common/references/human_g1k_v37.fasta.fai')
Add a sequence file with the default index location:
>>> rg.add_sequence('gs://hail-common/references/human_g1k_v37.fasta.gz')
Notes
This method can only be run once per reference genome. Use
has_sequence()
to test whether a sequence is loaded.FASTA and index files are hosted on google cloud for some of Hail’s built-in references:
GRCh37
FASTA file:
gs://hail-common/references/human_g1k_v37.fasta.gz
Index file:
gs://hail-common/references/human_g1k_v37.fasta.fai
GRCh38
FASTA file:
gs://hail-common/references/Homo_sapiens_assembly38.fasta.gz
Index file:
gs://hail-common/references/Homo_sapiens_assembly38.fasta.fai
Public download links are available here.
- classmethod from_fasta_file(name, fasta_file, index_file, x_contigs=[], y_contigs=[], mt_contigs=[], par=[])[source]
Create reference genome from a FASTA file.
- Parameters:
name (
str
) – Name for new reference genome.fasta_file (
str
) – Path to FASTA file. Can be compressed (GZIP) or uncompressed.index_file (
str
) – Path to FASTA index file. Must be uncompressed.x_contigs (
str
orlist
ofstr
) – Contigs to be treated as X chromosomes.y_contigs (
str
orlist
ofstr
) – Contigs to be treated as Y chromosomes.mt_contigs (
str
orlist
ofstr
) – Contigs to be treated as mitochondrial DNA.par (
list
oftuple
of (str, int, int)) – List of tuples with (contig, start, end)
- Returns:
- property global_positions_dict
Get a dictionary mapping contig names to their global genomic positions.
- Returns:
dict
– A dictionary of contig names to global genomic positions.
- has_liftover(dest_reference_genome)[source]
True
if a liftover chain file is available from this reference genome to the destination reference.- Parameters:
dest_reference_genome (
str
orReferenceGenome
)- Returns:
- locus_from_global_position(global_pos)[source]
” Constructs a locus from a global position in reference genome. The inverse of
Locus.position()
.Examples
>>> rg = hl.get_reference('GRCh37') >>> rg.locus_from_global_position(0) Locus(contig=1, position=1, reference_genome=GRCh37)
>>> rg.locus_from_global_position(2824183054) Locus(contig=21, position=42584230, reference_genome=GRCh37)
>>> rg = hl.get_reference('GRCh38') >>> rg.locus_from_global_position(2824183054) Locus(contig=chr22, position=1, reference_genome=GRCh38)
- Parameters:
global_pos (int) – Zero-based global base position along the reference genome.
- Returns:
- classmethod read(path)[source]
Load reference genome from a JSON file.
Notes
The JSON file must have the following format:
{"name": "my_reference_genome", "contigs": [{"name": "1", "length": 10000000}, {"name": "2", "length": 20000000}, {"name": "X", "length": 19856300}, {"name": "Y", "length": 78140000}, {"name": "MT", "length": 532}], "xContigs": ["X"], "yContigs": ["Y"], "mtContigs": ["MT"], "par": [{"start": {"contig": "X","position": 60001},"end": {"contig": "X","position": 2699521}}, {"start": {"contig": "Y","position": 10001},"end": {"contig": "Y","position": 2649521}}] }
name must be unique and not overlap with Hail’s pre-instantiated references:
'GRCh37'
,'GRCh38'
,'GRCm38'
,'CanFam3'
, and'default'
. The contig names in xContigs, yContigs, and mtContigs must be present in contigs. The intervals listed in par must have contigs in either xContigs or yContigs and must have positions between 0 and the contig length given in contigs.- Parameters:
path (
str
) – Path to JSON file.- Returns:
- remove_liftover(dest_reference_genome)[source]
Remove liftover to dest_reference_genome.
- Parameters:
dest_reference_genome (
str
orReferenceGenome
)
- write(output)[source]
“Write this reference genome to a file in JSON format.
Examples
>>> my_rg = hl.ReferenceGenome("new_reference", ["x", "y", "z"], {"x": 500, "y": 300, "z": 200}) >>> my_rg.write(f"output/new_reference.json")
Notes
Use
read()
to reimport the exported reference genome in a new HailContext session.- Parameters:
output (
str
) – Path of JSON file to write.