Import / Export
This page describes functionality for moving data in and out of Hail.
Hail has a suite of functionality for importing and exporting data to and from general-purpose, genetics-specific, and high-performance native file formats.
Native file formats
When saving data to disk with the intent to later use Hail, we highly recommend
that you use the native file formats to store Table
and
MatrixTable
objects. These binary formats not only smaller than other formats
(especially textual ones) in most cases, but also are significantly faster to
read into Hail later.
These files can be created with methods on the Table
and
MatrixTable
objects:
These files can be read into a Hail session later using the following methods:
|
Read in a |
|
Read in a |
Import
General purpose
The import_table()
function is widely-used to import textual data
into a Hail Table
. import_matrix_table()
is used to import
two-dimensional matrix data in textual representations into a Hail
MatrixTable
. Finally, it is possible to create a Hail Table
from a pandas
DataFrame with Table.from_pandas()
.
|
Import delimited text file (text table) as |
|
Import tab-delimited file(s) as a |
|
Import lines of file(s) as a |
Genetics
Hail has several functions to import genetics-specific file formats into Hail
MatrixTable
or Table
objects:
|
Import VCF file(s) as a |
|
Import a PLINK dataset (BED, BIM, FAM) as a |
|
Import a UCSC BED file as a |
|
Import BGEN file(s) as a |
|
Index BGEN files as required by |
|
Import GEN file(s) as a |
|
Import a PLINK FAM file into a |
|
Import a locus interval list as a |
Export
General purpose
Some of the most widely-used export functionality is found as class methods
on the Table
and Expression
objects:
Table.export()
: Used to write a Table to a text table (TSV).Expression.export()
: Used to write an expression to a text file. For one-dimensional expressions (table row fields, matrix table row or column fields), this is very similar toTable.export()
. For two-dimensional expressions (entry expressions on matrix tables), a text matrix representation that can be imported withimport_matrix_table()
will be produced.Table.to_pandas()
: Used to convert a Hail table to apandas
DataFrame.
Genetics
Hail can export to some of the genetics-specific file formats:
|
Export a |
|
Export MatrixTable as |
|
Export a |
|
Export a |
|
Export a |
|
Extract metadata from VCF header. |
Reference documentation
- hail.methods.read_matrix_table(path, *, _intervals=None, _filter_intervals=False, _drop_cols=False, _drop_rows=False, _create_row_uids=False, _create_col_uids=False, _n_partitions=None, _assert_type=None, _load_refs=True)[source]
Read in a
MatrixTable
written withMatrixTable.write()
.- Parameters:
path (
str
) – File to read.- Returns:
- hail.methods.read_table(path, *, _intervals=None, _filter_intervals=False, _n_partitions=None, _assert_type=None, _load_refs=True, _create_row_uids=False)[source]
Read in a
Table
written withTable.write()
.
- hail.methods.import_bed(path, reference_genome='default', skip_invalid_intervals=False, contig_recoding=None, **kwargs)[source]
Import a UCSC BED file as a
Table
.Examples
The file formats are
$ cat data/file1.bed track name="BedTest" 20 1 14000000 20 17000000 18000000 ... $ cat file2.bed track name="BedTest" 20 1 14000000 cnv1 20 17000000 18000000 cnv2 ...
Add the row field cnv_region indicating inclusion in at least one interval of the three-column BED file:
>>> bed = hl.import_bed('data/file1.bed', reference_genome='GRCh37') >>> result = dataset.annotate_rows(cnv_region = hl.is_defined(bed[dataset.locus]))
Add a row field cnv_id with the value given by the fourth column of a BED file:
>>> bed = hl.import_bed('data/file2.bed') >>> result = dataset.annotate_rows(cnv_id = bed[dataset.locus].target)
Notes
The table produced by this method has one of two possible structures. If the .bed file has only three fields (chrom, chromStart, and chromEnd), then the produced table has only one column:
If the .bed file has four or more columns, then Hail will store the fourth column as a row field in the table:
UCSC bed files can have up to 12 fields, but Hail will only ever look at the first four. Hail ignores header lines in BED files.
Warning
Intervals in UCSC BED files are 0-indexed and half open. The line “5 100 105” correpsonds to the interval
[5:101-5:106)
in Hail’s 1-indexed notation. Details here.- Parameters:
path (
str
) – Path to .bed file.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.skip_invalid_intervals (
bool
) – IfTrue
and reference_genome is notNone
, skip lines with intervals that are not consistent with the reference genome.contig_recoding (
dict
of (str
,str
)) – Mapping from contig name in BED to contig name in loaded dataset. All contigs must be present in the reference_genome, so this is useful for mapping differently-formatted data onto known references.**kwargs – Additional optional arguments to
import_table()
are valid arguments here except: no_header, delimiter, impute, skip_blank_lines, types, and comment as these are used by import_bed.
- Returns:
Table
– Interval-keyed table.
- hail.methods.import_bgen(path, entry_fields, sample_file=None, n_partitions=None, block_size=None, index_file_map=None, variants=None, _row_fields=['varid', 'rsid'])[source]
Import BGEN file(s) as a
MatrixTable
.Examples
Import a BGEN file as a matrix table with GT and GP entry fields:
>>> ds_result = hl.import_bgen("data/example.8bits.bgen", ... entry_fields=['GT', 'GP'], ... sample_file="data/example.8bits.sample")
Import a BGEN file as a matrix table with genotype dosage entry field:
>>> ds_result = hl.import_bgen("data/example.8bits.bgen", ... entry_fields=['dosage'], ... sample_file="data/example.8bits.sample")
Load a single variant from a BGEN file:
>>> ds_result = hl.import_bgen("data/example.8bits.bgen", ... entry_fields=['dosage'], ... sample_file="data/example.8bits.sample", ... variants=[hl.eval(hl.parse_variant('1:2000:A:G'))])
Load a set of variants specified by a table expression from a BGEN file:
>>> variants = hl.import_table("data/bgen-variants.txt") >>> variants = variants.annotate(v=hl.parse_variant(variants.v)).key_by('v') >>> ds_result = hl.import_bgen("data/example.8bits.bgen", ... entry_fields=['dosage'], ... sample_file="data/example.8bits.sample", ... variants=variants.v)
Load a set of variants specified by a table keyed by ‘locus’ and ‘alleles’ from a BGEN file:
>>> ds_result = hl.import_bgen("data/example.8bits.bgen", ... entry_fields=['dosage'], ... sample_file="data/example.8bits.sample", ... variants=variants_table)
Notes
Hail supports importing data from v1.2 of the BGEN file format. Genotypes must be unphased and diploid, genotype probabilities must be stored with 8 bits, and genotype probability blocks must be compressed with zlib or uncompressed. All variants must be bi-allelic.
Each BGEN file must have a corresponding index file, which can be generated with
index_bgen()
. All files must have been indexed with the same reference genome. To load multiple files at the same time, use Hadoop Glob Patterns.If n_partitions and block_size are both specified, block_size is used. If neither are specified, the default is a 128MB block size.
Column Fields
s (
tstr
) – Column key. This is the sample ID imported from the first column of the sample file if given. Otherwise, the sample ID is taken from the sample identifying block in the first BGEN file if it exists; else IDs are assigned from _0, _1, to _N.
Row Fields
Between two and four row fields are created. The locus and alleles are always included. _row_fields determines if varid and rsid are also included. For best performance, only include fields necessary for your analysis. NOTE: the _row_fields parameter is considered an experimental feature and may be removed without warning.
locus (
tlocus
ortstruct
) – Row key. The chromosome and position. If reference_genome is defined, the type will betlocus
parameterized by reference_genome. Otherwise, the type will be atstruct
with two fields: contig with typetstr
and position with typetint32
.alleles (
tarray
oftstr
) – Row key. An array containing the alleles of the variant. The reference allele is the first element in the array.varid (
tstr
) – The variant identifier. The third field in each variant identifying block.rsid (
tstr
) – The rsID for the variant. The fifth field in each variant identifying block.
Entry Fields
Up to three entry fields are created, as determined by entry_fields. For best performance, include precisely those fields required for your analysis. It is also possible to pass an empty tuple or list for entry_fields, which can greatly accelerate processing speed if your workflow does not use the genotype data.
GT (
tcall
) – The hard call corresponding to the genotype with the greatest probability. If there is not a unique maximum probability, the hard call is set to missing.GP (
tarray
oftfloat64
) – Genotype probabilities as defined by the BGEN file spec. For bi-allelic variants, the array has three elements giving the probabilities of homozygous reference, heterozygous, and homozygous alternate genotype, in that order.dosage (
tfloat64
) – The expected value of the number of alternate alleles, given by the probability of heterozygous genotype plus twice the probability of homozygous alternate genotype. All variants must be bi-allelic.
See also
- Parameters:
entry_fields (
list
ofstr
) – List of entry fields to create. Options:'GT'
,'GP'
,'dosage'
.sample_file (
str
, optional) – Sample file to read the sample ids from. If specified, the number of samples in the file must match the number in the BGEN file(s).n_partitions (
int
, optional) – Number of partitions.block_size (
int
, optional) – Block size, in MB.index_file_map (
dict
ofstr
tostr
, optional) – Dict of BGEN file to index file location. Cannot use Hadoop glob patterns in file names.variants (
StructExpression
orLocusExpression
orlist
ofStruct
orlist
ofLocus
orTable
) – Variants to filter to. The underlying type of the input (row key in the case of aTable
) must be either atlocus
, a struct with one field locus, or a struct with two fields: locus and alleles. The type of locus can either be atlocus
or atstruct
with two fields: contig of typetstr
and position of typetint
. If the type of locus istlocus
, the reference genome must match that used to index the BGEN file(s). The type of alleles is atarray
oftstr
._row_fields (
list
ofstr
) – List of non-key row fields to create. Options:'varid'
,'rsid'
- Returns:
- hail.methods.index_bgen(path, index_file_map=None, reference_genome='default', contig_recoding=None, skip_invalid_loci=False, _buffer_size=16000000)[source]
Index BGEN files as required by
import_bgen()
.If index_file_map is unspecified, then, for each BGEN file, the index file is written in the same directory and as the associated BGEN file with the same filename appended by .idx2. Otherwise, the index_file_map must specify a distinct idx2 path for each BGEN file.
Example
Index a BGEN file, renaming contig name “01” to “1”:
>>> hl.index_bgen("data/example.8bits.bgen", ... contig_recoding={"01": "1"}, ... reference_genome='GRCh37')
Warning
While this method parallelizes over a list of BGEN files, each file is indexed serially by one core. Indexing several BGEN files on a large cluster is a waste of resources, so indexing should generally be done once, separately from large analyses.
See also
- Parameters:
path (
str
orlist
ofstr
) – The .bgen files to index. May be one of: a BGEN file path, a list of BGEN file paths, or the path of a directory that contains BGEN files.index_file_map (
dict
ofstr
tostr
, optional) – Dict of BGEN file to index file location. Index file location must have a .idx2 file extension. Cannot use Hadoop glob patterns in file names.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.contig_recoding (
dict
ofstr
tostr
, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given by reference_genome.skip_invalid_loci (
bool
) – IfTrue
, skip loci that are not consistent with reference_genome.
- hail.methods.import_fam(path, quant_pheno=False, delimiter='\\\\s+', missing='NA')[source]
Import a PLINK FAM file into a
Table
.Examples
Import a tab-separated FAM file with a case-control phenotype:
>>> fam_kt = hl.import_fam('data/case_control_study.fam')
Import a FAM file with a quantitative phenotype:
>>> fam_kt = hl.import_fam('data/quantitative_study.fam', quant_pheno=True)
Notes
In Hail, unlike PLINK, the user must explicitly distinguish between case-control and quantitative phenotypes. Importing a quantitative phenotype with
quant_pheno=False
will return an error (unless all values happen to be 0, 1, 2, or -9):The resulting
Table
will have fields, types, and values that are interpreted as missing.One of:
Warning
Hail will interpret the value “-9” as a valid quantitative phenotype, which differs from default PLINK behavior. Use
missing='-9'
to interpret this value as missing.
- hail.methods.import_gen(path, sample_file=None, tolerance=0.2, min_partitions=None, chromosome=None, reference_genome='default', contig_recoding=None, skip_invalid_loci=False)[source]
Import GEN file(s) as a
MatrixTable
.Examples
>>> ds = hl.import_gen('data/example.gen', ... sample_file='data/example.sample', ... reference_genome='GRCh37')
Notes
For more information on the GEN file format, see here.
If the GEN file has only 5 columns before the start of the genotype probability data (chromosome field is missing), you must specify the chromosome using the chromosome parameter.
To load multiple files at the same time, use Hadoop Glob Patterns.
Column Fields
s (
tstr
) – Column key. This is the sample ID imported from the first column of the sample file.
Row Fields
locus (
tlocus
ortstruct
) – Row key. The genomic location consisting of the chromosome (1st column if present, otherwise given by chromosome) and position (4th column if chromosome is not defined). If reference_genome is defined, the type will betlocus
parameterized by reference_genome. Otherwise, the type will be atstruct
with two fields: contig with typetstr
and position with typetint32
.alleles (
tarray
oftstr
) – Row key. An array containing the alleles of the variant. The reference allele (4th column if chromosome is not defined) is the first element of the array and the alternate allele (5th column if chromosome is not defined) is the second element.varid (
tstr
) – The variant identifier. 2nd column of GEN file if chromosome present, otherwise 1st column.rsid (
tstr
) – The rsID. 3rd column of GEN file if chromosome present, otherwise 2nd column.
Entry Fields
GT (
tcall
) – The hard call corresponding to the genotype with the highest probability.GP (
tarray
oftfloat64
) – Genotype probabilities as defined by the GEN file spec. The array is set to missing if the sum of the probabilities is a distance greater than the tolerance parameter from 1.0. Otherwise, the probabilities are normalized to sum to 1.0. For example, the input[0.98, 0.0, 0.0]
will be normalized to[1.0, 0.0, 0.0]
.
- Parameters:
sample_file (
str
) – Sample file to import.tolerance (
float
) – If the sum of the genotype probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing.min_partitions (
int
, optional) – Number of partitions.chromosome (
str
, optional) – Chromosome if not included in the GEN filereference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.contig_recoding (
dict
ofstr
tostr
, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given by reference_genome.skip_invalid_loci (
bool
) – IfTrue
, skip loci that are not consistent with reference_genome.
- Returns:
- hail.methods.import_locus_intervals(path, reference_genome='default', skip_invalid_intervals=False, contig_recoding=None, **kwargs)[source]
Import a locus interval list as a
Table
.Examples
Add the row field capture_region indicating inclusion in at least one locus interval from capture_intervals.txt:
>>> intervals = hl.import_locus_intervals('data/capture_intervals.txt', reference_genome='GRCh37') >>> result = dataset.annotate_rows(capture_region = hl.is_defined(intervals[dataset.locus]))
Notes
Hail expects an interval file to contain either one, three or five fields per line in the following formats:
contig:start-end
contig start end
(tab-separated)contig start end direction target
(tab-separated)
A file in either of the first two formats produces a table with one field:
interval (
tinterval
) - Row key. Genomic interval. If reference_genome is defined, the point type of the interval will betlocus
parameterized by the reference_genome. Otherwise, the point type is atstruct
with two fields: contig with typetstr
and position with typetint32
.
A file in the third format (with a “target” column) produces a table with two fields:
If reference_genome is defined AND the file has one field, intervals are parsed with
parse_locus_interval()
. See the documentation for valid inputs.If reference_genome is NOT defined and the file has one field, intervals are parsed with the regex
`"([^:]*):(\d+)\-(\d+)"
where contig, start, and end match each of the three capture groups.start
andend
match positions inclusively, e.g.start <= position <= end
.For files with three or five fields,
start
andend
match positions inclusively, e.g.start <= position <= end
.- Parameters:
path (
str
) – Path to file.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.skip_invalid_intervals (
bool
) – IfTrue
and reference_genome is notNone
, skip lines with intervals that are not consistent with the reference genome.contig_recoding (
dict
of (str
,str
)) – Mapping from contig name in file to contig name in loaded dataset. All contigs must be present in the reference_genome, so this is useful for mapping differently-formatted data onto known references.**kwargs – Additional optional arguments to
import_table()
are valid arguments here except: no_header, comment, impute, and types, as these are used byimport_locus_intervals()
.
- Returns:
Table
– Interval-keyed table.
- hail.methods.import_matrix_table(paths, row_fields={}, row_key=[], entry_type=dtype('int32'), missing='NA', min_partitions=None, no_header=False, force_bgz=False, sep=None, delimiter=None, comment=())[source]
Import tab-delimited file(s) as a
MatrixTable
.Examples
Consider the following file containing counts from a RNA sequencing dataset:
$ cat data/matrix1.tsv Barcode Tissue Days GENE1 GENE2 GENE3 GENE4 TTAGCCA brain 1.0 0 0 1 0 ATCACTT kidney 5.5 3 0 2 0 CTCTTCT kidney 2.5 0 0 0 1 CTATATA brain 7.0 0 0 3 0
The field
Days
contains floating-point numbers and each of theGENE
fields contain integers. contains integers.To import this matrix:
>>> matrix1 = hl.import_matrix_table('data/matrix1.tsv', ... row_fields={'Barcode': hl.tstr, 'Tissue': hl.tstr, 'Days':hl.tfloat32}, ... row_key='Barcode') >>> matrix1.describe() ---------------------------------------- Global fields: None ---------------------------------------- Column fields: 'col_id': str ---------------------------------------- Row fields: 'Barcode': str 'Tissue': str 'Days': float32 ---------------------------------------- Entry fields: 'x': int32 ---------------------------------------- Column key: 'col_id': str Row key: 'Barcode': str ----------------------------------------
In this example, the header information is missing for the row fields, but the column IDs are still present:
$ cat data/matrix2.tsv GENE1 GENE2 GENE3 GENE4 TTAGCCA brain 1.0 0 0 1 0 ATCACTT kidney 5.5 3 0 2 0 CTCTTCT kidney 2.5 0 0 0 1 CTATATA brain 7.0 0 0 3 0
The row fields get imported as f0, f1, and f2, so we need to do:
>>> matrix2 = hl.import_matrix_table('data/matrix2.tsv', ... row_fields={'f0': hl.tstr, 'f1': hl.tstr, 'f2':hl.tfloat32}, ... row_key='f0') >>> matrix2.rename({'f0': 'Barcode', 'f1': 'Tissue', 'f2': 'Days'})
Sometimes, the header and row information is missing completely:
$ cat data/matrix3.tsv 0 0 1 0 3 0 2 0 0 0 0 1 0 0 3 0
>>> matrix3 = hl.import_matrix_table('data/matrix3.tsv', no_header=True)
In this case, the file has no row fields, so we use the default row index as a key for the imported matrix table.
Notes
The resulting matrix table has the following structure:
The row fields are named as specified in the column header. If they are missing from the header or
no_header=True
, row field names are set to the strings f0, f1, … (0-indexed) in column order. The types of all row fields must be specified in the row_fields argument.The row key is taken from the row_key argument, and must be a subset of row fields. If left empty, the row key will be a new row field row_id of type
int
, whose values 0, 1, … index the original rows of the matrix.There is one column field, col_id, which is a key field of type :obj:str or :obj:int. By default, its values are the strings given by the corresponding column names in the header line. If
no_header=True
, column IDs are set to integers 0, 1, … (also 0-indexed) in column order.There is one entry field, x, that contains the data from the imported matrix.
All columns to be imported as row fields must be at the start of the row.
Unlike
import_table()
, no type imputation is done so types must be specified for all columns that should be imported as row fields. (The other columns are imported as entries in the matrix.)The header information for row fields is allowed to be missing, if the column IDs are present, but the header must then consist only of tab-delimited column IDs (no row field names).
The column IDs will never be missing, even if the missing string appears in the column IDs.
- Parameters:
row_fields (
dict
ofstr
toHailType
) – Columns to take as row fields in the MatrixTable. They must be located before all entry columns.row_key (
str
orlist
ofstr
) – Key fields(s). If empty, creates an index row_id to use as key.entry_type (
HailType
) – Type of entries in matrix table. Must be one of:tint32
,tint64
,tfloat32
,tfloat64
, ortstr
. Default:tint32
.missing (
str
) – Identifier to be treated as missing. Default: NAmin_partitions (
int
orNone
) – Minimum number of partitions.no_header (
bool
) – IfTrue
, assume the file has no header and name the row fields f0, f1, … fK (0-indexed) and the column keys 0, 1, … N.force_bgz (
bool
) – IfTrue
, load .gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec.sep (
str
) – This parameter is a deprecated name for delimiter, please use that instead.delimiter (
str
) – A single character string which separates values in the file.comment (
str
orlist
ofstr
) – Skip lines beginning with the given string if the string is a single character. Otherwise, skip lines that match the regex specified. Multiple comment characters or patterns should be passed as a list.
- Returns:
MatrixTable
– MatrixTable constructed from imported data.
- hail.methods.import_plink(bed, bim, fam, min_partitions=None, delimiter='\\\\s+', missing='NA', quant_pheno=False, a2_reference=True, reference_genome='default', contig_recoding=None, skip_invalid_loci=False, n_partitions=None, block_size=None)[source]
Import a PLINK dataset (BED, BIM, FAM) as a
MatrixTable
.Examples
>>> ds = hl.import_plink(bed='data/test.bed', ... bim='data/test.bim', ... fam='data/test.fam', ... reference_genome='GRCh37')
Notes
Only binary SNP-major mode files can be read into Hail. To convert your file from individual-major mode to SNP-major mode, use PLINK to read in your fileset and use the
--make-bed
option.Hail uses the individual ID (column 2 in FAM file) as the sample id (s). The individual IDs must be unique.
The resulting
MatrixTable
has the following fields:Row fields:
locus (
tlocus
ortstruct
) – Row key. The chromosome and position. If reference_genome is defined, the type will betlocus
parameterized by reference_genome. Otherwise, the type will be atstruct
with two fields: contig with typetstr
and position with typetint32
.alleles (
tarray
oftstr
) – Row key. An array containing the alleles of the variant. The reference allele (A2 if a2_reference isTrue
) is the first element in the array.rsid (
tstr
) – Column 2 in the BIM file.cm_position (
tfloat64
) – Column 3 in the BIM file, the position in centimorgans.
Column fields:
s (
tstr
) – Column 2 in the Fam file (key field).fam_id (
tstr
) – Column 1 in the FAM file. Set to missing if ID equals “0”.pat_id (
tstr
) – Column 3 in the FAM file. Set to missing if ID equals “0”.mat_id (
tstr
) – Column 4 in the FAM file. Set to missing if ID equals “0”.is_female (
tstr
) – Column 5 in the FAM file. Set to missing if value equals “-9”, “0”, or “N/A”. Set to true if value equals “2”. Set to false if value equals “1”.is_case (
tbool
) – Column 6 in the FAM file. Only present if quant_pheno equals False. Set to missing if value equals “-9”, “0”, “N/A”, or the value specified by missing. Set to true if value equals “2”. Set to false if value equals “1”.quant_pheno (
tfloat
) – Column 6 in the FAM file. Only present if quant_pheno equals True. Set to missing if value equals missing.
Entry fields:
GT (
tcall
) – Genotype call (diploid, unphased).
Warning
Hail will interpret the value “-9” as a valid quantitative phenotype, which differs from default PLINK behavior. Use
missing='-9'
to interpret this value as missing.- Parameters:
bed (
str
) – PLINK BED file.bim (
str
) – PLINK BIM file.fam (
str
) – PLINK FAM file.min_partitions (
int
, optional) – Minimum number of partitions. Useful in conjunction with block_size.missing (
str
) – String used to denote missing values only for the phenotype field. This is in addition to “-9”, “0”, and “N/A” for case-control phenotypes.delimiter (
str
) – FAM file field delimiter regex.quant_pheno (
bool
) – IfTrue
, FAM phenotype is interpreted as quantitative.a2_reference (
bool
) – IfTrue
, A2 is treated as the reference allele. If False, A1 is treated as the reference allele.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.contig_recoding (
dict
ofstr
tostr
, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given byreference_genome
. IfNone
, the default is dependent on thereference_genome
. For “GRCh37”, the default is{'23': 'X', '24': 'Y', '25': 'X', '26': 'MT'}
. For “GRCh38”, the default is{'1': 'chr1', ..., '22': 'chr22', '23': 'chrX', '24': 'chrY', '25': 'chrX', '26': 'chrM'}
.skip_invalid_loci (
bool
) – IfTrue
, skip loci that are not consistent with reference_genome.n_partitions (
int
, optional) – Number of partitions. If both n_partitions and block_size are specified, n_partitions will be used.block_size (
int
, optional) – Block size, in MB. Default: 128MB blocks.
- Returns:
- hail.methods.import_table(paths, key=None, min_partitions=None, impute=False, no_header=False, comment=(), delimiter='\t', missing='NA', types={}, quote=None, skip_blank_lines=False, force_bgz=False, filter=None, find_replace=None, force=False, source_file_field=None)[source]
Import delimited text file (text table) as
Table
.The resulting
Table
will have no key fields. UseTable.key_by()
to specify keys. See also:import_matrix_table()
.Examples
Consider this file:
$ cat data/samples1.tsv Sample Height Status Age PT-1234 154.1 ADHD 24 PT-1236 160.9 Control 19 PT-1238 NA ADHD 89 PT-1239 170.3 Control 55
The field
Height
contains floating-point numbers and the fieldAge
contains integers.To import this table using field types:
>>> table = hl.import_table('data/samples1.tsv', ... types={'Height': hl.tfloat64, 'Age': hl.tint32})
Note
Sample
andStatus
need no type, becausetstr
is the default type.To import a table using type imputation (which causes the file to be parsed twice):
>>> table = hl.import_table('data/samples1.tsv', impute=True)
Detailed examples
Let’s import fields from a CSV file with missing data and special characters:
$ cat data/samples2.csv Batch,PT-ID 1kg,PT-0001 1kg,PT-0002 study1,PT-0003 study3,PT-0003 .,PT-0004 1kg,PT-0005 .,PT-0006 1kg,PT-0007
In this case, we should:
Pass the non-default delimiter
,
Pass the non-default missing value
.
>>> table = hl.import_table('data/samples2.csv', delimiter=',', missing='.')
Let’s import a table from a file with no header and sample IDs that need to be transformed. Suppose the sample IDs are of the form
NA#####
. This file has no header line, and the sample ID is hidden in a field with other information.To import:
>>> t = hl.import_table('data/samples3.tsv', no_header=True) >>> t = t.annotate(sample = t.f0.split("_")[1]).key_by('sample')
Let’s import a table from a file where one of the fields is a JSON object.
To import, we need to specify the types argument.
>>> my_types = {"id": hl.tint32, "json_field":hl.tstruct(foo=hl.tstr, x=hl.tint32)} >>> ht_with_json = hl.import_table('data/table_with_json.tsv', types=my_types)
Notes
The impute parameter tells Hail to scan the file an extra time to gather information about possible field types. While this is a bit slower for large files because the file is parsed twice, the convenience is often worth this cost.
The delimiter parameter is either a delimiter character (if a single character) or a field separator regex (2 or more characters). This regex follows the Java regex standard.
Note
Use
delimiter='\s+'
to specify whitespace delimited files.If set, the comment parameter causes Hail to skip any line that starts with the given string(s). For example, passing
comment='#'
will skip any line beginning in a pound sign. If the string given is a single character, Hail will skip any line beginning with the character. Otherwise if the length of the string is greater than 1, Hail will interpret the string as a regex and will filter out lines matching the regex. For example, passingcomment=['#', '^track.*']
will filter out lines beginning in a pound sign and any lines that match the regex'^track.*'
.The missing parameter defines the representation of missing data in the table.
Note
The missing parameter is NOT a regex. The comment parameter is treated as a regex ONLY if the length of the string is greater than 1 (not a single character).
The no_header parameter indicates that the file has no header line. If this option is passed, then the field names will be f0, f1, … fN (0-indexed).
The types parameter allows the user to pass the types of fields in the table. It is an
dict
keyed bystr
, withHailType
values. See the examples above for a standard usage. Additionally, this option can be used to override type imputation. For example, if the fieldChromosome
only contains the values1
through22
, it will be imputed to have typetint32
, whereas most Hail methods expect that a chromosome field will be of typetstr
. Settingimpute=True
andtypes={'Chromosome': hl.tstr}
solves this problem.- Parameters:
min_partitions (
int
orNone
) – Minimum number of partitions.no_header (
bool
) – IfTrue`
, assume the file has no header and name the N fields f0, f1, … fN (0-indexed).impute (
bool
) – IfTrue
, Impute field types from the file.comment (
str
orlist
ofstr
) – Skip lines beginning with the given string if the string is a single character. Otherwise, skip lines that match the regex specified. Multiple comment characters or patterns should be passed as a list.delimiter (
str
) – Field delimiter regex.missing (
str
orlist
[str
]) – Identifier(s) to be treated as missing.types (
dict
mappingstr
toHailType
) – Dictionary defining field types.skip_blank_lines (
bool
) – IfTrue
, ignore empty lines. Otherwise, throw an error if an empty line is found.force_bgz (
bool
) – IfTrue
, load files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not'.bgz'
, but the file is blocked gzip, so that the file can be read in parallel and not on a single node.filter (
str
, optional) – Line filter regex. A partial match results in the line being removed from the file. Applies before find_replace, if both are defined.find_replace ((
str
,str
)) – Line substitution regex. Functions likere.sub
, but obeys the exact semantics of Java’s String.replaceAll.force (
bool
) – IfTrue
, load gzipped files serially on one core. This should be used only when absolutely necessary, as processing time will be increased due to lack of parallelism.source_file_field (
str
, optional) – If defined, the source file name for each line will be a field of the table with this name. Can be useful when importing multiple tables using glob patterns.
- Returns:
- hail.methods.import_lines(paths, min_partitions=None, force_bgz=False, force=False, file_per_partition=False)[source]
Import lines of file(s) as a
Table
of strings.Examples
To import a file as a table of strings:
>>> ht = hl.import_lines('data/matrix2.tsv') >>> ht.describe() ---------------------------------------- Global fields: None ---------------------------------------- Row fields: 'file': str 'text': str ---------------------------------------- Key: [] ----------------------------------------
- Parameters:
min_partitions (
int
orNone
) – Minimum number of partitions.force_bgz (
bool
) – IfTrue
, load files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. This option is useful when the file extension is not'.bgz'
, but the file is blocked gzip, so that the file can be read in parallel and not on a single node.force (
bool
) – IfTrue
, load gzipped files serially on one core. This should be used only when absolutely necessary, as processing time will be increased due to lack of parallelism.file_per_partition (
bool
) – IfTrue
, each file will be in a seperate partition. Not recommended for most uses. Error thrown ifTrue
and min_partitions is less than the number of files
- Returns:
Table
– Table constructed from imported data.
- hail.methods.import_vcf(path, force=False, force_bgz=False, header_file=None, min_partitions=None, drop_samples=False, call_fields=['PGT'], reference_genome='default', contig_recoding=None, array_elements_required=True, skip_invalid_loci=False, entry_float_type=dtype('float64'), filter=None, find_replace=None, n_partitions=None, block_size=None, _create_row_uids=False, _create_col_uids=False)[source]
Import VCF file(s) as a
MatrixTable
.Examples
Import a standard bgzipped VCF with GRCh37 as the reference genome.
>>> ds = hl.import_vcf('data/example2.vcf.bgz', reference_genome='GRCh37')
Import a variant-partitioned dataset stored in one or more VCF files. The
*
is a glob pattern which matches any string of characters.>>> ds = hl.import_vcf('data/samplepart*.vcf')
Import a VCF dataset and override every header with the header from
data/samplepart1.vcf
. If the other VCF files are missing headers, have differing sample names, or otherwise have incompatible headers, header_file can be used to enforce a consistent header.>>> ds = hl.import_vcf('data/samplepart*.vcf', header_file='data/samplepart1.vcf')
Import a VCF with GRCh38 as the reference genome that incorrectly uses the contig names from GRCh37 (i.e. uses contig name “1” instead of “chr1”).
>>> recode = {f"{i}":f"chr{i}" for i in (list(range(1, 23)) + ['X', 'Y'])} >>> ds = hl.import_vcf('data/grch38_bad_contig_names.vcf', reference_genome='GRCh38', contig_recoding=recode)
Import a bgzipped VCF which uses the “gz” extension rather than the “bgz” extension:
>>> ds = hl.import_vcf('data/sample.vcf.gz', force_bgz=True)
Import a VCF which has missing values (“.”) inside INFO or FORMAT array fields:
>>> print(open('data/missing-values-in-array-fields.vcf').read()) ##fileformat=VCFv4.1 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=X,Number=.,Type=Integer,Description=""> ##FORMAT=<ID=Y,Number=.,Type=Integer,Description=""> ##FORMAT=<ID=Z,Number=.,Type=Integer,Description=""> ##INFO=<ID=A,Number=A,Type=Integer,Description=""> ##INFO=<ID=B,Number=R,Type=Float,Description=""> ##INFO=<ID=C,Number=3,Type=Float,Description=""> ##INFO=<ID=D,Number=.,Type=Float,Description=""> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 1 123456 . A C . . A=1,.;B=.,2,.;C=. GT:X:Y:Z 0/0:1,.,1:.
>>> ds = hl.import_vcf('data/missing-values-in-array-fields.vcf', array_elements_required=False) >>> ds.show(n_rows=1, n_cols=1, include_row_fields=True) +---------------+------------+------+---------+----------+--------------+ | locus | alleles | rsid | qual | filters | info.A | +---------------+------------+------+---------+----------+--------------+ | locus<GRCh37> | array<str> | str | float64 | set<str> | array<int32> | +---------------+------------+------+---------+----------+--------------+ | 1:123456 | ["A","C"] | NA | NA | NA | [1,NA] | +---------------+------------+------+---------+----------+--------------+ +------------------+----------------+----------------+--------------+ | info.B | info.C | info.D | 'SAMPLE1'.GT | +------------------+----------------+----------------+--------------+ | array<float64> | array<float64> | array<float64> | call | +------------------+----------------+----------------+--------------+ | [NA,2.00e+00,NA] | NA | NA | 0/0 | +------------------+----------------+----------------+--------------+ +--------------+--------------+--------------+ | 'SAMPLE1'.X | 'SAMPLE1'.Y | 'SAMPLE1'.Z | +--------------+--------------+--------------+ | array<int32> | array<int32> | array<int32> | +--------------+--------------+--------------+ | [1,NA,1] | NA | NA | +--------------+--------------+--------------+
Notes
Hail is designed to be maximally compatible with files in the VCF v4.2 spec.
import_vcf()
takes a list of VCF files to load. All files must have the same header and the same set of samples in the same order (e.g., a dataset split by chromosome). Files can be specified as Hadoop glob patterns.Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed (.vcf) or block compressed (.vcf.bgz). If you have a large compressed VCF that ends in .vcf.gz, it is likely that the file is actually block-compressed, and you should rename the file to .vcf.bgz accordingly. If you are unable to rename this file, please use force_bgz=True to ignore the extension and treat this file as block-gzipped.
If you have a non-block (aka standard) gzipped file, you may use force=True; however, we strongly discourage this because each file will be processed by a single core. Import will take significantly longer for any non-trivial dataset.
import_vcf()
does not perform deduplication - if the provided VCF(s) contain multiple records with the same chrom, pos, ref, alt, all these records will be imported as-is (in multiple rows) and will not be collapsed into a single variant.Note
Using the FILTER field:
The information in the FILTER field of a VCF is contained in the
filters
row field. This annotation is aset<str>
and can be queried for filter membership with expressions likeds.filters.contains("VQSRTranche99.5...")
. Variants that are flagged as “PASS” will have no filters applied; for these variants,hl.len(ds.filters)
is0
. Thus, filtering to PASS variants can be done withMatrixTable.filter_rows()
as follows:>>> pass_ds = dataset.filter_rows(hl.len(dataset.filters) == 0)
Column Fields
s (
tstr
) – Column key. This is the sample ID.
Row Fields
locus (
tlocus
ortstruct
) – Row key. The chromosome (CHROM field) and position (POS field). If reference_genome is defined, the type will betlocus
parameterized by reference_genome. Otherwise, the type will be atstruct
with two fields: contig with typetstr
and position with typetint32
.alleles (
tarray
oftstr
) – Row key. An array containing the alleles of the variant. The reference allele (REF field) is the first element in the array and the alternate alleles (ALT field) are the subsequent elements.filters (
tset
oftstr
) – Set containing all filters applied to a variant.rsid (
tstr
) – rsID of the variant.qual (
tfloat64
) – Floating-point number in the QUAL field.info (
tstruct
) – All INFO fields defined in the VCF header can be found in the struct info. Data types match the type specified in the VCF header, and if the declaredNumber
is not 1, the result will be stored as an array.
Entry Fields
import_vcf()
generates an entry field for each FORMAT field declared in the VCF header. The types of these fields are generated according to the same rules as INFO fields, with one difference – “GT” and other fields specified in call_fields will be read astcall
.- Parameters:
path (
str
orlist
ofstr
) – One or more paths to VCF files to read. Each path may or may not include glob expressions like*
,?
, or[abc123]
.force (
bool
) – IfTrue
, load .vcf.gz files serially. No downstream operations can be parallelized, so this mode is strongly discouraged.force_bgz (
bool
) – IfTrue
, load .vcf.gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec.header_file (
str
, optional) – Optional header override file. If not specified, the first file in path is used. Glob patterns are not allowed in the header_file.min_partitions (
int
, optional) – Minimum partitions to load per file.drop_samples (
bool
) – IfTrue
, create sites-only dataset. Don’t load sample IDs or entries.call_fields (
list
ofstr
) – List of FORMAT fields to load astcall
. “GT” is loaded as a call automatically.reference_genome (
str
orReferenceGenome
, optional) – Reference genome to use.contig_recoding (
dict
of (str
,str
), optional) – Mapping from contig name in VCF to contig name in loaded dataset. All contigs must be present in the reference_genome, so this is useful for mapping differently-formatted data onto known references.array_elements_required (
bool
) – IfTrue
, all elements in an array field must be present. Set this parameter toFalse
for Hail to allow array fields with missing values such as1,.,5
. In this case, the second element will be missing. However, in the case of a single missing element.
, the entire field will be missing and not an array with one missing element.skip_invalid_loci (
bool
) – IfTrue
, skip loci that are not consistent with reference_genome.entry_float_type (
HailType
) – Type of floating point entries in matrix table. Must be one of:tfloat32
ortfloat64
. Default:tfloat64
.filter (
str
, optional) – Line filter regex. A partial match results in the line being removed from the file. Applies before find_replace, if both are defined.find_replace ((
str
,str
)) – Line substitution regex. Functions likere.sub
, but obeys the exact semantics of Java’s String.replaceAll.n_partitions (
int
, optional) – Number of partitions. If both n_partitions and block_size are specified, n_partitions will be used.block_size (
int
, optional) – Block size, in MB. Default: 128MB blocks.
- Returns:
- hail.methods.export_vcf(dataset, output, append_to_header=None, parallel=None, metadata=None, *, tabix=False)[source]
Export a
MatrixTable
orTable
as a VCF file.Note
Requires the dataset to have a compound row key:
Examples
Export to VCF as a block-compressed file:
>>> hl.export_vcf(dataset, 'output/example.vcf.bgz')
Notes
export_vcf()
writes the dataset to disk in VCF format as described in the VCF 4.2 spec.Use the
.vcf.bgz
extension rather than.vcf
in the output file name for blocked GZIP compression.Note
We strongly recommended compressed (
.bgz
extension) and parallel output (parallel set to'separate_header'
or'header_per_shard'
) when exporting large VCFs.Hail exports the fields of struct info as INFO fields, the elements of
set<str>
filters as FILTERS, the value of str rsid as ID, and the value of float64 qual as QUAL. No other row fields are exported.The FORMAT field is generated from the entry schema, which must be a
tstruct
. There is a FORMAT field for each field of the struct. If dataset is aTable
, then there will be no FORMAT field and the output will be a sites-only VCF.INFO and FORMAT fields may be generated from Struct fields of type
tcall
,tint32
,tfloat32
,tfloat64
, ortstr
. If a field has typetint64
, every value must be a validint32
. Arrays and sets containing these types are also allowed but cannot be nested; for example,array<array<int32>>
is invalid. Arrays and sets are written with the same comma-separated format. Fields of typetbool
are also permitted in info and will generate INFO fields of VCF type Flag.Hail also exports the name, length, and assembly of each contig as a VCF header line, where the assembly is set to the
ReferenceGenome
name.Consider the workflow of importing a VCF and immediately exporting the dataset back to VCF. The output VCF header will contain FORMAT lines for each entry field and INFO lines for all fields in info, but these lines will have empty Description fields and the Number and Type fields will be determined from their corresponding Hail types. To output a desired Description, Number, and/or Type value in a FORMAT or INFO field or to specify FILTER lines, use the metadata parameter to supply a dictionary with the relevant information. See
get_vcf_metadata()
for how to obtain the dictionary corresponding to the original VCF, and for info on how this dictionary should be structured.The output VCF header will also contain CONTIG lines with ID, length, and assembly fields derived from the reference genome of the dataset.
The output VCF header will not contain lines added by external tools (such as bcftools and GATK) unless they are explicitly inserted using the append_to_header parameter.
Warning
INFO fields stored at VCF import are not automatically modified to reflect filtering of samples or genotypes, which can affect the value of AC (allele count), AF (allele frequency), AN (allele number), etc. If a filtered dataset is exported to VCF without updating info, downstream tools which may produce erroneous results. The solution is to create new fields in info or overwrite existing fields. For example, in order to produce an accurate AC field, one can run
variant_qc()
and copy the variant_qc.AC field to info.AC as shown below.>>> ds = dataset.filter_entries(dataset.GQ >= 20) >>> ds = hl.variant_qc(ds) >>> ds = ds.annotate_rows(info = ds.info.annotate(AC=ds.variant_qc.AC)) >>> hl.export_vcf(ds, 'output/example.vcf.bgz')
Warning
Do not export to a path that is being read from in the same pipeline.
- Parameters:
dataset (
MatrixTable
) – Dataset.output (
str
) – Path of .vcf or .vcf.bgz file to write.append_to_header (
str
, optional) – Path of file to append to VCF header.parallel (
str
, optional) – If'header_per_shard'
, return a set of VCF files (one per partition) rather than serially concatenating these files. If'separate_header'
, return a separate VCF header file and a set of VCF files (one per partition) without the header. IfNone
, concatenate the header and all partitions into one VCF file.metadata (
dict
[str
,dict
[str
,dict
[str
,str
]]], optional) – Dictionary with information to fill in the VCF header. Seeget_vcf_metadata()
for how this dictionary should be structured.tabix (
bool
, optional) – If true, writes a tabix index for the output VCF. Note: This feature is experimental, and the interface and defaults may change in future versions.
- hail.methods.export_elasticsearch(t, host, port, index, index_type, block_size, config=None, verbose=True)[source]
Export a
Table
to Elasticsearch.By default, this method supports Elasticsearch versions 6.8.x - 7.x.x. Older versions of elasticsearch will require recompiling hail.
Warning
export_elasticsearch()
is EXPERIMENTAL.Note
Table rows may be exported more than once. For example, if a task has to be retried after being preempted midway through processing a partition. To avoid duplicate documents in Elasticsearch, use a config with the es.mapping.id option set to a field that contains a unique value for each row.
- hail.methods.export_bgen(mt, output, gp=None, varid=None, rsid=None, parallel=None, compression_codec='zlib')[source]
Export MatrixTable as
MatrixTable
as BGEN 1.2 file with 8 bits of per probability. Also writes SAMPLE file.If parallel is
None
, the BGEN file is written tooutput + '.bgen'
. Otherwise,output + '.bgen'
will be a directory containing many BGEN files. In either case, the SAMPLE file is written tooutput + '.sample'
. For example,>>> hl.export_bgen(mt, '/path/to/dataset')
Will write two files: /path/to/dataset.bgen and /path/to/dataset.sample. In contrast,
>>> hl.export_bgen(mt, '/path/to/dataset', parallel='header_per_shard')
Will create /path/to/dataset.sample and will create
mt.n_partitions()
files into the directory /path/to/dataset.bgen/.Notes
The
export_bgen()
function requires genotype probabilities, either as an entry field of mt (of typearray<float64>
), or an entry expression passed in the gp argument.- Parameters:
mt (
MatrixTable
) – Input matrix table.output (
str
) – Root for output BGEN and SAMPLE files.gp (
ArrayExpression
of typetfloat64
, optional) – Expression for genotype probabilities. IfNone
, entry field GP is used if it exists and is of typetarray
with element typetfloat64
.varid (
StringExpression
, optional) – Expression for the variant ID. IfNone
, the row field varid is used if defined and is of typetstr
. The default and missing value ishl.delimit([mt.locus.contig, hl.str(mt.locus.position), mt.alleles[0], mt.alleles[1]], ':')
rsid (
StringExpression
, optional) – Expression for the rsID. IfNone
, the row field rsid is used if defined and is of typetstr
. The default and missing value is"."
.parallel (
str
, optional) – IfNone
, write a single BGEN file. If'header_per_shard'
, write a collection of BGEN files (one per partition), each with its own header. If'separate_header'
, write a file for each partition, without header, and a header file for the combined dataset. Note that the files produced by'separate_header'
are each individually invalid BGEN files, they can only be read if they are concatenated together with the header file.compresssion_codec (str, optional) – Compression codec. One of ‘zlib’, ‘zstd’.
- hail.methods.export_gen(dataset, output, precision=4, gp=None, id1=None, id2=None, missing=None, varid=None, rsid=None)[source]
Export a
MatrixTable
as GEN and SAMPLE files.Note
Requires the dataset to have a compound row key:
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Examples
Import genotype probability data, filter variants based on INFO score, and export data to a GEN and SAMPLE file:
>>> example_ds = hl.import_gen('data/example.gen', sample_file='data/example.sample') >>> example_ds = example_ds.filter_rows(agg.info_score(example_ds.GP).score >= 0.9) >>> hl.export_gen(example_ds, 'output/infoscore_filtered')
Notes
Writes out the dataset to a GEN and SAMPLE fileset in the Oxford spec.
- Parameters:
dataset (
MatrixTable
) – Dataset.output (
str
) – Filename root for output GEN and SAMPLE files.precision (
int
) – Number of digits to write after the decimal point.gp (
ArrayExpression
of typetfloat64
, optional) – Expression for the genotype probabilities to output. IfNone
, the entry field GP is used if defined and is of typetarray
with element typetfloat64
. The array length must be 3. The values at indices 0, 1, and 2 are exported as the probabilities of homozygous reference, heterozygous, and homozygous variant, respectively. The default and missing value is[0, 0, 0]
.id1 (
StringExpression
, optional) – Expression for the first column of the SAMPLE file. IfNone
, the column key of the dataset is used and must be one field of typetstr
.id2 (
StringExpression
, optional) – Expression for the second column of the SAMPLE file. IfNone
, the column key of the dataset is used and must be one field of typetstr
.missing (
NumericExpression
, optional) – Expression for the third column of the SAMPLE file, which is the sample missing rate. Values must be between 0 and 1.varid (
StringExpression
, optional) – Expression for the variant ID (2nd column of the GEN file). IfNone
, the row field varid is used if defined and is of typetstr
. The default and missing value ishl.delimit([dataset.locus.contig, hl.str(dataset.locus.position), dataset.alleles[0], dataset.alleles[1]], ':')
rsid (
StringExpression
, optional) – Expression for the rsID (3rd column of the GEN file). IfNone
, the row field rsid is used if defined and is of typetstr
. The default and missing value is"."
.
- hail.methods.export_plink(dataset, output, call=None, fam_id=None, ind_id=None, pat_id=None, mat_id=None, is_female=None, pheno=None, varid=None, cm_position=None)[source]
Export a
MatrixTable
as PLINK2 BED, BIM and FAM files.Note
Requires the dataset to be keyed by two fields:
Note
Requires the column key to be one field of type
tstr
Note
Requires the dataset to contain no multiallelic variants. Use
split_multi()
orsplit_multi_hts()
to split multiallelic sites, orMatrixTable.filter_rows()
to remove them.Note
Requires the dataset to contain only diploid and unphased genotype calls. Use
call()
to recode genotype calls ormissing()
to set genotype calls to missing.Examples
Import data from a VCF file, split multi-allelic variants, and export to PLINK files with the FAM file individual ID set to the sample ID:
>>> ds = hl.split_multi_hts(dataset) >>> hl.export_plink(ds, 'output/example', ind_id = ds.s)
Notes
On an imported VCF, the example above will behave similarly to the PLINK conversion command
plink --vcf /path/to/file.vcf --make-bed --out sample --const-fid --keep-allele-order
except that:
Variants that result from splitting a multi-allelic variant may be re-ordered relative to the BIM and BED files.
- Parameters:
dataset (
MatrixTable
) – Dataset.output (
str
) – Filename root for output BED, BIM, and FAM files.call (
CallExpression
, optional) – Expression for the genotype call to output. IfNone
, the entry field GT is used if defined and is of typetcall
.fam_id (
StringExpression
, optional) – Expression for the family ID. The default and missing values are'0'
.ind_id (
StringExpression
, optional) – Expression for the individual (proband) ID. IfNone
, the column key of the dataset is used and must be one field of typetstr
.pat_id (
StringExpression
, optional) – Expression for the paternal ID. The default and missing values are'0'
.mat_id (
StringExpression
, optional) – Expression for the maternal ID. The default and missing values are'0'
.is_female (
BooleanExpression
, optional) – Expression for the proband sex.True
is output as'2'
andFalse
is output as'1'
. The default and missing values are'0'
.pheno (
BooleanExpression
orNumericExpression
, optional) – Expression for the phenotype. If pheno is a boolean expression,True
is output as'2'
andFalse
is output as'1'
. The default and missing values are'NA'
.varid (
StringExpression
, optional) – Expression for the variant ID (2nd column of the BIM file). The default value ishl.delimit([dataset.locus.contig, hl.str(dataset.locus.position), dataset.alleles[0], dataset.alleles[1]], ':')
cm_position (
Float64Expression
, optional) – Expression for the 3rd column of the BIM file (position in centimorgans). The default value is0.0
. The missing value is0.0
.
- hail.methods.get_vcf_metadata(path)[source]
Extract metadata from VCF header.
Examples
>>> hl.get_vcf_metadata('data/example2.vcf.bgz') {'filter': {'LowQual': {'Description': ''}, ...}, 'format': {'AD': {'Description': 'Allelic depths for the ref and alt alleles in the order listed', 'Number': 'R', 'Type': 'Integer'}, ...}, 'info': {'AC': {'Description': 'Allele count in genotypes, for each ALT allele, in the same order as listed', 'Number': 'A', 'Type': 'Integer'}, ...}}
Notes
This method parses the VCF header to extract the ID, Number, Type, and Description fields from FORMAT and INFO lines as well as ID and Description for FILTER lines. For example, given the following header lines:
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FILTER=<ID=LowQual,Description="Low quality"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
The resulting Python dictionary returned would be
metadata = {'filter': {'LowQual': {'Description': 'Low quality'}}, 'format': {'DP': {'Description': 'Read Depth', 'Number': '1', 'Type': 'Integer'}}, 'info': {'MQ': {'Description': 'RMS Mapping Quality', 'Number': '1', 'Type': 'Float'}}}
which can be used with
export_vcf()
to fill in the relevant fields in the header.