# Import / Export¶

 export_cassandra(table, address, keyspace, …) Export a Table to Cassandra. export_gen(dataset, output[, precision]) Export a MatrixTable as GEN and SAMPLE files. export_plink(dataset, output, **fam_args) Export a MatrixTable as PLINK2 BED, BIM and FAM files. export_solr(table, zk_host, collection[, …]) Export a Table to Solr. export_vcf(dataset, output[, …]) Export a MatrixTable as a VCF file. get_vcf_metadata(path) Extract metadata from VCF header. import_bed(path[, reference_genome, …]) Import a UCSC BED file as a Table. import_bgen(path, entry_fields[, …]) Import BGEN file(s) as a MatrixTable. index_bgen(path) Index BGEN files as required by import_bgen(). import_fam(path[, quant_pheno, delimiter, …]) Import a PLINK FAM file into a Table. import_gen(path[, sample_file, tolerance, …]) Import GEN file(s) as a MatrixTable. import_locus_intervals(path[, …]) Import a locus interval list as a Table. import_matrix_table(paths[, row_fields, …]) Import tab-delimited file(s) as a MatrixTable. import_plink(bed, bim, fam[, …]) Import a PLINK dataset (BED, BIM, FAM) as a MatrixTable. import_table(paths[, key, min_partitions, …]) Import delimited text file (text table) as Table. import_vcf(path[, force, force_bgz, …]) Import VCF file(s) as a MatrixTable. read_matrix_table(path[, _drop_cols, _drop_rows]) Read in a MatrixTable written with written with MatrixTable.write(). read_table(path) Read in a Table written with Table.write().
hail.methods.export_cassandra(table, address, keyspace, table_name, block_size=100, rate=1000)[source]

Export a Table to Cassandra.

Warning

export_cassandra() is EXPERIMENTAL.

hail.methods.export_gen(dataset, output, precision=4)[source]

Export a MatrixTable as GEN and SAMPLE files.

Note

Requires the dataset to be keyed by two fields:

Also requires that locus is the partition key.

Note

Requires the dataset to contain no multiallelic variants. Use SplitMulti or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Import genotype probability data, filter variants based on INFO score, and export data to a GEN and SAMPLE file:

>>> example_ds = hl.import_gen('data/example.gen', sample_file='data/example.sample')
>>> example_ds = example_ds.filter_rows(agg.info_score(example_ds.GP).score >= 0.9)
>>> hl.export_gen(example_ds, 'output/infoscore_filtered')


Notes

Writes out the dataset to a GEN and SAMPLE fileset in the Oxford spec.

This method requires a GP (genotype probabilities) entry field of type array<float64>. The values at indices 0, 1, and 2 are exported as the probabilities of homozygous reference, heterozygous, and homozygous variant, respectively. Missing GP values are exported as 0 0 0.

The first six columns of the GEN file are as follows:

• chromosome (locus.contig)
• variant ID (varid if defined, else Contig:Position:Ref:Alt)
• rsID (rsid if defined, else .)
• position (locus.position)
• reference allele (alleles[0])
• alternate allele (alleles[1])

The SAMPLE file has three columns:

• ID_1 and ID_2 are identical and set to the sample ID (s).
• The third column (missing) is set to 0 for all samples.
Parameters: dataset (MatrixTable) – Dataset with entry field GP of type array. output (str) – Filename root for output GEN and SAMPLE files. precision (int) – Number of digits to write after the decimal point.

Export a MatrixTable as PLINK2 BED, BIM and FAM files.

Note

Requires the dataset to be keyed by two fields:

Also requires that locus is the partition key.

Note

Requires the column key to be one field of type tstr

Note

Requires the dataset to contain no multiallelic variants. Use SplitMulti or split_multi_hts() to split multiallelic sites, or MatrixTable.filter_rows() to remove them.

Examples

Import data from a VCF file, split multi-allelic variants, and export to PLINK files with the FAM file individual ID set to the sample ID:

>>> ds = hl.split_multi_hts(dataset)
>>> hl.export_plink(ds, 'output/example', id = ds.s)


Notes

fam_args may be used to set the fields in the output FAM file via expressions with column and global fields in scope:

If no assignment is given, the corresponding PLINK missing value is written: 0 for IDs and sex, NA for phenotype. Only one of is_case or quant_pheno can be assigned. For Boolean expressions, true and false are output as 2 and 1, respectively (i.e., female and case are 2).

The BIM file ID field has the form chr:pos:ref:alt with values given by v.contig, v.start, v.ref, and v.alt.

On an imported VCF, the example above will behave similarly to the PLINK conversion command

plink --vcf /path/to/file.vcf --make-bed --out sample --const-fid --keep-allele-order


except that:

• Variants that result from splitting a multi-allelic variant may be re-ordered relative to the BIM and BED files.
• PLINK uses the rsID for the BIM file ID.
Parameters: dataset (MatrixTable) – Dataset. output (str) – Filename root for output BED, BIM, and FAM files. fam_args (varargs of hail.expr.expressions.Expression) – Named expressions defining FAM field values.
hail.methods.export_solr(table, zk_host, collection, block_size=100)[source]

Export a Table to Solr.

Warning

export_solr() is EXPERIMENTAL.

hail.methods.export_vcf(dataset, output, append_to_header=None, parallel=None, metadata=None)[source]

Export a MatrixTable as a VCF file.

Note

Requires the dataset to be keyed by two fields:

Also requires that locus is the partition key.

Examples

Export to VCF as a block-compressed file:

>>> hl.export_vcf(dataset, 'output/example.vcf.bgz')


Notes

export_vcf() writes the dataset to disk in VCF format as described in the VCF 4.2 spec.

Use the .vcf.bgz extension rather than .vcf in the output file name for blocked GZIP compression.

Note

We strongly recommended compressed (.bgz extension) and parallel output (parallel set to 'separate_header' or 'header_per_shard') when exporting large VCFs.

Hail exports the fields of struct info as INFO fields, the elements of set<str> filters as FILTERS, and the value of float64 qual as QUAL. No other row fields are exported.

The FORMAT field is generated from the entry schema, which must be a tstruct. There is a FORMAT field for each field of the Struct.

INFO and FORMAT fields may be generated from Struct fields of type tcall, tint32, tfloat32, tfloat64, or tstr. If a field has type tint64, every value must be a valid int32. Arrays and sets containing these types are also allowed but cannot be nested; for example, array<array<int32>> is invalid. Arrays and sets are written with the same comma-separated format. Fields of type tbool are also permitted in info and will generate INFO fields of VCF type Flag.

Hail also exports the name, length, and assembly of each contig as a VCF header line, where the assembly is set to the ReferenceGenome name.

Consider the workflow of importing a VCF and immediately exporting the dataset back to VCF. The output VCF header will contain FORMAT lines for each entry field and INFO lines for all fields in info, but these lines will have empty Description fields and the Number and Type fields will be determined from their corresponding Hail types. To output a desired Description, Number, and/or Type value in a FORMAT or INFO field or to specify FILTER lines, use the metadata parameter to supply a dictionary with the relevant information. See get_vcf_metadata() for how to obtain the dictionary corresponding to the original VCF, and for info on how this dictionary should be structured.

The output VCF header will also contain CONTIG lines with ID, length, and assembly fields derived from the reference genome of the dataset.

The output VCF header will not contain lines added by external tools (such as bcftools and GATK) unless they are explicitly inserted using the append_to_header parameter.

Warning

INFO fields stored at VCF import are not automatically modified to reflect filtering of samples or genotypes, which can affect the value of AC (allele count), AF (allele frequency), AN (allele number), etc. If a filtered dataset is exported to VCF without updating info, downstream tools which may produce erroneous results. The solution is to create new fields in info or overwrite existing fields. For example, in order to produce an accurate AC field, one can run variant_qc() and copy the variant_qc.AC field to info.AC as shown below.

>>> ds = dataset.filter_entries(dataset.GQ >= 20)
>>> ds = hl.variant_qc(ds)
>>> ds = ds.annotate_rows(info = ds.info.annotate(AC=ds.variant_qc.AC))
>>> hl.export_vcf(ds, 'output/example.vcf.bgz')

Parameters: dataset (MatrixTable) – Dataset. output (str) – Path of .vcf or .vcf.bgz file to write. append_to_header (str, optional) – Path of file to append to VCF header. parallel (str, optional) – If 'header_per_shard', return a set of VCF files (one per partition) rather than serially concatenating these files. If 'separate_header', return a separate VCF header file and a set of VCF files (one per partition) without the header. If None, concatenate the header and all partitions into one VCF file. metadata (dict[str] or dict[str, dict[str, str], optional) – Dictionary with information to fill in the VCF header. See get_vcf_metadata() for how this dictionary should be structured.
hail.methods.get_vcf_metadata(path)[source]

Examples

>>> metadata = hl.get_vcf_metadata('data/example2.vcf.bgz')
{'filter': {'LowQual': {'Description': ''}, ...},
'format': {'AD': {'Description': 'Allelic depths for the ref and alt alleles in the order listed',
'Number': 'R',
'Type': 'Integer'}, ...},
'info': {'AC': {'Description': 'Allele count in genotypes, for each ALT allele, in the same order as listed',
'Number': 'A',
'Type': 'Integer'}, ...}}


Notes

This method parses the VCF header to extract the ID, Number, Type, and Description fields from FORMAT and INFO lines as well as ID and Description for FILTER lines. For example, given the following header lines:

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FILTER=<ID=LowQual,Description="Low quality">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">


The resulting Python dictionary returned would be

metadata = {'filter': {'LowQual': {'Description': 'Low quality'}},
'Number': '1',
'Type': 'Integer'}},
'info': {'MQ': {'Description': 'RMS Mapping Quality',
'Number': '1',
'Type': 'Float'}}}


which can be used with export_vcf() to fill in the relevant fields in the header.

Parameters: path (str) – VCF file(s) to read. If more than one file is given, the first file is used. dict of str to (dict of str to (dict of str to str))
hail.methods.import_bed(path, reference_genome='default', skip_invalid_intervals=False) → hail.table.Table[source]

Import a UCSC BED file as a Table.

Examples

The file formats are

$cat data/file1.bed track name="BedTest" 20 1 14000000 20 17000000 18000000 ...$ cat file2.bed
track name="BedTest"
20    1          14000000  cnv1
20    17000000   18000000  cnv2
...


Add the row field cnv_region indicating inclusion in at least one interval of the three-column BED file:

>>> bed = hl.import_bed('data/file1.bed')
>>> result = dataset.annotate_rows(cnv_region = hl.is_defined(bed[dataset.locus]))


Add a row field cnv_id with the value given by the fourth column of a BED file:

>>> bed = hl.import_bed('data/file2.bed')
>>> result = dataset.annotate_rows(cnv_id = bed[dataset.locus].target)


Notes

The table produced by this method has one of two possible structures. If the .bed file has only three fields (chrom, chromStart, and chromEnd), then the produced table has only one column:

If the .bed file has four or more columns, then Hail will store the fourth column as a row field in the table:

UCSC bed files can have up to 12 fields, but Hail will only ever look at the first four. Hail ignores header lines in BED files.

Warning

UCSC BED files are 0-indexed and end-exclusive. The line “5 100 105” will contain locus 5:105 but not 5:100. Details here.

Parameters: path (str) – Path to .bed file. reference_genome (str or ReferenceGenome, optional) – Reference genome to use. skip_invalid_intervals (bool) – If True and reference_genome is not None, skip lines with intervals that are not consistent with the reference genome. Table – Interval-keyed table.
hail.methods.import_bgen(path, entry_fields, sample_file=None, min_partitions=None, reference_genome='default', contig_recoding=None, tolerance=0.2) → hail.matrixtable.MatrixTable[source]

Import BGEN file(s) as a MatrixTable.

Examples

Import a BGEN file as a matrix table with GT and GP entry fields, renaming contig name “01” to “1”:

>>> ds_result = hl.import_bgen("data/example.8bits.bgen",
...                            entry_fields=['GT', 'GP'],
...                            sample_file="data/example.8bits.sample",
...                            contig_recoding={"01": "1"})


Import a BGEN file as a matrix table with genotype dosage entry field, renaming contig name “01” to “1”:

>>> ds_result = hl.import_bgen("data/example.8bits.bgen",
...                             entry_fields=['dosage'],
...                             sample_file="data/example.8bits.sample",
...                             contig_recoding={"01": "1"})


Notes

Hail supports importing data from v1.1 and v1.2 of the BGEN file format. For v1.2, genotypes must be unphased and diploid, and genotype probability blocks must be compressed with zlib or uncompressed. If entry_fields includes 'dosage', all variants must be bi-allelic.

Each BGEN file must have a corresponding index file, which can be generated with index_bgen(). To load multiple files at the same time, use Hadoop Glob Patterns.

Column Fields

• s (tstr) – Column key. This is the sample ID imported from the first column of the sample file if given. Otherwise, the sample ID is taken from the sample identifying block in the first BGEN file if it exists; else IDs are assigned from _0, _1, to _N.

Row Fields

Entry Fields

Up to three entry fields are created, as determined by entry_fields which must be non-empty. For best performance, include precisely those fields required for your analysis. For BGEN v1.1 files, all entry fields are set to missing if the sum of the genotype probabilities is a distance greater than tolerance from 1.0.

• GT (tcall) – The hard call corresponding to the genotype with the greatest probability.
• GP (tarray of tfloat64) – Genotype probabilities as defined by the BGEN file spec. For bi-allelic variants, the array has three elements giving the probabilities of homozygous reference, heterozygous, and homozygous alternate genotype, in that order. For v1.2 files, no modifications are made to these genotype probabilities. For v1.1 files, the probabilities are normalized to sum to 1.0. For example, [0.98, 0.0, 0.0] is normalized to [1.0, 0.0, 0.0].
• dosage (tfloat64) – The expected value of the number of alternate alleles, given by the probability of heterozygous genotype plus twice the probability of homozygous alternate genotype. All variants must be bi-allelic.
Parameters: path (str or list of str) – BGEN file(s) to read. entry_fields (list of str) – List of entry fields to create. Options: 'GT', 'GP', 'dosage'. sample_file (str, optional) – Sample file to read the sample ids from. If specified, the number of samples in the file must match the number in the BGEN file(s). min_partitions (int, optional) – Number of partitions. reference_genome (str or ReferenceGenome, optional) – Reference genome to use. contig_recoding (dict of str to str, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given by reference_genome. tolerance (float) – If the sum of the probabilities for an entry differ from 1.0 by more than the tolerance, set the entry to missing. Only applicable to v1.1. MatrixTable
hail.methods.index_bgen(path)[source]

Index BGEN files as required by import_bgen().

The index file is generated in the same directory as path with the filename of path appended by .idx.

Example

>>> hl.index_bgen("data/example.8bits.bgen")


Warning

While this method parallelizes over a list of BGEN files, each file is indexed serially by one core. Indexing several BGEN files on a large cluster is a waste of resources, so indexing should generally be done once, separately from large analyses.

path: str or list of str
.bgen files to index.
hail.methods.import_fam(path, quant_pheno=False, delimiter='\\\\s+', missing='NA') → hail.table.Table[source]

Import a PLINK FAM file into a Table.

Examples

Import a tab-separated FAM file with a case-control phenotype:

>>> fam_kt = hl.import_fam('data/case_control_study.fam')


Import a FAM file with a quantitative phenotype:

>>> fam_kt = hl.import_fam('data/quantitative_study.fam', quant_pheno=True)


Notes

In Hail, unlike PLINK, the user must explicitly distinguish between case-control and quantitative phenotypes. Importing a quantitative phenotype without quant_pheno=True will return an error (unless all values happen to be 0, 1, 2, or -9):

The resulting Table will have fields, types, and values that are interpreted as missing.

One of:

• is_case (tbool) – Case-control phenotype (missing = “0”, “-9”, non-numeric or the missing argument, if given.
• quant_pheno (tfloat64) – Quantitative phenotype (missing = “NA” or the missing argument, if given.
Parameters: path (str) – Path to FAM file. quant_pheno (bool) – If True, phenotype is interpreted as quantitative. delimiter (str) – Field delimiter regex. missing (str) – The string used to denote missing values. For case-control, 0, -9, and non-numeric are also treated as missing. Table
hail.methods.import_gen(path, sample_file=None, tolerance=0.2, min_partitions=None, chromosome=None, reference_genome='default', contig_recoding=None) → hail.matrixtable.MatrixTable[source]

Import GEN file(s) as a MatrixTable.

Examples

>>> ds = hl.import_gen('data/example.gen',
...                    sample_file='data/example.sample')


Notes

If the GEN file has only 5 columns before the start of the genotype probability data (chromosome field is missing), you must specify the chromosome using the chromosome parameter.

To load multiple files at the same time, use Hadoop Glob Patterns.

Column Fields

• s (tstr) – Column key. This is the sample ID imported from the first column of the sample file.

Row Fields

• locus (tlocus or tstruct) – Row key. The genomic location consisting of the chromosome (1st column if present, otherwise given by chromosome) and position (3rd column if chromosome is not defined). If reference_genome is defined, the type will be tlocus parameterized by reference_genome. Otherwise, the type will be a tstruct with two fields: contig with type tstr and position with type tint32.
• alleles (tarray of tstr) – Row key. An array containing the alleles of the variant. The reference allele (4th column if chromosome is not defined) is the first element of the array and the alternate allele (5th column if chromosome is not defined) is the second element.
• varid (tstr) – The variant identifier. 2nd column of GEN file if chromosome present, otherwise 1st column.
• rsid (tstr) – The rsID. 3rd column of GEN file if chromosome present, otherwise 2nd column.

Entry Fields

• GT (tcall) – The hard call corresponding to the genotype with the highest probability.
• GP (tarray of tfloat64) – Genotype probabilities as defined by the GEN file spec. The array is set to missing if the sum of the probabilities is a distance greater than the tolerance parameter from 1.0. Otherwise, the probabilities are normalized to sum to 1.0. For example, the input [0.98, 0.0, 0.0] will be normalized to [1.0, 0.0, 0.0].
Parameters: path (str or list of str) – GEN files to import. sample_file (str) – Sample file to import. tolerance (float) – If the sum of the genotype probabilities for a genotype differ from 1.0 by more than the tolerance, set the genotype to missing. min_partitions (int, optional) – Number of partitions. chromosome (str, optional) – Chromosome if not included in the GEN file reference_genome (str or ReferenceGenome, optional) – Reference genome to use. contig_recoding (dict of str to str, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given by reference_genome. MatrixTable
hail.methods.import_locus_intervals(path, reference_genome='default', skip_invalid_intervals=False) → hail.table.Table[source]

Import a locus interval list as a Table.

Examples

Add the row field capture_region indicating inclusion in at least one locus interval from capture_intervals.txt:

>>> intervals = hl.import_locus_intervals('data/capture_intervals.txt')
>>> result = dataset.annotate_rows(capture_region = hl.is_defined(intervals[dataset.locus]))


Notes

Hail expects an interval file to contain either one, three or five fields per line in the following formats:

• contig:start-end
• contig  start  end (tab-separated)
• contig  start  end  direction  target (tab-separated)

A file in either of the first two formats produces a table with one field:

A file in the third format (with a “target” column) produces a table with two fields:

If reference_genome is defined AND the file has one field, intervals are parsed with parse_locus_interval(). See the documentation for valid inputs.

If reference_genome is NOT defined and the file has one field, intervals are parsed with the regex "([^:]*):(\d+)\-(\d+)" where contig, start, and end match each of the three capture groups. start and end match positions inclusively, e.g. start <= position <= end.

For files with three or five fields, start and end match positions inclusively, e.g. start <= position <= end.

Parameters: path (str) – Path to file. reference_genome (str or ReferenceGenome, optional) – Reference genome to use. skip_invalid_intervals (bool) – If True and reference_genome is not None, skip lines with intervals that are not consistent with the reference genome. Table – Interval-keyed table.
hail.methods.import_matrix_table(paths, row_fields={}, row_key=[], entry_type=dtype('int32'), missing='NA', min_partitions=None, no_header=False, force_bgz=False) → hail.matrixtable.MatrixTable[source]

Import tab-delimited file(s) as a MatrixTable.

Examples

Consider the following file containing counts from a RNA sequencing dataset:

$cat data/matrix1.tsv Barcode Tissue Days GENE1 GENE2 GENE3 GENE4 TTAGCCA brain 1.0 0 0 1 0 ATCACTT kidney 5.5 3 0 2 0 CTCTTCT kidney 2.5 0 0 0 1 CTATATA brain 7.0 0 0 3 0  The field Height contains floating-point numbers and the field Age contains integers. To import this matrix: >>> matrix1 = hl.import_matrix_table('data/matrix1.tsv', ... row_fields={'Barcode': hl.tstr, 'Tissue': hl.tstr, 'Days':hl.tfloat32}, ... row_key='Barcode') >>> matrix1.describe() ---------------------------------------- Global fields: None ---------------------------------------- Column fields: 'col_id': str ---------------------------------------- Row fields: 'Barcode': str 'Tissue': str 'Days': float32 ---------------------------------------- Entry fields: 'x': int32 ---------------------------------------- Column key: 'col_id': str Row key: 'Barcode': str Partition key: 'Barcode': str ----------------------------------------  In this example, the header information is missing for the row fields, but the column IDs are still present: $ cat data/matrix2.tsv
GENE1   GENE2   GENE3   GENE4
TTAGCCA brain   1.0     0       0       1       0
ATCACTT kidney  5.5     3       0       2       0
CTCTTCT kidney  2.5     0       0       0       1
CTATATA brain   7.0     0       0       3       0


The row fields get imported as f0, f1, and f2, so we need to do:

>>> matrix2 = hl.import_matrix_table('data/matrix2.tsv',
...                                  row_fields={'f0': hl.tstr, 'f1': hl.tstr, 'f2':hl.tfloat32},
...                                  row_key='f0')
>>> matrix2.rename({'f0': 'Barcode', 'f1': 'Tissue', 'f2': 'Days'})


Sometimes, the header and row information is missing completely:

$cat data/matrix3.tsv 0 0 1 0 3 0 2 0 0 0 0 1 0 0 3 0  >>> matrix3 = hl.import_matrix_table('data/matrix3.tsv', no_header=True)  In this case, the file has no row fields, so we use the default row index as a key for the imported matrix table. Notes The resulting matrix table has the following structure: • The row fields are named as specified in the column header. If they are missing from the header or no_header=True, row field names are set to the strings f0, f1, … (0-indexed) in column order. The types of all row fields must be specified in the row_fields argument. • The row key is taken from the row_key argument, and must be a subset of row fields. If left empty, the row key will be a new row field row_idx of type int, whose values 0, 1, … index the original rows of the matrix. • There is one column field, col_id, which is a key field of type :obj:str or :obj:int. By default, its values are the strings given by the corresponding column names in the header line. If no_header=True, column IDs are set to integers 0, 1, … (also 0-indexed) in column order. • There is one entry field, x, that contains the data from the imported matrix. All columns to be imported as row fields must be at the start of the row. Unlike import_table, no type imputation is done so types must be specified for all columns that should be imported as row fields. (The other columns are imported as entries in the matrix.) The header information for row fields is allowed to be missing, if the column IDs are present, but the header must then consist only of tab-delimited column IDs (no row field names). Parameters: paths (str or list of str) – Files to import. row_fields (dict of str to HailType) – Columns to take as row fields in the MatrixTable. They must be located before all entry columns. row_key (str or list of str) – Key fields(s). If empty, creates an index row_id to use as key. entry_type (HailType) – Type of entries in matrix table. Must be one of: tint32, tint64, tfloat32, tfloat64, or tstr. Default: tint32. missing (str) – Identifier to be treated as missing. Default: NA min_partitions (int or None) – Minimum number of partitions. no_header (bool) – If True, assume the file has no header and name the row fields f0, f1, … fK (0-indexed) and the column keys 0, 1, … N. force_bgz (bool) – If True, load .gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. MatrixTable – MatrixTable constructed from imported data Import a PLINK dataset (BED, BIM, FAM) as a MatrixTable. Examples >>> ds = hl.import_plink(bed="data/test.bed", ... bim="data/test.bim", ... fam="data/test.fam")  Notes Only binary SNP-major mode files can be read into Hail. To convert your file from individual-major mode to SNP-major mode, use PLINK to read in your fileset and use the --make-bed option. Hail ignores the centimorgan position (Column 3 in BIM file). Hail uses the individual ID (column 2 in FAM file) as the sample id (s). The individual IDs must be unique. The resulting MatrixTable has the following fields: • Row fields: • Column fields: • s (tstr) – Column 2 in the Fam file (key field). • fam_id (tstr) – Column 1 in the FAM file. Set to missing if ID equals “0”. • pat_id (tstr) – Column 3 in the FAM file. Set to missing if ID equals “0”. • mat_id (tstr) – Column 4 in the FAM file. Set to missing if ID equals “0”. • is_female (tstr) – Column 5 in the FAM file. Set to missing if value equals “-9”, “0”, or “N/A”. Set to true if value equals “2”. Set to false if value equals “1”. • is_case (tstr) – Column 6 in the FAM file. Only present if quant_pheno equals False. Set to missing if value equals “-9”, “0”, “N/A”, or the value specified by missing. Set to true if value equals “2”. Set to false if value equals “1”. • quant_pheno (tstr) – Column 6 in the FAM file. Only present if quant_pheno equals True. Set to missing if value equals missing. • Entry fields: Parameters: bed (str) – PLINK BED file. bim (str) – PLINK BIM file. fam (str) – PLINK FAM file. min_partitions (int, optional) – Number of partitions. missing (str) – String used to denote missing values only for the phenotype field. This is in addition to “-9”, “0”, and “N/A” for case-control phenotypes. delimiter (str) – FAM file field delimiter regex. quant_pheno (bool) – If true, FAM phenotype is interpreted as quantitative. a2_reference (bool) – If True, A2 is treated as the reference allele. If False, A1 is treated as the reference allele. reference_genome (str or ReferenceGenome, optional) – Reference genome to use. contig_recoding (dict of str to str, optional) – Dict of old contig name to new contig name. The new contig name must be in the reference genome given by reference_genome. MatrixTable hail.methods.import_table(paths, key=(), min_partitions=None, impute=False, no_header=False, comment=(), delimiter='\t', missing='NA', types={}, quote=None, skip_blank_lines=False) → hail.table.Table[source] Import delimited text file (text table) as Table. The resulting Table will have no key fields. Use Table.key_by() to specify keys. Examples Consider this file: $ cat data/samples1.tsv
Sample     Height  Status  Age
PT-1236    160.9   Control 19
PT-1239    170.3   Control 55


The field Height contains floating-point numbers and the field Age contains integers.

To import this table using field types:

>>> table = hl.import_table('data/samples1.tsv',
...                              types={'Height': hl.tfloat64, 'Age': hl.tint32})


Note Sample and Status need no type, because tstr is the default type.

To import a table using type imputation (which causes the file to be parsed twice):

>>> table = hl.import_table('data/samples1.tsv', impute=True)


Detailed examples

Let’s import fields from a CSV file with missing data and special characters:

\$ cat data/samples2.tsv
Batch,PT-ID
1kg,PT-0001
1kg,PT-0002
study1,PT-0003
study3,PT-0003
.,PT-0004
1kg,PT-0005
.,PT-0006
1kg,PT-0007


In this case, we should:

• Pass the non-default delimiter ,
• Pass the non-default missing value .
>>> table = hl.import_table('data/samples2.tsv', delimiter=',', missing='.')


Let’s import a table from a file with no header and sample IDs that need to be transformed. Suppose the sample IDs are of the form NA#####. This file has no header line, and the sample ID is hidden in a field with other information.

To import:

>>> t = hl.import_table('data/samples3.tsv', no_header=True)
>>> t = t.annotate(sample = t.f0.split("_")[1]).key_by('sample')


Notes

The impute parameter tells Hail to scan the file an extra time to gather information about possible field types. While this is a bit slower for large files because the file is parsed twice, the convenience is often worth this cost.

The delimiter parameter is either a delimiter character (if a single character) or a field separator regex (2 or more characters). This regex follows the Java regex standard.

Note

Use delimiter='\s+' to specify whitespace delimited files.

If set, the comment parameter causes Hail to skip any line that starts with the given string(s). For example, passing comment='#' will skip any line beginning in a pound sign. If the string given is a single character, Hail will skip any line beginning with the character. Otherwise if the length of the string is greater than 1, Hail will interpret the string as a regex and will filter out lines matching the regex. For example, passing comment=['#', '^track.*'] will filter out lines beginning in a pound sign and any lines that match the regex '^track.*'.

The missing parameter defines the representation of missing data in the table.

Note

The missing parameter is NOT a regex. The comment parameter is treated as a regex ONLY if the length of the string is greater than 1 (not a single character).

The no_header parameter indicates that the file has no header line. If this option is passed, then the field names will be f0, f1, … fN (0-indexed).

The types parameter allows the user to pass the types of fields in the table. It is an dict keyed by str, with HailType values. See the examples above for a standard usage. Additionally, this option can be used to override type imputation. For example, if the field Chromosome only contains the values 1 through 22, it will be imputed to have type tint32, whereas most Hail methods expect that a chromosome field will be of type tstr. Setting impute=True and types={'Chromosome': hl.tstr} solves this problem.

Parameters: paths (str or list of str) – Files to import. key (str or list of str) – Key fields(s). min_partitions (int or None) – Minimum number of partitions. no_header (bool) – If True, assume the file has no header and name the N fields f0, f1, … fN (0-indexed). impute (bool) – If True, Impute field types from the file. comment (str or list of str) – Skip lines beginning with the given string if the string is a single character. Otherwise, skip lines that match the regex specified. delimiter (str) – Field delimiter regex. missing (str) – Identifier to be treated as missing. types (dict mapping str to HailType) – Dictionary defining field types. quote (str or None) – Quote character. skip_blank_lines (bool) – If True, ignore empty lines. Otherwise, throw an error if an empty line is found. Table
hail.methods.import_vcf(path, force=False, force_bgz=False, header_file=None, min_partitions=None, drop_samples=False, call_fields=[], reference_genome='default', contig_recoding=None) → hail.matrixtable.MatrixTable[source]

Import VCF file(s) as a MatrixTable.

Examples

>>> ds = hl.import_vcf('data/example2.vcf.bgz')


Notes

Hail is designed to be maximally compatible with files in the VCF v4.2 spec.

import_vcf() takes a list of VCF files to load. All files must have the same header and the same set of samples in the same order (e.g., a dataset split by chromosome). Files can be specified as Hadoop glob patterns.

Ensure that the VCF file is correctly prepared for import: VCFs should either be uncompressed (.vcf) or block compressed (.vcf.bgz). If you have a large compressed VCF that ends in .vcf.gz, it is likely that the file is actually block-compressed, and you should rename the file to .vcf.bgz accordingly. If you actually have a standard gzipped file, it is possible to import it to Hail using the force parameter. However, this is not recommended – all parsing will have to take place on one node because gzip decompression is not parallelizable. In this case, import will take significantly longer.

import_vcf() does not perform deduplication - if the provided VCF(s) contain multiple records with the same chrom, pos, ref, alt, all these records will be imported as-is (in multiple rows) and will not be collapsed into a single variant.

Note

Using the FILTER field:

The information in the FILTER field of a VCF is contained in the filters row field. This annotation is a set<str> and can be queried for filter membership with expressions like ds.filters.contains("VQSRTranche99.5..."). Variants that are flagged as “PASS” will have no filters applied; for these variants, hl.len(ds.filters) is 0. Thus, filtering to PASS variants can be done with MatrixTable.filter_rows() as follows:

>>> pass_ds = dataset.filter_rows(hl.len(dataset.filters) == 0)


Column Fields

Row Fields

Entry Fields

import_vcf() generates an entry field for each FORMAT field declared in the VCF header. The types of these fields are generated according to the same rules as INFO fields, with one difference – “GT” and other fields specified in call_fields will be read as tcall.

Parameters: path (str or list of str) – VCF file(s) to read. force (bool) – If True, load .vcf.gz files serially. No downstream operations can be parallelized, so this mode is strongly discouraged. force_bgz (bool) – If True, load .vcf.gz files as blocked gzip files, assuming that they were actually compressed using the BGZ codec. header_file (str, optional) – Optional header override file. If not specified, the first file in path is used. min_partitions (int, optional) – Minimum partitions to load per file. drop_samples (bool) – If True, create sites-only dataset. Don’t load sample IDs or entries. call_fields (list of str) – List of FORMAT fields to load as tcall. “GT” is loaded as a call automatically. reference_genome (str or ReferenceGenome, optional) – Reference genome to use. contig_recoding (dict of (str, str)) – Mapping from contig name in VCF to contig name in loaded dataset. All contigs must be present in the reference_genome, so this is useful for mapping differently-formatted data onto known references. MatrixTable
hail.methods.read_matrix_table(path, _drop_cols=False, _drop_rows=False) → hail.matrixtable.MatrixTable[source]

Read in a MatrixTable written with written with MatrixTable.write()

Parameters: path (str) – File to read. MatrixTable
hail.methods.read_table(path) → hail.table.Table[source]

Read in a Table written with Table.write().

Parameters: path (str) – File to read. Table