Change Log¶
Frequently Asked Questions¶
With a version like 0.x, is Hail ready for use in publications?¶
Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.
What stability is guaranteed?¶
We do not intentionally break back-compatibility of interfaces or file formats. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. The exception to this rule is experimental functionality, denoted as such in the reference documentation, which may change at any time.
Please note that forward compatibility should not be expected, especially relating to file formats: this means that it may not be possible to use an earlier version of Hail to read files written in a later version.
Version 0.2.28¶
Released 2019-11-22
Critical correctness bug fix¶
- (#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.
Bug fixes¶
New Features¶
Version 0.2.27¶
Released 2019-11-15
New Features¶
- (#7379) Add
delimiter
argument tohl.import_matrix_table
- (#7389) Add
force
andforce_bgz
arguments tohl.experimental.import_gtf
- (#7386)(#7394)
Add
{Table, MatrixTable}.tail
. - (#7467) Added
hl.if_else
as an alias forhl.cond
; deprecatedhl.cond
. - (#7453) Add
hl.parse_int{32, 64}
andhl.parse_float{32, 64}
, which can parse strings to numbers and return missing on failure. - (#7475) Add
row_join_type
argument toMatrixTable.union_cols
to support outer joins on rows.
Bug fixes¶
Version 0.2.25¶
Released 2019-10-14
New features¶
- (#7240) Add
interactive schema widget to
{MatrixTable, Table}.describe
. Use this by passing the argumentwidget=True
. - (#7250)
{Table, MatrixTable, Expression}.summarize()
now summarizes elements of collections (arrays, sets, dicts). - (#7271) Improve
hl.plot.qq
by increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot. - (#7280) Add HTML
output for
{Table, MatrixTable, Expression}.summarize()
. - (#7294) Add HTML
output for
hl.summarize_variants()
.
Bug fixes¶
Version 0.2.24¶
Released 2019-10-03
hailctl dataproc
¶
- (#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.
New features¶
- (#7071) Add
permit_shuffle
flag tohl.{split_multi, split_multi_hts}
to allow processing of datasets with both multiallelics and duplciate loci. - (#7121) Add
hl.contig_length
function. - (#7130) Add
window
method onLocusExpression
, which creates an interval around a locus. - (#7172) Permit
hl.init(sc=sc)
with pip-installed packages, given the right configuration options.
Version 0.2.23¶
Released 2019-09-23
hailctl dataproc
¶
Bug fixes¶
New features¶
- (#7009) Introduced
analysis pass in Python that mostly obviates the
hl.bind
andhl.rbind
operators; idiomatic Python that generates Hail expressions will perform much better. - (#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.
- (#7085) Warn only
once about schema mismatches during JSON import (used in VEP,
Nirvana, and sometimes
import_table
. - (#7106)
hl.agg.call_stats
can now accept a number of alleles for itsalleles
parameter, useful when dealing with biallelic calls without the alleles array at hand.
Version 0.2.20¶
Released 2019-08-19
Critical memory management fix¶
- (#6824) Fixed memory
management inside
annotate_cols
with aggregations. This was causing memory leaks and segfaults.
Bug fixes¶
Version 0.2.19¶
Released 2019-08-01
Critical performance bug fix¶
Bug fixes¶
- (#6757) Fixed
correctness bug in optimizations applied to the combination of
Table.order_by
withhl.desc
arguments andshow()
, leading to tables sorted in ascending, not descending order. - (#6770) Fixed
assertion error caused by
Table.expand_types()
, which was used byTable.to_spark
andTable.to_pandas
.
Performance Improvements¶
- (#6666) Slightly
improve performance of
hl.pca
andhl.hwe_normalized_pca
. - (#6669) Improve
performance of
hl.split_multi
andhl.split_multi_hts
. - (#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
- (#6775) Fixed a major performance problem related to reading block matrices.
Version 0.2.18¶
Released 2019-07-12
Critical performance bug fix¶
- (#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.
Version 0.2.17¶
Released 2019-07-10
New features¶
- (#6349) Added
compression
parameter toexport_block_matrices
, which can be'gz'
or'bgz'
. - (#6405) When a matrix
table has string column-keys,
matrixtable.show
uses the column key as the column name. - (#6345) Added an improved scan implementation, which reduces the memory load on master.
- (#6462) Added
export_bgen
method. - (#6473) Improved
performance of
hl.agg.array_sum
by about 50%. - (#6498) Added method
hl.lambda_gc
to calculate the genomic control inflation factor. - (#6456) Dramatically
improved performance of pipelines containing long chains of calls to
Table.annotate
, orMatrixTable
equivalents. - (#6506) Improved the
performance of the generated code for the
Table.annotate(**thing)
pattern.
Bug fixes¶
- (#6404) Added
n_rows
andn_cols
parameters toExpression.show
for consistency with othershow
methods. - (#6408)(#6419)
Fixed an issue where the
filter_intervals
optimization could make scans return incorrect results. - (#6459)(#6458)
Fixed rare correctness bug in the
filter_intervals
optimization which could result too many rows being kept. - (#6496) Fixed html
output of
show
methods to truncate long field contents. - (#6478) Fixed the
broken documentation for the experimental
approx_cdf
andapprox_quantiles
aggregators. - (#6504) Fix
Table.show
collecting data twice while running in Jupyter notebooks. - (#6571) Fixed the
message printed in
hl.concordance
to print the number of overlapping samples, not the full list of overlapping sample IDs. - (#6583) Fixed
hl.plot.manhattan
for non-default reference genomes.
Version 0.2.16¶
Released 2019-06-19
Version 0.2.15¶
Released 2019-06-14
After some infrastructural changes to our development process, we should be getting back to frequent releases.
hailctl
¶
Starting in 0.2.15, pip
installations of Hail come bundled with a
command- line tool, hailctl
. This tool subsumes the functionality of
cloudtools
, which is now deprecated. See the release thread on the
forum
for more information.
New features¶
- (#5932)(#6115)
hl.import_bed
abdhl.import_locus_intervals
now accept keyword arguments to pass through tohl.import_table
, which is used internally. This permits parameters likemin_partitions
to be set. - (#5980) Added
log
option tohl.plot.histogram2d
. - (#5937) Added
all_matches
parameter toTable.index
andMatrixTable.index_{rows, cols, entries}
, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus. - (#5913) Added functionality that makes arrays of structs easier to work with.
- (#6089) Added HTML
output to
Expression.show
when running in a notebook. - (#6172)
hl.split_multi_hts
now uses the originalGQ
value if thePL
is missing. - (#6123) Added
hl.binary_search
to search sorted numeric arrays. - (#6224) Moved
implementation of
hl.concordance
from backend to Python. Performance directly fromread()
is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure. - (#6214) Updated Hail Python dependencies.
- (#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.
Bug fixes¶
- (#5895) Fixed crash
caused by
-0.0
floating-point values inhl.agg.hist
. - (#6013) Turned off
feature in HTSJDK that caused crashes in
hl.import_vcf
due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec. - (#6117) Fixed problem
causing
Table.flatten()
to be quadratic in the size of the schema. - (#6228)(#5993)
Fixed
MatrixTable.union_rows()
to join distinct keys on the right, preventing an unintentional cartesian product. - (#6235) Fixed an
issue related to aggregation inside
MatrixTable.filter_cols
. - (#6226) Restored lost
behavior where
Table.show(x < 0)
shows the entire table. - (#6267) Fixed cryptic
crashes related to
hl.split_multi
andMatrixTable.entries()
with duplicate row keys.
Version 0.2.14¶
Released 2019-04-24
A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.
Version 0.2.13¶
Released 2019-04-18
Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.
New features¶
- (#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
- (#5860) Improve performance of some column annotation pipelines.
- (#5858) Add
unify
option toTable.union
which allows unification of tables with different fields or field orderings. - (#5799)
mt.entries()
is four times faster. - (#5756) Hail now uses Spark 2.4.x by default.
- (#5677)
MatrixTable
now also supportsshow
. - (#5793)(#5701)
Add
array.index(x)
which find the first index ofarray
whose value is equal tox
. - (#5790) Add
array.head()
which returns the first element of the array, or missing if the array is empty. - (#5690) Improve
performance of
ld_matrix
. - (#5743)
mt.compute_entry_filter_stats
computes statistics about the number of filtered entries in a matrix table. - (#5758) failure to parse an interval will now produce a much more detailed error message.
- (#5723)
hl.import_matrix_table
can now import a matrix table with no columns. - (#5724)
hl.rand_norm2d
samples from a two dimensional random normal.
Bug fixes¶
- (#5885) Fix
Table.to_spark
in the presence of fields of tuples. - (#5882)(#5886)
Fix
BlockMatrix
conversion methods to correctly handle filtered entries. - (#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
- (#5855)(#5786)
Fix
hl.mendel_errors
incorrectly reporting children counts in the presence of entry filtering. - (#5830)(#5835) Fix Nirvana support
- (#5773) Fix
hl.sample_qc
to use correct number of total rows when calculating call rate. - (#5763)(#5764)
Fix
hl.agg.array_agg
to work insidemt.annotate_rows
and similar functions. - (#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
- (#5692) When
keyed
isTrue
,hl.maximal_independent_set
now does not produce duplicates. - (#5725) Docs now
consistently refer to
hl.agg
notagg
. - (#5730)(#5782)
Taught
import_bgen
to optimize itsvariants
argument.
Version 0.2.12¶
Released 2019-03-28
New features¶
Bug fixes¶
Experimental¶
- (#5524) Add
summarize
functions to Table, MatrixTable, and Expression. - (#5570) Add
hl.agg.approx_cdf
aggregator for approximate density calculation. - (#5571) Add
log
parameter tohl.plot.histogram
. - (#5601) Add
hl.plot.joint_plot
, extend functionality ofhl.plot.scatter
. - (#5608) Add LD score simulation framework.
- (#5628) Add
hl.experimental.full_outer_join_mt
for full outer joins onMatrixTable
s.
Version 0.2.11¶
Released 2019-03-06
New features¶
- (#5374) Add default
arguments to
hl.add_sequence
for running on GCP. - (#5481) Added
sample_cols
method toMatrixTable
. - (#5501) Exposed
MatrixTable.unfilter_entries
. Seefilter_entries
documentation for more information. - (#5480) Added
n_cols
argument toMatrixTable.head
. - (#5529) Added
Table.{semi_join, anti_join}
andMatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}
. - (#5528) Added
{MatrixTable, Table}.checkpoint
methods as wrappers aroundwrite
/read_{matrix_table, table}
.
Bug fixes¶
- (#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
- (#5419) Resolved
issue with
import_vcf
force_bgz
and file size checks. - (#5427) Resolved
issue with
Table.show
and dictionary field types. - (#5468) Resolved
ordering problem with
Expression.show
on key fields that are not the first key. - (#5492) Fixed
hl.agg.collect
crashing when collectingfloat32
values. - (#5525) Fixed
hl.trio_matrix
crashing whencomplete_trios
isFalse
.
Version 0.2.10¶
Released 2019-02-15
New features¶
- (#5272) Added a new ‘delimiter’ option to Table.export.
- (#5251) Add utility
aliases to
hl.plot
foroutput_notebook
andshow
. - (#5249) Add
histogram2d
function tohl.plot
module. - (#5247) Expose
MatrixTable.localize_entries
method for converting to a Table with an entries array. - (#5300) Add new
filter
andfind_replace
arguments tohl.import_table
andhl.import_vcf
to apply regex and substitutions to text input.
Performance improvements¶
- (#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.
Bug fixes¶
- (#5306) Fix
ReferenceGenome.add_sequence
causing a crash. - (#5268) Fix
Table.export
writing a file called ‘None’ in the current directory. - (#5265) Fix
hl.get_reference
raising an exception when called beforehl.init()
. - (#5250) Fix crash in
pc_relate
when called on a MatrixTable field other than ‘GT’. - (#5278) Fix crash in
Table.order_by
when sorting by fields whose names are not valid Python identifiers. - (#5294) Fix crash in
hl.trio_matrix
when sample IDs are missing. - (#5295) Fix crash in
Table.index
related to key field incompatibilities.
Version 0.2.9¶
Released 2019-01-30
New features¶
Performance improvements¶
Bug fixes¶
- (#5144) Fix crash
caused by
hl.index_bgen
(since 0.2.7) - (#5177) Fix bug
causing
Table.repartition(n, shuffle=True)
to fail to increase partitioning for unkeyed tables. - (#5173) Fix bug
causing
Table.show
to throw an error when the table is empty (since 0.2.8). - (#5210) Fix bug
causing
Table.show
to always print types, regardless oftypes
argument (since 0.2.8). - (#5211) Fix bug
causing
MatrixTable.make_table
to unintentionally discard non-key row fields (since 0.2.8).
Version 0.2.7¶
Released 2019-01-03
Version 0.2.6¶
Released 2018-12-17
New features¶
- (#4962) Expanded
comparison operators (
==
,!=
,<
,<=
,>
,>=
) to support expressions of every type. - (#4927) Expanded
functionality of
Table.order_by
to support ordering by arbitrary expressions, instead of just top-level fields. - (#4926) Expanded
default GRCh38 contig recoding behavior in
import_plink
.
Bug fixes¶
- (#4941) Fixed variable scoping error in regression methods.
- (#4857) Fixed bug in
maximal_independent_set appearing when nodes were named something
other than
i
andj
. - (#4932) Fixed
possible error in
export_plink
related to tolerance of writer process failure. - (#4920) Fixed bad
error message in
Table.order_by
.
Version 0.2.5¶
Released 2018-12-07
New features¶
- (#4845) The
or_error
method in
hl.case
andhl.switch
statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions. - (#4865) We use this
new
or_error
functionality in methods that require biallelic variants to include an offending variant in the error message. - (#4820) Added hl.reversed for reversing arrays and strings.
- (#4895) Added
include_strand
option to the hl.liftover function.
Performance improvements¶
Bug fixes¶
- (#4754)(#4799)
Fixed optimizer assertion errors related to certain types of
pipelines using
group_rows_by
. - (#4888) Fixed assertion error in BlockMatrix.sum.
- (#4871) Fixed possible error in locally sorting nested collections.
- (#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
- (#4527)(#4761)
Fixed optimizer assertion error sometimes encountered with
hl.split_multi[_hts]
.