Change Log¶
Frequently Asked Questions¶
With a version like 0.x, is Hail ready for use in publications?¶
Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.
What stability is guaranteed?¶
We do not intentionally break back-compatibility of interfaces or file formats. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. The exception to this rule is experimental functionality, denoted as such in the reference documentation, which may change at any time.
Please note that forward compatibility should not be expected, especially relating to file formats: this means that it may not be possible to use an earlier version of Hail to read files written in a later version.
Version 0.2.61¶
Released 2020-12-03
Version 0.2.60¶
Released 2020-11-16
Version 0.2.59¶
Released 2020-10-22
Version 0.2.58¶
Released 2020-10-08
New features¶
(#9524) Hail should now be buildable using Spark 3.0.
(#9549) Add
ignore_in_sample_frequency
flag tohl.de_novo
.(#9501) Configurable cache size for
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9474) Add
ArrayExpression.first
andArrayExpression.last
.(#9459) Add
StringExpression.join
, an analogue to Python’sstr.join
.(#9398) Hail will now throw
HailUserError
s if theor_error
branch of aCaseBuilder
is hit.
Bug fixes¶
(#9503) NDArrays can now hold arbitrary data types, though only ndarrays of primitives can be collected to Python.
(#9501) Remove memory leak in
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9424)
hl.experimental.writeBlockMatrices
didn’t correctly supportoverwrite
flag.
hailctl dataproc¶
Version 0.2.57¶
Released 2020-09-03
Version 0.2.55¶
Released 2020-08-19
Bug fixes¶
(#9250)
hailctl dataproc
no longer uses deprecatedgcloud
flags. Consequently, users must update to a recent version ofgcloud
.(#9294) The “Python 3” kernel in notebooks in clusters started by
hailctl dataproc
now features the same Spark monitoring widget found in the “Hail” kernel. There is now no reason to use the “Hail” kernel.
Version 0.2.53¶
Released 2020-07-30
Version 0.2.52¶
Released 2020-07-29
Version 0.2.51¶
Released 2020-07-28
Version 0.2.50¶
Released 2020-07-23
Version 0.2.49¶
Released 2020-07-08
Version 0.2.48¶
Released 2020-07-07
Bug fixes¶
(#9029) Fix crash when using
hl.agg.linreg
with no aggregated data records.(#9028) Fixed memory leak affecting
Table.annotate
with scans,hl.experimental.densify
, andTable.group_by
/aggregate
.(#8978) Fixed aggregation behavior of
MatrixTable.{group_rows_by, group_cols_by}
to skip filtered entries.
Version 0.2.47¶
Released 2020-06-23
Version 0.2.46¶
Released 2020-06-17
Version 0.2.43¶
Released 2020-05-28
Version 0.2.40¶
Released 2020-05-12
Version 0.2.39¶
Released 2020-04-29
Bug fixes¶
(#8615) Fix contig ordering in the CanFam3 (dog) reference genome.
(#8622) Fix bug that causes inscrutable JVM Bytecode errors.
(#8645) Ease unnecessarily strict assertion that caused errors when aggregating by key (e.g.
hl.experimental.spread
).(#8621)
hl.nd.array
now supports arrays with no elements (e.g.hl.nd.array([]).reshape((0, 5))
) and, consequently, matmul with an inner dimension of zero.
New features¶
(#8571)
hl.init(skip_logging_configuration=True)
will skip configuration of Log4j. Users may use this to configure their own logging.(#8588) Users who manually build Python wheels will experience less unnecessary output when doing so.
(#8572) Add
hl.parse_json
which converts a string containing JSON into a Hail object.
Performance Improvements¶
Documentation¶
Version 0.2.38¶
Released 2020-04-21
Critical Linreg Aggregator Correctness Bug¶
(#8575) Fixed a correctness bug in the linear regression aggregator. This was introduced in version 0.2.29. See https://discuss.hail.is/t/possible-incorrect-linreg-aggregator-results-in-0-2-29-0-2-37/1375 for more details.
Version 0.2.37¶
Released 2020-04-14
Bug fixes¶
(#8487) Fix incorrect handling of badly formatted data for
hl.gp_dosage
.(#8497) Fix handling of missingness for
hl.hamming
.(#8537) Fix compile-time errror.
(#8539) Fix compiler error in
Table.multi_way_zip_join
.(#8488) Fix
hl.agg.call_stats
to appropriately throw an error for badly-formatted calls.
Version 0.2.36¶
Released 2020-04-06
Version 0.2.35¶
Released 2020-04-02
Critical Memory Management Bug Fix¶
(#8412) Fixed a serious per-partition memory leak that causes certain pipelines to run out of memory unexpectedly. Please update from 0.2.34.
Bug fixes¶
Performance Improvements¶
Version 0.2.34¶
Released 2020-03-12
New features¶
Bug fixes¶
hailctl dataproc¶
(#8253)
hailctl dataproc
now supports new flags--requester-pays-allow-all
and--requester-pays-allow-buckets
. This will configure your hail installation to be able to read from requester pays buckets. The charges for reading from these buckets will be billed to the project that the cluster is created in.(#8268) The data sources for VEP have been moved to
gs://hail-us-vep
,gs://hail-eu-vep
, andgs://hail-uk-vep
, which are requester-pays buckets in Google Cloud.hailctl dataproc
will automatically infer which of these buckets you should pull data from based on the region your cluster is spun up in. If you are in none of those regions, please contact us on discuss.hail.is.
Version 0.2.33¶
Released 2020-02-27
Bug fixes¶
(#8153) Fixed complier bug causing
MatchError
inimport_bgen
.(#8123) Fixed an issue with multiple Python HailContexts running on the same cluster.
(#8150) Fixed an issue where output from VEP about failures was not reported in error message.
(#8152) Fixed an issue where the row count of a MatrixTable coming from
import_matrix_table
was incorrect.(#8175) Fixed a bug where
persist
did not actually do anything.
Version 0.2.32¶
Released 2020-02-07
Critical performance regression fix¶
(#7989) Fixed performance regression leading to a large slowdown when
hl.variant_qc
was run after filtering columns.
Performance¶
Bug fixes¶
(#7976) Fixed divide-by-zero error in
hl.concordance
with no overlapping rows or cols.(#7965) Fixed optimizer error leading to crashes caused by
MatrixTable.union_rows
.(#8035) Fix compiler bug in
Table.multi_way_zip_join
.(#8021) Fix bug in computing shape after
BlockMatrix.filter
.(#7986) Fix error in NDArray matrix/vector multiply.
Version 0.2.31¶
Released 2020-01-22
New features¶
(#7787) Added transition/transversion information to
hl.summarize_variants
.(#7792) Add Python stack trace to array index out of bounds errors in Hail pipelines.
(#7832) Add
spark_conf
argument tohl.init
, permitting configuration of Spark runtime for a Hail session.(#7823) Added datetime functions
hl.experimental.strptime
andhl.experimental.strftime
.(#7888) Added
hl.nd.array
constructor from nested standard arrays.
File size¶
(#7923) Fixed compression problem since 0.2.23 resulting in larger-than-expected matrix table files for datasets with few entry fields (e.g. GT-only datasets).
Performance¶
Version 0.2.29¶
Released 2019-12-17
Bug fixes¶
(#7229) Fixed
hl.maximal_independent_set
tie breaker functionality.(#7732) Fixed incompatibility with old files leading to incorrect data read when filtering intervals after
read_matrix_table
.(#7642) Fixed crash when constant-folding functions that throw errors.
(#7611) Fixed
hl.hadoop_ls
to handle glob patterns correctly.(#7653) Fixed crash in
ld_prune
by unfiltering missing GTs.
Performance improvements¶
New features¶
(#7686) Added
comment
argument toimport_matrix_table
, allowing lines with certain prefixes to be ignored.(#7688) Added experimental support for
NDArrayExpression
s in newhl.nd
module.(#7608)
hl.grep
now has ashow
argument that allows users to either print the results (default) or return a dictionary of the results.
Version 0.2.28¶
Released 2019-11-22
Critical correctness bug fix¶
(#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.
Bug fixes¶
New Features¶
Version 0.2.27¶
Released 2019-11-15
New Features¶
(#7379) Add
delimiter
argument tohl.import_matrix_table
(#7389) Add
force
andforce_bgz
arguments tohl.experimental.import_gtf
(#7467) Added
hl.if_else
as an alias forhl.cond
; deprecatedhl.cond
.(#7453) Add
hl.parse_int{32, 64}
andhl.parse_float{32, 64}
, which can parse strings to numbers and return missing on failure.(#7475) Add
row_join_type
argument toMatrixTable.union_cols
to support outer joins on rows.
Bug fixes¶
Version 0.2.25¶
Released 2019-10-14
New features¶
(#7240) Add interactive schema widget to
{MatrixTable, Table}.describe
. Use this by passing the argumentwidget=True
.(#7250)
{Table, MatrixTable, Expression}.summarize()
now summarizes elements of collections (arrays, sets, dicts).(#7271) Improve
hl.plot.qq
by increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot.(#7280) Add HTML output for
{Table, MatrixTable, Expression}.summarize()
.(#7294) Add HTML output for
hl.summarize_variants()
.
Bug fixes¶
Version 0.2.24¶
Released 2019-10-03
hailctl dataproc
¶
(#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.
New features¶
(#7071) Add
permit_shuffle
flag tohl.{split_multi, split_multi_hts}
to allow processing of datasets with both multiallelics and duplciate loci.(#7121) Add
hl.contig_length
function.(#7130) Add
window
method onLocusExpression
, which creates an interval around a locus.(#7172) Permit
hl.init(sc=sc)
with pip-installed packages, given the right configuration options.
Version 0.2.23¶
Released 2019-09-23
hailctl dataproc
¶
Bug fixes¶
New features¶
(#7009) Introduced analysis pass in Python that mostly obviates the
hl.bind
andhl.rbind
operators; idiomatic Python that generates Hail expressions will perform much better.(#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.
(#7085) Warn only once about schema mismatches during JSON import (used in VEP, Nirvana, and sometimes
import_table
.(#7106)
hl.agg.call_stats
can now accept a number of alleles for itsalleles
parameter, useful when dealing with biallelic calls without the alleles array at hand.
Version 0.2.20¶
Released 2019-08-19
Critical memory management fix¶
(#6824) Fixed memory management inside
annotate_cols
with aggregations. This was causing memory leaks and segfaults.
Bug fixes¶
Version 0.2.19¶
Released 2019-08-01
Critical performance bug fix¶
Bug fixes¶
(#6757) Fixed correctness bug in optimizations applied to the combination of
Table.order_by
withhl.desc
arguments andshow()
, leading to tables sorted in ascending, not descending order.(#6770) Fixed assertion error caused by
Table.expand_types()
, which was used byTable.to_spark
andTable.to_pandas
.
Performance Improvements¶
(#6666) Slightly improve performance of
hl.pca
andhl.hwe_normalized_pca
.(#6669) Improve performance of
hl.split_multi
andhl.split_multi_hts
.(#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
(#6775) Fixed a major performance problem related to reading block matrices.
Version 0.2.18¶
Released 2019-07-12
Critical performance bug fix¶
(#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.
Version 0.2.17¶
Released 2019-07-10
New features¶
(#6349) Added
compression
parameter toexport_block_matrices
, which can be'gz'
or'bgz'
.(#6405) When a matrix table has string column-keys,
matrixtable.show
uses the column key as the column name.(#6345) Added an improved scan implementation, which reduces the memory load on master.
(#6462) Added
export_bgen
method.(#6473) Improved performance of
hl.agg.array_sum
by about 50%.(#6498) Added method
hl.lambda_gc
to calculate the genomic control inflation factor.(#6456) Dramatically improved performance of pipelines containing long chains of calls to
Table.annotate
, orMatrixTable
equivalents.(#6506) Improved the performance of the generated code for the
Table.annotate(**thing)
pattern.
Bug fixes¶
(#6404) Added
n_rows
andn_cols
parameters toExpression.show
for consistency with othershow
methods.(#6408)(#6419) Fixed an issue where the
filter_intervals
optimization could make scans return incorrect results.(#6459)(#6458) Fixed rare correctness bug in the
filter_intervals
optimization which could result too many rows being kept.(#6496) Fixed html output of
show
methods to truncate long field contents.(#6478) Fixed the broken documentation for the experimental
approx_cdf
andapprox_quantiles
aggregators.(#6504) Fix
Table.show
collecting data twice while running in Jupyter notebooks.(#6571) Fixed the message printed in
hl.concordance
to print the number of overlapping samples, not the full list of overlapping sample IDs.(#6583) Fixed
hl.plot.manhattan
for non-default reference genomes.
Version 0.2.16¶
Released 2019-06-19
Version 0.2.15¶
Released 2019-06-14
After some infrastructural changes to our development process, we should be getting back to frequent releases.
hailctl
¶
Starting in 0.2.15, pip
installations of Hail come bundled with a
command- line tool, hailctl
. This tool subsumes the functionality of
cloudtools
, which is now deprecated. See the release thread on the
forum
for more information.
New features¶
(#5932)(#6115)
hl.import_bed
abdhl.import_locus_intervals
now accept keyword arguments to pass through tohl.import_table
, which is used internally. This permits parameters likemin_partitions
to be set.(#5980) Added
log
option tohl.plot.histogram2d
.(#5937) Added
all_matches
parameter toTable.index
andMatrixTable.index_{rows, cols, entries}
, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.(#5913) Added functionality that makes arrays of structs easier to work with.
(#6089) Added HTML output to
Expression.show
when running in a notebook.(#6172)
hl.split_multi_hts
now uses the originalGQ
value if thePL
is missing.(#6123) Added
hl.binary_search
to search sorted numeric arrays.(#6224) Moved implementation of
hl.concordance
from backend to Python. Performance directly fromread()
is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.(#6214) Updated Hail Python dependencies.
(#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.
Bug fixes¶
(#5895) Fixed crash caused by
-0.0
floating-point values inhl.agg.hist
.(#6013) Turned off feature in HTSJDK that caused crashes in
hl.import_vcf
due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.(#6117) Fixed problem causing
Table.flatten()
to be quadratic in the size of the schema.(#6228)(#5993) Fixed
MatrixTable.union_rows()
to join distinct keys on the right, preventing an unintentional cartesian product.(#6235) Fixed an issue related to aggregation inside
MatrixTable.filter_cols
.(#6226) Restored lost behavior where
Table.show(x < 0)
shows the entire table.(#6267) Fixed cryptic crashes related to
hl.split_multi
andMatrixTable.entries()
with duplicate row keys.
Version 0.2.14¶
Released 2019-04-24
A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.
Version 0.2.13¶
Released 2019-04-18
Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.
New features¶
(#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
(#5860) Improve performance of some column annotation pipelines.
(#5858) Add
unify
option toTable.union
which allows unification of tables with different fields or field orderings.(#5799)
mt.entries()
is four times faster.(#5756) Hail now uses Spark 2.4.x by default.
(#5677)
MatrixTable
now also supportsshow
.(#5793)(#5701) Add
array.index(x)
which find the first index ofarray
whose value is equal tox
.(#5790) Add
array.head()
which returns the first element of the array, or missing if the array is empty.(#5690) Improve performance of
ld_matrix
.(#5743)
mt.compute_entry_filter_stats
computes statistics about the number of filtered entries in a matrix table.(#5758) failure to parse an interval will now produce a much more detailed error message.
(#5723)
hl.import_matrix_table
can now import a matrix table with no columns.(#5724)
hl.rand_norm2d
samples from a two dimensional random normal.
Bug fixes¶
(#5885) Fix
Table.to_spark
in the presence of fields of tuples.(#5882)(#5886) Fix
BlockMatrix
conversion methods to correctly handle filtered entries.(#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
(#5855)(#5786) Fix
hl.mendel_errors
incorrectly reporting children counts in the presence of entry filtering.(#5773) Fix
hl.sample_qc
to use correct number of total rows when calculating call rate.(#5763)(#5764) Fix
hl.agg.array_agg
to work insidemt.annotate_rows
and similar functions.(#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
(#5692) When
keyed
isTrue
,hl.maximal_independent_set
now does not produce duplicates.(#5725) Docs now consistently refer to
hl.agg
notagg
.(#5730)(#5782) Taught
import_bgen
to optimize itsvariants
argument.
Version 0.2.12¶
Released 2019-03-28
New features¶
Bug fixes¶
Experimental¶
(#5524) Add
summarize
functions to Table, MatrixTable, and Expression.(#5570) Add
hl.agg.approx_cdf
aggregator for approximate density calculation.(#5571) Add
log
parameter tohl.plot.histogram
.(#5601) Add
hl.plot.joint_plot
, extend functionality ofhl.plot.scatter
.(#5608) Add LD score simulation framework.
(#5628) Add
hl.experimental.full_outer_join_mt
for full outer joins onMatrixTable
s.
Version 0.2.11¶
Released 2019-03-06
New features¶
(#5374) Add default arguments to
hl.add_sequence
for running on GCP.(#5481) Added
sample_cols
method toMatrixTable
.(#5501) Exposed
MatrixTable.unfilter_entries
. Seefilter_entries
documentation for more information.(#5480) Added
n_cols
argument toMatrixTable.head
.(#5529) Added
Table.{semi_join, anti_join}
andMatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}
.(#5528) Added
{MatrixTable, Table}.checkpoint
methods as wrappers aroundwrite
/read_{matrix_table, table}
.
Bug fixes¶
(#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
(#5419) Resolved issue with
import_vcf
force_bgz
and file size checks.(#5427) Resolved issue with
Table.show
and dictionary field types.(#5468) Resolved ordering problem with
Expression.show
on key fields that are not the first key.(#5492) Fixed
hl.agg.collect
crashing when collectingfloat32
values.(#5525) Fixed
hl.trio_matrix
crashing whencomplete_trios
isFalse
.
Version 0.2.10¶
Released 2019-02-15
New features¶
(#5272) Added a new ‘delimiter’ option to Table.export.
(#5251) Add utility aliases to
hl.plot
foroutput_notebook
andshow
.(#5249) Add
histogram2d
function tohl.plot
module.(#5247) Expose
MatrixTable.localize_entries
method for converting to a Table with an entries array.(#5300) Add new
filter
andfind_replace
arguments tohl.import_table
andhl.import_vcf
to apply regex and substitutions to text input.
Performance improvements¶
(#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.
Bug fixes¶
(#5306) Fix
ReferenceGenome.add_sequence
causing a crash.(#5268) Fix
Table.export
writing a file called ‘None’ in the current directory.(#5265) Fix
hl.get_reference
raising an exception when called beforehl.init()
.(#5250) Fix crash in
pc_relate
when called on a MatrixTable field other than ‘GT’.(#5278) Fix crash in
Table.order_by
when sorting by fields whose names are not valid Python identifiers.(#5294) Fix crash in
hl.trio_matrix
when sample IDs are missing.(#5295) Fix crash in
Table.index
related to key field incompatibilities.
Version 0.2.9¶
Released 2019-01-30
New features¶
Performance improvements¶
Bug fixes¶
(#5144) Fix crash caused by
hl.index_bgen
(since 0.2.7)(#5177) Fix bug causing
Table.repartition(n, shuffle=True)
to fail to increase partitioning for unkeyed tables.(#5173) Fix bug causing
Table.show
to throw an error when the table is empty (since 0.2.8).(#5210) Fix bug causing
Table.show
to always print types, regardless oftypes
argument (since 0.2.8).(#5211) Fix bug causing
MatrixTable.make_table
to unintentionally discard non-key row fields (since 0.2.8).
Version 0.2.7¶
Released 2019-01-03
Version 0.2.6¶
Released 2018-12-17
New features¶
(#4962) Expanded comparison operators (
==
,!=
,<
,<=
,>
,>=
) to support expressions of every type.(#4927) Expanded functionality of
Table.order_by
to support ordering by arbitrary expressions, instead of just top-level fields.(#4926) Expanded default GRCh38 contig recoding behavior in
import_plink
.
Bug fixes¶
(#4941) Fixed variable scoping error in regression methods.
(#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than
i
andj
.(#4932) Fixed possible error in
export_plink
related to tolerance of writer process failure.(#4920) Fixed bad error message in
Table.order_by
.
Version 0.2.5¶
Released 2018-12-07
New features¶
(#4845) The or_error method in
hl.case
andhl.switch
statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.(#4865) We use this new
or_error
functionality in methods that require biallelic variants to include an offending variant in the error message.(#4820) Added hl.reversed for reversing arrays and strings.
(#4895) Added
include_strand
option to the hl.liftover function.
Performance improvements¶
Bug fixes¶
(#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using
group_rows_by
.(#4888) Fixed assertion error in BlockMatrix.sum.
(#4871) Fixed possible error in locally sorting nested collections.
(#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
(#4527)(#4761) Fixed optimizer assertion error sometimes encountered with
hl.split_multi[_hts]
.