Change Log

Frequently Asked Questions

With a version like 0.x, is Hail ready for use in publications?

Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.

What stability is guaranteed?

We do not intentionally break back-compatibility of interfaces or file formats. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. The exception to this rule is experimental functionality, denoted as such in the reference documentation, which may change at any time.

Please note that forward compatibility should not be expected, especially relating to file formats: this means that it may not be possible to use an earlier version of Hail to read files written in a later version.

Version 0.2.79

Release 2021-11-17

Bug fixes

  • (#11023) Fixed bug in call decoding that was introduced in version 0.2.78.

New features

  • (#10993) New function p_value_excess_het.

Version 0.2.78

Release 2021-10-19

Bug fixes

  • (#10766) Don’t throw out of memory error when broadcasting more than 2^(31) - 1 bytes.

  • (#10910) Filters on key field won’t be slowed down by uses of MatrixTable.localize_entries or Table.rename.

  • (#10959) Don’t throw an error in certain situations where some key fields are optimized away.

New features

  • (#10855) Arbitrary aggregations can be implemented using hl.agg.fold.

Performance Improvements

  • (#10971) Substantially improve the speed of Table.collect when collecting large amounts of data.

Version 0.2.77

Release 2021-09-21

Bug fixes

  • (#10888) Fix crash when calling hl.liftover.

  • (#10883) Fix crash / long compilation times writing matrix tables with many partitions.

Version 0.2.76

Released 2021-09-15

Bug fixes

  • (#10872) Fix long compile times or method size errors when writing tables with many partitions

  • (#10878) Fix crash importing or sorting tables with empty data partitions

Version 0.2.75

Released 2021-09-10

Bug fixes

  • (#10733) Fix a bug in tabix parsing when the size of the list of all sequences is large.

  • (#10765) Fix rare bug where valid pipelines would fail to compile if intervals were created conditionally.

  • (#10746) Various compiler improvements, decrease likelihood of ClassTooLarge errors.

  • (#10829) Fix a bug where hl.missing and CaseBuilder.or_error failed if their type was a struct containing a field starting with a number.

New features

  • (#10768) Support multiplying StringExpressions to repeat them, as with normal python strings.

Performance improvements

  • (#10625) Reduced need to copy strings around, pipelines with many string operations should get faster.

  • (#10775) Improved performance of to_matrix_table_row_major on both BlockMatrix and Table.

Version 0.2.74

Released 2021-07-26

Bug fixes

  • (#10697) Fixed bug in read_table when the table has missing keys and _n_partitions is specified.

  • (#10695) Fixed bug in hl.experimental.loop causing incorrect results when loop state contained pointers.

Version 0.2.73

Released 2021-07-22

Bug fixes

  • (#10684) Fixed a rare bug reading arrays from disk where short arrays would have their first elements corrupted and long arrays would cause segfaults.

  • (#10523) Fixed bug where liftover would fail with “Could not initialize class” errors.

Version 0.2.72

Released 2021-07-19

New Features

  • (#10655) Revamped many hail error messages to give useful python stack traces.

  • (#10663) Added DictExpression.items() to mirror python’s dict.items().

  • (#10657) now supports mapping over multiple lists like Python’s built-in map.

Bug fixes

  • (#10662) Fixed partitioning logic in hl.import_plink.

  • (#10669) NDArrayNumericExpression.sum() now works correctly on ndarrays of booleans.

Version 0.2.71

Released 2021-07-08

New Features

  • (#10632) Added support for weighted linear regression to hl.linear_regression_rows.

  • (#10635) Added hl.nd.maximum and hl.nd.minimum.

  • (#10602) Added hl.starmap.

Bug fixes

  • (#10038) Fixed crashes when writing/reading matrix tables with 0 partitions.

  • (#10624) Fixed out of bounds bug with _quantile_from_cdf.

hailctl dataproc

  • (#10633) Added --scopes parameter to hailctl dataproc start.

Version 0.2.70

Released 2021-06-21

Version 0.2.69

Released 2021-06-14

New Features

  • (#10592) Added hl.get_hgdp function.

  • (#10555) Added hl.hadoop_scheme_supported function.

  • (#10551) Indexing ndarrays now supports ellipses.

Bug fixes

  • (#10553) Dividing two integers now returns a float64, not a float32.

  • (#10595) Don’t include nans in lambda_gc_agg.

hailctl dataproc

  • (#10574) Hail logs will now be stored in /home/hail by default.

Version 0.2.68

Released 2021-05-27

Version 0.2.67

Critical performance fix

Released 2021-05-06

  • (#10451) Fixed a memory leak / performance bug triggered by hl.literal(...).contains(...)

Version 0.2.66

Released 2021-05-03

New features

  • (#10398) Added new method BlockMatrix.to_ndarray.

  • (#10251) Added suport for haploid GT calls to VCF combiner.

Version 0.2.65

Released 2021-04-14

Default Spark Version Change

  • Starting from version 0.2.65, Hail uses Spark 3.1.1 by default. This will also allow the use of all python versions >= 3.6. By building hail from source, it is still possible to use older versions of Spark.

New features

  • (#10290) Added hl.nd.solve.

  • (#10187) Added NDArrayNumericExpression.sum.

Performance improvements

  • (#10233) Loops created with hl.experimental.loop will now clean up unneeded memory between iterations.

Bug fixes

  • (#10227) hl.nd.qr now supports ndarrays that have 0 rows or columns.

Version 0.2.64

Released 2021-03-11

New features

  • (#10164) Add source_file_field parameter to hl.import_table to allow lines to be associated with their original source file.

Bug fixes

  • (#10182) Fixed serious memory leak in certain uses of filter_intervals.

  • (#10133) Fix bug where some pipelines incorrectly infer missingness, leading to a type error.

  • (#10134) Teach hl.king to treat filtered entries as missing values.

  • (#10158) Fixes hail usage in latest versions of jupyter that rely on asyncio.

  • (#10174) Fixed bad error message when incorrect return type specified with hl.loop.

Version 0.2.63

Released 2021-03-01

  • (#10105) Hail will now return frozenset and hail.utils.frozendict instead of normal sets and dicts.

Bug fixes

  • (#10035) Fix mishandling of NaN values in hl.agg.hist, where they were unintentionally included in the first bin.

  • (#10007) Improve error message from hadoop_ls when file does not exist.

Performance Improvements

  • (#10068) Make certain array copies faster.

  • (#10061) Improve code generation of hl.if_else and hl.coalesce.

Version 0.2.62

Released 2021-02-03

New features

  • (#9936) Deprecated hl.null in favor of hl.missing for naming consistency.

  • (#9973) hl.vep now includes a vep_proc_id field to aid in debugging unexpected output.

  • (#9839) Hail now eagerly deletes temporary files produced by some BlockMatrix operations.

  • (#9835) hl.any and hl.all now also support a single collection argument and a varargs of Boolean expressions.

  • (#9816) hl.pc_relate now includes values on the diagonal of kinship, IBD-0, IBD-1, and IBD-2

  • (#9736) Let NDArrayExpression.reshape take varargs instead of mandating a tuple.

  • (#9766) hl.export_vcf now warns if INFO field names are invalid according to the VCF 4.3 spec.

Bug fixes

  • (#9976) Fixed show() representation of Hail dictionaries.

Performance improvements

  • (#9909) Improved performance of hl.experimental.densify by approximately 35%.

Version 0.2.61

Released 2020-12-03

New features

  • (#9749) Add or_error method to SwitchBuilder (hl.switch)

Bug fixes

  • (#9775) Fixed race condition leading to invalid intermediate files in VCF combiner.

  • (#9751) Fix bug where constructing an array of empty structs causes type error.

  • (#9731) Fix error and incorrect behavior when using hl.import_matrix_table with int64 data types.

Version 0.2.60

Released 2020-11-16

New features

  • (#9696) hl.experimental.export_elasticsearch will now support Elasticsearch versions 6.8 - 7.x by default.

Bug fixes

  • (#9641) Showing hail ndarray data now always prints in correct order.

hailctl dataproc

  • (#9610) Support interval fields in hailctl dataproc describe

Version 0.2.59

Released 2020-10-22

Datasets / Annotation DB

  • (#9605) The Datasets API and the Annotation Database now support AWS, and users are required to specify what cloud platform they’re using.

hailctl dataproc

  • (#9609) Fixed bug where hailctl dataproc modify did not correctly print corresponding gcloud command.

Version 0.2.58

Released 2020-10-08

New features

  • (#9524) Hail should now be buildable using Spark 3.0.

  • (#9549) Add ignore_in_sample_frequency flag to hl.de_novo.

  • (#9501) Configurable cache size for BlockMatrix.to_matrix_table_row_major and BlockMatrix.to_table_row_major.

  • (#9474) Add ArrayExpression.first and ArrayExpression.last.

  • (#9459) Add StringExpression.join, an analogue to Python’s str.join.

  • (#9398) Hail will now throw HailUserErrors if the or_error branch of a CaseBuilder is hit.

Bug fixes

  • (#9503) NDArrays can now hold arbitrary data types, though only ndarrays of primitives can be collected to Python.

  • (#9501) Remove memory leak in BlockMatrix.to_matrix_table_row_major and BlockMatrix.to_table_row_major.

  • (#9424) hl.experimental.writeBlockMatrices didn’t correctly support overwrite flag.

Performance improvements

  • (#9506) hl.agg.ndarray_sum will now do a tree aggregation.

hailctl dataproc

  • (#9502) Fix hailctl dataproc modify to install dependencies of the wheel file.

  • (#9420) Add --debug-mode flag to hailctl dataproc start. This will enable heap dumps on OOM errors.

  • (#9520) Add support for requester pays buckets to hailctl dataproc describe.


  • (#9482) ArrayExpression.head has been deprecated in favor of ArrayExpression.first.

Version 0.2.57

Released 2020-09-03

New features

  • (#9343) Implement the KING method for relationship inference as hl.methods.king.

Version 0.2.56

Released 2020-08-31

New features

  • (#9308) Add hl.enumerate in favor of hl.zip_with_index, which is now deprecated.

  • (#9278) Add ArrayExpression.grouped, a function that groups hail arrays into fixed size subarrays.


  • (#9373)(#9374) Decrease amount of memory used when slicing or filtering along a single BlockMatrix dimension.

Bug fixes

  • (#9304) Fix crash in run_combiner caused by inputs where VCF lines and BGZ blocks align.

hailctl dataproc

  • (#9263) Add support for --expiration-time argument to hailctl dataproc start.

  • (#9263) Add support for --no-max-idle, no-max-age, --max-age, and --expiration-time to hailctl dataproc --modify.

Version 0.2.55

Released 2020-08-19


  • (#9264) Table.checkpoint now uses a faster LZ4 compression scheme.

Bug fixes

  • (#9250) hailctl dataproc no longer uses deprecated gcloud flags. Consequently, users must update to a recent version of gcloud.

  • (#9294) The “Python 3” kernel in notebooks in clusters started by hailctl dataproc now features the same Spark monitoring widget found in the “Hail” kernel. There is now no reason to use the “Hail” kernel.

Version 0.2.54

Released 2020-08-07

VCF Combiner

  • (#9224)(#9237) Breaking change: Users are now required to pass a partitioning argument to the command-line interface or run_combiner method. See documentation for details.

  • (#8963) Improved performance of VCF combiner by ~4x.

New features

  • (#9209) Add hl.agg.ndarray_sum aggregator.

Bug fixes

  • (#9206)(#9207) Improved error messages from invalid usages of Hail expressions.

  • (#9223) Fixed error in bounds checking for NDArray slicing.

Version 0.2.53

Released 2020-07-30

Bug fixes

  • (#9173) Use less confusing column key behavior in

  • (#9172) Add a missing Python dependency to Hail: google-cloud-storage.

  • (#9170) Change Hail tree aggregate depth logic to correctly respect the branching factor set in hl.init.

Version 0.2.52

Released 2020-07-29

Bug fixes

  • (#8944)(#9169) Fixed crash (error 134 or SIGSEGV) in MatrixTable.annotate_cols, hl.sample_qc, and more.

Version 0.2.51

Released 2020-07-28

Bug fixes

  • (#9161) Fix bug that prevented concatenating ndarrays that are fields of a table.

  • (#9152) Fix bounds in NDArray slicing.

  • (#9161) Fix bugs calculating row_id in hl.import_matrix_table.

Version 0.2.50

Released 2020-07-23

Bug fixes

  • (#9114) CHANGELOG: Fixed crash when using repeated calls to hl.filter_intervals.

New features

  • (#9101) Add hl.nd.{concat, hstack, vstack} to concatenate ndarrays.

  • (#9105) Add hl.nd.{eye, identity} to create identity matrix ndarrays.

  • (#9093) Add hl.nd.inv to invert ndarrays.

  • (#9063) Add BlockMatrix.tree_matmul to improve matrix multiply performance with a large inner dimension.

Version 0.2.49

Released 2020-07-08

Bug fixes

  • (#9058) Fixed memory leak affecting Table.aggregate, MatrixTable.annotate_cols aggregations, and hl.sample_qc.

Version 0.2.48

Released 2020-07-07

Bug fixes

  • (#9029) Fix crash when using hl.agg.linreg with no aggregated data records.

  • (#9028) Fixed memory leak affecting Table.annotate with scans, hl.experimental.densify, and Table.group_by / aggregate.

  • (#8978) Fixed aggregation behavior of MatrixTable.{group_rows_by, group_cols_by} to skip filtered entries.

Version 0.2.47

Released 2020-06-23

Bug fixes

  • (#9009) Fix memory leak when counting per-partition. This caused excessive memory use in BlockMatrix.write_from_entry_expr, and likely in many other places.

  • (#9006) Fix memory leak in hl.export_bgen.

  • (#9001) Fix double close error that showed up on Azure Cloud.

Version 0.2.46

Released 2020-06-17


  • (#8955) Natural language documentation search

Bug fixes

  • (#8981) Fix BlockMatrix OOM triggered by the MatrixWriteBlockMatrix WriteBlocksRDD method

Version 0.2.45

Release 2020-06-15

Bug fixes

  • (#8948) Fix integer overflow error when reading files >2G with hl.import_plink.

  • (#8903) Fix Python type annotations for empty collection constructors and hl.shuffle.

  • (#8942) Refactored VCF combiner to support other GVCF schemas.

  • (#8941) Fixed hl.import_plink with multiple data partitions.

hailctl dataproc

  • (#8946) Fix bug when a user specifies packages in hailctl dataproc start that are also dependencies of the Hail package.

  • (#8939) Support tuples in hailctl dataproc describe.

Version 0.2.44

Release 2020-06-06

New Features

  • (#8914) hl.export_vcf can now export tables as sites-only VCFs.

  • (#8894) Added hl.shuffle function to randomly permute arrays.

  • (#8854) Add composable option to parallel text export for use with gsutil compose.

Bug fixes

  • (#8883) Fix an issue related to failures in pipelines with force_bgz=True.


  • (#8887) Substantially improve the performance of hl.experimental.import_gtf.

Version 0.2.43

Released 2020-05-28

Bug fixes

  • (#8867) Fix a major correctness bug ocurring when calling BlockMatrix.transpose on sparse, non-symmetric BlockMatrices.

  • (#8876) Fixed “ChannelClosedException: null” in {Table, MatrixTable}.write.

Version 0.2.42

Released 2020-05-27

New Features

  • (#8822) Add optional non-centrality parameter to hl.pchisqtail.

  • (#8861) Add contig_recoding option to hl.experimental.run_combiner.

Bug fixes

  • (#8863) Fixes VCF combiner to successfully import GVCFs with alleles called as .

  • (#8845) Fixed issue where accessing an element of an ndarray in a call to Table.transmute would fail.

  • (#8855) Fix crash in filter_intervals.

Version 0.2.41

Released 2020-05-15

Bug fixes

  • (#8799)(#8786) Fix ArrayIndexOutOfBoundsException seen in pipelines that reuse a tuple value.

hailctl dataproc

  • (#8790) Use configured compute zone as default for hailctl dataproc connect and hailctl dataproc modify.

Version 0.2.40

Released 2020-05-12

VCF Combiner

  • (#8706) Add option to key by both locus and alleles for final output.

Bug fixes

  • (#8729) Fix assertion error in Table.group_by(...).aggregate(...)

  • (#8708) Fix assertion error in reading tables and matrix tables with _intervals option.

  • (#8756) Fix return type of LocusExpression.window to use locus’s reference genome instead of default RG.

Version 0.2.39

Released 2020-04-29

Bug fixes

  • (#8615) Fix contig ordering in the CanFam3 (dog) reference genome.

  • (#8622) Fix bug that causes inscrutable JVM Bytecode errors.

  • (#8645) Ease unnecessarily strict assertion that caused errors when aggregating by key (e.g. hl.experimental.spread).

  • (#8621) hl.nd.array now supports arrays with no elements (e.g. hl.nd.array([]).reshape((0, 5))) and, consequently, matmul with an inner dimension of zero.

New features

  • (#8571) hl.init(skip_logging_configuration=True) will skip configuration of Log4j. Users may use this to configure their own logging.

  • (#8588) Users who manually build Python wheels will experience less unnecessary output when doing so.

  • (#8572) Add hl.parse_json which converts a string containing JSON into a Hail object.

Performance Improvements

  • (#8535) Increase speed of import_vcf.

  • (#8618) Increase speed of Jupyter Notebook file listing and Notebook creation when buckets contain many objects.

  • (#8613) hl.experimental.export_entries_by_col stages files for improved reliability and performance.


  • (#8619) Improve installation documentation to suggest better performing LAPACK and BLAS libraries.

  • (#8647) Clarify that a LAPACK or BLAS library is a requirement for a complete Hail installation.

  • (#8654) Add link to document describing the creation of a Microsoft Azure HDInsight Hail cluster.

Version 0.2.38

Released 2020-04-21

Critical Linreg Aggregator Correctness Bug

Performance improvements

  • (#8558) Make hl.experimental.export_entries_by_col more fault tolerant.

Version 0.2.37

Released 2020-04-14

Bug fixes

  • (#8487) Fix incorrect handling of badly formatted data for hl.gp_dosage.

  • (#8497) Fix handling of missingness for hl.hamming.

  • (#8537) Fix compile-time errror.

  • (#8539) Fix compiler error in Table.multi_way_zip_join.

  • (#8488) Fix hl.agg.call_stats to appropriately throw an error for badly-formatted calls.

New features

  • (#8327) Attempting to write to the same file being read from in a pipeline will now throw an error instead of corrupting data.

Version 0.2.36

Released 2020-04-06

Critical Memory Management Bug Fix

  • (#8463) Reverted a change (separate to the bug in 0.2.34) that led to a memory leak in version 0.2.35.

Bug fixes

  • (#8371) Fix runtime error in joins leading to “Cannot set required field missing” error message.

  • (#8436) Fix compiler bug leading to possibly-invalid generated code.

Version 0.2.35

Released 2020-04-02

Critical Memory Management Bug Fix

  • (#8412) Fixed a serious per-partition memory leak that causes certain pipelines to run out of memory unexpectedly. Please update from 0.2.34.

New features

  • (#8404) Added “CanFam3” (a reference genome for dogs) as a bundled reference genome.

Bug fixes

  • (#8420) Fixed a bug where hl.binom_test’s "lower" and "upper" alternative options were reversed.

  • (#8377) Fixed “inconsistent agg or scan environments” error.

  • (#8322) Fixed bug where aggregate_rows did not interact with hl.agg.array_agg correctly.

Performance Improvements

  • (#8413) Improves internal region memory management, decreasing JVM overhead.

  • (#8383) Significantly improve GVCF import speed.

  • (#8358) Fixed memory leak in hl.experimental.export_entries_by_col.

  • (#8326) Codegen infrastructure improvement resulting in ~3% overall speedup.

hailctl dataproc

  • (#8399) Enable spark speculation by default.

  • (#8340) Add new Australia region to --vep.

  • (#8347) Support all GCP machine types as potential master machines.

Version 0.2.34

Released 2020-03-12

New features

  • (#8233) StringExpression.matches can now take a hail StringExpression, as opposed to only regular python strings.

  • (#8198) Improved matrix multiplication interoperation between hail NDArrayExpression and numpy.

Bug fixes

  • (#8279) Fix a bug where hl.agg.approx_cdf failed inside of a group_cols_by.

  • (#8275) Fix bad error message coming from mt.make_table() when keys are missing.

  • (#8274) Fix memory leak in hl.export_bgen.

  • (#8273) Fix segfault caused by hl.agg.downsample inside of an array_agg or group_by.

hailctl dataproc

  • (#8253) hailctl dataproc now supports new flags --requester-pays-allow-all and --requester-pays-allow-buckets. This will configure your hail installation to be able to read from requester pays buckets. The charges for reading from these buckets will be billed to the project that the cluster is created in.

  • (#8268) The data sources for VEP have been moved to gs://hail-us-vep, gs://hail-eu-vep, and gs://hail-uk-vep, which are requester-pays buckets in Google Cloud. hailctl dataproc will automatically infer which of these buckets you should pull data from based on the region your cluster is spun up in. If you are in none of those regions, please contact us on

Version 0.2.33

Released 2020-02-27

New features

  • (#8173) Added new method hl.zeros.

Bug fixes

  • (#8153) Fixed complier bug causing MatchError in import_bgen.

  • (#8123) Fixed an issue with multiple Python HailContexts running on the same cluster.

  • (#8150) Fixed an issue where output from VEP about failures was not reported in error message.

  • (#8152) Fixed an issue where the row count of a MatrixTable coming from import_matrix_table was incorrect.

  • (#8175) Fixed a bug where persist did not actually do anything.

hailctl dataproc

  • (#8079) Using connect to open the jupyter notebook browser will no longer crash if your project contains requester-pays buckets.

Version 0.2.32

Released 2020-02-07

Critical performance regression fix

  • (#7989) Fixed performance regression leading to a large slowdown when hl.variant_qc was run after filtering columns.


  • (#7962) Improved performance of hl.pc_relate.

  • (#8032) Drastically improve performance of pipelines calling hl.variant_qc and hl.sample_qc iteratively.

  • (#8037) Improve performance of NDArray matrix multiply by using native linear algebra libraries.

Bug fixes

  • (#7976) Fixed divide-by-zero error in hl.concordance with no overlapping rows or cols.

  • (#7965) Fixed optimizer error leading to crashes caused by MatrixTable.union_rows.

  • (#8035) Fix compiler bug in Table.multi_way_zip_join.

  • (#8021) Fix bug in computing shape after BlockMatrix.filter.

  • (#7986) Fix error in NDArray matrix/vector multiply.

New features

  • (#8007) Add hl.nd.diagonal function.

Cheat sheets

  • (#7940) Added cheat sheet for MatrixTables.

  • (#7963) Improved Table sheet sheet.

Version 0.2.31

Released 2020-01-22

New features

  • (#7787) Added transition/transversion information to hl.summarize_variants.

  • (#7792) Add Python stack trace to array index out of bounds errors in Hail pipelines.

  • (#7832) Add spark_conf argument to hl.init, permitting configuration of Spark runtime for a Hail session.

  • (#7823) Added datetime functions hl.experimental.strptime and hl.experimental.strftime.

  • (#7888) Added hl.nd.array constructor from nested standard arrays.

File size

  • (#7923) Fixed compression problem since 0.2.23 resulting in larger-than-expected matrix table files for datasets with few entry fields (e.g. GT-only datasets).


  • (#7867) Fix performance regression leading to extra scans of data when order_by and key_by appeared close together.

  • (#7901) Fix performance regression leading to extra scans of data when group_by/aggregate and key_by appeared close together.

  • (#7830) Improve performance of array arithmetic.

Bug fixes

  • (#7922) Fix still-not-well-understood serialization error about ApproxCDFCombiner.

  • (#7906) Fix optimizer error by relaxing unnecessary assertion.

  • (#7788) Fix possible memory leak in ht.tail and ht.head.

  • (#7796) Fix bug in ingesting numpy arrays not in row-major orientation.

Version 0.2.30

Released 2019-12-20


  • (#7771) Fixed extreme performance regression in scans.

  • (#7764) Fixed mt.entry_field.take performance regression.

New features

  • (#7614) Added experimental support for loops with hl.experimental.loop.


  • (#7745) Changed export_vcf to only use scientific notation when necessary.

Version 0.2.29

Released 2019-12-17

Bug fixes

  • (#7229) Fixed hl.maximal_independent_set tie breaker functionality.

  • (#7732) Fixed incompatibility with old files leading to incorrect data read when filtering intervals after read_matrix_table.

  • (#7642) Fixed crash when constant-folding functions that throw errors.

  • (#7611) Fixed hl.hadoop_ls to handle glob patterns correctly.

  • (#7653) Fixed crash in ld_prune by unfiltering missing GTs.

Performance improvements

  • (#7719) Generate more efficient IR for Table.flatten.

  • (#7740) Method wrapping large let bindings to keep method size down.

New features

  • (#7686) Added comment argument to import_matrix_table, allowing lines with certain prefixes to be ignored.

  • (#7688) Added experimental support for NDArrayExpressions in new hl.nd module.

  • (#7608) hl.grep now has a show argument that allows users to either print the results (default) or return a dictionary of the results.

hailctl dataproc

  • (#7717) Throw error when mispelling arguments instead of silently quitting.

Version 0.2.28

Released 2019-11-22

Critical correctness bug fix

  • (#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.

Bug fixes

  • (#7571) Don’t set GQ to missing if PL is missing in split_multi_hts.

  • (#7577) Fixed an optimizer bug.

New Features

  • (#7561) Added hl.plot.visualize_missingness() to plot missingness patterns for MatrixTables.

  • (#7575) Added hl.version() to quickly check hail version.

hailctl dataproc

  • (#7586) hailctl dataproc now supports --gcloud_configuration option.


  • (#7570) Hail has a cheatsheet for Tables now.

Version 0.2.27

Released 2019-11-15

New Features

  • (#7379) Add delimiter argument to hl.import_matrix_table

  • (#7389) Add force and force_bgz arguments to hl.experimental.import_gtf

  • (#7386)(#7394) Add {Table, MatrixTable}.tail.

  • (#7467) Added hl.if_else as an alias for hl.cond; deprecated hl.cond.

  • (#7453) Add hl.parse_int{32, 64} and hl.parse_float{32, 64}, which can parse strings to numbers and return missing on failure.

  • (#7475) Add row_join_type argument to MatrixTable.union_cols to support outer joins on rows.

Bug fixes

  • (#7479)(#7368)(#7402) Fix optimizer bugs.

  • (#7506) Updated to latest htsjdk to resolve VCF parsing problems.

hailctl dataproc

  • (#7460) The Spark monitor widget now automatically collapses after a job completes.

Version 0.2.26

Released 2019-10-24

New Features

  • (#7325) Add string.reverse function.

  • (#7328) Add string.translate function.

  • (#7344) Add hl.reverse_complement function.

  • (#7306) Teach the VCF combiner to handle allele specific (AS_*) fields.

  • (#7346) Add hl.agg.approx_median function.

Bug Fixes

  • (#7361) Fix AD calculation in sparse_split_multi.

Performance Improvements

  • (#7355) Improve performance of IR copying.

Version 0.2.25

Released 2019-10-14

New features

  • (#7240) Add interactive schema widget to {MatrixTable, Table}.describe. Use this by passing the argument widget=True.

  • (#7250) {Table, MatrixTable, Expression}.summarize() now summarizes elements of collections (arrays, sets, dicts).

  • (#7271) Improve hl.plot.qq by increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot.

  • (#7280) Add HTML output for {Table, MatrixTable, Expression}.summarize().

  • (#7294) Add HTML output for hl.summarize_variants().

Bug fixes

  • (#7200) Fix VCF parsing with missingness inside arrays of floating-point values in the FORMAT field.

  • (#7219) Fix crash due to invalid optimizer rule.

Performance improvements

  • (#7187) Dramatically improve performance of chained BlockMatrix multiplies without checkpoints in between.

  • (#7195)(#7194) Improve performance of group[_rows]_by / aggregate.

  • (#7201) Permit code generation of larger aggregation pipelines.

Version 0.2.24

Released 2019-10-03

hailctl dataproc

  • (#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.

New features

  • (#7071) Add permit_shuffle flag to hl.{split_multi, split_multi_hts} to allow processing of datasets with both multiallelics and duplciate loci.

  • (#7121) Add hl.contig_length function.

  • (#7130) Add window method on LocusExpression, which creates an interval around a locus.

  • (#7172) Permit hl.init(sc=sc) with pip-installed packages, given the right configuration options.

Bug fixes

  • (#7070) Fix unintentionally strict type error in MatrixTable.union_rows.

  • (#7170) Fix issues created downstream of BlockMatrix.T.

  • (#7146) Fix bad handling of edge cases in BlockMatrix.filter.

  • (#7182) Fix problem parsing VCFs where lines end in an INFO field of type flag.

Version 0.2.23

Released 2019-09-23

hailctl dataproc

  • (#7087) Added back progress bar to notebooks, with links to the correct Spark UI url.

  • (#7104) Increased disk requested when using --vep to address the “colony collapse” cluster error mode.

Bug fixes

  • (#7066) Fixed generated code when methods from multiple reference genomes appear together.

  • (#7077) Fixed crash in hl.agg.group_by.

New features

  • (#7009) Introduced analysis pass in Python that mostly obviates the hl.bind and hl.rbind operators; idiomatic Python that generates Hail expressions will perform much better.

  • (#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.

  • (#7085) Warn only once about schema mismatches during JSON import (used in VEP, Nirvana, and sometimes import_table.

  • (#7106) hl.agg.call_stats can now accept a number of alleles for its alleles parameter, useful when dealing with biallelic calls without the alleles array at hand.


  • (#7086) Improved performance of JSON import.

  • (#6981) Improved performance of Hail min/max/mean operators. Improved performance of split_multi_hts by an additional 33%.

  • (#7082)(#7096)(#7098) Improved performance of large pipelines involving many annotate calls.

Version 0.2.22

Released 2019-09-12

New features

  • (#7013) Added contig_recoding to import_bed and import_locus_intervals.


  • (#6969) Improved performance of hl.agg.mean, hl.agg.stats, and hl.agg.corr.

  • (#6987) Improved performance of import_matrix_table.

  • (#7033)(#7049) Various improvements leading to overall 10-15% improvement.

hailctl dataproc

  • (#7003) Pass through extra arguments for hailctl dataproc list and hailctl dataproc stop.

Version 0.2.21

Released 2019-09-03

Bug fixes

  • (#6945) Fixed expand_types to preserve ordering by key, also affects to_pandas and to_spark.

  • (#6958) Fixed stack overflow errors when counting the result of a Table.union.

New features

  • (#6856) Teach hl.agg.counter to weigh each value differently.

  • (#6903) Teach hl.range to treat a single argument as 0..N.

  • (#6903) Teach BlockMatrix how to checkpoint.


  • (#6895) Improved performance of hl.import_bgen(...).count().

  • (#6948) Fixed performance bug in BlockMatrix filtering functions.

  • (#6943) Improved scaling of Table.union.

  • (#6980) Reduced compute time for split_multi_hts by as much as 40%.

hailctl dataproc

  • (#6904) Added --dry-run option to submit.

  • (#6951) Fixed --max-idle and --max-age arguments to start.

  • (#6919) Added --update-hail-version to modify.

Version 0.2.20

Released 2019-08-19

Critical memory management fix

  • (#6824) Fixed memory management inside annotate_cols with aggregations. This was causing memory leaks and segfaults.

Bug fixes

  • (#6769) Fixed non-functional hl.lambda_gc method.

  • (#6847) Fixed bug in handling of NaN in hl.agg.min and hl.agg.max. These will now properly ignore NaN (the intended semantics). Note that hl.min and hl.max propagate NaN; use hl.nanmin and hl.nanmax to ignore NaN.

New features

  • (#6847) Added hl.nanmin and hl.nanmax functions.

Version 0.2.19

Released 2019-08-01

Critical performance bug fix

  • (#6629) Fixed a critical performance bug introduced in (#6266). This bug led to long hang times when reading in Hail tables and matrix tables written in version 0.2.18.

Bug fixes

  • (#6757) Fixed correctness bug in optimizations applied to the combination of Table.order_by with hl.desc arguments and show(), leading to tables sorted in ascending, not descending order.

  • (#6770) Fixed assertion error caused by Table.expand_types(), which was used by Table.to_spark and Table.to_pandas.

Performance Improvements

  • (#6666) Slightly improve performance of hl.pca and hl.hwe_normalized_pca.

  • (#6669) Improve performance of hl.split_multi and hl.split_multi_hts.

  • (#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.

  • (#6775) Fixed a major performance problem related to reading block matrices.

hailctl dataproc

  • (#6760) Fixed the address pointed at by ui in connect, after Google changed proxy settings that rendered the UI URL incorrect. Also added new address hist/spark-history.

Version 0.2.18

Released 2019-07-12

Critical performance bug fix

  • (#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.

Bug fixes

  • (#6598) Fixed code generated by MatrixTable.unfilter_entries to improve performance. This will slightly improve the performance of hwe_normalized_pca and relatedness computation methods, which use unfilter_entries internally.

Version 0.2.17

Released 2019-07-10

New features

  • (#6349) Added compression parameter to export_block_matrices, which can be 'gz' or 'bgz'.

  • (#6405) When a matrix table has string column-keys, uses the column key as the column name.

  • (#6345) Added an improved scan implementation, which reduces the memory load on master.

  • (#6462) Added export_bgen method.

  • (#6473) Improved performance of hl.agg.array_sum by about 50%.

  • (#6498) Added method hl.lambda_gc to calculate the genomic control inflation factor.

  • (#6456) Dramatically improved performance of pipelines containing long chains of calls to Table.annotate, or MatrixTable equivalents.

  • (#6506) Improved the performance of the generated code for the Table.annotate(**thing) pattern.

Bug fixes

  • (#6404) Added n_rows and n_cols parameters to for consistency with other show methods.

  • (#6408)(#6419) Fixed an issue where the filter_intervals optimization could make scans return incorrect results.

  • (#6459)(#6458) Fixed rare correctness bug in the filter_intervals optimization which could result too many rows being kept.

  • (#6496) Fixed html output of show methods to truncate long field contents.

  • (#6478) Fixed the broken documentation for the experimental approx_cdf and approx_quantiles aggregators.

  • (#6504) Fix collecting data twice while running in Jupyter notebooks.

  • (#6571) Fixed the message printed in hl.concordance to print the number of overlapping samples, not the full list of overlapping sample IDs.

  • (#6583) Fixed hl.plot.manhattan for non-default reference genomes.


  • (#6488) Exposed table.multi_way_zip_join. This takes a list of tables of identical types, and zips them together into one table.

Version 0.2.16

Released 2019-06-19


  • (#6357) Accommodated Google Dataproc bug causing cluster creation failures.

Bug fixes

  • (#6378) Fixed problem in how entry_float_type was being handled in import_vcf.

Version 0.2.15

Released 2019-06-14

After some infrastructural changes to our development process, we should be getting back to frequent releases.


Starting in 0.2.15, pip installations of Hail come bundled with a command- line tool, hailctl. This tool subsumes the functionality of cloudtools, which is now deprecated. See the release thread on the forum for more information.

New features

  • (#5932)(#6115) hl.import_bed abd hl.import_locus_intervals now accept keyword arguments to pass through to hl.import_table, which is used internally. This permits parameters like min_partitions to be set.

  • (#5980) Added log option to hl.plot.histogram2d.

  • (#5937) Added all_matches parameter to Table.index and MatrixTable.index_{rows, cols, entries}, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.

  • (#5913) Added functionality that makes arrays of structs easier to work with.

  • (#6089) Added HTML output to when running in a notebook.

  • (#6172) hl.split_multi_hts now uses the original GQ value if the PL is missing.

  • (#6123) Added hl.binary_search to search sorted numeric arrays.

  • (#6224) Moved implementation of hl.concordance from backend to Python. Performance directly from read() is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.

  • (#6214) Updated Hail Python dependencies.

  • (#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.

Bug fixes

  • (#5895) Fixed crash caused by -0.0 floating-point values in hl.agg.hist.

  • (#6013) Turned off feature in HTSJDK that caused crashes in hl.import_vcf due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.

  • (#6117) Fixed problem causing Table.flatten() to be quadratic in the size of the schema.

  • (#6228)(#5993) Fixed MatrixTable.union_rows() to join distinct keys on the right, preventing an unintentional cartesian product.

  • (#6235) Fixed an issue related to aggregation inside MatrixTable.filter_cols.

  • (#6226) Restored lost behavior where < 0) shows the entire table.

  • (#6267) Fixed cryptic crashes related to hl.split_multi and MatrixTable.entries() with duplicate row keys.

Version 0.2.14

Released 2019-04-24

A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.

New features

  • (#5915) Added hl.cite_hail and hl.cite_hail_bibtex functions to generate appropriate citations.

  • (#5872) Fixed hl.init when the idempotent parameter is True.

Version 0.2.13

Released 2019-04-18

Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.

New features

  • (#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.

  • (#5860) Improve performance of some column annotation pipelines.

  • (#5858) Add unify option to Table.union which allows unification of tables with different fields or field orderings.

  • (#5799) mt.entries() is four times faster.

  • (#5756) Hail now uses Spark 2.4.x by default.

  • (#5677) MatrixTable now also supports show.

  • (#5793)(#5701) Add array.index(x) which find the first index of array whose value is equal to x.

  • (#5790) Add array.head() which returns the first element of the array, or missing if the array is empty.

  • (#5690) Improve performance of ld_matrix.

  • (#5743) mt.compute_entry_filter_stats computes statistics about the number of filtered entries in a matrix table.

  • (#5758) failure to parse an interval will now produce a much more detailed error message.

  • (#5723) hl.import_matrix_table can now import a matrix table with no columns.

  • (#5724) hl.rand_norm2d samples from a two dimensional random normal.

Bug fixes

  • (#5885) Fix Table.to_spark in the presence of fields of tuples.

  • (#5882)(#5886) Fix BlockMatrix conversion methods to correctly handle filtered entries.

  • (#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.

  • (#5855)(#5786) Fix hl.mendel_errors incorrectly reporting children counts in the presence of entry filtering.

  • (#5830)(#5835) Fix Nirvana support

  • (#5773) Fix hl.sample_qc to use correct number of total rows when calculating call rate.

  • (#5763)(#5764) Fix hl.agg.array_agg to work inside mt.annotate_rows and similar functions.

  • (#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.

  • (#5692) When keyed is True, hl.maximal_independent_set now does not produce duplicates.

  • (#5725) Docs now consistently refer to hl.agg not agg.

  • (#5730)(#5782) Taught import_bgen to optimize its variants argument.


  • (#5732) The hl.agg.approx_quantiles aggregate computes an approximation of the quantiles of an expression.

  • (#5693)(#5396) Table._multi_way_zip_join now correctly handles keys that have been truncated.

Version 0.2.12

Released 2019-03-28

New features

  • (#5614) Add support for multiple missing values in hl.import_table.

  • (#5666) Produce HTML table output for when running in Jupyter notebook.

Bug fixes

  • (#5603)(#5697) Fixed issue where min_partitions on hl.import_table was non-functional.

  • (#5611) Fix hl.nirvana crash.


  • (#5524) Add summarize functions to Table, MatrixTable, and Expression.

  • (#5570) Add hl.agg.approx_cdf aggregator for approximate density calculation.

  • (#5571) Add log parameter to hl.plot.histogram.

  • (#5601) Add hl.plot.joint_plot, extend functionality of hl.plot.scatter.

  • (#5608) Add LD score simulation framework.

  • (#5628) Add hl.experimental.full_outer_join_mt for full outer joins on MatrixTables.

Version 0.2.11

Released 2019-03-06

New features

  • (#5374) Add default arguments to hl.add_sequence for running on GCP.

  • (#5481) Added sample_cols method to MatrixTable.

  • (#5501) Exposed MatrixTable.unfilter_entries. See filter_entries documentation for more information.

  • (#5480) Added n_cols argument to MatrixTable.head.

  • (#5529) Added Table.{semi_join, anti_join} and MatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}.

  • (#5528) Added {MatrixTable, Table}.checkpoint methods as wrappers around write / read_{matrix_table, table}.

Bug fixes

  • (#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.

  • (#5419) Resolved issue with import_vcf force_bgz and file size checks.

  • (#5427) Resolved issue with and dictionary field types.

  • (#5468) Resolved ordering problem with on key fields that are not the first key.

  • (#5492) Fixed hl.agg.collect crashing when collecting float32 values.

  • (#5525) Fixed hl.trio_matrix crashing when complete_trios is False.

Version 0.2.10

Released 2019-02-15

New features

  • (#5272) Added a new ‘delimiter’ option to Table.export.

  • (#5251) Add utility aliases to hl.plot for output_notebook and show.

  • (#5249) Add histogram2d function to hl.plot module.

  • (#5247) Expose MatrixTable.localize_entries method for converting to a Table with an entries array.

  • (#5300) Add new filter and find_replace arguments to hl.import_table and hl.import_vcf to apply regex and substitutions to text input.

Performance improvements

  • (#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.

Bug fixes

  • (#5306) Fix ReferenceGenome.add_sequence causing a crash.

  • (#5268) Fix Table.export writing a file called ‘None’ in the current directory.

  • (#5265) Fix hl.get_reference raising an exception when called before hl.init().

  • (#5250) Fix crash in pc_relate when called on a MatrixTable field other than ‘GT’.

  • (#5278) Fix crash in Table.order_by when sorting by fields whose names are not valid Python identifiers.

  • (#5294) Fix crash in hl.trio_matrix when sample IDs are missing.

  • (#5295) Fix crash in Table.index related to key field incompatibilities.

Version 0.2.9

Released 2019-01-30

New features

  • (#5149) Added bitwise transformation functions: hl.bit_{and, or, xor, not, lshift, rshift}.

  • (#5154) Added hl.rbind function, which is similar to hl.bind but expects a function as the last argument instead of the first.

Performance improvements

  • (#5107) Hail’s Python interface generates tighter intermediate code, which should result in moderate performance improvements in many pipelines.

  • (#5172) Fix unintentional performance deoptimization related to introduced in 0.2.8.

  • (#5078) Improve performance of hl.ld_prune by up to 30x.

Bug fixes

  • (#5144) Fix crash caused by hl.index_bgen (since 0.2.7)

  • (#5177) Fix bug causing Table.repartition(n, shuffle=True) to fail to increase partitioning for unkeyed tables.

  • (#5173) Fix bug causing to throw an error when the table is empty (since 0.2.8).

  • (#5210) Fix bug causing to always print types, regardless of types argument (since 0.2.8).

  • (#5211) Fix bug causing MatrixTable.make_table to unintentionally discard non-key row fields (since 0.2.8).

Version 0.2.8

Released 2019-01-15

New features

  • (#5072) Added multi-phenotype option to hl.logistic_regression_rows

  • (#5077) Added support for importing VCF floating-point FORMAT fields as float32 as well as float64.

Performance improvements

  • (#5068) Improved optimization of MatrixTable.count_cols.

  • (#5131) Fixed performance bug related to hl.literal on large values with missingness

Bug fixes

  • (#5088) Fixed name separator in MatrixTable.make_table.

  • (#5104) Fixed optimizer bug related to experimental functionality.

  • (#5122) Fixed error constructing Table or MatrixTable objects with fields with certain character patterns like $.

Version 0.2.7

Released 2019-01-03

New features

  • (#5046)(experimental) Added option to BlockMatrix.export_rectangles to export as NumPy-compatible binary.

Performance improvements

  • (#5050) Short-circuit iteration in logistic_regression_rows and poisson_regression_rows if NaNs appear.

Version 0.2.6

Released 2018-12-17

New features

  • (#4962) Expanded comparison operators (==, !=, <, <=, >, >=) to support expressions of every type.

  • (#4927) Expanded functionality of Table.order_by to support ordering by arbitrary expressions, instead of just top-level fields.

  • (#4926) Expanded default GRCh38 contig recoding behavior in import_plink.

Performance improvements

  • (#4952) Resolved lingering issues related to (#4909).

Bug fixes

  • (#4941) Fixed variable scoping error in regression methods.

  • (#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than i and j.

  • (#4932) Fixed possible error in export_plink related to tolerance of writer process failure.

  • (#4920) Fixed bad error message in Table.order_by.

Version 0.2.5

Released 2018-12-07

New features

  • (#4845) The or_error method in and hl.switch statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.

  • (#4865) We use this new or_error functionality in methods that require biallelic variants to include an offending variant in the error message.

  • (#4820) Added hl.reversed for reversing arrays and strings.

  • (#4895) Added include_strand option to the hl.liftover function.

Performance improvements

  • (#4907)(#4911) Addressed one aspect of bad scaling in enormous literal values (triggered by a list of 300,000 sample IDs) related to logging.

  • (#4909)(#4914) Fixed a check in Table/MatrixTable initialization that scaled O(n^2) with the total number of fields.

Bug fixes

  • (#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using group_rows_by.

  • (#4888) Fixed assertion error in BlockMatrix.sum.

  • (#4871) Fixed possible error in locally sorting nested collections.

  • (#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.

  • (#4527)(#4761) Fixed optimizer assertion error sometimes encountered with hl.split_multi[_hts].

Version 0.2.4: Beginning of history!

We didn’t start manually curating information about user-facing changes until version 0.2.4.

The full commit history is available here.