Change Log

0.2.22

Released 2019-09-12

New features

  • (#7013) Add contig_recoding to import_bed and import_locus_intervals.

Performance

  • (#6969) Improve performance of hl.agg.mean, hl.agg.stats, and hl.agg.corr.
  • (#6987) Improve performance of import_matrix_table.
  • (#7033)(#7049) Various improvements leading to overall 10-15% improvement.

hailctl dataproc

  • (#7003) Pass through extra arguments for hailctl dataproc list and hailctl dataproc stop.

0.2.21

Released 2019-09-03

Bug fixes

  • (#6945) Fix expand_types to preserve ordering by key, also affects to_pandas and to_spark.
  • (#6958) Fix stack overflow errors when counting the result of a Table.union.

New features

  • (#6856) Teach hl.agg.counter to weigh each value differently.
  • (#6903) Teach hl.range to treat a single argument as 0..N.
  • (#6903) Teach BlockMatrix how to checkpoint.

Performance

  • (#6895) Improve performance of hl.import_bgen(...).count().
  • (#6948) Fix performance bug in BlockMatrix filtering functions.
  • (#6943) Improve scaling of Table.union.
  • (#6980) Reduce compute time for split_multi_hts by as much as 40%.

hailctl dataproc

  • (#6904) Add --dry-run option to submit.
  • (#6951) Fix --max-idle and --max-age arguments to start.
  • (#6919) Add --update-hail-version to modify.

0.2.20

Released 2019-08-19

Critical memory management fix

  • (#6824) Fixed memory management inside annotate_cols with aggregations. This was causing memory leaks and segfaults.

Bug fixes

  • (#6769) Fix non-functional hl.lambda_gc method.
  • (#6847) Fix bug in handling of NaN in hl.agg.min and hl.agg.max. These will now properly ignore NaN (the intended semantics). Note that hl.min and hl.max propagate NaN; use hl.nanmin and hl.nanmax to ignore NaN.

New features

  • (#6847) Added hl.nanmin and hl.nanmax functions.

0.2.19

Released 2019-08-01

Critical performance bug fix

  • (#6629) Fixed a critical performance bug introduced in (#6266). This bug led to long hang times when reading in Hail tables and matrix tables written in version 0.2.18.

Bug fixes

  • (#6757) Fixed correctness bug in optimizations applied to the combination of Table.order_by with hl.desc arguments and show(), leading to tables sorted in ascending, not descending order.
  • (#6770) Fixed assertion error caused by Table.expand_types(), which was used by Table.to_spark and Table.to_pandas.

Performance Improvements

  • (#6666) Slightly improve performance of hl.pca and hl.hwe_normalized_pca.
  • (#6669) Improve performance of hl.split_multi and hl.split_multi_hts.
  • (#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
  • (#6775) Fixed a major performance problem related to reading block matrices.

hailctl dataproc

  • (#6760) Fixed the address pointed at by ui in connect, after Google changed proxy settings that rendered the UI URL incorrect. Also added new address hist/spark-history.

0.2.18

Released 2019-07-12

Critical performance bug fix

  • (#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.

Bug fixes

  • (#6598) Fixed code generated by MatrixTable.unfilter_entries to improve performance. This will slightly improve the performance of hwe_normalized_pca and relatedness computation methods, which use unfilter_entries internally.

0.2.17

Released 2019-07-10

New features

  • (#6349) Added compression parameter to export_block_matrices, which can be 'gz' or 'bgz'.
  • (#6405) When a matrix table has string column-keys, matrixtable.show uses the column key as the column name.
  • (#6345) Added an improved scan implementation, which reduces the memory load on master.
  • (#6462) Added export_bgen method.
  • (#6473) Improved performance of hl.agg.array_sum by about 50%.
  • (#6498) Added method hl.lambda_gc to calculate the genomic control inflation factor.
  • (#6456) Dramatically improved performance of pipelines containing long chains of calls to Table.annotate, or MatrixTable equivalents.
  • (#6506) Improved the performance of the generated code for the Table.annotate(**thing) pattern.

Bug fixes

  • (#6404) n_rows and n_cols parameters added to Expression.show for consistency with other show methods.
  • (#6408)(#6419) Fixed an issue where the filter_intervals optimization could make scans return incorrect results.
  • (#6459)(#6458) Fixed rare correctness bug in the filter_intervals optimization which could result too many rows being kept.
  • (#6496) Fixed html output of show methods to truncate long field contents.
  • (#6478) Fixed the broken documentation for the experimental approx_cdf and approx_quantiles aggregators.
  • (#6504) Fix Table.show collecting data twice while running in Jupyter notebooks.
  • (#6571) Fixed the message printed in hl.concordance to print the number of overlapping samples, not the full list of overlapping sample IDs.
  • (#6583) Fixed hl.plot.manhattan for non-default reference genomes.

Experimental

  • (#6488) Exposed table.multi_way_zip_join. This takes a list of tables of identical types, and zips them together into one table.

0.2.16

Released 2019-06-19

hailctl

  • (#6357) Accommodated Google Dataproc bug causing cluster creation failures.

Bug fixes

  • (#6378) Fixed problem in how entry_float_type was being handled in import_vcf.

0.2.15

Released 2019-06-14

After some infrastructural changes to our development process, we should be getting back to frequent releases.

hailctl

Starting in 0.2.15, pip installations of Hail come bundled with a command- line tool, hailctl. This tool subsumes the functionality of cloudtools, which is now deprecated. See the release thread on the forum for more information.

New features

  • (#5932)(#6115) hl.import_bed abd hl.import_locus_intervals now accept keyword arguments to pass through to hl.import_table, which is used internally. This permits parameters like min_partitions to be set.
  • (#5980) Added log option to hl.plot.histogram2d.
  • (#5937) Added all_matches parameter to Table.index and MatrixTable.index_{rows, cols, entries}, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.
  • (#5913) Added functionality that makes arrays of structs easier to work with.
  • (#6089) Added HTML output to Expression.show when running in a notebook.
  • (#6172) hl.split_multi_hts now uses the original GQ value if the PL is missing.
  • (#6123) Added hl.binary_search to search sorted numeric arrays.
  • (#6224) Moved implementation of hl.concordance from backend to Python. Performance directly from read() is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.
  • (#6214) Updated Hail Python dependencies.
  • (#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.

Bug fixes

  • (#5895) Fixed crash caused by -0.0 floating-point values in hl.agg.hist.
  • (#6013) Turned off feature in HTSJDK that caused crashes in hl.import_vcf due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.
  • (#6117) Fixed problem causing Table.flatten() to be quadratic in the size of the schema.
  • (#6228)(#5993) Fixed MatrixTable.union_rows() to join distinct keys on the right, preventing an unintentional cartesian product.
  • (#6235) Fixed an issue related to aggregation inside MatrixTable.filter_cols.
  • (#6226) Restored lost behavior where Table.show(x < 0) shows the entire table.
  • (#6267) Fixed cryptic crashes related to hl.split_multi and MatrixTable.entries() with duplicate row keys.

0.2.14

Released 2019-04-24

A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.

New features

  • (#5915) Added hl.cite_hail and hl.cite_hail_bibtex functions to generate appropriate citations.
  • (#5872) Fixed hl.init when the idempotent parameter is True.

0.2.13

Released 2019-04-18

Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.

New features

  • (#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
  • (#5860) Improve performance of some column annotation pipelines.
  • (#5858) Add unify option to Table.union which allows unification of tables with different fields or field orderings.
  • (#5799) mt.entries() is four times faster.
  • (#5756) Hail now uses Spark 2.4.x by default.
  • (#5677) MatrixTable now also supports show.
  • (#5793)(#5701) Add array.index(x) which find the first index of array whose value is equal to x.
  • (#5790) Add array.head() which returns the first element of the array, or missing if the array is empty.
  • (#5690) Improve performance of ld_matrix.
  • (#5743) mt.compute_entry_filter_stats computes statistics about the number of filtered entries in a matrix table.
  • (#5758) failure to parse an interval will now produce a much more detailed error message.
  • (#5723) hl.import_matrix_table can now import a matrix table with no columns.
  • (#5724) hl.rand_norm2d samples from a two dimensional random normal.

Bug fixes

  • (#5885) Fix Table.to_spark in the presence of fields of tuples.
  • (#5882)(#5886) Fix BlockMatrix conversion methods to correctly handle filtered entries.
  • (#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
  • (#5855)(#5786) Fix hl.mendel_errors incorrectly reporting children counts in the presence of entry filtering.
  • (#5830)(#5835) Fix Nirvana support
  • (#5773) Fix hl.sample_qc to use correct number of total rows when calculating call rate.
  • (#5763)(#5764) Fix hl.agg.array_agg to work inside mt.annotate_rows and similar functions.
  • (#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
  • (#5692) When keyed is True, hl.maximal_independent_set now does not produce duplicates.
  • (#5725) Docs now consistently refer to hl.agg not agg.
  • (#5730)(#5782) Taught import_bgen to optimize its variants argument.

Experimental

  • (#5732) The hl.agg.approx_quantiles aggregate computes an approximation of the quantiles of an expression.
  • (#5693)(#5396) Table._multi_way_zip_join now correctly handles keys that have been truncated.

0.2.12

Released 2019-03-28

New features

  • (#5614) Add support for multiple missing values in hl.import_table.
  • (#5666) Produce HTML table output for Table.show() when running in Jupyter notebook.

Bug fixes

  • (#5603)(#5697) Fixed issue where min_partitions on hl.import_table was non-functional.
  • (#5611) Fix hl.nirvana crash.

Experimental

  • (#5524) Add summarize functions to Table, MatrixTable, and Expression.
  • (#5570) Add hl.agg.approx_cdf aggregator for approximate density calculation.
  • (#5571) Add log parameter to hl.plot.histogram.
  • (#5601) Add hl.plot.joint_plot, extend functionality of hl.plot.scatter.
  • (#5608) Add LD score simulation framework.
  • (#5628) Add hl.experimental.full_outer_join_mt for full outer joins on MatrixTables.

0.2.11

Released 2019-03-06

New features

  • (#5374) Add default arguments to hl.add_sequence for running on GCP.
  • (#5481) Added sample_cols method to MatrixTable.
  • (#5501) Exposed MatrixTable.unfilter_entries. See filter_entries documentation for more information.
  • (#5480) Added n_cols argument to MatrixTable.head.
  • (#5529) Added Table.{semi_join, anti_join} and MatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}.
  • (#5528) Added {MatrixTable, Table}.checkpoint methods as wrappers around write / read_{matrix_table, table}.

Bug fixes

  • (#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
  • (#5419) Resolved issue with import_vcf force_bgz and file size checks.
  • (#5427) Resolved issue with Table.show and dictionary field types.
  • (#5468) Resolved ordering problem with Expression.show on key fields that are not the first key.
  • (#5492) Fixed hl.agg.collect crashing when collecting float32 values.
  • (#5525) Fixed hl.trio_matrix crashing when complete_trios is False.

0.2.10

Released 2019-02-15

New features

  • (#5272) Added a new ‘delimiter’ option to Table.export.
  • (#5251) Add utility aliases to hl.plot for output_notebook and show.
  • (#5249) Add histogram2d function to hl.plot module.
  • (#5247) Expose MatrixTable.localize_entries method for converting to a Table with an entries array.
  • (#5300) Add new filter and find_replace arguments to hl.import_table and hl.import_vcf to apply regex and substitutions to text input.

Performance improvements

  • (#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.

Bug fixes

  • (#5306) Fix ReferenceGenome.add_sequence causing a crash.
  • (#5268) Fix Table.export writing a file called ‘None’ in the current directory.
  • (#5265) Fix hl.get_reference raising an exception when called before hl.init().
  • (#5250) Fix crash in pc_relate when called on a MatrixTable field other than ‘GT’.
  • (#5278) Fix crash in Table.order_by when sorting by fields whose names are not valid Python identifiers.
  • (#5294) Fix crash in hl.trio_matrix when sample IDs are missing.
  • (#5295) Fix crash in Table.index related to key field incompatibilities.

0.2.9

Released 2019-01-30

New features

  • (#5149) Added bitwise transformation functions: hl.bit_{and, or, xor, not, lshift, rshift}.
  • (#5154) Added hl.rbind function, which is similar to hl.bind but expects a function as the last argument instead of the first.

Performance improvements

  • (#5107) Hail’s Python interface generates tighter intermediate code, which should result in moderate performance improvements in many pipelines.
  • (#5172) Fix unintentional performance deoptimization related to Table.show introduced in 0.2.8.
  • (#5078) Improve performance of hl.ld_prune by up to 30x.

Bug fixes

  • (#5144) Fix crash caused by hl.index_bgen (since 0.2.7)
  • (#5177) Fix bug causing Table.repartition(n, shuffle=True) to fail to increase partitioning for unkeyed tables.
  • (#5173) Fix bug causing Table.show to throw an error when the table is empty (since 0.2.8).
  • (#5210) Fix bug causing Table.show to always print types, regardless of types argument (since 0.2.8).
  • (#5211) Fix bug causing MatrixTable.make_table to unintentionally discard non-key row fields (since 0.2.8).

0.2.8

Released 2019-01-15

New features

  • (#5072) Added multi-phenotype option to hl.logistic_regression_rows
  • (#5077) Added support for importing VCF floating-point FORMAT fields as float32 as well as float64.

Performance improvements

  • (#5068) Improved optimization of MatrixTable.count_cols.
  • (#5131) Fixed performance bug related to hl.literal on large values with missingness

Bug fixes

  • (#5088) Fixed name separator in MatrixTable.make_table.
  • (#5104) Fixed optimizer bug related to experimental functionality.
  • (#5122) Fixed error constructing Table or MatrixTable objects with fields with certain character patterns like $.

0.2.7

Released 2019-01-03

New features

  • (#5046)(experimental) Added option to BlockMatrix.export_rectangles to export as NumPy-compatible binary.

Performance improvements

  • (#5050) Short-circuit iteration in logistic_regression_rows and poisson_regression_rows if NaNs appear.

0.2.6

Released 2018-12-17

New features

  • (#4962) Expanded comparison operators (==, !=, <, <=, >, >=) to support expressions of every type.
  • (#4927) Expanded functionality of Table.order_by to support ordering by arbitrary expressions, instead of just top-level fields.
  • (#4926) Expanded default GRCh38 contig recoding behavior in import_plink.

Performance improvements

  • (#4952) Resolved lingering issues related to (#4909).

Bug fixes

  • (#4941) Fixed variable scoping error in regression methods.
  • (#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than i and j.
  • (#4932) Fixed possible error in export_plink related to tolerance of writer process failure.
  • (#4920) Fixed bad error message in Table.order_by.

0.2.5

Released 2018-12-07

New features

  • (#4845) The or_error method in hl.case and hl.switch statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.
  • (#4865) We use this new or_error functionality in methods that require biallelic variants to include an offending variant in the error message.
  • (#4820) Added hl.reversed for reversing arrays and strings.
  • (#4895) Added include_strand option to the hl.liftover function.

Performance improvements

  • (#4907)(#4911) Addressed one aspect of bad scaling in enormous literal values (triggered by a list of 300,000 sample IDs) related to logging.
  • (#4909)(#4914) Fixed a check in Table/MatrixTable initialization that scaled O(n^2) with the total number of fields.

Bug fixes

  • (#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using group_rows_by.
  • (#4888) Fixed assertion error in BlockMatrix.sum.
  • (#4871) Fixed possible error in locally sorting nested collections.
  • (#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
  • (#4527)(#4761) Fixed optimizer assertion error sometimes encountered with hl.split_multi[_hts].

0.2.4: Beginning of history!

We didn’t start manually curating information about user-facing changes until version 0.2.4.

The full commit history is available here.