Change Log¶
Frequently Asked Questions¶
With a version like 0.x, is Hail ready for use in publications?¶
Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.
What stability is guaranteed?¶
We do not intentionally break back-compatibility of interfaces or file formats. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. The exception to this rule is experimental functionality, denoted as such in the reference documentation, which may change at any time.
Please note that forward compatibility should not be expected, especially relating to file formats: this means that it may not be possible to use an earlier version of Hail to read files written in a later version.
Version 0.2.112¶
Released 2023-03-15
Version 0.2.111¶
Released 2023-03-13
Version 0.2.110¶
Released 2023-03-08
New Features¶
(#12643) In Query on Batch,
hl.skat(..., logistic=True)
is now supported.(#12643) In Query on Batch,
hl.liftover
is now supported.(#12629) In Query on Batch,
hl.ibd
is now supported.(#12722) Add
hl.simulate_random_mating
to generate a population from founders under the assumption of random mating.(#12701) Query on Spark now officially supports Spark 3.3.0 and Dataproc 2.1.x
Performance Improvements¶
(#12679) In Query on Batch,
hl.balding_nichols_model
is slightly faster. Also addedhl.utils.genomic_range_table
to quickly create a table keyed by locus.
Bug Fixes¶
(#12711) In Query on Batch, fix null pointer exception (manifesting as
scala.MatchError: null
) when reading data from requester pays buckets.(#12739) Fix
hl.plot.cdf
,hl.plot.pdf
, andhl.plot.joint_plot
which were broken by changes in Hail and changes in bokeh.(#12735) Fix (#11738) by allowing user to override default types in
to_pandas
.(#12760) Mitigate some JVM bytecode generation errors, particularly those related to too many method parameters.
(#12766) Fix (#12759) by loosening
parsimonious
dependency pin.(#12732) In Query on Batch, fix bug that sometimes prevented terminating a pipeline using Control-C.
(#12771) Use a version of
jgscm
whose version complies with PEP 440.
Version 0.2.109¶
Released 2023-02-08
New Features¶
(#12605) Add
hl.pgenchisq
the cumulative distribution function of the generalized chi-squared distribution.(#12637) Query-on-Batch now supports
hl.skat(..., logistic=False)
.(#12645) Added
hl.vds.truncate_reference_blocks
to transform a VDS to checkpoint reference blocks in order to drastically improve interval filtering performance. Also addedhl.vds.merge_reference_blocks
to merge adjacent reference blocks according to user criteria to better compress reference data.
Bug Fixes¶
(#12650) Hail will now throw an exception on
hl.export_bgen
when there is no GP field, instead of exporting null records.(#12635) Fix bug where
hl.skat
did not work on Apple M1 machines.(#12571) When using Query-on-Batch, hl.hadoop* methods now properly support creation and modification time.
(#12566) Improve error message when combining incompatibly indexed fields in certain operations including array indexing.
Version 0.2.108¶
Released 2023-1-12
Bug fixes¶
(#12585)
hail.ggplot
s that have more than one legend group or facet are now interactive. If such a plot has enough legend entries that the legend would be taller than the plot, the legend will now be scrollable. Legend entries for such plots can be clicked to show/hide traces on the plot, but this does not work and is a known issue that will only be addressed ifhail.ggplot
is migrated off of plotly.(#12584) Fixed bug which arose as an assertion error about type mismatches. This was usually triggered when working with tuples.
(#12583) Fixed bug which showed an empty table for
ht.col_key.show()
.(#12582) Fixed bug where matrix tables with duplicate col keys do not show properly. Also fixed bug where tables and matrix tables with HTML unsafe column headers are rendered wrong in Jupyter.
(#12574) Fixed a memory leak when processing tables. Could trigger unnecessarily high memory use and out of memory errors when there are many rows per partition or large key fields.
(#12565) Fixed a bug that prevented exploding on a field of a Table whose value is a random value.
Version 0.2.107¶
Released 2022-12-14
Version 0.2.106¶
Released 2022-12-13
New Features¶
(#12522) Added
hailctl
config setting'batch/backend'
to specify the default backend to use in batch scripts when not specified in code.(#12497) Added support for
scales
,nrow
, andncol
arguments, as well as grouped legends, tohail.ggplot.facet_wrap
.(#12471) Added
hailctl batch submit
command to run local scripts inside batch jobs.(#12525) Add support for passing arguments to
hailctl batch submit
.(#12465) Batch jobs’ status now contains the region the job ran in. The job itself can access which region it is in through the
HAIL_REGION
environment variable.(#12464) When using Query-on-Batch, all jobs for a single hail session are inserted into the same batch instead of one batch per action.
(#12457)
pca
andhwe_normalized_pca
are now supported in Query-on-Batch.(#12376) Added
hail.query_table
function for reading tables with indices from Python.(#12139) Random number generation has been updated, but shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
(#11884) Added
Job.always_copy_output
when using theServiceBackend
. The default behavior isFalse
, which is a breaking change from the previous behavior to always copy output files regardless of the job’s completion state.(#12139) Brand new random number generation, shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
Bug Fixes¶
(#12487) Fixed a bug causing rare but deterministic job failures deserializing data in Query-on-Batch.
(#12535) QoB will now error if the user reads from and writes to the same path. QoB also now respects the user’s configuration of
disable_progress_bar
. Whendisable_progress_bar
is unspecified, QoB only disables the progress bar for non-interactive sessions.(#12517) Fix a performance regression that appears when using
hl.split_multi_hts
among other methods.
Version 0.2.105¶
Released 2022-10-31 🎃
Bug Fixes¶
(#12384) Fixed a critical bug that disabled tree aggregation and scan executions in 0.2.104, leading to out-of-memory errors.
(#12265) Fix long-standing bug wherein
hl.agg.collect_as_set
andhl.agg.counter
error when applied to types which, in Python, are unhashable. For example,hl.agg.counter(t.list_of_genes)
will not error whent.list_of_genes
is a list. Instead, the counter dictionary will useFrozenList
keys from thefrozenlist
package.
Version 0.2.104¶
Release 2022-10-19
Version 0.2.103¶
Release 2022-10-18
Version 0.2.102¶
Released 2022-10-06
New Features¶
(#12218) Missing values are now supported in primitive columns in
Table.to_pandas
.(#12254) Cross-product-style legends for data groups have been replaced with factored ones (consistent with
ggplot2
’s implementation) forhail.ggplot.geom_point
, and support has been added for custom legend group labels.(#12268)
VariantDataset
now implementsunion_rows
for combining datasets with the same samples but disjoint variants.
Version 0.2.101¶
Released 2022-10-04
New Features¶
(#12218) Support missing values in primitive columns in
Table.to_pandas
.(#12195) Add a
impute_sex_chr_ploidy_from_interval_coverage
to impute sex ploidy directly from a coverage MT.(#12222) Query-on-Batch pipelines now add worker jobs to the same batch as the driver job instead of producing a new batch per stage.
(#12244) Added support for custom labels for per-group legends to
hail.ggplot.geom_point
via thelegend_format
keyword argument
Version 0.2.100¶
Released 2022-09-23
Version 0.2.99¶
Released 2022-09-13
New Features¶
Version 0.2.98¶
Released 2022-08-22
New Features¶
(#12062)
hl.balding_nichols_model
now supports an optional boolean parameter,phased
, to control the phasedness of the generated genotypes.
Performance improvements¶
Bug fixes¶
(#12115) When using
use_new_shuffle=True
, fix a bug when there are more than 2^31 rows(#12074) Fix bug where
hl.init
could silently overwrite the global random seed.(#12079) Fix bug in handling of missing (aka NA) fields in grouped aggregation and distinct by key.
(#12056) Fix
hl.export_vcf
to actually create tabix files when requested.(#12020) Fix bug in
hl.experimental.densify
which manifested as anAssertionError
about dtypes.
Version 0.2.97¶
Released 2022-06-30
Bug fixes¶
(#11962) Fix error (logged as (#11891)) in VCF combiner when exactly 10 or 100 files are combined.
(#11969) Fix
import_table
andimport_lines
to use multiple partitions whenforce_bgz
is used.(#11964) Fix erroneous “Bucket is a requester pays bucket but no user project provided.” errors in Google Dataproc by updating to the latest Dataproc image version.
Version 0.2.96¶
Released 2022-06-21
Bug fixes¶
(#11905) Fix erroneous FileNotFoundError in glob patterns
(#11921) and (#11910) Fix file clobbering during text export with speculative execution.
(#11920) Fix array out of bounds error when tree aggregating a multiple of 50 partitions.
(#11937) Fixed correctness bug in scan order for
Table.annotate
andMatrixTable.annotate_rows
in certain circumstances.(#11887) Escape VCF description strings when exporting.
(#11886) Fix an error in an example in the docs for
hl.split_multi
.
Version 0.2.95¶
Released 2022-05-13
New features¶
(#11809) Export
dtypes_from_pandas
inexpr.types
(#11807) Teach smoothed_pdf to add a plot to an existing figure.
(#11746) The ServiceBackend, in interactive mode, will print a link to the currently executing driver batch.
(#11759)
hl.logistic_regression_rows
,hl.poisson_regression_rows
, andhl.skat
all now support configuration of the maximum number of iterations and the tolerance.(#11835) Add
hl.ggplot.geom_density
which renders a plot of an approximation of the probability density function of its argument.
Bug fixes¶
(#11815) Fix incorrectly missing entries in to_dense_mt at the position of ref block END.
(#11828) Fix
hl.init
to not ignore itssc
argument. This bug was introduced in 0.2.94.(#11830) Fix an error and relax a timeout which caused
hailtop.aiotools.copy
to hang.(#11778) Fix a (different) error which could cause hangs in
hailtop.aiotools.copy
.
Version 0.2.93¶
Release 2022-03-27
Beta features¶
Several issues with the beta version of Hail Query on Hail Batch are addressed in this release.
Version 0.2.92¶
Release 2022-03-25
New features¶
(#11613) Add
hl.ggplot
support forscale_fill_hue
,scale_color_hue
, andscale_fill_manual
,scale_color_manual
. This allows for an infinite number of discrete colors.(#11608) Add all remaining and all versions of extant public gnomAD datasets to the Hail Annotation Database and Datasets API. Current as of March 23rd 2022.
(#11662) Add the
weight
aestheticgeom_bar
.
Beta features¶
This version of Hail includes all the necessary client-side infrastructure to execute Hail Query pipelines on a Hail Batch cluster. This effectively enables a “serverless” version of Hail Query which is independent of Apache Spark. Broad affiliated users should contact the Hail team for help using Hail Query on Hail Batch. Unaffiliated users should also contact the Hail team to discuss the feasibility of running your own Hail Batch cluster. The Hail team is accessible at both https://hail.zulipchat.com and https://discuss.hail.is .
Version 0.2.91¶
Release 2022-03-18
Bug fixes¶
(#11614) Update
hail.utils.tutorial.get_movie_lens
to usehttps
instead ofhttp
. Movie Lens has stopped serving data over insecure HTTP.(#11563) Fix issue hail-is/hail#11562.
(#11611) Fix a bug that prevents the display of
hl.ggplot.geom_hline
andhl.ggplot.geom_vline
.
Version 0.2.90¶
Release 2022-03-11
Critical BlockMatrix from_numpy correctness bug¶
(#11555)
BlockMatrix.from_numpy
did not work correctly. Version 1.0 of org.scalanlp.breeze, a dependency of Apache Spark that hail also depends on, has a correctness bug that results in BlockMatrices that repeat the top left block of the block matrix for every block. This affected anyone running Spark 3.0.x or 3.1.x.
Version 0.2.88¶
Release 2022-03-01
This release addresses the deploy issues in the 0.2.87 release of Hail.
Version 0.2.87¶
Release 2022-02-28
An error in the deploy process required us to yank this release from PyPI. Please do not use this release.
Version 0.2.84¶
Release 2022-02-10
Bug fixes¶
(#11328) Fix bug where occasionally files written to disk would be unreadable.
(#11331) Fix bug that potentially caused files written to disk to be unreadable.
(#11312) Fix aggregator memory leak.
(#11340) Fix bug where repeatedly annotating same field name could cause failure to compile.
(#11342) Fix to possible issues about having too many open file handles.
Version 0.2.82¶
Release 2022-01-24
Bug fixes¶
(#11209) Significantly improved usefulness and speed of
Table.to_pandas
, resolved several bugs with output.
New features¶
Version 0.2.81¶
Release 2021-12-20
Version 0.2.80¶
Release 2021-12-15
New features¶
(#11077)
hl.experimental.write_matrix_tables
now returns the paths of the written matrix tables.
hailctl dataproc¶
(#11157) Updated Dataproc image version to mitigate the Log4j vulnerability.
(#10900) Added
--region
parameter tohailctl dataproc submit
.(#11090) Teach
hailctl dataproc describe
how to read URLs with the protocolss3
(Amazon S3),hail-az
(Azure Blob Storage), andfile
(local file system) in addition togs
(Google Cloud Storage).
Version 0.2.79¶
Release 2021-11-17
Version 0.2.77¶
Release 2021-09-21
Version 0.2.76¶
Released 2021-09-15
Version 0.2.75¶
Released 2021-09-10
Bug fixes¶
(#10733) Fix a bug in tabix parsing when the size of the list of all sequences is large.
(#10765) Fix rare bug where valid pipelines would fail to compile if intervals were created conditionally.
(#10746) Various compiler improvements, decrease likelihood of
ClassTooLarge
errors.(#10829) Fix a bug where
hl.missing
andCaseBuilder.or_error
failed if their type was a struct containing a field starting with a number.
Version 0.2.74¶
Released 2021-07-26
Version 0.2.73¶
Released 2021-07-22
Version 0.2.70¶
Released 2021-06-21
Version 0.2.68¶
Released 2021-05-27
Version 0.2.67¶
Version 0.2.66¶
Released 2021-05-03
Version 0.2.65¶
Released 2021-04-14
Default Spark Version Change¶
Starting from version 0.2.65, Hail uses Spark 3.1.1 by default. This will also allow the use of all python versions >= 3.6. By building hail from source, it is still possible to use older versions of Spark.
Version 0.2.64¶
Released 2021-03-11
New features¶
(#10164) Add source_file_field parameter to hl.import_table to allow lines to be associated with their original source file.
Bug fixes¶
(#10182) Fixed serious memory leak in certain uses of
filter_intervals
.(#10133) Fix bug where some pipelines incorrectly infer missingness, leading to a type error.
(#10134) Teach
hl.king
to treat filtered entries as missing values.(#10158) Fixes hail usage in latest versions of jupyter that rely on
asyncio
.(#10174) Fixed bad error message when incorrect return type specified with
hl.loop
.
Version 0.2.63¶
Released 2021-03-01
(#10105) Hail will now return
frozenset
andhail.utils.frozendict
instead of normal sets and dicts.
Bug fixes¶
Version 0.2.62¶
Released 2021-02-03
New features¶
(#9936) Deprecated
hl.null
in favor ofhl.missing
for naming consistency.(#9973)
hl.vep
now includes avep_proc_id
field to aid in debugging unexpected output.(#9839) Hail now eagerly deletes temporary files produced by some BlockMatrix operations.
(#9835)
hl.any
andhl.all
now also support a single collection argument and a varargs of Boolean expressions.(#9816)
hl.pc_relate
now includes values on the diagonal of kinship, IBD-0, IBD-1, and IBD-2(#9736) Let NDArrayExpression.reshape take varargs instead of mandating a tuple.
(#9766)
hl.export_vcf
now warns if INFO field names are invalid according to the VCF 4.3 spec.
Version 0.2.61¶
Released 2020-12-03
Version 0.2.60¶
Released 2020-11-16
Version 0.2.59¶
Released 2020-10-22
Version 0.2.58¶
Released 2020-10-08
New features¶
(#9524) Hail should now be buildable using Spark 3.0.
(#9549) Add
ignore_in_sample_frequency
flag tohl.de_novo
.(#9501) Configurable cache size for
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9474) Add
ArrayExpression.first
andArrayExpression.last
.(#9459) Add
StringExpression.join
, an analogue to Python’sstr.join
.(#9398) Hail will now throw
HailUserError
s if theor_error
branch of aCaseBuilder
is hit.
Bug fixes¶
(#9503) NDArrays can now hold arbitrary data types, though only ndarrays of primitives can be collected to Python.
(#9501) Remove memory leak in
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9424)
hl.experimental.writeBlockMatrices
didn’t correctly supportoverwrite
flag.
hailctl dataproc¶
Version 0.2.57¶
Released 2020-09-03
Version 0.2.55¶
Released 2020-08-19
Bug fixes¶
(#9250)
hailctl dataproc
no longer uses deprecatedgcloud
flags. Consequently, users must update to a recent version ofgcloud
.(#9294) The “Python 3” kernel in notebooks in clusters started by
hailctl dataproc
now features the same Spark monitoring widget found in the “Hail” kernel. There is now no reason to use the “Hail” kernel.
Version 0.2.53¶
Released 2020-07-30
Version 0.2.52¶
Released 2020-07-29
Version 0.2.51¶
Released 2020-07-28
Version 0.2.50¶
Released 2020-07-23
Version 0.2.49¶
Released 2020-07-08
Version 0.2.48¶
Released 2020-07-07
Bug fixes¶
(#9029) Fix crash when using
hl.agg.linreg
with no aggregated data records.(#9028) Fixed memory leak affecting
Table.annotate
with scans,hl.experimental.densify
, andTable.group_by
/aggregate
.(#8978) Fixed aggregation behavior of
MatrixTable.{group_rows_by, group_cols_by}
to skip filtered entries.
Version 0.2.47¶
Released 2020-06-23
Version 0.2.46¶
Released 2020-06-17
Version 0.2.43¶
Released 2020-05-28
Version 0.2.40¶
Released 2020-05-12
Version 0.2.39¶
Released 2020-04-29
Bug fixes¶
(#8615) Fix contig ordering in the CanFam3 (dog) reference genome.
(#8622) Fix bug that causes inscrutable JVM Bytecode errors.
(#8645) Ease unnecessarily strict assertion that caused errors when aggregating by key (e.g.
hl.experimental.spread
).(#8621)
hl.nd.array
now supports arrays with no elements (e.g.hl.nd.array([]).reshape((0, 5))
) and, consequently, matmul with an inner dimension of zero.
New features¶
(#8571)
hl.init(skip_logging_configuration=True)
will skip configuration of Log4j. Users may use this to configure their own logging.(#8588) Users who manually build Python wheels will experience less unnecessary output when doing so.
(#8572) Add
hl.parse_json
which converts a string containing JSON into a Hail object.
Performance Improvements¶
Documentation¶
Version 0.2.38¶
Released 2020-04-21
Critical Linreg Aggregator Correctness Bug¶
(#8575) Fixed a correctness bug in the linear regression aggregator. This was introduced in version 0.2.29. See https://discuss.hail.is/t/possible-incorrect-linreg-aggregator-results-in-0-2-29-0-2-37/1375 for more details.
Version 0.2.37¶
Released 2020-04-14
Bug fixes¶
(#8487) Fix incorrect handling of badly formatted data for
hl.gp_dosage
.(#8497) Fix handling of missingness for
hl.hamming
.(#8537) Fix compile-time errror.
(#8539) Fix compiler error in
Table.multi_way_zip_join
.(#8488) Fix
hl.agg.call_stats
to appropriately throw an error for badly-formatted calls.
Version 0.2.36¶
Released 2020-04-06
Version 0.2.35¶
Released 2020-04-02
Critical Memory Management Bug Fix¶
(#8412) Fixed a serious per-partition memory leak that causes certain pipelines to run out of memory unexpectedly. Please update from 0.2.34.
Bug fixes¶
Performance Improvements¶
Version 0.2.34¶
Released 2020-03-12
New features¶
Bug fixes¶
hailctl dataproc¶
(#8253)
hailctl dataproc
now supports new flags--requester-pays-allow-all
and--requester-pays-allow-buckets
. This will configure your hail installation to be able to read from requester pays buckets. The charges for reading from these buckets will be billed to the project that the cluster is created in.(#8268) The data sources for VEP have been moved to
gs://hail-us-vep
,gs://hail-eu-vep
, andgs://hail-uk-vep
, which are requester-pays buckets in Google Cloud.hailctl dataproc
will automatically infer which of these buckets you should pull data from based on the region your cluster is spun up in. If you are in none of those regions, please contact us on discuss.hail.is.
Version 0.2.33¶
Released 2020-02-27
Bug fixes¶
(#8153) Fixed complier bug causing
MatchError
inimport_bgen
.(#8123) Fixed an issue with multiple Python HailContexts running on the same cluster.
(#8150) Fixed an issue where output from VEP about failures was not reported in error message.
(#8152) Fixed an issue where the row count of a MatrixTable coming from
import_matrix_table
was incorrect.(#8175) Fixed a bug where
persist
did not actually do anything.
Version 0.2.32¶
Released 2020-02-07
Critical performance regression fix¶
(#7989) Fixed performance regression leading to a large slowdown when
hl.variant_qc
was run after filtering columns.
Performance¶
Bug fixes¶
(#7976) Fixed divide-by-zero error in
hl.concordance
with no overlapping rows or cols.(#7965) Fixed optimizer error leading to crashes caused by
MatrixTable.union_rows
.(#8035) Fix compiler bug in
Table.multi_way_zip_join
.(#8021) Fix bug in computing shape after
BlockMatrix.filter
.(#7986) Fix error in NDArray matrix/vector multiply.
Version 0.2.31¶
Released 2020-01-22
New features¶
(#7787) Added transition/transversion information to
hl.summarize_variants
.(#7792) Add Python stack trace to array index out of bounds errors in Hail pipelines.
(#7832) Add
spark_conf
argument tohl.init
, permitting configuration of Spark runtime for a Hail session.(#7823) Added datetime functions
hl.experimental.strptime
andhl.experimental.strftime
.(#7888) Added
hl.nd.array
constructor from nested standard arrays.
File size¶
(#7923) Fixed compression problem since 0.2.23 resulting in larger-than-expected matrix table files for datasets with few entry fields (e.g. GT-only datasets).
Performance¶
Version 0.2.29¶
Released 2019-12-17
Bug fixes¶
(#7229) Fixed
hl.maximal_independent_set
tie breaker functionality.(#7732) Fixed incompatibility with old files leading to incorrect data read when filtering intervals after
read_matrix_table
.(#7642) Fixed crash when constant-folding functions that throw errors.
(#7611) Fixed
hl.hadoop_ls
to handle glob patterns correctly.(#7653) Fixed crash in
ld_prune
by unfiltering missing GTs.
Performance improvements¶
New features¶
(#7686) Added
comment
argument toimport_matrix_table
, allowing lines with certain prefixes to be ignored.(#7688) Added experimental support for
NDArrayExpression
s in newhl.nd
module.(#7608)
hl.grep
now has ashow
argument that allows users to either print the results (default) or return a dictionary of the results.
Version 0.2.28¶
Released 2019-11-22
Critical correctness bug fix¶
(#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.
Bug fixes¶
New Features¶
Version 0.2.27¶
Released 2019-11-15
New Features¶
(#7379) Add
delimiter
argument tohl.import_matrix_table
(#7389) Add
force
andforce_bgz
arguments tohl.experimental.import_gtf
(#7467) Added
hl.if_else
as an alias forhl.cond
; deprecatedhl.cond
.(#7453) Add
hl.parse_int{32, 64}
andhl.parse_float{32, 64}
, which can parse strings to numbers and return missing on failure.(#7475) Add
row_join_type
argument toMatrixTable.union_cols
to support outer joins on rows.
Bug fixes¶
Version 0.2.25¶
Released 2019-10-14
New features¶
(#7240) Add interactive schema widget to
{MatrixTable, Table}.describe
. Use this by passing the argumentwidget=True
.(#7250)
{Table, MatrixTable, Expression}.summarize()
now summarizes elements of collections (arrays, sets, dicts).(#7271) Improve
hl.plot.qq
by increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot.(#7280) Add HTML output for
{Table, MatrixTable, Expression}.summarize()
.(#7294) Add HTML output for
hl.summarize_variants()
.
Bug fixes¶
Version 0.2.24¶
Released 2019-10-03
hailctl dataproc
¶
(#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.
New features¶
(#7071) Add
permit_shuffle
flag tohl.{split_multi, split_multi_hts}
to allow processing of datasets with both multiallelics and duplciate loci.(#7121) Add
hl.contig_length
function.(#7130) Add
window
method onLocusExpression
, which creates an interval around a locus.(#7172) Permit
hl.init(sc=sc)
with pip-installed packages, given the right configuration options.
Version 0.2.23¶
Released 2019-09-23
hailctl dataproc
¶
Bug fixes¶
New features¶
(#7009) Introduced analysis pass in Python that mostly obviates the
hl.bind
andhl.rbind
operators; idiomatic Python that generates Hail expressions will perform much better.(#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.
(#7085) Warn only once about schema mismatches during JSON import (used in VEP, Nirvana, and sometimes
import_table
.(#7106)
hl.agg.call_stats
can now accept a number of alleles for itsalleles
parameter, useful when dealing with biallelic calls without the alleles array at hand.
Version 0.2.20¶
Released 2019-08-19
Critical memory management fix¶
(#6824) Fixed memory management inside
annotate_cols
with aggregations. This was causing memory leaks and segfaults.
Bug fixes¶
Version 0.2.19¶
Released 2019-08-01
Critical performance bug fix¶
Bug fixes¶
(#6757) Fixed correctness bug in optimizations applied to the combination of
Table.order_by
withhl.desc
arguments andshow()
, leading to tables sorted in ascending, not descending order.(#6770) Fixed assertion error caused by
Table.expand_types()
, which was used byTable.to_spark
andTable.to_pandas
.
Performance Improvements¶
(#6666) Slightly improve performance of
hl.pca
andhl.hwe_normalized_pca
.(#6669) Improve performance of
hl.split_multi
andhl.split_multi_hts
.(#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
(#6775) Fixed a major performance problem related to reading block matrices.
Version 0.2.18¶
Released 2019-07-12
Critical performance bug fix¶
(#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.
Version 0.2.17¶
Released 2019-07-10
New features¶
(#6349) Added
compression
parameter toexport_block_matrices
, which can be'gz'
or'bgz'
.(#6405) When a matrix table has string column-keys,
matrixtable.show
uses the column key as the column name.(#6345) Added an improved scan implementation, which reduces the memory load on master.
(#6462) Added
export_bgen
method.(#6473) Improved performance of
hl.agg.array_sum
by about 50%.(#6498) Added method
hl.lambda_gc
to calculate the genomic control inflation factor.(#6456) Dramatically improved performance of pipelines containing long chains of calls to
Table.annotate
, orMatrixTable
equivalents.(#6506) Improved the performance of the generated code for the
Table.annotate(**thing)
pattern.
Bug fixes¶
(#6404) Added
n_rows
andn_cols
parameters toExpression.show
for consistency with othershow
methods.(#6408)(#6419) Fixed an issue where the
filter_intervals
optimization could make scans return incorrect results.(#6459)(#6458) Fixed rare correctness bug in the
filter_intervals
optimization which could result too many rows being kept.(#6496) Fixed html output of
show
methods to truncate long field contents.(#6478) Fixed the broken documentation for the experimental
approx_cdf
andapprox_quantiles
aggregators.(#6504) Fix
Table.show
collecting data twice while running in Jupyter notebooks.(#6571) Fixed the message printed in
hl.concordance
to print the number of overlapping samples, not the full list of overlapping sample IDs.(#6583) Fixed
hl.plot.manhattan
for non-default reference genomes.
Version 0.2.16¶
Released 2019-06-19
Version 0.2.15¶
Released 2019-06-14
After some infrastructural changes to our development process, we should be getting back to frequent releases.
hailctl
¶
Starting in 0.2.15, pip
installations of Hail come bundled with a
command- line tool, hailctl
. This tool subsumes the functionality of
cloudtools
, which is now deprecated. See the release thread on the
forum
for more information.
New features¶
(#5932)(#6115)
hl.import_bed
abdhl.import_locus_intervals
now accept keyword arguments to pass through tohl.import_table
, which is used internally. This permits parameters likemin_partitions
to be set.(#5980) Added
log
option tohl.plot.histogram2d
.(#5937) Added
all_matches
parameter toTable.index
andMatrixTable.index_{rows, cols, entries}
, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.(#5913) Added functionality that makes arrays of structs easier to work with.
(#6089) Added HTML output to
Expression.show
when running in a notebook.(#6172)
hl.split_multi_hts
now uses the originalGQ
value if thePL
is missing.(#6123) Added
hl.binary_search
to search sorted numeric arrays.(#6224) Moved implementation of
hl.concordance
from backend to Python. Performance directly fromread()
is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.(#6214) Updated Hail Python dependencies.
(#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.
Bug fixes¶
(#5895) Fixed crash caused by
-0.0
floating-point values inhl.agg.hist
.(#6013) Turned off feature in HTSJDK that caused crashes in
hl.import_vcf
due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.(#6117) Fixed problem causing
Table.flatten()
to be quadratic in the size of the schema.(#6228)(#5993) Fixed
MatrixTable.union_rows()
to join distinct keys on the right, preventing an unintentional cartesian product.(#6235) Fixed an issue related to aggregation inside
MatrixTable.filter_cols
.(#6226) Restored lost behavior where
Table.show(x < 0)
shows the entire table.(#6267) Fixed cryptic crashes related to
hl.split_multi
andMatrixTable.entries()
with duplicate row keys.
Version 0.2.14¶
Released 2019-04-24
A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.
Version 0.2.13¶
Released 2019-04-18
Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.
New features¶
(#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
(#5860) Improve performance of some column annotation pipelines.
(#5858) Add
unify
option toTable.union
which allows unification of tables with different fields or field orderings.(#5799)
mt.entries()
is four times faster.(#5756) Hail now uses Spark 2.4.x by default.
(#5677)
MatrixTable
now also supportsshow
.(#5793)(#5701) Add
array.index(x)
which find the first index ofarray
whose value is equal tox
.(#5790) Add
array.head()
which returns the first element of the array, or missing if the array is empty.(#5690) Improve performance of
ld_matrix
.(#5743)
mt.compute_entry_filter_stats
computes statistics about the number of filtered entries in a matrix table.(#5758) failure to parse an interval will now produce a much more detailed error message.
(#5723)
hl.import_matrix_table
can now import a matrix table with no columns.(#5724)
hl.rand_norm2d
samples from a two dimensional random normal.
Bug fixes¶
(#5885) Fix
Table.to_spark
in the presence of fields of tuples.(#5882)(#5886) Fix
BlockMatrix
conversion methods to correctly handle filtered entries.(#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
(#5855)(#5786) Fix
hl.mendel_errors
incorrectly reporting children counts in the presence of entry filtering.(#5773) Fix
hl.sample_qc
to use correct number of total rows when calculating call rate.(#5763)(#5764) Fix
hl.agg.array_agg
to work insidemt.annotate_rows
and similar functions.(#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
(#5692) When
keyed
isTrue
,hl.maximal_independent_set
now does not produce duplicates.(#5725) Docs now consistently refer to
hl.agg
notagg
.(#5730)(#5782) Taught
import_bgen
to optimize itsvariants
argument.
Version 0.2.12¶
Released 2019-03-28
New features¶
Bug fixes¶
Experimental¶
(#5524) Add
summarize
functions to Table, MatrixTable, and Expression.(#5570) Add
hl.agg.approx_cdf
aggregator for approximate density calculation.(#5571) Add
log
parameter tohl.plot.histogram
.(#5601) Add
hl.plot.joint_plot
, extend functionality ofhl.plot.scatter
.(#5608) Add LD score simulation framework.
(#5628) Add
hl.experimental.full_outer_join_mt
for full outer joins onMatrixTable
s.
Version 0.2.11¶
Released 2019-03-06
New features¶
(#5374) Add default arguments to
hl.add_sequence
for running on GCP.(#5481) Added
sample_cols
method toMatrixTable
.(#5501) Exposed
MatrixTable.unfilter_entries
. Seefilter_entries
documentation for more information.(#5480) Added
n_cols
argument toMatrixTable.head
.(#5529) Added
Table.{semi_join, anti_join}
andMatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}
.(#5528) Added
{MatrixTable, Table}.checkpoint
methods as wrappers aroundwrite
/read_{matrix_table, table}
.
Bug fixes¶
(#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
(#5419) Resolved issue with
import_vcf
force_bgz
and file size checks.(#5427) Resolved issue with
Table.show
and dictionary field types.(#5468) Resolved ordering problem with
Expression.show
on key fields that are not the first key.(#5492) Fixed
hl.agg.collect
crashing when collectingfloat32
values.(#5525) Fixed
hl.trio_matrix
crashing whencomplete_trios
isFalse
.
Version 0.2.10¶
Released 2019-02-15
New features¶
(#5272) Added a new ‘delimiter’ option to Table.export.
(#5251) Add utility aliases to
hl.plot
foroutput_notebook
andshow
.(#5249) Add
histogram2d
function tohl.plot
module.(#5247) Expose
MatrixTable.localize_entries
method for converting to a Table with an entries array.(#5300) Add new
filter
andfind_replace
arguments tohl.import_table
andhl.import_vcf
to apply regex and substitutions to text input.
Performance improvements¶
(#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.
Bug fixes¶
(#5306) Fix
ReferenceGenome.add_sequence
causing a crash.(#5268) Fix
Table.export
writing a file called ‘None’ in the current directory.(#5265) Fix
hl.get_reference
raising an exception when called beforehl.init()
.(#5250) Fix crash in
pc_relate
when called on a MatrixTable field other than ‘GT’.(#5278) Fix crash in
Table.order_by
when sorting by fields whose names are not valid Python identifiers.(#5294) Fix crash in
hl.trio_matrix
when sample IDs are missing.(#5295) Fix crash in
Table.index
related to key field incompatibilities.
Version 0.2.9¶
Released 2019-01-30
New features¶
Performance improvements¶
Bug fixes¶
(#5144) Fix crash caused by
hl.index_bgen
(since 0.2.7)(#5177) Fix bug causing
Table.repartition(n, shuffle=True)
to fail to increase partitioning for unkeyed tables.(#5173) Fix bug causing
Table.show
to throw an error when the table is empty (since 0.2.8).(#5210) Fix bug causing
Table.show
to always print types, regardless oftypes
argument (since 0.2.8).(#5211) Fix bug causing
MatrixTable.make_table
to unintentionally discard non-key row fields (since 0.2.8).
Version 0.2.7¶
Released 2019-01-03
Version 0.2.6¶
Released 2018-12-17
New features¶
(#4962) Expanded comparison operators (
==
,!=
,<
,<=
,>
,>=
) to support expressions of every type.(#4927) Expanded functionality of
Table.order_by
to support ordering by arbitrary expressions, instead of just top-level fields.(#4926) Expanded default GRCh38 contig recoding behavior in
import_plink
.
Bug fixes¶
(#4941) Fixed variable scoping error in regression methods.
(#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than
i
andj
.(#4932) Fixed possible error in
export_plink
related to tolerance of writer process failure.(#4920) Fixed bad error message in
Table.order_by
.
Version 0.2.5¶
Released 2018-12-07
New features¶
(#4845) The or_error method in
hl.case
andhl.switch
statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.(#4865) We use this new
or_error
functionality in methods that require biallelic variants to include an offending variant in the error message.(#4820) Added hl.reversed for reversing arrays and strings.
(#4895) Added
include_strand
option to the hl.liftover function.
Performance improvements¶
Bug fixes¶
(#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using
group_rows_by
.(#4888) Fixed assertion error in BlockMatrix.sum.
(#4871) Fixed possible error in locally sorting nested collections.
(#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
(#4527)(#4761) Fixed optimizer assertion error sometimes encountered with
hl.split_multi[_hts]
.