Change Log And Version Policy
Python Version Compatibility Policy
Hail complies with NumPy’s compatibility policy on Python versions. In particular, Hail officially supports:
All minor versions of Python released 42 months prior to the project, and at minimum the two latest minor versions.
All minor versions of numpy released in the 24 months prior to the project, and at minimum the last three minor versions.
Frequently Asked Questions
With a version like 0.x, is Hail ready for use in publications?
Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.
What is the difference between the Hail Python library version and the native file format version?
The Hail Python library version, the version you see on
PyPI, in pip
, or in
hl.version()
changes every time we release the Python library. The
Hail native file format version only changes when we change the format
of Hail Table and MatrixTable files. If a version of the Python library
introduces a new native file format version, we note that in the change
log. All subsequent versions of the Python library can read the new file
format version.
The native file format changes much slower than the Python library version. It is not currently possible to view the file format version of a Hail Table or MatrixTable.
What stability is guaranteed?
The Hail file formats and Python API are backwards compatible. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. This also means any file written by python library versions 0.2.1 through 0.2.5 can be read by 0.2.5.
Forward compatibility of file formats and the Python API is not guaranteed. In particular, a new file format version is only readable by library versions released after the file format. For example, Python library version 0.2.119 introduces a new file format version: 1.7.0. All library versions before 0.2.119, for example 0.2.118, cannot read file format version 1.7.0. All library versions after and including 0.2.119 can read file format version 1.7.0.
Each version of the Hail Python library can only write files using the latest file format version it supports.
The hl.experimental package and other methods marked experimental in the docs are exempt from this policy. Their functionality or even existence may change without notice. Please contact us if you critically depend on experimental functionality.
Version 0.2.133
Released 2024-09-25
New Features
(#14619) Teach
hailctl dataproc submit
to use the--project
argument as an argument togcloud dataproc
rather than the submitted script.
Bug Fixes
(#14673) Fix typo in Interpret rule for
TableAggregate
.(#14697) Set
QUAL="."
to missing rather than htsjdk’s sentinel value.(#14292) Prevent GCS cold storage check from throwing an error when reading from a public access bucket.
(#14651) Remove jackson string length restriction for all backends.
(#14653) Add
--public-ip-address
argument togcloud dataproc start
command built byhailctl dataproc start
, fixing creation of dataproc 2.2 clusters.
Version 0.2.132
Released 2024-07-08
New Features
(#14572) Added
StringExpression.find
for finding substrings in a Hail str.
Bug Fixes
(#14574) Fixed
TypeError
bug when initializing Hail Query withbackend='batch'
.(#14571) Fixed a deficiency that caused certain pipelines that construct Hail
NDArray
s from streams to run out of memory.(#14579) Fix serialization bug that broke some Query-on-Batch pipelines with many complex expressions.
(#14567) Fix Jackson configuration that broke some Query-on-Batch pipelines with many complex expressions.
Version 0.2.131
Released 2024-05-30
New Features
(#14560) The gvcf import stage of the VDS combiner now preserves the GT of reference blocks. Some datasets have haploid calls on sex chromosomes, and the fact that the reference was haploid should be preserved.
Bug Fixes
(#14563) The version of
notebook
installed in Hail Dataproc clusters has been upgraded from 6.5.4 to 6.5.6 in order to fix a bug where Jupyter Notebooks wouldn’t start on clusters. The workaround involving creating a cluster with--packages='ipython<8.22'
is no longer necessary.
Deprecations
(#14158) Hail now supports and primarily tests against Dataproc 2.2.5, Spark 3.5.0, and Java 11. We strongly recommend updating to Spark 3.5.0 and Java 11. You should also update your GCS connector after installing Hail:
curl https://broad.io/install-gcs-connector | python3
. Do not try to update before installing Hail 0.2.131.
Version 0.2.130
Released 2024-10-02
0.2.129 contained test configuration artifacts that prevented users from
starting dataproc clusters with hailctl
. Please upgrade to 0.2.130
if you use dataproc.
New Features
(hail##14447) Added
copy_spark_log_on_error
initialization flag that when set, copies the hail driver log to the remotetmpdir
if query execution raises an exception.
Bug Fixes
(#14452) Fixes a bug that prevents users from starting dataproc clusters with hailctl
Version 0.2.129
Released 2024-04-02
Documentation
New Features
(#14406) Performance improvements for reading structured data from (Matrix)Tables
(#14255) Added Cochran-Hantel-Haenszel test for association (
cochran_mantel_haenszel_test
). Our thanks to @Will-Tyler for generously contributing this feature.(#14393)
hail
depends onprotobuf
no longer; users may choose their own version ofprotobuf
.(#14360) Exposed previously internal
_num_allele_type
asnumeric_allele_type
and deprecated it. Add newAlleleType
enumeration for users to be able to easily use the values returned bynumeric_allele_type
.(#14297)
vds.sample_gc
now uses independent aggregators. Users may now import these functions and use them directly.(#14405)
VariantDataset.validate
now checks that all ref blocks are no longer than the ref_block_max_length field, if it exists.
Bug Fixes
(#14420) Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the
_intervals
or_n_partitions
arguments toread_(matrix)_table
, or possiblyrepartition
. Please reach out to us if you’re concerned you may have been affected by this.(#14330) Fixes erroneous error in
export_vcf
with unphased haploid Calls.(#14303) Fix missingness error when sampling entries from a MatrixTable.
(#14288) Contigs may now be compared for inquality while filtering rows.
Deprecations
(#14386)
MatrixTable.make_table
is deprecated. Use.localize_entries
instead.
Version 0.2.128
Released 2024-02-16
In GCP, the Hail Annotation DB and Datasets API have moved from multi-regional US and EU buckets to regional US-CENTRAL1 and EUROPE-WEST1 buckets. These buckets are requester pays which means unless your cluster is in the US-CENTRAL1 or EUROPE-WEST1 region, you will pay a per-gigabyte rate to read from the Annotation DB or Datasets API. We must make this change because reading from a multi-regional bucket into a regional VM is no longer free. Unfortunately, cost constraints require us to choose only one region per continent and we have chosen US-CENTRAL1 and EUROPE-WEST1.
Documentation
New Features
(#14206) Introduce
hailctl config set http/timeout_in_seconds
which Batch and QoB users can use to increase the timeout on their laptops. Laptops tend to have flaky internet connections and a timeout of 300 seconds produces a more robust experience.(#14178) Reduce VDS Combiner runtime slightly by computing the maximum ref block length without executing the combination pipeline twice.
(#14207) VDS Combiner now verifies that every GVCF path and sample name is unique.
Bug Fixes
(#14300) Require orjson<3.9.12 to avoid a segfault introduced in orjson 3.9.12
(#14071) Use indexed VEP cache files for GRCh38 on both dataproc and QoB.
(#14232) Allow use of large numbers of fields on a table without triggering
ClassTooLargeException: Class too large:
.(#14246)(#14245) Fix a bug, introduced in 0.2.114, in which
Table.multi_way_zip_join
andTable.aggregate_by_key
could throw “NoSuchElementException: Ref with name__iruid_...
” when one or more of the tables had a number of partitions substantially different from the desired number of output partitions.(#14202) Support coercing
{}
(the empty dictionary) into any Struct type (with all missing fields).(#14239) Remove an erroneous statement from the MatrixTable tutorial.
(#14176)
hailtop.fs.ls
can now list a bucket, e.g.hailtop.fs.ls("gs://my-bucket")
.(#14258) Fix
import_avro
to not raiseNullPointerException
in certain rare cases (e.g. when using_key_by_assert_sorted
).(#14285) Fix a broken link in the MatrixTable tutorial.
Deprecations
(#14293) Support for the
hail-az://
scheme, deprecated in 0.2.116, is now gone. Please use the standardhttps://ACCOUNT.blob.core.windows.net/CONTAINER/PATH
.
Version 0.2.127
Released 2024-01-12
If you have an Apple M1 laptop, verify that
file $JAVA_HOME/bin/java
returns a message including the phrase “arm64”. If it instead includes the phrase “x86_64” then you must upgrade to a new version of Java. You may find such a version of Java here.
New Features
Bug Fixes
(#14110) Fix
hailctl hdinsight start
, which has been broken since 0.2.118.(#14098)(#14090)(#14118) Fix (#14089), which makes
hailctl dataproc connect
work in Windows Subsystem for Linux.(#14048) Fix (#13979), affecting Query-on-Batch and manifesting most frequently as “com.github.luben.zstd.ZstdException: Corrupted block detected”.
(#14066) Since 0.2.110,
hailctl dataproc
set the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue (#13960) for details.(#14057) Fix (#13998) which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger “lowering”).
(#14006) Fix (#14000). Hail now supports identity_by_descent on Apple M1 and M2 chips; however, your Java installation must be an arm64 installation. Using x86_64 Java with Hail on Apple M1 or M2 will cause SIGILL errors. If you have an Apple M1 or Apple M2 and
/usr/libexec/java_home -V
does not include(arm64)
, you must switch to an arm64 version of the JVM.(#14022) Fix (#13937) caused by faulty library code in the Google Cloud Storage API Java client library.
(#13812) Permit
hailctl batch submit
to accept relative paths. Fix (#13785).(#13885) Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.
(#14127)
hailctl dataproc start ... --dry-run
now uses shell escapes such that, after copied and pasted into a shell, thegcloud
command works as expected.(#14062) Fix (#14052) which caused incorrect results for identity by descent in Query-on-Batch.
(#14122) Ensure that stack traces are transmitted from workers to the driver to the client.
(#14105) When a VCF contains missing values in array fields, Hail now suggests using
array_elements_required=False
.
Deprecations
(#13987) Deprecate
default_reference
parameter tohl.init
, users should usehl.default_reference
with an argument to set new default references usually shortly afterhl.init
.
Version 0.2.126
Released 2023-10-30
Bug Fixes
(#13939) Fix a bug introduced in 0.2.125 which could cause dict literals created in python to be decoded incorrectly, causing runtime errors or, potentially, incorrect results.
(#13751) Correct the broadcasting of ndarrays containing at least one dimension of length zero. This previously produced incorrect results.
Version 0.2.125
Released 2023-10-26
New Features
(#13682)
hl.export_vcf
now clearly reports all Table or Matrix Table fields which cannot be represented in a VCF.(#13355) Improve the Hail compiler to more reliably rewrite
Table.filter
andMatrixTable.filter_rows
to usehl.filter_intervals
. Before this change some queries required reading all partitions even though only a small number of partitions match the filter.(#13787) Improve speed of reading hail format datasets from disk. Simple pipelines may see as much as a halving in latency.
(#13849) Fix (#13788), improving the error message when
hl.logistic_regression_rows
is provided row or entry annotations for the dependent variable.(#13888)
hl.default_reference
can now be passed an argument to change the default reference genome.
Bug Fixes
(#13702) Fix (#13699) and (#13693). Since 0.2.96, pipelines that combined random functions (e.g.
hl.rand_unif
) withindex(..., all_matches=True)
could fail with aClassCastException
.(#13707) Fix (#13633).
hl.maximum_independent_set
now accepts strings as the names of individuals. It has always accepted structures containing a single string field.(#13713) Fix (#13704), in which Hail could encounter an IllegalArgumentException if there are too many transient errors.
(#13730) Fix (#13356) and (#13409). In QoB pipelines with 10K or more partitions, transient “Corrupted block detected” errors were common. This was caused by incorrect retry logic. That logic has been fixed.
(#13732) Fix (#13721) which manifested with the message “Missing Range header in response”. The root cause was a bug in the Google Cloud Storage SDK on which we rely. The fix is to update to a version without this bug. The buggy version of GCS SDK was introduced in 0.2.123.
(#13759) Since Hail 0.2.123, Hail would hang in Dataproc Notebooks due to (#13690).
(#13755) Ndarray concatenation now works with arrays with size zero dimensions.
(#13817) Mitigate new transient error from Google Cloud Storage which manifests as
aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548)
.(#13715) Fix (#13697), a long standing issue with QoB. When a QoB driver or worker fails, the corresponding Batch Job will also appear as failed.
(#13829) Fix (#13828). The Hail combiner now properly imports
PGT
fields from GVCFs.(#13805) Fix (#13767).
hailctl dataproc submit
now expands~
in the--files
and--pyfiles
arguments.(#13797) Fix (#13756). Operations that collect large results such as
to_pandas
may require up to 3x less memory.(#13826) Fix (#13793). Ensure
hailctl describe -u
overrides thegcs_requester_pays/project
config variable.(#13814) Fix (#13757). Pipelines that are memory-bound by copious use of
hl.literal
, such ashl.vds.filter_intervals
, require substantially less memory.(#13894) Fix (#13837) in which Hail could break a Spark installation if the Hail JAR appears on the classpath before the Scala JARs.
(#13919) Fix (#13915) which prevented using a glob pattern in
hl.import_vcf
.
Version 0.2.124
Released 2023-09-21
New Features
(#13608) Change default behavior of hl.ggplot.geom_density to use a new method. The old method is still available using the flag smoothed=True. The new method is typically a much more accurate representation, and works well for any distribution, not just smooth ones.
Version 0.2.123
Released 2023-09-19
New Features
(#13610) Additional setup is no longer required when using hail.plot or hail.ggplot in a Jupyter notebook (calling bokeh.io.output_notebook or hail.plot.output_notebook and/or setting plotly.io.renderers.default = ‘iframe’ is no longer necessary).
Bug Fixes
(#13634) Fix a bug which caused Query-on-Batch pipelines with a large number of partitions (close to 100k) to run out of memory on the driver after all partitions finish.
(#13619) Fix an optimization bug that, on some pipelines, since at least 0.2.58 (commit 23813af), resulted in Hail using essentially unbounded amounts of memory.
(#13609) Fix a bug in hail.ggplot.scale_color_continuous that sometimes caused errors by generating invalid colors.
Version 0.2.122
Released 2023-09-07
New Features
(#13508) The n parameter of MatrixTable.tail is deprecated in favor of a new n_rows parameter.
Bug Fixes
(#13498) Fix a bug where field names can shadow methods on the StructExpression class, e.g. “items”, “keys”, “values”. Now the only way to access such fields is through the getitem syntax, e.g. “some_struct[‘items’]”. It’s possible this could break existing code that uses such field names.
(#13585) Fix bug introduced in 0.2.121 where Query-on-Batch users could not make requests to
batch.hail.is
without a domain configuration set.
Version 0.2.121
Released 2023-09-06
New Features
(#13385) The VDS combiner now supports arbitrary custom call fields via the
call_fields
parameter.(#13224)
hailctl config get
,set
, andunset
now support shell auto-completion. Runhailctl --install-completion zsh
to install the auto-completion forzsh
. You must already have completion enabled forzsh
.(#13279) Add
hailctl batch init
which helps new users interactively set uphailctl
for Query-on-Batch and Batch use.
Bug Fixes
(#13573) Fix (#12936) in which VEP frequently failed (due to Docker not starting up) on clusters with a non-trivial number of workers.
(#13485) Fix (#13479) in which
hl.vds.local_to_global
could produce invalid values when the LA field is too short. There were and are no issues when the LA field has the correct length.(#13340) Fix
copy_log
to correctly copy relative file paths.(#13364)
hl.import_gvcf_interval
now treatsPGT
as a call field.(#13333) Fix interval filtering regression:
filter_rows
orfilter
mentioning the same field twice or using two fields incorrectly read the entire dataset. In 0.2.121, these filters will correctly read only the relevant subset of the data.(#13368) In Azure, Hail now uses fewer “list blobs” operations. This should reduce cost on pipelines that import many files, export many of files, or use file glob expressions.
(#13414) Resolves (#13407) in which uses of
union_rows
could reduce parallelism to one partition resulting in severely degraded performance.(#13405)
MatrixTable.aggregate_cols
no longer forces a distributed computation. This should be what you want in the majority of cases. In case you know the aggregation is very slow and should be parallelized, usemt.cols().aggregate
instead.(#13460) In Query-on-Spark, restore
hl.read_table
optimization that avoids reading unnecessary data in pipelines that do not reference row fields.(#13447) Fix (#13446). In all three submit commands (
batch
,dataproc
, andhdinsight
), Hail now allows and encourages the use of – to separate arguments meant for the user script from those meant for hailctl. In hailctl batch submit, option-like arguments, for example “–foo”, are now supported before “–” if and only if they do not conflict with a hailctl option.(#13422)
hailtop.hail_frozenlist.frozenlist
now has an eval-ablerepr
.(#13523)
hl.Struct
is now pickle-able.(#13505) Fix bug introduced in 0.2.117 by commit
c9de81108
which prevented the passing of keyword arguments to Python jobs. This manifested as “ValueError: too many values to unpack”.(#13536) Fixed (#13535) which prevented the use of Python jobs when the client (e.g. your laptop) Python version is 3.11 or later.
(#13434) In QoB, Hail’s file systems now correctly list all files in a directory, not just the first 1000. This could manifest in an
import_table
orimport_vcf
which used a glob expression. In such a case, only the first 1000 files would have been included in the resulting Table or MatrixTable.(#13550)
hl.utils.range_table(n)
now supports all valid 32-bit signed integer values ofn
.(#13500) In Query-on-Batch, the client-side Python code will not try to list every job when a QoB batch fails. This could take hours for long-running pipelines or pipelines with many partitions.
Deprecations
Version 0.2.120
Released 2023-07-27
New Features
(#13206) The VDS Combiner now works in Query-on-Batch.
Bug Fixes
(#13313) Fix bug introduced in 0.2.119 which causes a serialization error when using Query-on-Spark to read a VCF which is sorted by locus, with split multi-allelics, in which the records sharing a single locus do not appear in the dictionary ordering of their alternate alleles.
(#13264) Fix bug which ignored the
partition_hint
of a Table group-by-and-aggregate.(#13239) Fix bug which ignored the
HAIL_BATCH_REGIONS
argument when determining in which regions to schedule jobs when using Query-on-Batch.(#13253) Improve
hadoop_ls
andhfs.ls
to quickly list globbed files in a directory. The speed improvement is proportional to the number of files in the directory.(#13226) Fix the comparison of an
hl.Struct
to anhl.struct
or field of typetstruct
. Resolves (#13045) and (Hail#13046).(#12995) Fixed bug causing poor performance and memory leaks for
MatrixTable.annotate_rows
aggregations.
Version 0.2.119
Released 2023-06-28
New Features
(#12081) Hail now uses Zstandard as the default compression algorithm for table and matrix table storage. Reducing file size around 20% in most cases.
(#12988) Arbitrary aggregations can now be used on arrays via
ArrayExpression.aggregate
. This method is useful for accessing functionality that exists in the aggregator library but not the basic expression library, for instance,call_stats
.(#13166) Add an
eigh
ndarray method, for finding eigenvalues of symmetric matrices (“h” is for Hermitian, the complex analogue of symmetric).
Bug Fixes
(#13184) The
vds.to_dense_mt
no longer densifies past the end of contig boundaries. A logic bug into_dense_mt
could lead to reference data toward’s the end of one contig being applied to the following contig up until the first reference block of the contig.(#13173) Fix globbing in scala blob storage filesystem implementations.
File Format
The native file format version is now 1.7.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.118
Released 2023-06-13
New Features
Bug Fixes
(#13126) Query-on-Batch pipelines with one partition are now retried when they encounter transient errors.
(#13113)
hail.ggplot.geom_point
now displays a legend group for a column even when it has only one value in it.(#13075) (#13074) Add a new transient error plaguing pipelines in Query-on-Batch in Google:
java.net.SocketTimeoutException: connect timed out
.(#12569) The documentation for
hail.ggplot.facets
is now correctly included in the API reference.
Version 0.2.117
Released 2023-05-22
New Features
(#12875) Parallel export modes now write a manifest file. These manifest files are text files with one filename per line, containing name of each shard written successfully to the directory. These filenames are relative to the export directory.
(#13007) In Query-on-Batch and
hailtop.batch
, memory and storage request strings may now be optionally terminated with aB
for bytes.
Bug Fixes
(#13065) In Azure Query-on-Batch, fix a resource leak that prevented running pipelines with >500 partitions and created flakiness with >250 partitions.
(#13067) In Query-on-Batch, driver and worker logs no longer buffer so messages should arrive in the UI after a fixed delay rather than proportional to the frequency of log messages.
(#13028) Fix crash in
hl.vds.filter_intervals
when using a table to filter a VDS that stores the max ref block length.(#13060) Prevent 500 Internal Server Error in Jupyter Notebooks of Dataproc clusters started by
hailctl dataproc
.(#13051) In Query-on-Batch and
hailtop.batch
, Azure Blob Storagehttps
URLs are now supported.(#13042) In Query-on-Batch,
naive_coalesce
no longer performs a full write/read of the dataset. It now operates identically to the Query-on-Spark implementation.(#13031) In
hl.ld_prune
, an informative error message is raised when a dataset does not contain diploid calls instead of an assertion error.(#13032) In Query-on-Batch, in Azure, Hail now users a newer version of the Azure blob storage libraries to reduce the frequency of “Stream is already closed” errors.
(#13011) In Query-on-Batch, the driver will use ~1/2 as much memory to read results as it did in 0.2.115.
(#13013) In Query-on-Batch, transient errors while streaming from Google Storage are now automatically retried.
Version 0.2.116
Released 2023-05-08
New Features
(#12917) ABS blob URIs in the format of
https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>
are now supported.(#12731) Introduced
hailtop.fs
that makes public a filesystem module that works for local fs, gs, s3 and abs. This is now used as theBackend.fs
for hail query but can be used standalone for Hail Batch users byimport hailtop.fs as hfs
.
Deprecations
Bug Fixes
Version 0.2.115
Released 2023-04-25
New Features
(#12731) Introduced
hailtop.fs
that makes public a filesystem module that works for local fs, gs, s3 and abs. This can be used byimport hailtop.fs as hfs
but has also replaced the underlying implementation of thehl.hadoop_*
methods. This means that thehl.hadoop_*
methods now support these additional blob storage providers.(#12917) ABS blob URIs in the form of
https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>
are now supported when running in Azure.
Deprecations
(#12917) The
hail-az
scheme for referencing ABS blobs in Azure is deprecated in favor of thehttps
scheme and will be removed in a future release.
Bug Fixes
(#12919) An interactive hail session is no longer unusable after hitting CTRL-C during a batch execution in Query-on-Batch
(#12913) Fixed bug in
hail.ggplot
where all legend entries would have the same text if one column had exactly one value for all rows and was mapped to either theshape
or thecolor
aesthetic forgeom_point
.
Version 0.2.114
Released 2023-04-19
New Features
(#12880) Added
hl.vds.store_ref_block_max_len
to patch old VDSes to make interval filtering faster.
Bug Fixes
(#12860) Fixed memory leak in shuffles in Query-on-Batch.
Version 0.2.113
Released 2023-04-07
New Features
(#12798) Query-on-Batch now supports
BlockMatrix.write(..., stage_locally=True)
.(#12793) Query-on-Batch now supports
hl.poisson_regression_rows
.(#12801) Hitting CTRL-C while interactively using Query-on-Batch cancels the underlying batch.
(#12810)
hl.array
can now convert 1-d ndarrays into the equivalent list.(#12851)
hl.variant_qc
no longer requires a locus field.(#12816) In Query-on-Batch,
hl.logistic_regression('firth', ...)
is now supported.(#12854) In Query-on-Batch, simple pipelines with large numbers of partitions should be substantially faster.
Bug Fixes
(#12783) Fixed bug where logs were not properly transmitted to Python.
(#12812) Fixed bug where
Table/MT._calculate_new_partitions
returned unbalanced intervals with whole-stage code generation runtime.(#12839) Fixed
hailctl dataproc
jupyter notebooks to be compatible with Spark 3.3, which have been broken since 0.2.110.(#12855) In Query-on-Batch, allow writing to requester pays buckets, which was broken before this release.
Version 0.2.112
Released 2023-03-15
Bug Fixes
(#12784) Removed an internal caching mechanism in Query on Batch that caused stalls in pipelines with large intermediates
Version 0.2.111
Released 2023-03-13
New Features
(#12581) In Query on Batch, users can specify which regions to have jobs run in.
Bug Fixes
(#12772) Fix
hailctl hdinsight submit
to pass args to the files
Version 0.2.110
Released 2023-03-08
New Features
(#12643) In Query on Batch,
hl.skat(..., logistic=True)
is now supported.(#12643) In Query on Batch,
hl.liftover
is now supported.(#12629) In Query on Batch,
hl.ibd
is now supported.(#12722) Add
hl.simulate_random_mating
to generate a population from founders under the assumption of random mating.(#12701) Query on Spark now officially supports Spark 3.3.0 and Dataproc 2.1.x
Performance Improvements
(#12679) In Query on Batch,
hl.balding_nichols_model
is slightly faster. Also addedhl.utils.genomic_range_table
to quickly create a table keyed by locus.
Bug Fixes
(#12711) In Query on Batch, fix null pointer exception (manifesting as
scala.MatchError: null
) when reading data from requester pays buckets.(#12739) Fix
hl.plot.cdf
,hl.plot.pdf
, andhl.plot.joint_plot
which were broken by changes in Hail and changes in bokeh.(#12735) Fix (#11738) by allowing user to override default types in
to_pandas
.(#12760) Mitigate some JVM bytecode generation errors, particularly those related to too many method parameters.
(#12766) Fix (#12759) by loosening
parsimonious
dependency pin.(#12732) In Query on Batch, fix bug that sometimes prevented terminating a pipeline using Control-C.
(#12771) Use a version of
jgscm
whose version complies with PEP 440.
Version 0.2.109
Released 2023-02-08
New Features
(#12605) Add
hl.pgenchisq
the cumulative distribution function of the generalized chi-squared distribution.(#12637) Query-on-Batch now supports
hl.skat(..., logistic=False)
.(#12645) Added
hl.vds.truncate_reference_blocks
to transform a VDS to checkpoint reference blocks in order to drastically improve interval filtering performance. Also addedhl.vds.merge_reference_blocks
to merge adjacent reference blocks according to user criteria to better compress reference data.
Bug Fixes
(#12650) Hail will now throw an exception on
hl.export_bgen
when there is no GP field, instead of exporting null records.(#12635) Fix bug where
hl.skat
did not work on Apple M1 machines.(#12571) When using Query-on-Batch, hl.hadoop* methods now properly support creation and modification time.
(#12566) Improve error message when combining incompatibly indexed fields in certain operations including array indexing.
Version 0.2.108
Released 2023-1-12
New Features
(#12576)
hl.import_bgen
andhl.export_bgen
now support compression with Zstd.
Bug fixes
(#12585)
hail.ggplot
s that have more than one legend group or facet are now interactive. If such a plot has enough legend entries that the legend would be taller than the plot, the legend will now be scrollable. Legend entries for such plots can be clicked to show/hide traces on the plot, but this does not work and is a known issue that will only be addressed ifhail.ggplot
is migrated off of plotly.(#12584) Fixed bug which arose as an assertion error about type mismatches. This was usually triggered when working with tuples.
(#12583) Fixed bug which showed an empty table for
ht.col_key.show()
.(#12582) Fixed bug where matrix tables with duplicate col keys do not show properly. Also fixed bug where tables and matrix tables with HTML unsafe column headers are rendered wrong in Jupyter.
(#12574) Fixed a memory leak when processing tables. Could trigger unnecessarily high memory use and out of memory errors when there are many rows per partition or large key fields.
(#12565) Fixed a bug that prevented exploding on a field of a Table whose value is a random value.
Version 0.2.107
Released 2022-12-14
Bug fixes
(#12543) Fixed
hl.vds.local_to_global
error when LA array contains non-ascending allele indices.
Version 0.2.106
Released 2022-12-13
New Features
(#12522) Added
hailctl
config setting'batch/backend'
to specify the default backend to use in batch scripts when not specified in code.(#12497) Added support for
scales
,nrow
, andncol
arguments, as well as grouped legends, tohail.ggplot.facet_wrap
.(#12471) Added
hailctl batch submit
command to run local scripts inside batch jobs.(#12525) Add support for passing arguments to
hailctl batch submit
.(#12465) Batch jobs’ status now contains the region the job ran in. The job itself can access which region it is in through the
HAIL_REGION
environment variable.(#12464) When using Query-on-Batch, all jobs for a single hail session are inserted into the same batch instead of one batch per action.
(#12457)
pca
andhwe_normalized_pca
are now supported in Query-on-Batch.(#12376) Added
hail.query_table
function for reading tables with indices from Python.(#12139) Random number generation has been updated, but shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
(#11884) Added
Job.always_copy_output
when using theServiceBackend
. The default behavior isFalse
, which is a breaking change from the previous behavior to always copy output files regardless of the job’s completion state.(#12139) Brand new random number generation, shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
Bug Fixes
(#12487) Fixed a bug causing rare but deterministic job failures deserializing data in Query-on-Batch.
(#12535) QoB will now error if the user reads from and writes to the same path. QoB also now respects the user’s configuration of
disable_progress_bar
. Whendisable_progress_bar
is unspecified, QoB only disables the progress bar for non-interactive sessions.(#12517) Fix a performance regression that appears when using
hl.split_multi_hts
among other methods.
Version 0.2.105
Released 2022-10-31 🎃
New Features
(#12293) Added support for
hail.MatrixTable
s tohail.ggplot
.
Bug Fixes
(#12384) Fixed a critical bug that disabled tree aggregation and scan executions in 0.2.104, leading to out-of-memory errors.
(#12265) Fix long-standing bug wherein
hl.agg.collect_as_set
andhl.agg.counter
error when applied to types which, in Python, are unhashable. For example,hl.agg.counter(t.list_of_genes)
will not error whent.list_of_genes
is a list. Instead, the counter dictionary will useFrozenList
keys from thefrozenlist
package.
Version 0.2.104
Release 2022-10-19
New Features
(#12346): Introduced new progress bars which include total time elapsed and look cool.
Version 0.2.103
Release 2022-10-18
Bug Fixes
(#12305): Fixed a rare crash reading tables/matrixtables with _intervals
Version 0.2.102
Released 2022-10-06
New Features
(#12218) Missing values are now supported in primitive columns in
Table.to_pandas
.(#12254) Cross-product-style legends for data groups have been replaced with factored ones (consistent with
ggplot2
’s implementation) forhail.ggplot.geom_point
, and support has been added for custom legend group labels.(#12268)
VariantDataset
now implementsunion_rows
for combining datasets with the same samples but disjoint variants.
Bug Fixes
Version 0.2.101
Released 2022-10-04
New Features
(#12218) Support missing values in primitive columns in
Table.to_pandas
.(#12195) Add a
impute_sex_chr_ploidy_from_interval_coverage
to impute sex ploidy directly from a coverage MT.(#12222) Query-on-Batch pipelines now add worker jobs to the same batch as the driver job instead of producing a new batch per stage.
(#12244) Added support for custom labels for per-group legends to
hail.ggplot.geom_point
via thelegend_format
keyword argument
Deprecations
(#12230) The python-dill Batch images in
gcr.io/hail-vdc
are no longer supported. Usehailgenetics/python-dill
instead.
Bug fixes
(#12215) Fix search bar in the Hail Batch documentation.
Version 0.2.100
Released 2022-09-23
New Features
(#12207) Add support for the
shape
aesthetic tohail.ggplot.geom_point
.
Deprecations
(#12213) The
batch_size
parameter ofvds.new_combiner
is deprecated in favor ofgvcf_batch_size
.
Bug fixes
Version 0.2.99
Released 2022-09-13
New Features
Performance Improvements
(#12159) Improve performance of MatrixTable reads when using
_intervals
argument
Bug fixes
Version 0.2.98
Released 2022-08-22
New Features
(#12062)
hl.balding_nichols_model
now supports an optional boolean parameter,phased
, to control the phasedness of the generated genotypes.
Performance improvements
Bug fixes
(#12115) When using
use_new_shuffle=True
, fix a bug when there are more than 2^31 rows(#12074) Fix bug where
hl.init
could silently overwrite the global random seed.(#12079) Fix bug in handling of missing (aka NA) fields in grouped aggregation and distinct by key.
(#12056) Fix
hl.export_vcf
to actually create tabix files when requested.(#12020) Fix bug in
hl.experimental.densify
which manifested as anAssertionError
about dtypes.
Version 0.2.97
Released 2022-06-30
New Features
(#11756)
hb.BatchPoolExecutor
and Python jobs both now also support async functions.
Bug fixes
(#11962) Fix error (logged as (#11891)) in VCF combiner when exactly 10 or 100 files are combined.
(#11969) Fix
import_table
andimport_lines
to use multiple partitions whenforce_bgz
is used.(#11964) Fix erroneous “Bucket is a requester pays bucket but no user project provided.” errors in Google Dataproc by updating to the latest Dataproc image version.
Version 0.2.96
Released 2022-06-21
New Features
(#11833)
hl.rand_unif
now has default arguments of 0.0 and 1.0
Bug fixes
(#11905) Fix erroneous FileNotFoundError in glob patterns
(#11921) and (#11910) Fix file clobbering during text export with speculative execution.
(#11920) Fix array out of bounds error when tree aggregating a multiple of 50 partitions.
(#11937) Fixed correctness bug in scan order for
Table.annotate
andMatrixTable.annotate_rows
in certain circumstances.(#11887) Escape VCF description strings when exporting.
(#11886) Fix an error in an example in the docs for
hl.split_multi
.
Version 0.2.95
Released 2022-05-13
New features
(#11809) Export
dtypes_from_pandas
inexpr.types
(#11807) Teach smoothed_pdf to add a plot to an existing figure.
(#11746) The ServiceBackend, in interactive mode, will print a link to the currently executing driver batch.
(#11759)
hl.logistic_regression_rows
,hl.poisson_regression_rows
, andhl.skat
all now support configuration of the maximum number of iterations and the tolerance.(#11835) Add
hl.ggplot.geom_density
which renders a plot of an approximation of the probability density function of its argument.
Bug fixes
(#11815) Fix incorrectly missing entries in to_dense_mt at the position of ref block END.
(#11828) Fix
hl.init
to not ignore itssc
argument. This bug was introduced in 0.2.94.(#11830) Fix an error and relax a timeout which caused
hailtop.aiotools.copy
to hang.(#11778) Fix a (different) error which could cause hangs in
hailtop.aiotools.copy
.
Version 0.2.94
Released 2022-04-26
Deprecation
(#11765) Deprecated and removed linear mixed model functionality.
Beta features
(#11782)
hl.import_table
is up to twice as fast for small tables.
New features
hailctl dataproc
(#11710) support pass-through arguments to
connect
Bug fixes
(#11792) Resolved issue where corrupted tables could be created with whole-stage code generation enabled.
Version 0.2.93
Release 2022-03-27
Beta features
Several issues with the beta version of Hail Query on Hail Batch are addressed in this release.
Version 0.2.92
Release 2022-03-25
New features
(#11613) Add
hl.ggplot
support forscale_fill_hue
,scale_color_hue
, andscale_fill_manual
,scale_color_manual
. This allows for an infinite number of discrete colors.(#11608) Add all remaining and all versions of extant public gnomAD datasets to the Hail Annotation Database and Datasets API. Current as of March 23rd 2022.
(#11662) Add the
weight
aestheticgeom_bar
.
Beta features
This version of Hail includes all the necessary client-side infrastructure to execute Hail Query pipelines on a Hail Batch cluster. This effectively enables a “serverless” version of Hail Query which is independent of Apache Spark. Broad affiliated users should contact the Hail team for help using Hail Query on Hail Batch. Unaffiliated users should also contact the Hail team to discuss the feasibility of running your own Hail Batch cluster. The Hail team is accessible at both https://hail.zulipchat.com and https://discuss.hail.is .
Version 0.2.91
Release 2022-03-18
Bug fixes
(#11614) Update
hail.utils.tutorial.get_movie_lens
to usehttps
instead ofhttp
. Movie Lens has stopped serving data over insecure HTTP.(#11563) Fix issue hail-is/hail#11562.
(#11611) Fix a bug that prevents the display of
hl.ggplot.geom_hline
andhl.ggplot.geom_vline
.
Version 0.2.90
Release 2022-03-11
Critical BlockMatrix from_numpy correctness bug
(#11555)
BlockMatrix.from_numpy
did not work correctly. Version 1.0 of org.scalanlp.breeze, a dependency of Apache Spark that hail also depends on, has a correctness bug that results in BlockMatrices that repeat the top left block of the block matrix for every block. This affected anyone running Spark 3.0.x or 3.1.x.
Bug fixes
(#11556) Fixed assertion error ocassionally being thrown by valid joins where the join key was a prefix of the left key.
Versioning
(#11551) Support Python 3.10.
Version 0.2.89
Release 2022-03-04
(#11452) Fix
impute_sex_chromosome_ploidy
docs.
Version 0.2.88
Release 2022-03-01
This release addresses the deploy issues in the 0.2.87 release of Hail.
Version 0.2.87
Release 2022-02-28
An error in the deploy process required us to yank this release from PyPI. Please do not use this release.
Bug fixes
(#11401) Fixed bug where
from_pandas
didn’t support missing strings.
Version 0.2.86
Release 2022-02-25
Bug fixes
Performance improvements
(#11306) Newly written tables that have no duplicate keys will be faster to join against.
Version 0.2.85
Release 2022-02-14
Bug fixes
New features
(#11332) Added
geom_ribbon
andgeom_area
to hail ggplot.
Version 0.2.84
Release 2022-02-10
Bug fixes
(#11328) Fix bug where occasionally files written to disk would be unreadable.
(#11331) Fix bug that potentially caused files written to disk to be unreadable.
(#11312) Fix aggregator memory leak.
(#11340) Fix bug where repeatedly annotating same field name could cause failure to compile.
(#11342) Fix to possible issues about having too many open file handles.
New features
Version 0.2.83
Release 2022-02-01
Bug fixes
New features
(#11274) Added
geom_col
tohail.ggplot
.
hailctl dataproc
(#11280) Updated dataproc image version to one not affected by log4j vulnerabilities.
Version 0.2.82
Release 2022-01-24
Bug fixes
(#11209) Significantly improved usefulness and speed of
Table.to_pandas
, resolved several bugs with output.
New features
Performance Improvements
(#11216) Significantly improve performance of
parse_locus_interval
Python and Java Support
File Format
The native file format version is now 1.6.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.81
Release 2021-12-20
hailctl dataproc
(#11182) Updated Dataproc image version to mitigate yet more Log4j vulnerabilities.
Version 0.2.80
Release 2021-12-15
New features
(#11077)
hl.experimental.write_matrix_tables
now returns the paths of the written matrix tables.
hailctl dataproc
(#11157) Updated Dataproc image version to mitigate the Log4j vulnerability.
(#10900) Added
--region
parameter tohailctl dataproc submit
.(#11090) Teach
hailctl dataproc describe
how to read URLs with the protocolss3
(Amazon S3),hail-az
(Azure Blob Storage), andfile
(local file system) in addition togs
(Google Cloud Storage).
Version 0.2.79
Release 2021-11-17
Bug fixes
(#11023) Fixed bug in call decoding that was introduced in version 0.2.78.
New features
(#10993) New function
p_value_excess_het
.
Version 0.2.78
Release 2021-10-19
Bug fixes
New features
(#10855) Arbitrary aggregations can be implemented using
hl.agg.fold
.
Performance Improvements
(#10971) Substantially improve the speed of
Table.collect
when collecting large amounts of data.
Version 0.2.77
Release 2021-09-21
Bug fixes
Version 0.2.76
Released 2021-09-15
Bug fixes
Version 0.2.75
Released 2021-09-10
Bug fixes
(#10733) Fix a bug in tabix parsing when the size of the list of all sequences is large.
(#10765) Fix rare bug where valid pipelines would fail to compile if intervals were created conditionally.
(#10746) Various compiler improvements, decrease likelihood of
ClassTooLarge
errors.(#10829) Fix a bug where
hl.missing
andCaseBuilder.or_error
failed if their type was a struct containing a field starting with a number.
New features
(#10768) Support multiplying
StringExpression
s to repeat them, as with normal python strings.
Performance improvements
Version 0.2.74
Released 2021-07-26
Bug fixes
Version 0.2.73
Released 2021-07-22
Bug fixes
Version 0.2.72
Released 2021-07-19
New Features
Bug fixes
Version 0.2.71
Released 2021-07-08
New Features
Bug fixes
hailctl dataproc
(#10633) Added
--scopes
parameter tohailctl dataproc start
.
Version 0.2.70
Released 2021-06-21
Version 0.2.69
Released 2021-06-14
New Features
Bug fixes
hailctl dataproc
(#10574) Hail logs will now be stored in
/home/hail
by default.
Version 0.2.68
Released 2021-05-27
Version 0.2.67
Critical performance fix
Released 2021-05-06
(#10451) Fixed a memory leak / performance bug triggered by
hl.literal(...).contains(...)
Version 0.2.66
Released 2021-05-03
New features
Version 0.2.65
Released 2021-04-14
Default Spark Version Change
Starting from version 0.2.65, Hail uses Spark 3.1.1 by default. This will also allow the use of all python versions >= 3.6. By building hail from source, it is still possible to use older versions of Spark.
New features
Performance improvements
(#10233) Loops created with
hl.experimental.loop
will now clean up unneeded memory between iterations.
Bug fixes
(#10227)
hl.nd.qr
now supports ndarrays that have 0 rows or columns.
Version 0.2.64
Released 2021-03-11
New features
(#10164) Add source_file_field parameter to hl.import_table to allow lines to be associated with their original source file.
Bug fixes
(#10182) Fixed serious memory leak in certain uses of
filter_intervals
.(#10133) Fix bug where some pipelines incorrectly infer missingness, leading to a type error.
(#10134) Teach
hl.king
to treat filtered entries as missing values.(#10158) Fixes hail usage in latest versions of jupyter that rely on
asyncio
.(#10174) Fixed bad error message when incorrect return type specified with
hl.loop
.
Version 0.2.63
Released 2021-03-01
(#10105) Hail will now return
frozenset
andhail.utils.frozendict
instead of normal sets and dicts.
Bug fixes
Performance Improvements
Version 0.2.62
Released 2021-02-03
New features
(#9936) Deprecated
hl.null
in favor ofhl.missing
for naming consistency.(#9973)
hl.vep
now includes avep_proc_id
field to aid in debugging unexpected output.(#9839) Hail now eagerly deletes temporary files produced by some BlockMatrix operations.
(#9835)
hl.any
andhl.all
now also support a single collection argument and a varargs of Boolean expressions.(#9816)
hl.pc_relate
now includes values on the diagonal of kinship, IBD-0, IBD-1, and IBD-2(#9736) Let NDArrayExpression.reshape take varargs instead of mandating a tuple.
(#9766)
hl.export_vcf
now warns if INFO field names are invalid according to the VCF 4.3 spec.
Bug fixes
(#9976) Fixed
show()
representation of Hail dictionaries.
Performance improvements
(#9909) Improved performance of
hl.experimental.densify
by approximately 35%.
Version 0.2.61
Released 2020-12-03
New features
(#9749) Add or_error method to SwitchBuilder (
hl.switch
)
Bug fixes
Version 0.2.60
Released 2020-11-16
New features
(#9696)
hl.experimental.export_elasticsearch
will now support Elasticsearch versions 6.8 - 7.x by default.
Bug fixes
(#9641) Showing hail ndarray data now always prints in correct order.
hailctl dataproc
(#9610) Support interval fields in
hailctl dataproc describe
Version 0.2.59
Released 2020-10-22
Datasets / Annotation DB
(#9605) The Datasets API and the Annotation Database now support AWS, and users are required to specify what cloud platform they’re using.
hailctl dataproc
(#9609) Fixed bug where
hailctl dataproc modify
did not correctly print correspondinggcloud
command.
Version 0.2.58
Released 2020-10-08
New features
(#9524) Hail should now be buildable using Spark 3.0.
(#9549) Add
ignore_in_sample_frequency
flag tohl.de_novo
.(#9501) Configurable cache size for
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9474) Add
ArrayExpression.first
andArrayExpression.last
.(#9459) Add
StringExpression.join
, an analogue to Python’sstr.join
.(#9398) Hail will now throw
HailUserError
s if theor_error
branch of aCaseBuilder
is hit.
Bug fixes
(#9503) NDArrays can now hold arbitrary data types, though only ndarrays of primitives can be collected to Python.
(#9501) Remove memory leak in
BlockMatrix.to_matrix_table_row_major
andBlockMatrix.to_table_row_major
.(#9424)
hl.experimental.writeBlockMatrices
didn’t correctly supportoverwrite
flag.
Performance improvements
(#9506)
hl.agg.ndarray_sum
will now do a tree aggregation.
hailctl dataproc
Deprecations
(#9482)
ArrayExpression.head
has been deprecated in favor ofArrayExpression.first
.
Version 0.2.57
Released 2020-09-03
New features
(#9343) Implement the KING method for relationship inference as
hl.methods.king
.
Version 0.2.56
Released 2020-08-31
New features
Performance
Bug fixes
(#9304) Fix crash in
run_combiner
caused by inputs where VCF lines and BGZ blocks align.
hailctl dataproc
Version 0.2.55
Released 2020-08-19
Performance
(#9264) Table.checkpoint now uses a faster LZ4 compression scheme.
Bug fixes
(#9250)
hailctl dataproc
no longer uses deprecatedgcloud
flags. Consequently, users must update to a recent version ofgcloud
.(#9294) The “Python 3” kernel in notebooks in clusters started by
hailctl dataproc
now features the same Spark monitoring widget found in the “Hail” kernel. There is now no reason to use the “Hail” kernel.
File Format
The native file format version is now 1.5.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.54
Released 2020-08-07
VCF Combiner
New features
(#9209) Add
hl.agg.ndarray_sum
aggregator.
Bug fixes
Version 0.2.53
Released 2020-07-30
Bug fixes
Version 0.2.52
Released 2020-07-29
Bug fixes
Version 0.2.51
Released 2020-07-28
Bug fixes
Version 0.2.50
Released 2020-07-23
Bug fixes
(#9114) CHANGELOG: Fixed crash when using repeated calls to
hl.filter_intervals
.
New features
Version 0.2.49
Released 2020-07-08
Bug fixes
(#9058) Fixed memory leak affecting
Table.aggregate
,MatrixTable.annotate_cols
aggregations, andhl.sample_qc
.
Version 0.2.48
Released 2020-07-07
Bug fixes
(#9029) Fix crash when using
hl.agg.linreg
with no aggregated data records.(#9028) Fixed memory leak affecting
Table.annotate
with scans,hl.experimental.densify
, andTable.group_by
/aggregate
.(#8978) Fixed aggregation behavior of
MatrixTable.{group_rows_by, group_cols_by}
to skip filtered entries.
Version 0.2.47
Released 2020-06-23
Bug fixes
Version 0.2.46
Released 2020-06-17
Site
(#8955) Natural language documentation search
Bug fixes
(#8981) Fix BlockMatrix OOM triggered by the MatrixWriteBlockMatrix WriteBlocksRDD method
Version 0.2.45
Release 2020-06-15
Bug fixes
hailctl dataproc
Version 0.2.44
Release 2020-06-06
New Features
Bug fixes
(#8883) Fix an issue related to failures in pipelines with
force_bgz=True
.
Performance
(#8887) Substantially improve the performance of
hl.experimental.import_gtf
.
Version 0.2.43
Released 2020-05-28
Bug fixes
Version 0.2.42
Released 2020-05-27
New Features
Bug fixes
Version 0.2.41
Released 2020-05-15
Bug fixes
hailctl dataproc
(#8790) Use configured compute zone as default for
hailctl dataproc connect
andhailctl dataproc modify
.
Version 0.2.40
Released 2020-05-12
VCF Combiner
(#8706) Add option to key by both locus and alleles for final output.
Bug fixes
Version 0.2.39
Released 2020-04-29
Bug fixes
(#8615) Fix contig ordering in the CanFam3 (dog) reference genome.
(#8622) Fix bug that causes inscrutable JVM Bytecode errors.
(#8645) Ease unnecessarily strict assertion that caused errors when aggregating by key (e.g.
hl.experimental.spread
).(#8621)
hl.nd.array
now supports arrays with no elements (e.g.hl.nd.array([]).reshape((0, 5))
) and, consequently, matmul with an inner dimension of zero.
New features
(#8571)
hl.init(skip_logging_configuration=True)
will skip configuration of Log4j. Users may use this to configure their own logging.(#8588) Users who manually build Python wheels will experience less unnecessary output when doing so.
(#8572) Add
hl.parse_json
which converts a string containing JSON into a Hail object.
Performance Improvements
Documentation
Version 0.2.38
Released 2020-04-21
Critical Linreg Aggregator Correctness Bug
(#8575) Fixed a correctness bug in the linear regression aggregator. This was introduced in version 0.2.29. See https://discuss.hail.is/t/possible-incorrect-linreg-aggregator-results-in-0-2-29-0-2-37/1375 for more details.
Performance improvements
(#8558) Make
hl.experimental.export_entries_by_col
more fault tolerant.
Version 0.2.37
Released 2020-04-14
Bug fixes
(#8487) Fix incorrect handling of badly formatted data for
hl.gp_dosage
.(#8497) Fix handling of missingness for
hl.hamming
.(#8537) Fix compile-time errror.
(#8539) Fix compiler error in
Table.multi_way_zip_join
.(#8488) Fix
hl.agg.call_stats
to appropriately throw an error for badly-formatted calls.
New features
(#8327) Attempting to write to the same file being read from in a pipeline will now throw an error instead of corrupting data.
Version 0.2.36
Released 2020-04-06
Critical Memory Management Bug Fix
(#8463) Reverted a change (separate to the bug in 0.2.34) that led to a memory leak in version 0.2.35.
Bug fixes
Version 0.2.35
Released 2020-04-02
Critical Memory Management Bug Fix
(#8412) Fixed a serious per-partition memory leak that causes certain pipelines to run out of memory unexpectedly. Please update from 0.2.34.
New features
(#8404) Added “CanFam3” (a reference genome for dogs) as a bundled reference genome.
Bug fixes
Performance Improvements
hailctl dataproc
Version 0.2.34
Released 2020-03-12
New features
Bug fixes
hailctl dataproc
(#8253)
hailctl dataproc
now supports new flags--requester-pays-allow-all
and--requester-pays-allow-buckets
. This will configure your hail installation to be able to read from requester pays buckets. The charges for reading from these buckets will be billed to the project that the cluster is created in.(#8268) The data sources for VEP have been moved to
gs://hail-us-vep
,gs://hail-eu-vep
, andgs://hail-uk-vep
, which are requester-pays buckets in Google Cloud.hailctl dataproc
will automatically infer which of these buckets you should pull data from based on the region your cluster is spun up in. If you are in none of those regions, please contact us on discuss.hail.is.
File Format
The native file format version is now 1.4.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.33
Released 2020-02-27
New features
(#8173) Added new method
hl.zeros
.
Bug fixes
(#8153) Fixed complier bug causing
MatchError
inimport_bgen
.(#8123) Fixed an issue with multiple Python HailContexts running on the same cluster.
(#8150) Fixed an issue where output from VEP about failures was not reported in error message.
(#8152) Fixed an issue where the row count of a MatrixTable coming from
import_matrix_table
was incorrect.(#8175) Fixed a bug where
persist
did not actually do anything.
hailctl dataproc
(#8079) Using
connect
to open the jupyter notebook browser will no longer crash if your project contains requester-pays buckets.
Version 0.2.32
Released 2020-02-07
Critical performance regression fix
(#7989) Fixed performance regression leading to a large slowdown when
hl.variant_qc
was run after filtering columns.
Performance
Bug fixes
(#7976) Fixed divide-by-zero error in
hl.concordance
with no overlapping rows or cols.(#7965) Fixed optimizer error leading to crashes caused by
MatrixTable.union_rows
.(#8035) Fix compiler bug in
Table.multi_way_zip_join
.(#8021) Fix bug in computing shape after
BlockMatrix.filter
.(#7986) Fix error in NDArray matrix/vector multiply.
New features
(#8007) Add
hl.nd.diagonal
function.
Cheat sheets
Version 0.2.31
Released 2020-01-22
New features
(#7787) Added transition/transversion information to
hl.summarize_variants
.(#7792) Add Python stack trace to array index out of bounds errors in Hail pipelines.
(#7832) Add
spark_conf
argument tohl.init
, permitting configuration of Spark runtime for a Hail session.(#7823) Added datetime functions
hl.experimental.strptime
andhl.experimental.strftime
.(#7888) Added
hl.nd.array
constructor from nested standard arrays.
File size
(#7923) Fixed compression problem since 0.2.23 resulting in larger-than-expected matrix table files for datasets with few entry fields (e.g. GT-only datasets).
Performance
Bug fixes
Version 0.2.30
Released 2019-12-20
Performance
New features
(#7614) Added experimental support for loops with
hl.experimental.loop
.
Miscellaneous
(#7745) Changed
export_vcf
to only use scientific notation when necessary.
Version 0.2.29
Released 2019-12-17
Bug fixes
(#7229) Fixed
hl.maximal_independent_set
tie breaker functionality.(#7732) Fixed incompatibility with old files leading to incorrect data read when filtering intervals after
read_matrix_table
.(#7642) Fixed crash when constant-folding functions that throw errors.
(#7611) Fixed
hl.hadoop_ls
to handle glob patterns correctly.(#7653) Fixed crash in
ld_prune
by unfiltering missing GTs.
Performance improvements
New features
(#7686) Added
comment
argument toimport_matrix_table
, allowing lines with certain prefixes to be ignored.(#7688) Added experimental support for
NDArrayExpression
s in newhl.nd
module.(#7608)
hl.grep
now has ashow
argument that allows users to either print the results (default) or return a dictionary of the results.
hailctl dataproc
(#7717) Throw error when mispelling arguments instead of silently quitting.
Version 0.2.28
Released 2019-11-22
Critical correctness bug fix
(#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.
Bug fixes
New Features
hailctl dataproc
(#7586)
hailctl dataproc
now supports--gcloud_configuration
option.
Documentation
(#7570) Hail has a cheatsheet for Tables now.
Version 0.2.27
Released 2019-11-15
New Features
(#7379) Add
delimiter
argument tohl.import_matrix_table
(#7389) Add
force
andforce_bgz
arguments tohl.experimental.import_gtf
(#7467) Added
hl.if_else
as an alias forhl.cond
; deprecatedhl.cond
.(#7453) Add
hl.parse_int{32, 64}
andhl.parse_float{32, 64}
, which can parse strings to numbers and return missing on failure.(#7475) Add
row_join_type
argument toMatrixTable.union_cols
to support outer joins on rows.
Bug fixes
hailctl dataproc
(#7460) The Spark monitor widget now automatically collapses after a job completes.
Version 0.2.26
Released 2019-10-24
New Features
Bug Fixes
(#7361) Fix
AD
calculation insparse_split_multi
.
Performance Improvements
(#7355) Improve performance of IR copying.
File Format
The native file format version is now 1.3.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.25
Released 2019-10-14
New features
(#7240) Add interactive schema widget to
{MatrixTable, Table}.describe
. Use this by passing the argumentwidget=True
.(#7250)
{Table, MatrixTable, Expression}.summarize()
now summarizes elements of collections (arrays, sets, dicts).(#7271) Improve
hl.plot.qq
by increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot.(#7280) Add HTML output for
{Table, MatrixTable, Expression}.summarize()
.(#7294) Add HTML output for
hl.summarize_variants()
.
Bug fixes
Performance improvements
File Format
The native file format version is now 1.2.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.24
Released 2019-10-03
hailctl dataproc
(#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.
New features
(#7071) Add
permit_shuffle
flag tohl.{split_multi, split_multi_hts}
to allow processing of datasets with both multiallelics and duplciate loci.(#7121) Add
hl.contig_length
function.(#7130) Add
window
method onLocusExpression
, which creates an interval around a locus.(#7172) Permit
hl.init(sc=sc)
with pip-installed packages, given the right configuration options.
Bug fixes
Version 0.2.23
Released 2019-09-23
hailctl dataproc
Bug fixes
New features
(#7009) Introduced analysis pass in Python that mostly obviates the
hl.bind
andhl.rbind
operators; idiomatic Python that generates Hail expressions will perform much better.(#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.
(#7085) Warn only once about schema mismatches during JSON import (used in VEP, Nirvana, and sometimes
import_table
.(#7106)
hl.agg.call_stats
can now accept a number of alleles for itsalleles
parameter, useful when dealing with biallelic calls without the alleles array at hand.
Performance
Version 0.2.22
Released 2019-09-12
New features
(#7013) Added
contig_recoding
toimport_bed
andimport_locus_intervals
.
Performance
hailctl dataproc
(#7003) Pass through extra arguments for
hailctl dataproc list
andhailctl dataproc stop
.
Version 0.2.21
Released 2019-09-03
Bug fixes
New features
Performance
hailctl dataproc
Version 0.2.20
Released 2019-08-19
Critical memory management fix
(#6824) Fixed memory management inside
annotate_cols
with aggregations. This was causing memory leaks and segfaults.
Bug fixes
New features
(#6847) Added
hl.nanmin
andhl.nanmax
functions.
Version 0.2.19
Released 2019-08-01
Critical performance bug fix
Bug fixes
(#6757) Fixed correctness bug in optimizations applied to the combination of
Table.order_by
withhl.desc
arguments andshow()
, leading to tables sorted in ascending, not descending order.(#6770) Fixed assertion error caused by
Table.expand_types()
, which was used byTable.to_spark
andTable.to_pandas
.
Performance Improvements
(#6666) Slightly improve performance of
hl.pca
andhl.hwe_normalized_pca
.(#6669) Improve performance of
hl.split_multi
andhl.split_multi_hts
.(#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
(#6775) Fixed a major performance problem related to reading block matrices.
hailctl dataproc
(#6760) Fixed the address pointed at by
ui
inconnect
, after Google changed proxy settings that rendered the UI URL incorrect. Also added new addresshist/spark-history
.
Version 0.2.18
Released 2019-07-12
Critical performance bug fix
(#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.
Bug fixes
(#6598) Fixed code generated by
MatrixTable.unfilter_entries
to improve performance. This will slightly improve the performance ofhwe_normalized_pca
and relatedness computation methods, which useunfilter_entries
internally.
Version 0.2.17
Released 2019-07-10
New features
(#6349) Added
compression
parameter toexport_block_matrices
, which can be'gz'
or'bgz'
.(#6405) When a matrix table has string column-keys,
matrixtable.show
uses the column key as the column name.(#6345) Added an improved scan implementation, which reduces the memory load on master.
(#6462) Added
export_bgen
method.(#6473) Improved performance of
hl.agg.array_sum
by about 50%.(#6498) Added method
hl.lambda_gc
to calculate the genomic control inflation factor.(#6456) Dramatically improved performance of pipelines containing long chains of calls to
Table.annotate
, orMatrixTable
equivalents.(#6506) Improved the performance of the generated code for the
Table.annotate(**thing)
pattern.
Bug fixes
(#6404) Added
n_rows
andn_cols
parameters toExpression.show
for consistency with othershow
methods.(#6408)(#6419) Fixed an issue where the
filter_intervals
optimization could make scans return incorrect results.(#6459)(#6458) Fixed rare correctness bug in the
filter_intervals
optimization which could result too many rows being kept.(#6496) Fixed html output of
show
methods to truncate long field contents.(#6478) Fixed the broken documentation for the experimental
approx_cdf
andapprox_quantiles
aggregators.(#6504) Fix
Table.show
collecting data twice while running in Jupyter notebooks.(#6571) Fixed the message printed in
hl.concordance
to print the number of overlapping samples, not the full list of overlapping sample IDs.(#6583) Fixed
hl.plot.manhattan
for non-default reference genomes.
Experimental
(#6488) Exposed
table.multi_way_zip_join
. This takes a list of tables of identical types, and zips them together into one table.
File Format
The native file format version is now 1.1.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.16
Released 2019-06-19
hailctl
(#6357) Accommodated Google Dataproc bug causing cluster creation failures.
Bug fixes
(#6378) Fixed problem in how
entry_float_type
was being handled inimport_vcf
.
Version 0.2.15
Released 2019-06-14
After some infrastructural changes to our development process, we should be getting back to frequent releases.
hailctl
Starting in 0.2.15, pip
installations of Hail come bundled with a
command- line tool, hailctl
. This tool subsumes the functionality of
cloudtools
, which is now deprecated. See the release thread on the
forum
for more information.
New features
(#5932)(#6115)
hl.import_bed
abdhl.import_locus_intervals
now accept keyword arguments to pass through tohl.import_table
, which is used internally. This permits parameters likemin_partitions
to be set.(#5980) Added
log
option tohl.plot.histogram2d
.(#5937) Added
all_matches
parameter toTable.index
andMatrixTable.index_{rows, cols, entries}
, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.(#5913) Added functionality that makes arrays of structs easier to work with.
(#6089) Added HTML output to
Expression.show
when running in a notebook.(#6172)
hl.split_multi_hts
now uses the originalGQ
value if thePL
is missing.(#6123) Added
hl.binary_search
to search sorted numeric arrays.(#6224) Moved implementation of
hl.concordance
from backend to Python. Performance directly fromread()
is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.(#6214) Updated Hail Python dependencies.
(#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.
Bug fixes
(#5895) Fixed crash caused by
-0.0
floating-point values inhl.agg.hist
.(#6013) Turned off feature in HTSJDK that caused crashes in
hl.import_vcf
due to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.(#6117) Fixed problem causing
Table.flatten()
to be quadratic in the size of the schema.(#6228)(#5993) Fixed
MatrixTable.union_rows()
to join distinct keys on the right, preventing an unintentional cartesian product.(#6235) Fixed an issue related to aggregation inside
MatrixTable.filter_cols
.(#6226) Restored lost behavior where
Table.show(x < 0)
shows the entire table.(#6267) Fixed cryptic crashes related to
hl.split_multi
andMatrixTable.entries()
with duplicate row keys.
Version 0.2.14
Released 2019-04-24
A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.
New features
Version 0.2.13
Released 2019-04-18
Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.
New features
(#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
(#5860) Improve performance of some column annotation pipelines.
(#5858) Add
unify
option toTable.union
which allows unification of tables with different fields or field orderings.(#5799)
mt.entries()
is four times faster.(#5756) Hail now uses Spark 2.4.x by default.
(#5677)
MatrixTable
now also supportsshow
.(#5793)(#5701) Add
array.index(x)
which find the first index ofarray
whose value is equal tox
.(#5790) Add
array.head()
which returns the first element of the array, or missing if the array is empty.(#5690) Improve performance of
ld_matrix
.(#5743)
mt.compute_entry_filter_stats
computes statistics about the number of filtered entries in a matrix table.(#5758) failure to parse an interval will now produce a much more detailed error message.
(#5723)
hl.import_matrix_table
can now import a matrix table with no columns.(#5724)
hl.rand_norm2d
samples from a two dimensional random normal.
Bug fixes
(#5885) Fix
Table.to_spark
in the presence of fields of tuples.(#5882)(#5886) Fix
BlockMatrix
conversion methods to correctly handle filtered entries.(#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
(#5855)(#5786) Fix
hl.mendel_errors
incorrectly reporting children counts in the presence of entry filtering.(#5773) Fix
hl.sample_qc
to use correct number of total rows when calculating call rate.(#5763)(#5764) Fix
hl.agg.array_agg
to work insidemt.annotate_rows
and similar functions.(#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
(#5692) When
keyed
isTrue
,hl.maximal_independent_set
now does not produce duplicates.(#5725) Docs now consistently refer to
hl.agg
notagg
.(#5730)(#5782) Taught
import_bgen
to optimize itsvariants
argument.
Experimental
Version 0.2.12
Released 2019-03-28
New features
Bug fixes
Experimental
(#5524) Add
summarize
functions to Table, MatrixTable, and Expression.(#5570) Add
hl.agg.approx_cdf
aggregator for approximate density calculation.(#5571) Add
log
parameter tohl.plot.histogram
.(#5601) Add
hl.plot.joint_plot
, extend functionality ofhl.plot.scatter
.(#5608) Add LD score simulation framework.
(#5628) Add
hl.experimental.full_outer_join_mt
for full outer joins onMatrixTable
s.
Version 0.2.11
Released 2019-03-06
New features
(#5374) Add default arguments to
hl.add_sequence
for running on GCP.(#5481) Added
sample_cols
method toMatrixTable
.(#5501) Exposed
MatrixTable.unfilter_entries
. Seefilter_entries
documentation for more information.(#5480) Added
n_cols
argument toMatrixTable.head
.(#5529) Added
Table.{semi_join, anti_join}
andMatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}
.(#5528) Added
{MatrixTable, Table}.checkpoint
methods as wrappers aroundwrite
/read_{matrix_table, table}
.
Bug fixes
(#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
(#5419) Resolved issue with
import_vcf
force_bgz
and file size checks.(#5427) Resolved issue with
Table.show
and dictionary field types.(#5468) Resolved ordering problem with
Expression.show
on key fields that are not the first key.(#5492) Fixed
hl.agg.collect
crashing when collectingfloat32
values.(#5525) Fixed
hl.trio_matrix
crashing whencomplete_trios
isFalse
.
Version 0.2.10
Released 2019-02-15
New features
(#5272) Added a new ‘delimiter’ option to Table.export.
(#5251) Add utility aliases to
hl.plot
foroutput_notebook
andshow
.(#5249) Add
histogram2d
function tohl.plot
module.(#5247) Expose
MatrixTable.localize_entries
method for converting to a Table with an entries array.(#5300) Add new
filter
andfind_replace
arguments tohl.import_table
andhl.import_vcf
to apply regex and substitutions to text input.
Performance improvements
(#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.
Bug fixes
(#5306) Fix
ReferenceGenome.add_sequence
causing a crash.(#5268) Fix
Table.export
writing a file called ‘None’ in the current directory.(#5265) Fix
hl.get_reference
raising an exception when called beforehl.init()
.(#5250) Fix crash in
pc_relate
when called on a MatrixTable field other than ‘GT’.(#5278) Fix crash in
Table.order_by
when sorting by fields whose names are not valid Python identifiers.(#5294) Fix crash in
hl.trio_matrix
when sample IDs are missing.(#5295) Fix crash in
Table.index
related to key field incompatibilities.
Version 0.2.9
Released 2019-01-30
New features
Performance improvements
Bug fixes
(#5144) Fix crash caused by
hl.index_bgen
(since 0.2.7)(#5177) Fix bug causing
Table.repartition(n, shuffle=True)
to fail to increase partitioning for unkeyed tables.(#5173) Fix bug causing
Table.show
to throw an error when the table is empty (since 0.2.8).(#5210) Fix bug causing
Table.show
to always print types, regardless oftypes
argument (since 0.2.8).(#5211) Fix bug causing
MatrixTable.make_table
to unintentionally discard non-key row fields (since 0.2.8).
Version 0.2.8
Released 2019-01-15
New features
Performance improvements
Bug fixes
Version 0.2.7
Released 2019-01-03
New features
(#5046)(experimental) Added option to BlockMatrix.export_rectangles to export as NumPy-compatible binary.
Performance improvements
(#5050) Short-circuit iteration in
logistic_regression_rows
andpoisson_regression_rows
if NaNs appear.
Version 0.2.6
Released 2018-12-17
New features
(#4962) Expanded comparison operators (
==
,!=
,<
,<=
,>
,>=
) to support expressions of every type.(#4927) Expanded functionality of
Table.order_by
to support ordering by arbitrary expressions, instead of just top-level fields.(#4926) Expanded default GRCh38 contig recoding behavior in
import_plink
.
Performance improvements
Bug fixes
(#4941) Fixed variable scoping error in regression methods.
(#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than
i
andj
.(#4932) Fixed possible error in
export_plink
related to tolerance of writer process failure.(#4920) Fixed bad error message in
Table.order_by
.
Version 0.2.5
Released 2018-12-07
New features
(#4845) The or_error method in
hl.case
andhl.switch
statements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.(#4865) We use this new
or_error
functionality in methods that require biallelic variants to include an offending variant in the error message.(#4820) Added hl.reversed for reversing arrays and strings.
(#4895) Added
include_strand
option to the hl.liftover function.
Performance improvements
Bug fixes
(#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using
group_rows_by
.(#4888) Fixed assertion error in BlockMatrix.sum.
(#4871) Fixed possible error in locally sorting nested collections.
(#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
(#4527)(#4761) Fixed optimizer assertion error sometimes encountered with
hl.split_multi[_hts]
.
Version 0.2.4: Beginning of history!
We didn’t start manually curating information about user-facing changes until version 0.2.4.
The full commit history is available here.