{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering and Annotation Tutorial\n", "\n", "### Filter\n", "\n", "You can filter the rows of a table with [Table.filter](https://hail.is/docs/0.2/hail.Table.html#hail.Table.filter). This returns a table of those rows for which the expression evaluates to `True`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hail as hl\n", "\n", "hl.utils.get_movie_lens('data/')\n", "users = hl.read_table('data/users.ht')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.filter(users.occupation == 'programmer').count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also express this query in multiple ways using [aggregations](https://hail.is/docs/0.2/tutorials/04-aggregation.html): " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.aggregate(hl.agg.filter(users.occupation == 'programmer', hl.agg.count()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.aggregate(hl.agg.counter(users.occupation == 'programmer'))[True]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Annotate\n", "\n", "You can add new fields to a table with [annotate](https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate). As an example, let's create a new column called `cleaned_occupation` that replaces missing entries in the `occupation` field labeled as 'other' with 'none.'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_occupations = hl.set(['other', 'none'])\n", "\n", "t = users.annotate(\n", " cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),\n", " hl.missing('str'),\n", " users.occupation))\n", "t.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare this to what we had before:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: `annotate` is functional: it doesn't mutate `users`, but returns a new table. This is also true of `filter`. In fact, all operations in Hail are functional." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select and Transmute\n", "There are two other annotate methods: [select](https://hail.is/docs/0.2/hail.Table.html#hail.Table.select) and [transmute](https://hail.is/docs/0.2/hail.Table.html#hail.Table.transmute). `select` allows you to create new tables from old ones by selecting existing fields, or creating new ones." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's extract the `sex` and `occupation` fields:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "users.select(users.sex, users.occupation).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also create a new field that stores the age relative to the average. Note that new fields *must* be assigned a name (in this case `mean_shifted_age`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mean_age = round(users.aggregate(hl.agg.stats(users.age)).mean)\n", "users.select(users.sex, users.occupation, mean_shifted_age = users.age - mean_age).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`transmute` replaces any fields mentioned on the right-hand side with the new fields, but leaves unmentioned fields unchanged. `transmute` is useful for transforming data into a new form. Compare the following two snippts of code. The second is identical to the first with `transmute` replacing `select.`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_occupations = hl.set(['other', 'none'])\n", "\n", "t = users.select(\n", " cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),\n", " hl.missing('str'),\n", " users.occupation))\n", "t.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing_occupations = hl.set(['other', 'none'])\n", "\n", "t = users.transmute(\n", " cleaned_occupation = hl.if_else(missing_occupations.contains(users.occupation),\n", " hl.missing('str'),\n", " users.occupation))\n", "t.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Global Fields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can add global fields with [annotate_globals](https://hail.is/docs/0.2/hail.Table.html#hail.Table.annotate_globals). Globals are useful for storing metadata about a dataset or storing small data structures like sets and maps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t = users.annotate_globals(cohort = 5, cloudable = hl.set(['sample1', 'sample10', 'sample15']))\n", "t.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t.cloudable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hl.eval(t.cloudable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercises\n", "\n", "\n", "- [Z-score normalize](https://en.wikipedia.org/wiki/Standard_score) the age field of `users`.\n", "- Convert `zip` to an integer. Hint: Not all zipcodes are US zipcodes! Use [hl.int32](https://hail.is/docs/0.2/functions/constructors.html#hail.expr.functions.int32) to convert a string to an integer. Use [StringExpression.matches](https://hail.is/docs/0.2/expressions.html#hail.expr.expressions.StringExpression.matches) to see if a string matches a regular expression." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }