{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregation Tutorial\n",
    "\n",
    "In the last section, we inspected the structure of the data and displayed a few example values.\n",
    "\n",
    "How do we get a deeper feel for the data?  One of the most natural things to do is to create a summary of a large number of values.  For example, you could ask:\n",
    "\n",
    " - How many women are in the dataset? How many men?\n",
    " - What is the average age?  Youngest age?  Oldest age?\n",
    " - What are all the occupations that appear, and how many times does each appear?\n",
    "\n",
    "We can answer these questions with *aggregation*.  Aggregation combines many values together to create a summary.\n",
    "\n",
    "To start, we'll aggregate all the values in a table.  (Later, we'll learn how to aggregate over subsets.)\n",
    "\n",
    "We can do this with the [Table.aggregate](https://hail.is/docs/0.2/hail.Table.html#hail.Table.aggregate) method.\n",
    "\n",
    "A call to `aggregate` has two parts:\n",
    "\n",
    " - The expression to aggregate over (e.g. a field of a `Table`).\n",
    " - The *aggregator* to combine the values into the summary.\n",
    " \n",
    "Hail has a large suite of [aggregators](https://hail.is/docs/0.2/aggregators.html) for summarizing data.  Let's see some in action!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `count`\n",
    "Aggregators live in the `hl.agg` module.  The simplest aggregator is [count](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.count).  It takes no arguments and returns the number of values aggregated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bokeh.io import output_notebook, show\n",
    "\n",
    "import hail as hl\n",
    "\n",
    "output_notebook()\n",
    "hl.init()\n",
    "\n",
    "\n",
    "hl.utils.get_movie_lens('data/')\n",
    "users = hl.read_table('data/users.ht')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.count()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `stats`\n",
    "[stats](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.stats) computes useful statistics about a numeric expression at once.  There are also aggregators for `mean`, `min`, `max`, `sum`, `product` and `array_sum`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.stats(users.age))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `counter`\n",
    "What about non-numeric data, like the `occupation` field?  \n",
    "\n",
    "[counter](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.counter) is modeled on the Python Counter object: it counts the number of times each distinct value occurs in the collection of values being aggregated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.counter(users.occupation))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `filter`\n",
    "You can filter elements of a collection before aggregation by using [filter](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.filter)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.filter(users.sex == 'M', hl.agg.count()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The argument to `filter` should be a Boolean expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.count_where(users.sex == 'M'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Boolean expressions can be compound, but be sure to use parentheses appropriately. A single '&' denotes logical AND and a single '|' denotes logical OR. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.filter((users.occupation == 'writer') | (users.occupation == 'executive'), hl.agg.count()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.filter((users.sex == 'F') | (users.occupation == 'executive'), hl.agg.count()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `hist`\n",
    "\n",
    "As we saw in the first tutorial, [hist](https://hail.is/docs/0.2/aggregators.html#hail.expr.aggregators.hist) can be used to build a histogram over numeric data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hist = users.aggregate(hl.agg.hist(users.age, 10, 70, 60))\n",
    "hist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p = hl.plot.histogram(hist, legend='Age')\n",
    "show(p)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `take` and `collect`\n",
    "There are a few aggregators for collecting values.\n",
    "\n",
    "- `take` localizes a few values into an array.  It has an optional `ordering`.\n",
    "- `collect` localizes all values into an array.\n",
    "- `collect_as_set` localizes all unique values into a set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.take(users.occupation, 5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "users.aggregate(hl.agg.take(users.age, 5, ordering=-users.age))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Warning!  Aggregators like `collect` and `counter` return Python objects and can fail with out of memory errors if you apply them to collections that are too large (e.g. all 50 trillion genotypes in the UK Biobank dataset)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}