PythonJob

class hailtop.batch.job.PythonJob(batch, token, *, name=None, attributes=None)

Bases: hailtop.batch.job.Job

Object representing a single Python job to execute.

Examples

Create a new Python job that multiplies two numbers and then adds 5 to the result:

# Create a batch object with a default Python image

b = Batch(default_python_image='gcr.io/hail-vdc/python-dill:3.7-slim')

def multiply(x, y):
    return x * y

def add(x, y):
    return x + y

j = b.new_python_job()
result = j.call(multiply, 2, 3)
result = j.call(add, result, 5)

# Write out the str representation of result to a file

b.write_output(result.as_str(), 'hello.txt')

b.run()

Notes

This class should never be created directly by the user. Use Batch.new_python_job() instead.

Methods

call

Execute a Python function.

image

Set the job’s docker image.

call(unapplied, *args, **kwargs)

Execute a Python function.

Examples

import json

def add(x, y):
    return x + y

def multiply(x, y):
    return x * y

def format_as_csv(x, y, add_result, mult_result):
    return f'{x},{y},{add_result},{mult_result}'

def csv_to_json(path):
    data = []
    with open(path) as f:
        for line in f:
            line = line.rstrip()
            fields = line.split(',')
            d = {'x': int(fields[0]),
                 'y': int(fields[1]),
                 'add': int(fields[2]),
                 'mult': int(fields[3])}
            data.append(d)
    return json.dumps(data)

# Get all the multiplication and addition table results

b = Batch(name='add-mult-table')

formatted_results = []

for x in range(3):
    for y in range(3):
        j = b.new_python_job(name=f'{x}-{y}')
        add_result = j.call(add, x, y)
        mult_result = j.call(multiply, x, y)
        result = j.call(format_as_csv, x, y, add_result, mult_result)
        formatted_results.append(result.as_str())

cat_j = b.new_bash_job(name='concatenate')
cat_j.command(f'cat {" ".join(formatted_results)} > {cat_j.output}')

csv_to_json_j = b.new_python_job(name='csv-to-json')
json_output = csv_to_json_j.call(csv_to_json, cat_j.output)

b.write_output(j.as_str(), '/output/add_mult_table.json')
b.run()

Notes

Unlike the BashJob, a PythonJob returns a new PythonResult for every invocation of PythonJob.call(). A PythonResult can be used as an argument in subsequent invocations of PythonJob.call(), as an argument in downstream python jobs, or as inputs to other bash jobs. Likewise, InputResourceFile, JobResourceFile, and ResourceGroup can be passed to PythonJob.call(). Batch automatically detects dependencies between jobs including between python jobs and bash jobs.

When a ResourceFile is passed as an argument, it is passed to the function as a string to the local file path. When a ResourceGroup is passed as an argument, it is passed to the function as a dict where the keys are the resource identifiers in the original ResourceGroup and the values are the local file paths.

Like JobResourceFile, all PythonResult are stored as temporary files and must be written to a permanent location using Batch.write_output() if the output needs to be saved. A PythonResult is saved as a dill serialized object. However, you can use one of the methods PythonResult.as_str(), PythonResult.as_repr(), or PythonResult.as_json() to convert a PythonResult to a JobResourceFile with the desired output.

Warning

You must have any non-builtin packages that are used by unapplied installed in your image. You can use docker.build_python_image() to build a Python image with additional Python packages installed that is compatible with Python jobs.

Here are some tips to make sure your function can be used with Batch:

  • Only reference top-level modules in your functions: like numpy or pandas.

  • If you get a serialization error, try moving your imports into your function.

  • Instead of serializing a complex class, determine what information is essential and only serialize that, perhaps as a dict or array.

Parameters
  • unapplied (Callable) – A reference to a Python function to execute.

  • args – Positional arguments to the Python function. Must be either a builtin Python object, a Resource, or a Dill serializable object.

  • kwargs – Key-word arguments to the Python function. Must be either a builtin Python object, a Resource, or a Dill serializable object.

Return type

PythonResult

Returns

resource.PythonResult

image(image)

Set the job’s docker image.

Notes

image must already exist and have the same version of Python as what is being used on the computer submitting the Batch. It also must have the dill Python package installed. You can use the function docker.build_python_image() to build a new image containing dill and additional Python packages.

Examples

Set the job’s docker image to gcr.io/hail-vdc/python-dill:3.7-slim:

>>> b = Batch()
>>> j = b.new_python_job()
>>> (j.image('gcr.io/hail-vdc/python-dill:3.7-slim')
...   .call(print, 'hello'))
>>> b.run()  
Parameters

image (str) – Docker image to use.

Return type

PythonJob

Returns

Same job object with docker image set.