Using AWS Batch with integrate.ai

Use the integrate.ai SDK to run batch jobs on remote datasets in AWS

AWS Batch provides convenient, scalable, machine learning computing that can be integrated with the SDK.

Required components

integrate.ai SDK and Client Docker image

AWS Batch

AWS Elastic Container Registry (ECR)

AWS Elastic Container Service (ECS)

AWS Systems Manager (SSM)

Amazon Simple Storage Service (S3)

AWS Console or CloudFormation or Terraform

Configure AWS Batch

If this is your first time running a training session in AWS Batch, ensure that you have followed the instructions for Setting up AWS Batch before you continue.

Running AWS Batch jobs through the SDK

Federated learning models are trained through sessions. You define the parameters required to train a federated model, including data and model configurations, in a session. Additional session parameters are required when using AWS Batch.

Install the SDK

Before you begin, ensure that you have installed the latest integrate.ai SDK. See Environment Setup for details.

# Update the CLI tool
pip install -U integrate-ai

# Install the latest SDK
iai sdk install --token <IAI_TOKEN>

Use the integrateai_batch_client.ipynb notebook to follow along and test the examples shown below by filling in your own variables as required.

Specify Batch parameters

In addition to the session definition, there are AWS Batch-specific parameters required by the SDK to run a batch job.

Training and test data paths

Specify the path(s) to your training and test data on S3.

Example:

train_path1 = "s3://{path to training data}"
train_path2 = "s3://{path to training data}"
test_path = "s3://{path to test data}"

AWS Authentication

If you are generating temporary AWS Credentials, specify them as in the example below. Otherwise, use the default profile credentials, or pass in a Dict of AWS credential values.

Example:

aws_creds = {
    'ACCESS_KEY': os.environ.get("AWS_ACCESS_KEY_ID"),
    'SECRET_KEY': os.environ.get("AWS_SECRET_ACCESS_KEY"),
    'SESSION_TOKEN': os.environ.get("AWS_SESSION_TOKEN"),
    'REGION': os.environ.get("AWS_REGION"),
}

Batch environment

Specify the name of the job_queue, job_definition, and ssm_token name that you created in Setting up AWS Batch.

job_queue='<aws batch job queue name>'
job_def='<aws batch job definition name>'
ssm_token='<name of iai token stored on SSM>'

Define a training session

Prepare your model configuration and data schema. See Models for information on the models available out-of-the-box in integrate.ai, or see Building a Custom Model for information on building your own model.

Define your training session as usual. The session definition is passed to the batch through the task group that also contains the tasks for the batch.

Example:

training_session = client.create_fl_session(
    name="Testing notebook",
    description="I am testing a batch job through a notebook",
    min_num_clients=2,
    num_rounds=2,
    package_name="iai_ffnet",
    model_config=<model_config>,
    data_config=<data_schema>,
)

The min_num_clients specified here must match the number of tasks added to the task group.

Specify the model_config and data_config names for the configuration and schema that you want to use.

Running a batch with task group and taskbuilder

Instead of running the integrate.ai client directly, import and use the taskgroup and taskbuilder functions. For each task, create a task object. Use the taskgroup to add each task to the batch.

One task is equivalent to one client in integrate.ai terms. The min_num_clients given in the training session definition must match the number of tasks defined in the batch.

Import the required functions.

from integrate_ai_sdk.taskgroup.taskbuilder import aws as taskbuilder_aws
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

Create a taskbuilder object, and provide the required parameters.

tb = taskbuilder_aws.batch(ssm_token_parameter_key=ssm_token, 
    job_queue=<job_queue>,
    aws_credentials=<aws_creds>,
    cpu_job_definition=<job_def>)

Create a task group and start the batch

The task group defines the training_session and the tasks to run for the batch. The following code snippet creates the task group and starts the job.

task_group_context = SessionTaskGroup(training_session)\
    .add_task(tb.hfl
        (train_path=train_path1, 
        test_path=test_path, 
        session=training_session, 
        vcpus='2', 
        memory='16384')
        )\
    .add_task(tb.hfl
        (train_path=train_path2, 
        test_path=test_path, 
        session=training_session, 
        vcpus='2', 
        memory='16384')
    ).start()

The vcpu and memory parameters are optional. Use them to adjust the values in the job definition if necessary.

Monitor submitted jobs

The task group context contains the the session ID.

print(task_group_context.session_id)

To monitor the status of the tasks:

task_group_status = task_group_context.status()
for task_status in task_group_status:
    print(task_status)

Specify a wait time that is appropriate for your tasks and monitor for session completion or failure:

task_group_context.wait(300)

You can also review the results of the job(s) in the AWS console.

Last updated