Data Requirements

To train a model, each client data silo must have their data in the same format.

To run a training session, the data must be prepared according to the following standards:

  1. The data configuration dictated in the Session configuration

  2. The data requirements for running a Horizontal Federated Learning (HFL) model with integrate.ai

integrate.ai supports tabular data for standard models. You can create custom dataloaders for custom models, which allow for image and video data inputted as a folder instead a flat file. For more information about dataloaders, see Create a custom dataloader.

The following data requirements are for our standard GLM and FFNet models.

integrate.ai Data Requirements

integrate.ai currently supports HFL to train models across different siloed datasets (or data silos) without transferring data between each silo. To train an HFL model, datasets within each data silo that will participate in an HFL training session need to be prepared according to the following requirements:

  1. Data should be in Apache Parquet format (recommended), .csv or S3 URLs (see below).

  2. For custom models, the data can be in either a single file OR folder format.

  3. Data must be fully feature engineered. Specifically, the data frame has to be ready to be used as an input for a neural network. This means that:

    1. All columns must be numerical

    2. Columns must not contain NULL values (missing values are imputed)

    3. Categorical variables are properly encoded (for example, by one-hot-encoding)

    4. Continuous variables are normalized to have mean = 0 and std = 1

  4. Feature engineering must be consistent across the silos. For example, if the datasets contain categorical values, such as postal codes, these values must be encoded the same way across all the datasets. For the postal code example, this means that the same postal code value translates to the same numerical values across all datasets that will participate in the training.

  5. Column names must be consistent across datasets. All column names (predictors and targets) must contain only alphanumeric characters (letters, numbers, dash -, underscore _) and start with a letter. You can select which columns you want to use in a specific training session.

If the above criteria are not met, the training will fail to run.

Remote datasets on AWS S3

The integrate.ai client and SDK are capable of working with datasets that are hosted remotely on AWS S3. You must set up and configure the AWS CLI to use S3 datasets.

Required AWS Credentials

The following environment variables must be set for the iai client to be able to read S3 data locations. You configure these variables as part of setting up the AWS CLI. If you are generating temporary AWS Credentials, specify them as in the example below. Otherwise, use the default profile credentials, or pass in a Dict of AWS credential values.

Example:

aws_creds = {
    'ACCESS_KEY': os.environ.get("AWS_ACCESS_KEY_ID"),
    'SECRET_KEY': os.environ.get("AWS_SECRET_ACCESS_KEY"),
    'SESSION_TOKEN': os.environ.get("AWS_SESSION_TOKEN"),
    'REGION': os.environ.get("AWS_REGION"),
}

S3 buckets as data paths

If the AWS CLI environment is properly configured, you can provide S3 URLs as the data_path in the SDK, or for the iai client commands.

For example, the following code demonstrates setting the dataset path:

import subprocess

data_path = "https://s3.<region>.amazonaws.com/pathtodata/"

dataset_1 = subprocess.Popen(f"iai client --token {IAI_TOKEN} --session {session.id} --dataset-path {data_path}/{filename} --dataset-name {dataset_one} --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

dataset_2 = subprocess.Popen(f"iai client --token {IAI_TOKEN} --session {session.id} --dataset-path {data_path}/{filename} --dataset-name {dataset_two} --remove-after-complete",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.PIPE,
)

Best practices for setting up S3 data:

  • Use EC2 instances for faster setup and greater flexibility to run experiments quickly. We recommend setting up a memory optimized instance, like an r4.large.

  • Configure read-only AWS permissions for the dataset.

See Using AWS Batch with integrate.ai for a tutorial that includes using data hosted in S3 buckets.

Last updated