Data Preparation

To train a model, each client data silo must have their data in the same format.

Once a session is created, all relevant users must prepare their data according to two standards:

  1. The data requirements for running an Horizontal Federated Learning (HFL) model with integrate.ai

integrate.ai supports tabular data for standard models. Users can create custom data loaders if they are creating a custom model, which allows for image and video data inputted as a folder versus a flat file. The following data requirements are for our standard GLM and FFNet models:

integrate.ai Data Requirements

integrate.ai currently supports HFL to train models across different siloed datasets (a.k.a., data silos) without transferring data between each silo. To train an HFL model, datasets within each data silo that will participate in an HFL training session need to be prepared according to the following requirements:

  1. We recommend that the data be in Apache Parquet format, but we also support .csv. For custom models, the data can be in either a single file OR folder format.

  2. The data has to be fully feature engineered. Specifically, the data frame has to be ready to be used as an input for a neural net. This means that:

    1. All columns must be numerical

    2. Columns must not contain NULL values (missing values are imputed)

    3. Categorical variables are properly encoded (e.g., via one-hot-encoding)

    4. Continuous variables are normalized to have mean = 0 and std = 1

  3. Feature engineering has to be consistent across the silos for training to work. For example, if the datasets contain categorical values, such as postal codes, these values must be encoded the same way across all the data sets. For the postal code example, this means that the same postal code value translates to the same numerical values across all datasets that will participate in the training.

  4. Column names have to be consistent across silo datasets. All column names (predictors and targets) must contain only alphanumeric characters (letters, numbers, -, _) and start with a letter. Our product will let you choose which columns you want to use in a specific training session.

If the above criteria are not met, the training will fail to run.

This version of the product will have QA checks for any columns that are not numeric, or columns referenced that do not exist in the prepared data. Therefore, ensuring that data is cleaned and in the right format according to the requirements above is key to ensuring that an HFL model can be trained with IntegrateFL.

Integration Tip - If accessing data from AWS, here are some best practices for setting up your data:

  • For the EAP, we recommend using EC2 instances as it will be fastest to set up and gives you more flexibility to run experiments quickly. We recommend setting up a memory optimized instance, like an r4.large.

  • As for mounting the dataset, your data needs to reside on the same instance that runs the docker client. We plan on supporting S3 and potentially other ways of mounting data in future releases.

Continue reading to learn how to start the session, now that the data is ready and the configurations are complete.

Last updated