Custom Model - Quick Start

A follow-along guide to uploading a model into integrate.ai

The value of the custom model feature in integrate.ai is to enable users to upload any type of model they want to federate, even if it's not pre-built within our tool. This allows users to build any model they want.

Set up your environment

Before you begin this tutorial, make sure to set up your environment, connect to the integrate.ai Docker image in your terminal and obtain a JWT token from the Client Management > Tokens page.

Create a custom model package

In this step, you will be writing the code for your model. Within the integrate.ai Docker image, create a folder called custom_lstm and start by creating 4 files that will serve as the definition and test case of the model.

These two files will serve as the basic definition of your model. The model.py file is the model class definition, and the dataset.py file defines a custom data loader.

Create a model.py file
from iai_powerflow_sdk.base_class import IaiBaseModule
import torch
from torch import nn
from opacus.layers import DPLSTM


class LSTMTagger(IaiBaseModule):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, output_size):
        """
        Here you should instantiate your layers based on the configs.
        @param embedding_dim: size of embedding output
        @param hidden_dim: size of lstm hidden state
        @param vocab_size: size of the tokenizer (total number of all possible words in the input)
        @param output_size: number of classes
        """

        # do not forget to call super init
        super(LSTMTagger, self).__init__()
        self.vocab_size = vocab_size
        self.output_size = output_size

        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim

        self.word_embeddings = nn.Embedding(self.vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.

        # To maintain DP you should replace nn.LSTM with opacus.layers.DPLSTM
        self.lstm = DPLSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, self.output_size)

    def forward(self, sentence):
        """
        the forward path of our model
        @param sentence: input tensor
        @return: the prediction tensor
        """
        batch = sentence.shape[0]
        length = sentence.shape[1]
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(length, batch, self.embedding_dim))
        tag_space = self.hidden2tag(lstm_out.view(batch, length, self.hidden_dim))
        return tag_space.view(batch, self.output_size, -1)


if __name__ == "__main__":
    # you can test your code here
    loss = nn.CrossEntropyLoss()
    t1 = torch.tensor([[0, 1, 2, 0, 1]])
    t2 = torch.tensor(
        [
            [0.3227, -0.1395, -0.2533],
            [0.2841, -0.1469, -0.2228],
            [0.3267, -0.1632, -0.2441],
            [0.3468, -0.1547, -0.2510],
            [0.3694, -0.1491, -0.2733],
        ]
    )
    t2 = t2.view(1, 3, 5)
    print(loss(t2, t1))
Create a dataset.py file
import json
import torch
from iai_powerflow_sdk.base_class import IaiBaseDataset


class TaggerDataset(IaiBaseDataset):
    def __init__(self, path: str, max_len) -> None:
        """
        Here we can load and preprocess the data.
        @param max_len: length of your sentences.
        @param path: you only need this parameter and it points to the root folder or file depending on your usecases.
        """
        super(TaggerDataset, self).__init__(path)

        self.tag_to_ix = {"DET": 0, "NN": 1, "V": 2, "PAD": 3}  # Assign each tag with a unique index
        self.to_ix = json.load(open(path + "/tokenizer_dict.json", "r"))
        self.max_len = max_len

        self.x = []
        self.y = []
        with open(path + "/data.csv") as f:
            for line in f.readlines():
                x, y = self.prepare_sequence(*line.strip().split(","))
                self.x.append(x)
                self.y.append(y)

    def __getitem__(self, item: int) -> torch.Tensor:
        return self.x[item], self.y[item]

    def __len__(self) -> int:
        return len(self.x)

    def prepare_sequence(self, seq, tags):
        """
        Convert words to padded tensors
        @param seq: words in the input sentence
        @param tags: words for labels
        @return: generated padded tensor ids
        """
        idxs = [self.to_ix[w.lower()] for w in seq.strip().split(" ")]
        idxs += [self.to_ix["PAD"]] * (self.max_len - len(idxs))
        tags = [self.tag_to_ix[w] for w in tags.strip().split(" ")]
        tags += [self.tag_to_ix["PAD"]] * (self.max_len - len(tags))
        return torch.tensor(idxs, dtype=torch.long), torch.tensor(tags, dtype=torch.long)


if __name__ == "__main__":

    def create_tokenizer_file():
        """
        The function we used once to generate the tokenizer dictionary.
        """
        data = []
        with open("sample_data/data.csv", "r") as f:
            for l in f.readlines():
                data.append(l.strip().split(",")[0])

        word_to_ix = {"PAD": 0}
        # For each words-list (sentence) and tags-list in each tuple of training_data
        for sent in data:
            for word in sent.split(" "):
                word = word.lower()
                if word not in word_to_ix:  # word has not been assigned an index yet
                    word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
        import json

        json.dump(word_to_ix, open("sample_data/tokenizer_dict.json", "w"))

    create_tokenizer_file()

    ds = TaggerDataset("./sample_data")
    print(ds[0])

Next, create the configuration files that will serve as your test case and default model and data configuration when using this model in integrate.ai. These configuration files must be named the same as the class in your respective model.py and dataset.py files. In this case, we will have the LSTMTagger.json file (serving as the model configuration) and the TaggerDataset.json (serving as the data configuration).

Create the LSTMTagger.json
{
	"embedding_dim": 4,
	"hidden_dim": 3,
	"output_size": 4,
	"vocab_size": 9
}
Create the TaggerDataset.json
{
	"max_len": 5
}

Upload the custom model

Once the model package is written, it can be tested and uploaded using the hfl upload command. This command uses the configuration files for the model and dataset to test the model package before uploading it into the integrate.ai UI.

You will need to reference a dataset in order for this model to be tested as part of the upload process. Datasets can be referenced as either a flat file or a folder, depending on how the custom dataset class is defined. In this example we're using a folder. Unzip this folder and save it in your custom_lstm folder to use in the hfl upload command.

Use the following command to upload your custom model package.

hfl upload --token TOKEN --package-path PACKAGE_PATH --dataset-path DATASET_PATH --task classification --batch-size BATCH_SIZE [--package-name PACKAGE_NAME] --model-config-path MODEL_CONFIG_PATH --data-config-path DATA_CONFIG_PATH [--description DESCRIPTION]
  • TOKEN is the token you obtained in the first setup step

  • PACKAGE_PATH is the path to your custom_lstmfolder

  • DATASET_PATH is the path to the sample_data folder

  • --task is already specified in the command for you above, in this example we're using classification

  • BATCH_SIZE can be set at your discretion

  • PACKAGE_NAME is optional, will default to the package directory name

  • MODEL_CONFIG_PATH and DATA_CONFIG_PATH need to reference the path to your LSTMTagger.json and TaggerDataset.json files

  • DESCRIPTION is option and can be set at your discretion

Use the command hfl upload --help to access the definitions of each element of the command.

The following message will appear if you have successfully uploaded the model:

2022-02-14 17:06:52,314 FLOUR MainThread INFO | orchestration.py:315 | Successfully uploaded model definition: <model name>

Once uploaded, the custom model will appear in the Model Library and any user in the workspace can login to create a session using that model.

Last updated