Custom Model - Quick Start

A follow-along guide to uploading a model into integrate.ai

The value of the custom model feature in integrate.ai is to enable users to upload any type of model they want to federate, even if it's not pre-built within our tool. This allows users to build any model they want.

Set up your environment

Before you begin this tutorial, make sure to set up your environment, connect to the integrate.ai Docker image in your terminal and obtain a JWT token from the Client Management > Tokens page.

Create a custom model package

In this step, you will be writing the code for your model. Within the integrate.ai Docker image, create a folder called custom_lstm and start by creating 4 files that will serve as the definition and test case of the model.

These two files will serve as the basic definition of your model. The model.py file is the model class definition, and the dataset.py file defines a custom data loader.

Create a model.py file

from iai_powerflow_sdk.base_class import IaiBaseModule
import torch
from torch import nn
from opacus.layers import DPLSTM


class LSTMTagger(IaiBaseModule):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, output_size):
        """
        Here you should instantiate your layers based on the configs.
        @param embedding_dim: size of embedding output
        @param hidden_dim: size of lstm hidden state
        @param vocab_size: size of the tokenizer (total number of all possible words in the input)
        @param output_size: number of classes
        """

        # do not forget to call super init
        super(LSTMTagger, self).__init__()
        self.vocab_size = vocab_size
        self.output_size = output_size

        self.hidden_dim = hidden_dim
        self.embedding_dim = embedding_dim

        self.word_embeddings = nn.Embedding(self.vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.

        # To maintain DP you should replace nn.LSTM with opacus.layers.DPLSTM
        self.lstm = DPLSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, self.output_size)

    def forward(self, sentence):
        """
        the forward path of our model
        @param sentence: input tensor
        @return: the prediction tensor
        """
        batch = sentence.shape[0]
        length = sentence.shape[1]
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(length, batch, self.embedding_dim))
        tag_space = self.hidden2tag(lstm_out.view(batch, length, self.hidden_dim))
        return tag_space.view(batch, self.output_size, -1)


if __name__ == "__main__":
    # you can test your code here
    loss = nn.CrossEntropyLoss()
    t1 = torch.tensor([[0, 1, 2, 0, 1]])
    t2 = torch.tensor(
        [
            [0.3227, -0.1395, -0.2533],
            [0.2841, -0.1469, -0.2228],
            [0.3267, -0.1632, -0.2441],
            [0.3468, -0.1547, -0.2510],
            [0.3694, -0.1491, -0.2733],
        ]
    )
    t2 = t2.view(1, 3, 5)
    print(loss(t2, t1))

Create a dataset.py file

import json
import torch
from iai_powerflow_sdk.base_class import IaiBaseDataset


class TaggerDataset(IaiBaseDataset):
    def __init__(self, path: str, max_len) -> None:
        """
        Here we can load and preprocess the data.
        @param max_len: length of your sentences.
        @param path: you only need this parameter and it points to the root folder or file depending on your usecases.
        """
        super(TaggerDataset, self).__init__(path)

        self.tag_to_ix = {"DET": 0, "NN": 1, "V": 2, "PAD": 3}  # Assign each tag with a unique index
        self.to_ix = json.load(open(path + "/tokenizer_dict.json", "r"))
        self.max_len = max_len

        self.x = []
        self.y = []
        with open(path + "/data.csv") as f:
            for line in f.readlines():
                x, y = self.prepare_sequence(*line.strip().split(","))
                self.x.append(x)
                self.y.append(y)

    def __getitem__(self, item: int) -> torch.Tensor:
        return self.x[item], self.y[item]

    def __len__(self) -> int:
        return len(self.x)

    def prepare_sequence(self, seq, tags):
        """
        Convert words to padded tensors
        @param seq: words in the input sentence
        @param tags: words for labels
        @return: generated padded tensor ids
        """
        idxs = [self.to_ix[w.lower()] for w in seq.strip().split(" ")]
        idxs += [self.to_ix["PAD"]] * (self.max_len - len(idxs))
        tags = [self.tag_to_ix[w] for w in tags.strip().split(" ")]
        tags += [self.tag_to_ix["PAD"]] * (self.max_len - len(tags))
        return torch.tensor(idxs, dtype=torch.long), torch.tensor(tags, dtype=torch.long)


if __name__ == "__main__":

    def create_tokenizer_file():
        """
        The function we used once to generate the tokenizer dictionary.
        """
        data = []
        with open("sample_data/data.csv", "r") as f:
            for l in f.readlines():
                data.append(l.strip().split(",")[0])

        word_to_ix = {"PAD": 0}
        # For each words-list (sentence) and tags-list in each tuple of training_data
        for sent in data:
            for word in sent.split(" "):
                word = word.lower()
                if word not in word_to_ix:  # word has not been assigned an index yet
                    word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
        import json

        json.dump(word_to_ix, open("sample_data/tokenizer_dict.json", "w"))

    create_tokenizer_file()

    ds = TaggerDataset("./sample_data")
    print(ds[0])

Next, create the configuration files that will serve as your test case and default model and data configuration when using this model in integrate.ai. These configuration files must be named the same as the class in your respective model.py and dataset.py files. In this case, we will have the LSTMTagger.json file (serving as the model configuration) and the TaggerDataset.json (serving as the data configuration).

Create the LSTMTagger.json

{
	"embedding_dim": 4,
	"hidden_dim": 3,
	"output_size": 4,
	"vocab_size": 9
}

Create the TaggerDataset.json

{
	"max_len": 5
}

Upload the custom model

Once the model package is written, it can be tested and uploaded using the hfl upload command. This command uses the configuration files for the model and dataset to test the model package before uploading it into the integrate.ai UI.

You will need to reference a dataset in order for this model to be tested as part of the upload process. Datasets can be referenced as either a flat file or a folder, depending on how the custom dataset class is defined. In this example we're using a folder. Unzip this folder and save it in your custom_lstm folder to use in the hfl upload command.

2KB

sample_data.zip