Exploratory Data Analysis - Intersect Mode

Perform basic aggregate statistics on the column of an overlapping dataset created via PRL

The Exploratory Data Analysis (EDA) Intersect feature for private record linkage (PRL) sessions enables you to access summary statistics about a group of datasets without needing access to the data itself. This enables you to get a basic understanding of the dataset when you don't have access to the data or you are not allowed to do any computations on the data. It helps you to understand more about the intersection between two datasets.

EDA is an important pre-step for federated modelling and a simple form of federated analytics. The feature has a built-in differential privacy setting. Differential privacy (DP) is dynamically added to each histogram that is generated for each feature in a participating dataset. The added privacy protection causes slight noise in the end result.

At a high level, the process is similar to that of creating and running a session to train a model. The steps are:

  1. Run a PRL session to determine the intersection between your datasets.

  2. Configure an EDA Intersect session.

  3. Create and start the session.

  4. Run the session and poll for session status.

  5. Analyse the results.

Use the integrateai_eda_intersect_batch.ipynb notebook to follow along and test the examples shown below by filling in your own variables as required.

This documentation provides supplementary and conceptual information to expand on the code demonstration.

API Reference

The core API module that contains the EDA-specific functionality is integrate_ai_sdk.api.eda. This module includes a core object called EdaResults, which contains the results of an EDA session.

If you are comfortable working with the integrate.ai SDK and API, see the API Documentation for details.

Configure an EDA Intersect Session

This example uses AWS Batch to run the session using data in S3 buckets. Ensure that you have completed the Environment Setup and that you have the correct roles and policies for AWS Batch and Fargate. See Using AWS Batch with integrate.ai for details.

To begin exploratory data analysis in Intersect mode, you must first create a session, the same as you would for training a model.

To configure the session, specify the following:

The eda_data_config specifies the names of the datasets used to generate the intersection in the PRL session in the format dataset_name : columns. If columns is empty ([]), then EDA Intersect is performed on all columns.

Exclude any columns that were used as identifiers during the PRL sessions.

The eda_config specifies the parameters for the EDA process, such as the strategy.

You must also specify the session ID of a successful PRL session using the datasets listed in the eda_data_config.

Example:

eda_data_config = {"passive_client":[],"active_client":[]} 
eda_config = {"strategy": {"name": "EDAHistogram", "params": {}}}
prl_session_id = "<PRL_SESSION_ID>"

If we need to find the correlation (or any other binary operation) between two specific columns, we should specify those columns in paired columns so the 2D histogram gets calculated for those columns.

To set which pairs you are interested in, specify their names in a dictionary like data_config.

For example: {"passive_client": ['x1', 'x5'], "active_client": ['x0', 'x2']}

will generate 2D histograms for these pairs of columns:

(x0, x1), (x0, x5), (x2, x1), (x2, x5), (x0, x2), (x1, x5)

paired_cols = {"active_client": ['x0', 'x2'], "passive_client": ['x1']}

Create and start an EDA Intersect session

The following code sample demonstrates creating and starting an EDA session to perform privacy-preserving data analysis on the intersection of two distinct datasets.

It returns an EDA session ID that you can use to track and reference your session.

eda_session = client.create_eda_session(
    name="EDA Intersect Session",
    description="Testing EDA Intersect mode through a notebook",
    data_config=eda_data_config,
    eda_mode="intersect",  #Generates histograms on an overlap of two distinct datasets
    prl_session_id=prl_session_id,
    hide_intersection="true",    #Optional - Whether to apply CKKS encryption when generating the output. Defaults to True. 
    paired_columns=paired_cols,
).start()

eda_session.id

For more information, see the create_eda_session() definition in the API documentation.

Start an EDA Intersect session using AWS Batch

The data paths and client names must match the information used in the PRL session.

# Example data paths in s3 
active_train_path = 's3://<path to dataset>/active_train.csv'
passive_train_path = 's3://<path to dataset>/passive_train.csv'
active_test_path = 's3://<path to dataset>/active_test.csv'
passive_test_path = 's3://<path to dataset>/passive_test.csv'

# Specify the AWS parameters
job_queue='batch-job-queue'
job_def='batch-job-def'

Specify your AWS credentials if you are generating temporary ones. Otherwise, use the default profile credentials.

aws_creds = {
    'ACCESS_KEY': os.environ.get("AWS_ACCESS_KEY_ID"),
    'SECRET_KEY': os.environ.get("AWS_SECRET_ACCESS_KEY"),
    "SESSION_TOKEN": os.environ.get("AWS_SESSION_TOKEN"),
    'REGION': os.environ.get("AWS_REGION"),
}

Set up the task builder and task group.

Import the taskbuilder and taskgroup from the SDK.

from integrate_ai_sdk.taskgroup.taskbuilder import aws as taskbuilder_aws
from integrate_ai_sdk.taskgroup.base import SessionTaskGroup

Specify the batch information to create the task builder object.

tb = taskbuilder_aws.batch( 
    job_queue=job_queue,
    aws_credentials=aws_creds,
    cpu_job_definition=job_def)

Create a task in the task group for each client. The number of tasks in the task group must match the number of clients specified in the data_config used to create the session.

The dataset_name specified in the task must be identical to the client_name specified in the PRL session.

task_group_context = (
    SessionTaskGroup(eda_session)
    .add_task(tb.eda(dataset_path=passive_train_path, dataset_name="passive_client", vcpus='2', memory='16384', client=client))
    .add_task(tb.eda(dataset_path=active_train_path, dataset_name="active_client", vcpus='2', memory='16384', client=client))
.start()
)

The vcpus and memory parameters are optional overrides for the job definition.

Monitor submitted jobs

Each task in the task group kicks off a job in AWS Batch. You can monitor the jobs through the console or the SDK.

The following code returns the session ID that is included in the job name.

# session available in group context after submission
print(task_group_context.session.id)

Next, check the status of the tasks.

# status of tasks submitted
task_group_status = task_group_context.status()
for task_status in task_group_status:
    print(task_status)

Submitted tasks are in the pending state until the clients join and the session is started. Once started, the status changes to running.

# Use to monitor if a session has completed successfully or has failed
# You can modify the time to wait as per your specific task
task_group_context.wait(300)

When the session completes successfully, "True" is returned. Otherwise, an error message appears.

Analyse the results

To retrieve the results of the session:

results = eda_session.results()

Example output:

EDA Session collection of datasets: ['active_client', 'passive_client']

Describe

You can use the .describe() function to review the results.

results.describe()

Example output:

Statistics

For categorical columns, you can use other statistics for further exploration. For example, unique_count, mode, and uniques.

results["active_client"][["x10", "x2"]].uniques()

Example output:

Mean

You can call functions such as .mean(), .median(), and .std() individually.

results["active_client"].mean() 

Example output:

Histograms

You can create histogram plots using the .plot_hist() function.

saved_dataset_one_hist_plots = results["active_client"].plot_hist()

single_hist = results["active_client"]["x10"].plot_hist()

2D Histograms

You can also plot 2D-Histograms of specified paired columns.

fig = results.plot_hist(active_client['x0'], passive_client['x1'])

Example output:

Correlation

You can perform binary calculations on columns specified in paired_columns, such as finding the correlation.

results.corr(active_client['x0'], passive_client['x1'])

Example output:

Addition, subtraction, division

#Addition example. Change the operator to try subtraction, division, etc.
op_res = active_client['x0']+passive_client['x1']
fig = op_res.plot_hist()

Example output:

GroupBy

groupby_result = results.groupby(active_client['x0'])[passive_client['x5']].mean()
print (groupby_result)

Last updated