Utilities API¶
Data loading and helper utilities for fairness_training.
create_stratified_dataloaders¶
def create_stratified_dataloaders(
X_train: np.ndarray,
y_train: np.ndarray,
X_val: Optional[np.ndarray] = None,
y_val: Optional[np.ndarray] = None,
X_test: Optional[np.ndarray] = None,
y_test: Optional[np.ndarray] = None,
protected_attr_idx: Union[int, List[int]] = 0,
batch_size_train: int = 2000,
batch_size_eval: int = 2000
) -> Tuple[DataLoader, ...]
Create DataLoaders with stratified sampling to maintain constant group ratios.
Required for training if user desires strict fairness guarantees.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
X_train |
np.ndarray |
required | Training features |
y_train |
np.ndarray |
required | Training targets |
X_val |
np.ndarray |
None |
Validation features |
y_val |
np.ndarray |
None |
Validation targets |
X_test |
np.ndarray |
None |
Test features |
y_test |
np.ndarray |
None |
Test targets |
protected_attr_idx |
int or List[int] |
0 |
Protected attribute column(s) |
batch_size_train |
int |
2000 |
Training batch size |
batch_size_eval |
int |
2000 |
Evaluation batch size |
Returns¶
Tuple of stratified DataLoaders for each provided dataset.
Stratification Behavior¶
Single attribute: Maintains 2-group proportions (A=0, A=1)
Two attributes: Maintains 4 intersectional group proportions (00, 01, 10, 11)
Example¶
from fairness_training import create_stratified_dataloaders
# Single protected attribute
train_loader, val_loader = create_stratified_dataloaders(
X_train, y_train,
X_val, y_val,
protected_attr_idx=0,
batch_size_train=2000
)
# Two protected attributes
train_loader, val_loader, test_loader = create_stratified_dataloaders(
X_train, y_train,
X_val, y_val,
X_test, y_test,
protected_attr_idx=[0, 1],
batch_size_train=2000,
batch_size_eval=2000
)
Output Example¶
=== Creating stratified training batches ===
Original samples: 19600
Per-batch allocation: group 0=1400, group 1=600
Created 9 batches of size 2000
Total samples used: 18000
Warning: Dropping 1600 samples (1200 from group 0, 400 from group 1) to ensure complete batches
Notes¶
- Protected attributes must be binary (0/1 values)
- Samples that don't fit into complete batches are dropped with a warning
- Batches are pre-arranged (loader uses
shuffle=False) - Maximum 2 protected attributes supported
Errors¶
| Error | Cause | Solution |
|---|---|---|
ValueError: not binary |
Protected column has non-0/1 values | Binarize your protected attributes |
ValueError: must have samples in both groups |
One group is empty | Check your data has both groups |
ValueError: No complete batches |
Too few samples or batch too large | Reduce batch size or add data |
get_fairness_metric¶
def get_fairness_metric(
metric: Union[str, FairnessMetric],
num_protected_attrs: int = 1
) -> FairnessMetric
Factory function to get a fairness metric instance.
See Fairness Metrics API for details.
Data Preparation Tips¶
Protected Attribute Format¶
Protected attributes must be: - Binary (0 or 1 values) - In specific columns of your feature matrix
import numpy as np
# Ensure binary
gender = (df['gender'] == 'Male').astype(np.float32)
race = (df['race'] == 'White').astype(np.float32)
# Place in first columns
X = np.column_stack([gender, race, other_features])
# Now protected_attr_idx=[0, 1]
Handling Imbalanced Groups¶
If one group is much smaller:
- Use stratified batching (always recommended)
- Increase batch size to ensure both groups present
- Check minimum samples per group:
for attr_idx in protected_attr_idx:
n_0 = (X[:, attr_idx] == 0).sum()
n_1 = (X[:, attr_idx] == 1).sum()
print(f"Attribute {attr_idx}: group 0 = {n_0}, group 1 = {n_1}")
# Minimum per batch
ratio = n_1 / (n_0 + n_1)
min_per_batch = max(int(batch_size * ratio), 1)
print(f" Min group 1 per batch of {batch_size}: {min_per_batch}")
Train/Val/Test Split¶
Use stratified splitting to maintain group proportions:
from sklearn.model_selection import train_test_split
# Stratify by both target and protected attribute
stratify_col = y.astype(str) + '_' + X[:, 0].astype(str)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=stratify_col, random_state=42
)
See Also¶
- Training Guide - Using dataloaders for training
- FairTrainer - Training utilities