Skip to content

How much data should you allocate to training and validation?

Reading Time: 8 minutes

Note: Jupyter notebook here

Context

When I was working at Mash on application credit scoring models, my manager asked me the following question:

  • Manager: “How did you split the dataset?”
  • Me: “80% training and 20% validation”
  • Manager: “Why not 75/25? Or 50/50? Maybe 50% for training is already sufficient and we will have more data for testing.”
  • Me: realising that I have no better explanation than “Andrew Ng says something like that in his ML course”.

My manager’s question was very pertinent and, given I didn’t have a good answer, it triggered an interesting quest for it. Is there a more scientific way to figure out how to split the data?

Turns out there is.

Note: you might also want to check this fantastic paper by Sebastian Raschk. Hands down the best summary of how to perform model evaluation, selection, and comparison in ML. If like me, you are sometimes confused about train, valid, test sets, cross-validation, bootstrapping (to name a few), bookmark this one.

As stated here on StackOverflow:

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance.

https://stackoverflow.com/a/13623707

To reiterate, you want to have enough training data to reduce variance in your parameter estimates, e.g. fit a good model, but also have enough testing data to be able to check that your model is actually good, e.g. measuring performance with some confidence (low variance).

How big should the training set be?

How much “enough” is “enough”?

StackOverflow to the rescue again.

An idea could be the following. To estimate the impact of the training set’s size on model performance:

  1. We split the entire dataset (let’s say 10k samples) in 2 chunks: 20% validation (2k) and 80% training (8k).
  2. We keep the validation set fixed.
  3. We select (with replacement) a 10% random selection of the 8k points in the training set (800 points). We train a model on these 800 points and we measure performance (let’s say AUC) on the validation set.
  4. We repeat #3, N times. Each time training on a random selection (with replacement) of 800 data points and each time validating on the fixed validation set. The magic of bootstrapping.
  5. We repeat #3 and #4 with increasing percentages of the training set. 20% (1.6k points), 30% (2.4k points), …, 90% (7.2k points). Each time we resample with replacement. Each time we retrain N times and validate on the same validation set.
  6. We end up with 9 AUC distributions, one per training set size we explored (10%, 20%, …, 80%, 90%).
  7. For each distribition, we calculate mean and 2.5/97.5 pecentiles. This allows to visualize how the average AUC (and its 95% confidence interval) changes as the training set gets bigger.

This procedure is illustrated in the next slide.

To recap, the question we are answering here is: what is the impact of the training set on model performance?

Let’s try to apply the above strategy to a real use case. I chose the adult dataset from UCI. The task is to predict whether a person gains more or less than 50k USD/year based on a bunch of personal information such as education, marital_status, occupation, race, sex, etc. I provide the Jupyter notebook to reproduce the full analysis here. I am also attaching the set of relevant python routines implementing the bootstrapping logic at the bottom of the post.

df, _ = estimate_impact_size(what="train", X=X, y=y, grid = np.arange(0.1, 1, 0.1), reps=range(30))

The `estimate_impact_size` function runs the approach we just described and produces the following chart and table. A couple of comments and clarifications:

  • the `y` axis shows AUC (we are running a binary classification here). Average and 95% confidence interval.
  • the x axis shows the number of data points (and relative % of total) the model(s) was(were) trained on.
  • The table below the chart is just displaying in tabular format what the graph already shows.
  • Bootstraps is the number of training repetitions (that’s N in the procedure I went through above). So, to be clear, each AUC distribution is composed of 30 data points in this case.

Insight: As you can see, the bigger the training set, the higher the AUC and the lower the variance. This is exactly what we expect. An interesting aspect here is that the AUC curve is not plateauing. Performance keeps increasing the more data we add. This is not always the case. The dataset I was working with at Mash showed something different: after a while, AUC stopped improving. After 5k (example) data points it made no difference whether I was training on 6k or 20k. AUC didn’t change. That was super insightful. It meant I could train with a smaller dataset without compromising on performance, hence having faster iterations and more experiments overall. Also, it is a far better chart to show when someone asks, “why are you splitting your data 75/25?”.

Important: I hear you asking: wait, which model are we actually looking at, here? I am using an XGBoost Classifier, xgb.XGBClassifier(learning_rate=0.03, n_estimators=300). It is rather obvious that the shape of the chart heavily depends on the model picked. A logistic regression and a neural network are going to behave very differently when we throw more data at them. So, the caveat here is that we have to somewhat choose the model first and then run these data-related experiments. What I did in practice was to run `estimate_impact_size` with different baseline models (I need to parametrize that in the function actually!), without tinkering with hyperparameters. E.g. I selected learning_rate=0.03, n_estimators=300 out of experience, not as a result of any specific data-driven process.

How big should the validation set be?

We can apply more or less the same methodology (in reverse) to estimate the appropriate size of the validation set. Here’s how to do that:

  1. We split the entire dataset (let’s say 10k samples) in 2 chunks: 30% validation (3k) and 70% training (7k).
  2. We keep the training set fixed and we train a model on it. Just one model, trained once on 7k points.
  3. We select (with replacement) a 10% random selection of the 3k points in the validation set (300 points). We measure model’s performance (let’s say AUC) on this set.
  4. We repeat #3, N times. Each time calculating AUC on a random selection (with replacement) of 300 data points.
  5. We repeat #3 and #4 with increasing percentages of the validation set. 20% (600 points), 30% (900k points), …, 90% (2.7k points). Each time we resample with replacement. Each time we validate with the same model trained in #2.
  6. We end up with 9 AUC distributions, one per validation set size we explored (10%, 20%, …, 80%, 90%).
  7. For each distribition, we calculate mean and 2.5/97.5 pecentiles. This allows to visualize how the average AUC (and its 95% confidence interval) changes as the validation set gets bigger.

This strategy is illustrated in the next slide.

Let’s run the procedure on the Adult dataset.

<meta charset="utf-8">df, _ = estimate_impact_size(what="test", X=X, y=y, grid = np.arange(0.1, 1, 0.1), reps=range(50))

The results are displayed in the chart and table below.

What do we learn from them? As the validation set increases in size:

  1. AUC doesn’t fluctuate much, being stable around 0.92.
  2. The 95% confidence interval shrinks.

Insight: Both behaviors are expected. The more the data we validate on, the lower the variance in our performance metrics. What is interesting is that the confidence interval stops reducing after a while. It seems that upon hitting 2.1k data points, we don’t gain any further benefit. This is important to know. It means we don’t need the additional 900 data points we had in the full validation set (3k in total). We could add those points to the training set instead, given the previous analysis suggested that the bigger it is the higher the AUC.

Important: the same caveat I pointed out before, about which model to use is valid here as well. For consistency, I am going for an XGBoost Classifier.

Conclusion

We could set 2.1k data points aside for the validation set. Ideally, we’d need the same for a test set. The rest can be allocated to the training set. The more the better in there, but we don’t have much of a choice if we want to reliably measure model performance.

I hope I kept my promise about being able to answer the “how should I split my dataset” question. At least my manager at Mash was way happier with this response compared to the one involving Andrew Ng.

Python implementation

Not the cleanest code ever, I have to admit, but it’ll do for showcasing purposes.

def percentile(n):
    def percentile_(x):
        return np.percentile(x, n)
    percentile_.__name__ = 'percentile_%s' % n
    return percentile_

def train_model(X_train, y_train, X_valid, y_valid, m=xgb.XGBClassifier(learning_rate=0.03, n_estimators=300, n_jobs=-1, use_label_encoder=False)):
    m.fit(X_train, y_train)
    probs_valid = m.predict_proba(X_valid)[:,1]
    return roc_auc_score(y_valid, probs_valid)

def estimate_valid_size_df(X, y, grid=np.arange(0.1, 1.1, 0.1), reps=range(30), verbose=False):
    valid_aucs = []

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=123)
    if verbose: print(f"Training on fixed {len(X_train)} points (70% total). Max validation size (30% total): {len(X_valid)}")

    m=xgb.XGBClassifier(learning_rate=0.03, n_estimators=300, n_jobs=-1, use_label_encoder=False)
    m.fit(X_train, y_train)
    probs_valid = m.predict_proba(X_valid)[:,1]
    valid = pd.DataFrame({'actual': y_valid, 'pred': probs_valid})

    for perc in grid:
        n = int(len(X_valid)*perc)
        if perc==1.0:
            auc = roc_auc_score(y_valid, probs_valid)
            valid_aucs.append((perc, n, auc, len(X_valid), len(X_train), 1))

        if perc<1.0:
            for _ in reps:
                val = valid.sample(n, replace=True)
                auc = roc_auc_score(val.actual, val.pred)
                valid_aucs.append((perc, n, auc, len(val), len(X_train), len(reps)))
    
    df = pd.DataFrame(valid_aucs, columns=['Percentage', 'Sample', 'AUC', 'Valid_size', 'Train_size', 'Bootstraps'])
    return df

def estimate_train_size_df(X, y, grid=np.arange(0.1, 1.1, 0.1), reps=range(30), verbose=False):
    since = time.time()
    train_aucs = []

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=123)
    if verbose: print(f"Validating on fixed {len(X_valid)} points (20% total). Max training size (80% total): {len(X_train)}")

    for perc in grid:
        n = int(len(X_train)*perc)
        if perc==1.0:
            auc = train_model(X_train, y_train, X_valid, y_valid)
            train_aucs.append((perc, n, auc, len(X_valid), len(X_train), 1))
            if verbose: print(f"Training once on {n} data points: {perc*100}% of {len(X_train)}...")
        
        if perc<1.0:
            if verbose: print(f"Training {len(reps)} times on {n} data points: {np.round(perc*100,1)}% of {len(X_train)}...")
            for _ in reps:
                X_t = X_train.sample(n)
                y_t = y_train.loc[X_t.index]
                auc = train_model(X_t, y_t, X_valid, y_valid)
                train_aucs.append((perc, n, auc, len(X_valid), len(X_t), len(reps)))
    time_elapsed = (time.time() - since)
    df = pd.DataFrame(train_aucs, columns=['Percentage', 'Sample', 'AUC', 'Valid_size', 'Train_size', 'Bootstraps'])
    print("Done in {:.0f}m {:.0f}s".format(time_elapsed // 60, time_elapsed % 60))
    return df

def aggregate_size_df(df):
    df["Perc-Sample"] = (df.Percentage*100).astype(int).astype(str) + "%-" + df.Sample.astype(str)
    df = df.groupby('Perc-Sample').agg(Sample=('Sample', 'min'),
                                    Valid_size=('Valid_size','min'),
                                    Train_size=('Train_size','min'),
                                    Bootstraps=('Bootstraps', 'min'),
                                    AUC_mean=('AUC', 'mean'),
                                    AUC_std=('AUC', 'std'),
                                    AUC_975=('AUC', percentile(97.5)),
                                    AUC_025=('AUC', percentile(2.5))
                                    )
    df["975VSmean_%"] = (df.AUC_975/df.AUC_mean-1) * 100
    df["025VSmean_%"] = (df.AUC_025/df.AUC_mean-1) * 100
    df.sort_values(by='Sample', inplace=True)
    return df

def plot_size_df(df, title=None, plot_std=False):
    _, _ = plt.subplots(figsize=(9, 7))
    plt.plot(df.index, df.AUC_mean, 'k', label="Mean AUC")
    if plot_std: plt.fill_between(df.index, df.AUC_mean - 2 * df.AUC_std, df.AUC_mean + 2 * df.AUC_std, color='b', alpha=0.2, label="2std (95%) AUC interval")
    plt.fill_between(df.index, df.AUC_025, df.AUC_975, color='g', alpha=0.2, label="2.5-97.5 (95%) AUC quantiles")
    plt.ylabel('AUC')
    plt.xlabel('%dataset - #samples')
    if title is not None: plt.title(title)

    for x,y in zip(df.index,df.AUC_mean):
        label = "{:.3f}".format(y)
        plt.annotate(label, # this is the text
                    (x,y), # this is the point to label
                    textcoords="offset points", # how to position the text
                    xytext=(0,10), # distance from text to points (x,y)
                    ha='center') # horizontal alignment can be left, right or center

    plt.legend(loc="lower right")
    plt.xticks(rotation=30)
    plt.show()
    display(df.round(3))

def estimate_impact_size(what: str,
                         X: pd.DataFrame,
                         y: pd.Series, 
                         grid: np.array = np.arange(0.1, 1.1, 0.1), 
                         reps: range = range(30), 
                         verbose: bool = False) -> (pd.DataFrame, pd.DataFrame):
    """
    Estimates the impact of the training set size on a fix-sized validation set.

    Parameters
    ----------
    what: str
        `train` or `test`. Whether to estimate the impact of the size of the 
        training or test set.
    
    X : pd.DataFrame
        The dataframe containing all our dataset. Ready to be fed to an estimator.

    y: pd.Series
        The ground truth labels

    grid: np.array (default=np.arange(0.1, 1.1, 0.1))
        Array of percentages of the validation set to explore.

    reps: range (default=range(30))
        Number of times the validation process is repeated at each percentage level.
        Bootstrapping with repetition.
    
    verbose: bool (default=False)
        Whether to print relevant info while running
    """
    if what == 'test': original = estimate_valid_size_df(X, y, grid=grid, reps=reps, verbose=verbose)
    elif what == 'train': original = estimate_train_size_df(X, y, grid=grid, reps=reps, verbose=verbose)
    else: raise ValueError(f"`what` accepts `test` or `train` only: {what} was provided instead.")

    df = aggregate_size_df(original)
    if what == 'test': title = f"AUC on validation set of increasing size (up to 30% total - {df.Valid_size.max()} points) \n at fixed training set size (@70% total - {df.Train_size.max()} points)"
    else: title = f"AUC on validation set (fixed @20% total - {df.Valid_size.min()} points) \n at increasing training set size (up to 80% total - {df.Train_size.max()} points)"
    
    plot_size_df(df, title)
    return df, original

Discover more from

Subscribe now to keep reading and get access to the full archive.

Continue reading