College Board Interview Question

How do you deal with imbalance data

Interview Answer

Anonymous

Sep 10, 2019

One strategy for dealing with very infrequent occurrences in a population is to roughly balance the positives and negatives in your training data. For instance, let's say the College Board wanted to run a classifier that would characterize students as 'true genius' and 'ordinary', and the occurrence of 'true genius' in the population is only 0.1%. They could build a training set with 1000 'true genius' and 1000 'ordinary'. If they just randomly selected 2000 students from the population, they would get only 2 'true genius' and training a model would likely fail. But by enriching the training data with the rarer class, they'll likely have a better chance of training a successful model.