When training a machine learning model, class imbalance is a common problem where one class (the majority class) has significantly more samples than another (the minority class). This can cause the model to be biased towards the majority class and perform poorly on the minority class.
SMOTE (Synthetic Minority Over-sampling Technique)1 is a simple method to combat this. Instead of simply duplicating existing data, SMOTE creates new, synthetic samples in the feature space that are “close” to the existing minority class samples.
SMOTE works by “drawing lines” between existing minority samples and generating new points along those lines. This helps the classification algorithm to learn better decision boundaries. The process for creating each synthetic sample is as follows:
- Randomly pick a data point from the minority class.
- Identify its k nearest neighbors that also belong to the minority class.
- Randomly select one of those neighbors.
- Generate a new data point at a random location on the line segment connecting the original sample and its chosen neighbor.
This method can be problematic for certain data distributions as it always creates new samples along a straight line. On the “Circle” dataset, you’ll see many new data points appear in the empty middle of the circle. This happens when SMOTE connects two points from opposite sides, generating a new sample along the line that cuts through the center. Increase k for a stronger effect. With the “Two Clusters”, a similar issue occurs. If a point in one cluster selects a neighbor from the other cluster, the new synthetic point will be placed in the empty space between them, incorrectly bridging the gap.
Footnotes
-
Chawla, N., Bowyer, K., Hall, L., & Kegelmeyer, W. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. arXiv:1106.1813. ↩