-->

## Friday, April 5, 2013

### When dealing with unbalanced datasets..

Oversampling of under-represented categories (ex. $|C_1|=10, |C_2|=80, |C_3|=20$) does not make sense from a theoretical standpoint if a) your cost function is symmetric (all classes are equally penalized) b) the distribution of training set coincides with the distribution of the test set. It is equivalent to re-weight your loss function when dealing with the samples of the over-sampled categories:

$\sum_{i\in C}=w_1\sum_{i\in C_1}E(x_i) + w_2\sum_{i\in C_2}E(x_i) + w_3\sum_{i\in C_3}E(x_i)$

This can also be achieved by directly modifying your loss function. It wont improve the global performances over the three Categories, but it will improve the precision (or recall) on C_1 and C_3. According to Cesa-Bianchi errors come from bias and variance of your data distribution: you can kill variance by adding large quantities of training data. The immediate consequence is that regularization won't be needed anymore since variance would tend to zero. Bias can be killed by picking the correct function from the function family.