Dummy Variable trap? Not a problem!

Hua Shi
3 min readFeb 20, 2020

--

When we get the data , most time we can see there are dummy variables in the dataset or there are categorical data that we need to transfer into dummy variables. However there are some points that we need to interpret and avoid.

Dummy Variable Trap

When we have one categorical column and it only contains “yes” or “no” data then we don’t need to worry about it. When we have more than two categories we need to know how to avoid Dummy Variable trap.

Definition:The Dummy Variable trap is a scenario in which the independent variables are multicollinear — a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others.

There are two different ways to avoid it.

First: Drop one dummy column

For example: There is a dataset called df can we can see there is a column which is called “brands’. In this table , we want to know the relationship between different car brands and the price. We can use ‘panda’ library to automatically generate dummy variables for “brands” column.

After applied the panda.get_dummy() and we can select any brands that we want to drop. Here I just drop the first one. If we don’t drop one category, there will be muti-collinearity and the values of VIF for those dummy variables will become supper large or converge to positive infinite.

Our model is :

How to interpret the those coefficients ?

The intercept contains β0 and error, so if car is Acura then all coefficients in this model will be zero, the intercept is the mean price of Acura and the error.

If the brand is BMW, then BMW=1 and price =β1+β0+error. We can say price of BMW is higher than Acura 0.5 thousand.

  • β0 measures the mean price of Acura (we dropped Acura)
  • β1 measure the differences in the mean price of Audi and Acura.
  • β2 measure the differences in the mean price of BMW and Acura.
  • β3 measure the differences in the mean price of Benze and Acura.
  • β4 measure the differences in the mean price of Jeep and Acura.

Second : Remove the intercept!

Now we have data which contains all brands dummy variables. Then we can do the regression and not add intercept.

From the regression result we can see that there is no intercept and all brands have their own coefficient and there is no comparison in the model. All the coefficients are price means of their brands.

references:

--

--

Hua Shi
Hua Shi

Written by Hua Shi

Data Engineer /Data Analyst /Machine Learning / Data Engineer/ MS in Economics

No responses yet