Breast Cancer Classification-Random Forest

Hua Shi
6 min readDec 12, 2020

--

This image is from here!

Introduction

Breast cancer is the most common cancer among women. Approximately 30% of women diagnosed with cancer are breast cancer and more than 40,000 women die from breast cancer in the USA. In order to help doctors diagnose breast cancer, machine learning can be applied to predict if patients have breast cancer or not based on their ten real-valued features of the cell nucleus.

The data is from Kaggle.com and the attribute information as follows:

The information is from here!

EDA

Distribution

First of all, there is no missing value, so we can directly dive into the EDA part. The target value is the “diagnosis” column, and there are two different unique values in this column— “malignant ” and “benign”. For this column, we need to transform those two values in a digital way — number 1 means “malignant” and number 0 means “begin”.

# catogorical column - Diagnosis (M = malignant =1 , B = benign=0)
df['target']=df['diagnosis'].map({"M":1,"B":0})

Based on the data, except for the “diagnosis” column, others are numerical data. The charts below show that the means and mediums of most features for malignant breast cancer are higher than those for begin breast cancer. We can see that there is no big difference between the distributions of fractal_dimension_mean, texture_se, smoothness_se, and symmetry_se in different classes.

Target

The distribution of the target shows that those two classes are not balanced, so we can use the oversampling method in the data engineering part.

Features vs Target

Then we can check the charts of — features vs target to see if there are outliers, and the means of those features in different classes.

Most features’ means in class #1 (malignant) are higher than those in class #0 (benign), which means if one patient takes a breast cancer test, the higher the numbers of those features are, the more likely it is malignant breast cancer. For example, if we got one’s test result, and we see the person’s radus_mean is between 10 to 15, then the person is more likely to have “benign” breast cancer.

Data Engineering & Data Pre-processing

  • Train Test Split — In this part, we need to declare inputs and the target. Then we can import train_test_split from sklearn. model_selection, and
    divide the data into 20% and 80%.
# declare inputs and target
inputs=df.drop(columns=['id','target'])
target=df.target
# Import the module for the split
from sklearn.model_selection import train_test_split
# Split the variables with an 80-20 split and some random state
# To have the same split as mine, use random_state = 365
x_train, x_test, y_train, y_test = train_test_split(inputs, target, test_size=0.2, random_state=365)
  • Data Standardization — In this part, we just simply use StandardScaler to standardize our data.
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()  
scaler.fit(x_train)
X_train = scaler.transform(x_train)
X_test = scaler.transform(x_test)
  • Feature selection — I select the data using the correlation matrix!
sns.set(style="white")# Compute the correlation matrix
corr = x_train.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(10, 10))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=2, center=0,
square=True, linewidths=.5, cbar_kws={"shrink":.5})
plt.show()

So we can see that the more pink the square is, the stronger correlation between the two features. Based on those values of the correlations among those features, some columns will be dropped. The reason why feature selection is important because it enables the machine learning algorithm to train faster. It reduces the complexity of a model and makes it easier to interpret. It improves the accuracy of a model if the right subset is chosen.

Now let us drop those columns with high correlation values —

corr_matrix = x_train.corr().abs()# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.9
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]
x_train.drop(columns=to_drop, inplace=True)
x_test.drop(columns=to_drop, inplace=True)
x_train.shape # double check the shape of our data
>>> (455, 20) # There were 30 columns now we have 20 columns
to_drop # we can see those ten columns got dropped
>>> [['perimeter_mean', 'area_mean', 'concave points_mean', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'concave points_worst']
  • Upsampling Method — Since in the target there are two classes and those two classes are not balanced, we can use the oversampling method for the train data.
from sklearn.utils import resample
# concatenate our training data back together
training = pd.DataFrame()
training[x_train.columns]=x_train
training['target']=list(y_train)
# separate minority and majority classes
MM= training[training.target==1]# minority
BB= training[training.target==0]# majority
MM_upsampled = resample(MM,
replace=True, # sample with replacement
n_samples=len(BB), # match number in majority class
random_state=23) # reproducible results
upsampled= pd.concat([BB,MM_upsampled])
upsampled.target.value_counts()
y_train = upsampled.target
x_train = upsampled.drop('target', axis=1)

Machine Learning Application

In this section, Random Forest was applied for predicting our target. First of all, we need to set up the baseline and tune the parameters for our model.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# given prameters different values
random_grid={
'n_estimators':list(range(220,240)),
'max_depth':list(range(1,30,2)),
'max_features':list(range(1,19)),
'min_samples_leaf':list(range(1,19)),
'min_samples_split':list(range(1,19))

}
# random forest model
rfc=RandomForestClassifier(random_state=666)
# randommized searchCV
RS=RandomizedSearchCV(rfc,random_grid,cv=10)
# fit the train data
RS.fit(x_train,y_train)
RS.best_params_,RS.best_score_
>>>
({'n_estimators': 222,
'min_samples_split': 2,
'min_samples_leaf': 3,
'max_features': 12,
'max_depth': 25},
0.9810344827586208) # the best score

Now we can train our model with those “best parameters”! Here I did not just simply enter the parameters, but I just use RS.best_params_[“n_estimators”] this way to extract the numbers. Otherwise, I need to change the numbers every time I run the model.

rfc=RandomForestClassifier(
n_estimators=RS.best_params_["n_estimators"],
min_samples_split= RS.best_params_["min_samples_split"],
min_samples_leaf= RS.best_params_["min_samples_leaf"],
max_features= RS.best_params_["max_features"],
max_depth= RS.best_params_["max_depth"],
n_jobs=-1,
random_state=42)
rfc.fit(x_train,y_train)In [29]:# let us take a look at the results
print('Accuracy score of train data :{}'.format(rfc.score(x_train,y_train)))
print('Accuracy score of test data:{}'.format(rfc.score(x_test,y_test)))
# obtian train data f1_Score
pre_train=rfc.predict(x_train)
print('Train data f1_Score:{}'.format(f1_score(y_train, pre_train,average='weighted')))
# Obtain Test data f1_Score
y_pred=rfc.predict(x_test)
print('Test data f1_Score:{}'.format(f1_score(y_test, y_pred,average='weighted')))
>>>Accuracy score of train data :0.9758620689655172
Accuracy score of test data:0.9210526315789473
Train data f1_Score:0.9758574758574758
Test data f1_Score:0.9206156885757346

It seems that our model works fine because the F1 score of test data is more than 90%. If we look at the confusion matrix we can see our model also works okay for each class, because we do not want to have high accuracy only for one class. (For more information and code detail please visit my Github!)

Conclusion

This project is good for beginners to learn machine learning and classification. Actually, PCA (Principal Component Analysis) also can be used in order to observe trends, jumps, clusters, and outliers. Also, when we do not know more about those features and we are not sure if we can drop those features in the feature selection part, we can use PCA to uncover the relationships between observations and variables, and among the variables.

If you want to learn more about PCA with this project, please visit my Github.

--

--

Hua Shi
Hua Shi

Written by Hua Shi

Data Engineer /Data Analyst /Machine Learning / Data Engineer/ MS in Economics

No responses yet