Most cancers detection is a well-liked instance of an imbalanced classification drawback as a result of there are sometimes considerably extra circumstances of non-cancer than precise most cancers.
A normal imbalanced classification dataset is the mammography dataset that entails detecting breast most cancers from radiological scans, particularly the presence of clusters of microcalcifications that seem shiny on a mammogram. This dataset was constructed by scanning the pictures, segmenting them into candidate objects, and utilizing pc imaginative and prescient strategies to explain every candidate object.
It’s a well-liked dataset for imbalanced classification due to the extreme class imbalance, particularly the place 98 p.c of candidate microcalcifications usually are not most cancers and solely 2 p.c had been labeled as most cancers by an skilled radiographer.
On this tutorial, you’ll uncover tips on how to develop and consider fashions for the imbalanced mammography most cancers classification dataset.
After finishing this tutorial, you’ll know:
- Easy methods to load and discover the dataset and generate concepts for information preparation and mannequin choice.
- Easy methods to consider a set of machine studying fashions and enhance their efficiency with information cost-sensitive strategies.
- Easy methods to match a last mannequin and use it to foretell class labels for particular circumstances.
Uncover SMOTE, one-class classification, cost-sensitive studying, threshold shifting, and rather more in my new book, with 30 step-by-step tutorials and full Python supply code.
Let’s get began.



Develop an Imbalanced Classification Mannequin to Detect Microcalcifications
Photograph by Bernard Spragg. NZ, some rights reserved.
Tutorial Overview
This tutorial is split into 5 components; they’re:
- Mammography Dataset
- Discover the Dataset
- Mannequin Take a look at and Baseline End result
- Consider Fashions
- Consider Machine Studying Algorithms
- Consider Price-Delicate Algorithms
- Make Predictions on New Information
Mammography Dataset
On this undertaking, we are going to use a regular imbalanced machine studying dataset known as the “mammography” dataset or generally “Woods Mammography.”
The dataset is credited to Kevin Woods, et al. and the 1993 paper titled “Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography.”
The main focus of the issue is on detecting breast most cancers from radiological scans, particularly the presence of clusters of microcalcifications that seem shiny on a mammogram.
The dataset concerned first began with 24 mammograms with a recognized most cancers analysis that had been scanned. The pictures had been then pre-processed utilizing picture segmentation pc imaginative and prescient algorithms to extract candidate objects from the mammogram photographs. As soon as segmented, the objects had been then manually labeled by an skilled radiologist.
A complete of 29 options had been extracted from the segmented objects regarded as most related to sample recognition, which was diminished to 18, then lastly to seven, as follows (taken instantly from the paper):
- Space of object (in pixels).
- Common grey degree of the thing.
- Gradient power of the thing’s perimeter pixels.
- Root imply sq. noise fluctuation within the object.
- Distinction, common grey degree of the thing minus the common of a two-pixel extensive border surrounding the thing.
- A low order second based mostly on form descriptor.
There are two courses and the aim is to tell apart between microcalcifications and non-microcalcifications utilizing the options for a given segmented object.
- Non-microcalcifications: adverse case, or majority class.
- Microcalcifications: constructive case, or minority class.
A variety of fashions had been evaluated and in contrast within the authentic paper, comparable to neural networks, choice bushes, and k-nearest neighbors. Fashions had been evaluated utilizing ROC Curves and in contrast utilizing the world beneath ROC Curve, or ROC AUC for brief.
ROC Curves and space beneath ROC Curves had been chosen with the intent to reduce the false-positive fee (complement of the specificity) and maximize the true-positive fee (sensitivity), the 2 axes of the ROC Curve. Using the ROC Curves additionally suggests the will for a probabilistic mannequin from which an operator can choose a likelihood threshold because the cut-off between the appropriate false constructive and true constructive charges.
Their outcomes instructed a “linear classifier” (seemingly a Gaussian Naive Bayes classifier) carried out the most effective with a ROC AUC of 0.936 averaged over 100 runs.
Subsequent, let’s take a more in-depth have a look at the info.
Need to Get Began With Imbalance Classification?
Take my free 7-day e mail crash course now (with pattern code).
Click on to sign-up and likewise get a free PDF Book model of the course.
Download Your FREE Mini-Course
Discover the Dataset
The Mammography dataset is a extensively used normal machine studying dataset, used to discover and display many strategies designed particularly for imbalanced classification.
One instance is the favored SMOTE data oversampling technique.
A model of this dataset was made out there that has some variations to the dataset described within the authentic paper.
First, obtain the dataset and reserve it in your present working listing with the title “mammography.csv”
Evaluate the contents of the file.
The primary few traces of the file ought to look as follows:
| 0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223,’-1′ 0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324,’-1′ -0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324,’-1′ 0.54608818,0.13141457,-0.45638679,-0.85955255,-0.37786573,-0.94572324,’-1′ -0.10298725,-0.3949941,-0.14081588,0.97970269,-0.37786573,1.0135658,’-1′ … |
We will see that the dataset has six reasonably than the seven enter variables. It’s attainable that the primary enter variable listed within the paper (space in pixels) was faraway from this model of the dataset.
The enter variables are numerical (real-valued) and the goal variable is the string with ‘-1’ for almost all class and ‘1’ for the minority class. These values will should be encoded as Zero and 1 respectively to fulfill the expectations of classification algorithms on binary imbalanced classification issues.
The dataset could be loaded as a DataFrame utilizing the read_csv() Pandas function, specifying the placement and the truth that there isn’t any header line.
| ... # outline the dataset location filename = ‘mammography.csv’ # load the csv file as a knowledge body dataframe = read_csv(filename, header=None) |
As soon as loaded, we will summarize the variety of rows and columns by printing the form of the DataFrame.
| ... # summarize the form of the dataset print(dataframe.form) |
We will additionally summarize the variety of examples in every class utilizing the Counter object.
| ... # summarize the category distribution goal = dataframe.values[:,–1] counter = Counter(goal) for ok,v in counter.objects(): per = v / len(goal) * 100 print(‘Class=%s, Rely=%d, Proportion=%.3f%%’ % (ok, v, per)) |
Tying this collectively, the whole instance of loading and summarizing the dataset is listed beneath.
| # load and summarize the dataset from pandas import read_csv from collections import Counter # outline the dataset location filename = ‘mammography.csv’ # load the csv file as a knowledge body dataframe = read_csv(filename, header=None) # summarize the form of the dataset print(dataframe.form) # summarize the category distribution goal = dataframe.values[:,–1] counter = Counter(goal) for ok,v in counter.objects(): per = v / len(goal) * 100 print(‘Class=%s, Rely=%d, Proportion=%.3f%%’ % (ok, v, per)) |
Operating the instance first masses the dataset and confirms the variety of rows and columns, that’s 11,183 rows and 6 enter variables and one goal variable.
The category distribution is then summarized, confirming the extreme class imbalanced with roughly 98 p.c for almost all class (no most cancers) and roughly 2 p.c for the minority class (most cancers).
| (11183, 7) Class=’-1′, Rely=10923, Proportion=97.675% Class=’1′, Rely=260, Proportion=2.325% |
The dataset seems to typically match the dataset described within the SMOTE paper. Particularly by way of the ratio of adverse to constructive examples.
A typical mammography dataset would possibly include 98% regular pixels and a pair of% irregular pixels.
— SMOTE: Synthetic Minority Over-sampling Technique, 2002.
Additionally, the particular variety of examples within the minority and majority courses additionally matches the paper.
The experiments had been performed on the mammography dataset. There have been 10923 examples within the majority class and 260 examples within the minority class initially.
— SMOTE: Synthetic Minority Over-sampling Technique, 2002.
I consider this is similar dataset, though I can’t clarify the mismatch within the variety of enter options, e.g. six in comparison with seven within the authentic paper.
We will additionally check out the distribution of the six numerical enter variables by making a histogram for every.
The entire instance is listed beneath.
| # create histograms of numeric enter variables from pandas import read_csv from matplotlib import pyplot # outline the dataset location filename = ‘mammography.csv’ # load the csv file as a knowledge body df = read_csv(filename, header=None) # histograms of all variables df.hist() pyplot.present() |
Operating the instance creates the determine with one histogram subplot for every of the six numerical enter variables within the dataset.
We will see that the variables have differing scales and that a lot of the variables have an exponential distribution, e.g. most circumstances falling into one bin, and the remaining falling into a protracted tail. The ultimate variable seems to have a bimodal distribution.
Relying on the selection of modeling algorithms, we might anticipate scaling the distributions to the identical vary to be helpful, and maybe the usage of some energy transforms.



Histogram Plots of the Numerical Enter Variables for the Mammography Dataset
We will additionally create a scatter plot for every pair of enter variables, referred to as a scatter plot matrix.
This may be useful to see if any variables relate to one another or change in the identical course, e.g. are correlated.
We will additionally shade the dots of every scatter plot based on the category label. On this case, the bulk class (no most cancers) can be mapped to blue dots and the minority class (most cancers) can be mapped to purple dots.
The entire instance is listed beneath.
| # create pairwise scatter plots of numeric enter variables from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # outline the dataset location filename = ‘mammography.csv’ # load the csv file as a knowledge body df = read_csv(filename, header=None) # outline a mapping of sophistication values to colours color_dict = {“‘-1′”:‘blue’, “‘1′”:‘purple’} # map every row to a shade based mostly on the category worth colours = [color_dict[str(x)] for x in df.values[:, –1]] # pairwise scatter plots of all numerical variables scatter_matrix(df, diagonal=‘kde’, shade=colours) pyplot.present() |
Operating the instance creates a determine displaying the scatter plot matrix, with six plots by six plots, evaluating every of the six numerical enter variables with one another. The diagonal of the matrix exhibits the density distribution of every variable.
Every pairing seems twice each above and beneath the top-left to bottom-right diagonal, offering two methods to overview the identical variable interactions.
We will see that the distributions for a lot of variables do differ for the two-class labels, suggesting that some cheap discrimination between the most cancers and no most cancers circumstances can be possible.



Scatter Plot Matrix by Class for the Numerical Enter Variables within the Mammography Dataset
Now that we have now reviewed the dataset, let’s have a look at growing a check harness for evaluating candidate fashions.
Mannequin Take a look at and Baseline End result
We’ll consider candidate fashions utilizing repeated stratified k-fold cross-validation.
The k-fold cross-validation procedure gives a superb basic estimate of mannequin efficiency that isn’t too optimistically biased, at the least in comparison with a single train-test break up. We’ll use ok=10, that means every fold will include about 11183/10 or about 1,118 examples.
Stratified implies that every fold will include the identical combination of examples by class, that’s about 98 p.c to 2 p.c no-cancer to most cancers objects. Repetition signifies that the analysis course of can be carried out a number of occasions to assist keep away from fluke outcomes and higher seize the variance of the chosen mannequin. We’ll use three repeats.
This implies a single mannequin can be match and evaluated 10 * Three or 30 occasions and the imply and normal deviation of those runs can be reported.
This may be achieved utilizing the RepeatedStratifiedKFold scikit-learn class.
We’ll consider and evaluate fashions utilizing the world beneath ROC Curve or ROC AUC calculated by way of the roc_auc_score() function.
We will outline a perform to load the dataset and break up the columns into enter and output variables. We’ll appropriately encode the category labels as Zero and 1. The load_dataset() perform beneath implements this.
| # load the dataset def load_dataset(full_path): # load the dataset as a numpy array information = read_csv(full_path, header=None) # retrieve numpy array information = information.values # break up into enter and output components X, y = information[:, :–1], information[:, –1] # label encode the goal variable to have the courses Zero and 1 y = LabelEncoder().fit_transform(y) return X, y |
We will then outline a perform that may consider a given mannequin on the dataset and return an inventory of ROC AUC scores for every fold and repeat.
The evaluate_model() perform beneath implements this, taking the dataset and mannequin as arguments and returning the record of scores.
| # consider a mannequin def evaluate_model(X, y, mannequin): # outline analysis process cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # consider mannequin scores = cross_val_score(mannequin, X, y, scoring=‘roc_auc’, cv=cv, n_jobs=–1) return scores |
Lastly, we will consider a baseline mannequin on the dataset utilizing this check harness.
A mannequin that predicts the a random class in proportion to the bottom fee of every class will lead to a ROC AUC of 0.5, the baseline in efficiency on this dataset. It is a so-called “no ability” classifier.
This may be achieved utilizing the DummyClassifier class from the scikit-learn library and setting the “technique” argument to ‘stratified‘.
| ... # outline the reference mannequin mannequin = DummyClassifier(technique=‘stratified’) |
As soon as the mannequin is evaluated, we will report the imply and normal deviation of the ROC AUC scores instantly.
| ... # consider the mannequin scores = evaluate_model(X, y, mannequin) # summarize efficiency print(‘Imply ROC AUC: %.3f (%.3f)’ % (imply(scores), std(scores))) |
Tying this collectively, the whole instance of loading the dataset, evaluating a baseline mannequin, and reporting the efficiency is listed beneath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| # check harness and baseline mannequin analysis from collections import Counter from numpy import imply from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier
# load the dataset def load_dataset(full_path): # load the dataset as a numpy array information = read_csv(full_path, header=None) # retrieve numpy array information = information.values # break up into enter and output components X, y = information[:, :–1], information[:, –1] # label encode the goal variable to have the courses Zero and 1 y = LabelEncoder().fit_transform(y) return X, y
# consider a mannequin def evaluate_model(X, y, mannequin): # outline analysis process cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # consider mannequin scores = cross_val_score(mannequin, X, y, scoring=‘roc_auc’, cv=cv, n_jobs=–1) return scores
# outline the placement of the dataset full_path = ‘mammography.csv’ # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.form, y.form, Counter(y)) # outline the reference mannequin mannequin = DummyClassifier(technique=‘stratified’) # consider the mannequin scores = evaluate_model(X, y, mannequin) # summarize efficiency print(‘Imply ROC AUC: %.3f (%.3f)’ % (imply(scores), std(scores))) |
Operating the instance first masses and summarizes the dataset.
We will see that we have now the proper variety of rows loaded, and that we have now six pc imaginative and prescient derived enter variables. Importantly, we will see that the category labels have the proper mapping to integers with Zero for almost all class and 1 for the minority class, customary for imbalanced binary classification datasets.
Subsequent, the common of the ROC AUC scores is reported.
As anticipated, the no-skill classifier achieves the worst-case efficiency of a imply ROC AUC of roughly 0.5. This gives a baseline in efficiency, above which a mannequin could be thought-about skillful on this dataset.
| (11183, 6) (11183,) Counter({0: 10923, 1: 260}) Imply ROC AUC: 0.503 (0.016) |
Now that we have now a check harness and a baseline in efficiency, we will start to judge some fashions on this dataset.
Consider Fashions
On this part, we are going to consider a set of various strategies on the dataset utilizing the check harness developed within the earlier part.
The aim is to each display tips on how to work via the issue systematically and to display the aptitude of some strategies designed for imbalanced classification issues.
The reported efficiency is nice, however not extremely optimized (e.g. hyperparameters usually are not tuned).
Are you able to do higher? For those who can obtain higher ROC AUC efficiency utilizing the identical check harness, I’d love to listen to about it. Let me know within the feedback beneath.
Consider Machine Studying Algorithms
Let’s begin by evaluating a combination of machine studying fashions on the dataset.
It may be a good suggestion to identify verify a set of various linear and nonlinear algorithms on a dataset to shortly flush out what works properly and deserves additional consideration, and what doesn’t.
We’ll consider the next machine studying fashions on the mammography dataset:
- Logistic Regression (LR)
- Help Vector Machine (SVM)
- Bagged Choice Bushes (BAG)
- Random Forest (RF)
- Gradient Boosting Machine (GBM)
We’ll use principally default mannequin hyperparameters, excluding the variety of bushes within the ensemble algorithms, which we are going to set to an affordable default of 1,000.
We’ll outline every mannequin in flip and add them to an inventory in order that we will consider them sequentially. The get_models() perform beneath defines the record of fashions for analysis, in addition to an inventory of mannequin quick names for plotting the outcomes later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| # outline fashions to check def get_models(): fashions, names = record(), record() # LR fashions.append(LogisticRegression(solver=‘lbfgs’)) names.append(‘LR’) # SVM fashions.append(SVC(gamma=‘scale’)) names.append(‘SVM’) # Bagging fashions.append(BaggingClassifier(n_estimators=1000)) names.append(‘BAG’) # RF fashions.append(RandomForestClassifier(n_estimators=1000)) names.append(‘RF’) # GBM fashions.append(GradientBoostingClassifier(n_estimators=1000)) names.append(‘GBM’) return fashions, names |
We will then enumerate the record of fashions in flip and consider every, reporting the imply ROC AUC and storing the scores for later plotting.
| ... # outline fashions fashions, names = get_models() outcomes = record() # consider every mannequin for i in vary(len(fashions)): # consider the mannequin and retailer outcomes scores = evaluate_model(X, y, fashions[i]) outcomes.append(scores) # summarize and retailer print(‘>%s %.3f (%.3f)’ % (names[i], imply(scores), std(scores))) |
On the finish of the run, we will plot every pattern of scores as a field and whisker plot with the identical scale in order that we will instantly evaluate the distributions.
| ... # plot the outcomes pyplot.boxplot(outcomes, labels=names, showmeans=True) pyplot.present() |
Tying this all collectively, the whole instance of evaluating a set of machine studying algorithms on the mammography dataset is listed beneath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
| # spot verify machine studying algorithms on the mammography dataset from numpy import imply from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import BaggingClassifier
# load the dataset def load_dataset(full_path): # load the dataset as a numpy array information = read_csv(full_path, header=None) # retrieve numpy array information = information.values # break up into enter and output components X, y = information[:, :–1], information[:, –1] # label encode the goal variable to have the courses Zero and 1 y = LabelEncoder().fit_transform(y) return X, y
# consider a mannequin def evaluate_model(X, y, mannequin): # outline analysis process cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # consider mannequin scores = cross_val_score(mannequin, X, y, scoring=‘roc_auc’, cv=cv, n_jobs=–1) return scores
# outline fashions to check def get_models(): fashions, names = record(), record() # LR fashions.append(LogisticRegression(solver=‘lbfgs’)) names.append(‘LR’) # SVM fashions.append(SVC(gamma=‘scale’)) names.append(‘SVM’) # Bagging fashions.append(BaggingClassifier(n_estimators=1000)) names.append(‘BAG’) # RF fashions.append(RandomForestClassifier(n_estimators=1000)) names.append(‘RF’) # GBM fashions.append(GradientBoostingClassifier(n_estimators=1000)) names.append(‘GBM’) return fashions, names
# outline the placement of the dataset full_path = ‘mammography.csv’ # load the dataset X, y = load_dataset(full_path) # outline fashions fashions, names = get_models() outcomes = record() # consider every mannequin for i in vary(len(fashions)): # consider the mannequin and retailer outcomes scores = evaluate_model(X, y, fashions[i]) outcomes.append(scores) # summarize and retailer print(‘>%s %.3f (%.3f)’ % (names[i], imply(scores), std(scores))) # plot the outcomes pyplot.boxplot(outcomes, labels=names, showmeans=True) pyplot.present() |
Operating the instance evaluates every algorithm in flip and studies the imply and normal deviation ROC AUC.
Your particular outcomes will range given the stochastic nature of the educational algorithms; take into account working the instance a number of occasions.
On this case, we will see that all the examined algorithms have ability, reaching a ROC AUC above the default of 0.5.
The outcomes counsel that the ensemble of choice tree algorithms performs higher on this dataset with maybe Random Forest performing the most effective, with a ROC AUC of about 0.950.
It’s fascinating to notice that that is higher than the ROC AUC described within the paper of 0.93, though we used a distinct mannequin analysis process.
The analysis was slightly unfair to the LR and SVM algorithms as we didn’t scale the enter variables previous to becoming the mannequin. We will discover this within the subsequent part.
| >LR 0.919 (0.040) >SVM 0.880 (0.049) >BAG 0.941 (0.041) >RF 0.950 (0.036) >GBM 0.918 (0.037) |
A determine is created displaying one field and whisker plot for every algorithm’s pattern of outcomes. The field exhibits the center 50 p.c of the info, the orange line in the midst of every field exhibits the median of the pattern, and the inexperienced triangle in every field exhibits the imply of the pattern.
We will see that each BAG and RF have a good distribution and a imply and median that intently align, maybe suggesting a non-skewed and Gaussian distribution of scores, e.g. steady.



Field and Whisker Plot of Machine Studying Fashions on the Imbalanced Mammography Dataset
Now that we have now a superb first set of outcomes, let’s see if we will enhance them with cost-sensitive classifiers.
Consider Price-Delicate Algorithms
Some machine studying algorithms could be tailored to pay extra consideration to 1 class than one other when becoming the mannequin.
These are known as cost-sensitive machine studying fashions and so they can be utilized for imbalanced classification by specifying a price that’s inversely proportional to the category distribution. For instance, with a 98 p.c to 2 p.c distribution for almost all and minority courses, we will specify to present errors on the minority class a weighting of 98 and errors for almost all class a weighting of two.
Three algorithms that supply this functionality are:
- Logistic Regression (LR)
- Help Vector Machine (SVM)
- Random Forest (RF)
This may be achieved in scikit-learn by setting the “class_weight” argument to “balanced” to make these algorithms cost-sensitive.
For instance, the up to date get_models() perform beneath defines the cost-sensitive model of those three algorithms to be evaluated on our dataset.
| # outline fashions to check def get_models(): fashions, names = record(), record() # LR fashions.append(LogisticRegression(solver=‘lbfgs’, class_weight=‘balanced’)) names.append(‘LR’) # SVM fashions.append(SVC(gamma=‘scale’, class_weight=‘balanced’)) names.append(‘SVM’) # RF fashions.append(RandomForestClassifier(n_estimators=1000)) names.append(‘RF’) return fashions, names |
Moreover, when exploring the dataset, we observed that most of the variables had a seemingly exponential information distribution. Generally we will higher unfold the info for a variable by utilizing an influence rework on every variable. This can be significantly useful to the LR and SVM algorithm and may additionally assist the RF algorithm.
We will implement this inside every fold of the cross-validation mannequin analysis course of utilizing a Pipeline. Step one will be taught the PowerTransformer on the coaching set folds and apply it to the coaching and check set folds. The second step would be the mannequin that we’re evaluating. The pipeline can then be evaluated instantly utilizing our evaluate_model() perform, for instance:
| ... # defines pipeline steps steps = [(‘p’, PowerTransformer()), (‘m’,fashions[i])] # outline pipeline pipeline = Pipeline(steps=steps) # consider the pipeline and retailer outcomes scores = evaluate_model(X, y, pipeline) |
Tying this collectively, the whole instance of evaluating energy reworked cost-sensitive machine studying algorithms on the mammography dataset is listed beneath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
| # cost-sensitive machine studying algorithms on the mammography dataset from numpy import imply from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier
# load the dataset def load_dataset(full_path): # load the dataset as a numpy array information = read_csv(full_path, header=None) # retrieve numpy array information = information.values # break up into enter and output components X, y = information[:, :–1], information[:, –1] # label encode the goal variable to have the courses Zero and 1 y = LabelEncoder().fit_transform(y) return X, y
# consider a mannequin def evaluate_model(X, y, mannequin): # outline analysis process cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # consider mannequin scores = cross_val_score(mannequin, X, y, scoring=‘roc_auc’, cv=cv, n_jobs=–1) return scores
# outline fashions to check def get_models(): fashions, names = record(), record() # LR fashions.append(LogisticRegression(solver=‘lbfgs’, class_weight=‘balanced’)) names.append(‘LR’) # SVM fashions.append(SVC(gamma=‘scale’, class_weight=‘balanced’)) names.append(‘SVM’) # RF fashions.append(RandomForestClassifier(n_estimators=1000)) names.append(‘RF’) return fashions, names
# outline the placement of the dataset full_path = ‘mammography.csv’ # load the dataset X, y = load_dataset(full_path) # outline fashions fashions, names = get_models() outcomes = record() # consider every mannequin for i in vary(len(fashions)): # defines pipeline steps steps = [(‘p’, PowerTransformer()), (‘m’,fashions[i])] # outline pipeline pipeline = Pipeline(steps=steps) # consider the pipeline and retailer outcomes scores = evaluate_model(X, y, pipeline) outcomes.append(scores) # summarize and retailer print(‘>%s %.3f (%.3f)’ % (names[i], imply(scores), std(scores))) # plot the outcomes pyplot.boxplot(outcomes, labels=names, showmeans=True) pyplot.present() |
Operating the instance evaluates every algorithm in flip and studies the imply and normal deviation ROC AUC.
Your particular outcomes will range given the stochastic nature of the educational algorithms; take into account working the instance a number of occasions.
On this case, we will see that each one three of the examined algorithms achieved a raise on ROC AUC in comparison with their non-transformed and cost-insensitive variations. It might be fascinating to repeat the experiment with out the rework to see if it was the rework or the cost-sensitive model of the algorithms, or each that resulted within the lifts in efficiency.
On this case, we will see the SVM achieved the most effective efficiency, performing higher than RF on this and the earlier part and reaching a imply ROC AUC of about 0.957.
| >LR 0.922 (0.036) >SVM 0.957 (0.024) >RF 0.951 (0.035) |
Field and whisker plots are then created evaluating the distribution of ROC AUC scores.
The SVM distribution seems compact in comparison with the opposite two fashions. As such the efficiency is probably going steady and will make a good selection for a last mannequin.



Field and Whisker Plots of Price-Delicate Machine Studying Fashions on the Imbalanced Mammography Dataset
Subsequent, let’s see how we would use a last mannequin to make predictions on new information.
Make Predictions on New Information
On this part, we are going to match a last mannequin and use it to make predictions on single rows of knowledge
We’ll use the cost-sensitive model of the SVM mannequin as the ultimate mannequin and an influence rework on the info previous to becoming the mannequin and making a prediction. Utilizing the pipeline will be sure that the rework is at all times carried out appropriately on enter information.
First, we will outline the mannequin as a pipeline.
| ... # outline mannequin to judge mannequin = SVC(gamma=‘scale’, class_weight=‘balanced’) # energy rework then match mannequin pipeline = Pipeline(steps=[(‘t’,PowerTransformer()), (‘m’,mannequin)]) |
As soon as outlined, we will match it on the complete coaching dataset.
| ... # match the mannequin pipeline.match(X, y) |
As soon as match, we will use it to make predictions for brand spanking new information by calling the predict() perform. It will return the category label of Zero for “no most cancers”, or 1 for “most cancers“.
For instance:
| ... # outline a row of knowledge row = [...] # make prediction yhat = mannequin.predict([row]) |
To display this, we will use the match mannequin to make some predictions of labels for a number of circumstances the place we all know if the case is a no most cancers or most cancers.
The entire instance is listed beneath.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
| # match a mannequin and make predictions for the on the mammography dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.svm import SVC from sklearn.pipeline import Pipeline
# load the dataset def load_dataset(full_path): # load the dataset as a numpy array information = read_csv(full_path, header=None) # retrieve numpy array information = information.values # break up into enter and output components X, y = information[:, :–1], information[:, –1] # label encode the goal variable to have the courses Zero and 1 y = LabelEncoder().fit_transform(y) return X, y
# outline the placement of the dataset full_path = ‘mammography.csv’ # load the dataset X, y = load_dataset(full_path) # outline mannequin to judge mannequin = SVC(gamma=‘scale’, class_weight=‘balanced’) # energy rework then match mannequin pipeline = Pipeline(steps=[(‘t’,PowerTransformer()), (‘m’,mannequin)]) # match the mannequin pipeline.match(X, y) # consider on some no most cancers circumstances (recognized class 0) print(‘No Most cancers:’) information = [[0.23001961,5.0725783,–0.27606055,0.83244412,–0.37786573,0.4803223], [0.15549112,–0.16939038,0.67065219,–0.85955255,–0.37786573,–0.94572324], [–0.78441482,–0.44365372,5.6747053,–0.85955255,–0.37786573,–0.94572324]] for row in information: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print(‘>Predicted=%d (anticipated 0)’ % (label)) # consider on some most cancers (recognized class 1) print(‘Most cancers:’) information = [[2.0158239,0.15353258,–0.32114211,2.1923706,–0.37786573,0.96176503], [2.3191888,0.72860087,–0.50146835,–0.85955255,–0.37786573,–0.94572324], [0.19224721,–0.2003556,–0.230979,1.2003796,2.2620867,1.132403]] for row in information: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print(‘>Predicted=%d (anticipated 1)’ % (label)) |
Operating the instance first suits the mannequin on the complete coaching dataset.
Then the match mannequin used to foretell the label of no most cancers circumstances is chosen from the dataset file. We will see that each one circumstances are appropriately predicted.
Then some circumstances of precise most cancers are used as enter to the mannequin and the label is predicted. As we would have hoped, the proper labels are predicted for all circumstances.
| No Most cancers: >Predicted=0 (anticipated 0) >Predicted=0 (anticipated 0) >Predicted=0 (anticipated 0) Most cancers: >Predicted=1 (anticipated 1) >Predicted=1 (anticipated 1) >Predicted=1 (anticipated 1) |
Additional Studying
This part gives extra sources on the subject in case you are trying to go deeper.
Papers
APIs
Dataset
Abstract
On this tutorial, you found tips on how to develop and consider fashions for imbalanced mammography most cancers classification dataset.
Particularly, you discovered:
- Easy methods to load and discover the dataset and generate concepts for information preparation and mannequin choice.
- Easy methods to consider a set of machine studying fashions and enhance their efficiency with information cost-sensitive strategies.
- Easy methods to match a last mannequin and use it to foretell class labels for particular circumstances.
Do you may have any questions?
Ask your questions within the feedback beneath and I’ll do my greatest to reply.