Upload files to "Tugas.Classification"

This commit is contained in:
202310715065 ANANDA DWI PRASETYO 2025-11-19 10:59:25 +07:00
parent e7e976ea6a
commit c195dd8fc6
4 changed files with 2993 additions and 0 deletions

View File

@ -0,0 +1,781 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"text-align:center\">\n",
" <a href=\"https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01\" target=\"_blank\">\n",
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\">\n",
" </a>\n",
"</p>\n",
"\n",
"# Decision Trees\n",
"\n",
"Estimated time needed: **15** minutes\n",
"\n",
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"* Develop a classification model using Decision Tree Algorithm\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this lab exercise, you will learn a popular machine learning algorithm, Decision Trees. You will use this classification algorithm to build a model from the historical data of patients, and their response to different medications. Then you will use the trained decision tree to predict the class of an unknown patient, or to find a proper drug for a new patient.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Table of contents</h1>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ol>\n",
" <li><a href=\"https://#about_dataset\">About the dataset</a></li>\n",
" <li><a href=\"https://#downloading_data\">Downloading the Data</a></li>\n",
" <li><a href=\"https://#pre-processing\">Pre-processing</a></li>\n",
" <li><a href=\"https://#setting_up_tree\">Setting up the Decision Tree</a></li>\n",
" <li><a href=\"https://#modeling\">Modeling</a></li>\n",
" <li><a href=\"https://#prediction\">Prediction</a></li>\n",
" <li><a href=\"https://#evaluation\">Evaluation</a></li>\n",
" <li><a href=\"https://#visualization\">Visualization</a></li>\n",
" </ol>\n",
"</div>\n",
"<br>\n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the Following Libraries:\n",
"\n",
"<ul>\n",
" <li> <b>numpy (as np)</b> </li>\n",
" <li> <b>pandas</b> </li>\n",
" <li> <b>DecisionTreeClassifier</b> from <b>sklearn.tree</b> </li>\n",
"</ul>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"if you uisng you own version comment out\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# Surpress warnings:\n",
"def warn(*args, **kwargs):\n",
" pass\n",
"import warnings\n",
"warnings.warn = warn"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"import sys\n",
"import numpy as np \n",
"import pandas as pd\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"import sklearn.tree as tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"about_dataset\">\n",
" <h2>About the dataset</h2>\n",
" Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y. \n",
" <br>\n",
" <br>\n",
" Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The features of this dataset are Age, Sex, Blood Pressure, and the Cholesterol of the patients, and the target is the drug that each patient responded to.\n",
" <br>\n",
" <br>\n",
" It is a sample of multiclass classifier, and you can use the training part of the dataset \n",
" to build a decision tree, and then use it to predict the class of an unknown patient, or to prescribe a drug to a new patient.\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"downloading_data\"> \n",
" <h2>Downloading the Data</h2>\n",
" To download the data, we will use pandas library to read itdirectly into a dataframe from IBM Object Storage.\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>BP</th>\n",
" <th>Cholesterol</th>\n",
" <th>Na_to_K</th>\n",
" <th>Drug</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>23</td>\n",
" <td>F</td>\n",
" <td>HIGH</td>\n",
" <td>HIGH</td>\n",
" <td>25.355</td>\n",
" <td>drugY</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>47</td>\n",
" <td>M</td>\n",
" <td>LOW</td>\n",
" <td>HIGH</td>\n",
" <td>13.093</td>\n",
" <td>drugC</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>47</td>\n",
" <td>M</td>\n",
" <td>LOW</td>\n",
" <td>HIGH</td>\n",
" <td>10.114</td>\n",
" <td>drugC</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>28</td>\n",
" <td>F</td>\n",
" <td>NORMAL</td>\n",
" <td>HIGH</td>\n",
" <td>7.798</td>\n",
" <td>drugX</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>61</td>\n",
" <td>F</td>\n",
" <td>LOW</td>\n",
" <td>HIGH</td>\n",
" <td>18.043</td>\n",
" <td>drugY</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age Sex BP Cholesterol Na_to_K Drug\n",
"0 23 F HIGH HIGH 25.355 drugY\n",
"1 47 M LOW HIGH 13.093 drugC\n",
"2 47 M LOW HIGH 10.114 drugC\n",
"3 28 F NORMAL HIGH 7.798 drugX\n",
"4 61 F LOW HIGH 18.043 drugY"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"my_data = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv', delimiter=\",\")\n",
"my_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div id=\"practice\"> \n",
" <h3>Practice</h3> \n",
" What is the size of data? \n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of data: (200, 6)\n",
"Number of rows: 200\n",
"Number of columns: 6\n"
]
}
],
"source": [
"# Menampilkan ukuran data\n",
"print(\"Shape of data:\", my_data.shape)\n",
"\n",
"# Menampilkan jumlah baris dan kolom secara terpisah\n",
"print(\"Number of rows:\", my_data.shape[0])\n",
"print(\"Number of columns:\", my_data.shape[1])\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"my_data.shape\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div href=\"pre-processing\">\n",
" <h2>Pre-processing</h2>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using <b>my_data</b> as the Drug.csv data read by pandas, declare the following variables: <br>\n",
"\n",
"<ul>\n",
" <li> <b> X </b> as the <b> Feature Matrix </b> (data of my_data) </li>\n",
" <li> <b> y </b> as the <b> response vector </b> (target) </li>\n",
"</ul>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove the column containing the target name since it doesn't contain numeric values.\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[23, 'F', 'HIGH', 'HIGH', 25.355],\n",
" [47, 'M', 'LOW', 'HIGH', 13.093],\n",
" [47, 'M', 'LOW', 'HIGH', 10.114],\n",
" [28, 'F', 'NORMAL', 'HIGH', 7.798],\n",
" [61, 'F', 'LOW', 'HIGH', 18.043]], dtype=object)"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = my_data[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values\n",
"X[0:5]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you may figure out, some features in this dataset are categorical, such as **Sex** or **BP**. Unfortunately, Sklearn Decision Trees does not handle categorical variables. We can still convert these features to numerical values using the **LabelEncoder() method**\n",
"to convert the categorical variable into dummy/indicator variables.\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[23, 0, 0, 0, 25.355],\n",
" [47, 1, 1, 0, 13.093],\n",
" [47, 1, 1, 0, 10.114],\n",
" [28, 0, 2, 0, 7.798],\n",
" [61, 0, 1, 0, 18.043]], dtype=object)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import preprocessing\n",
"le_sex = preprocessing.LabelEncoder()\n",
"le_sex.fit(['F','M'])\n",
"X[:,1] = le_sex.transform(X[:,1]) \n",
"\n",
"\n",
"le_BP = preprocessing.LabelEncoder()\n",
"le_BP.fit([ 'LOW', 'NORMAL', 'HIGH'])\n",
"X[:,2] = le_BP.transform(X[:,2])\n",
"\n",
"\n",
"le_Chol = preprocessing.LabelEncoder()\n",
"le_Chol.fit([ 'NORMAL', 'HIGH'])\n",
"X[:,3] = le_Chol.transform(X[:,3]) \n",
"\n",
"X[0:5]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can fill the target variable.\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 drugY\n",
"1 drugC\n",
"2 drugC\n",
"3 drugX\n",
"4 drugY\n",
"Name: Drug, dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = my_data[\"Drug\"]\n",
"y[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"<div id=\"setting_up_tree\">\n",
" <h2>Setting up the Decision Tree</h2>\n",
" We will be using <b>train/test split</b> on our <b>decision tree</b>. Let's import <b>train_test_split</b> from <b>sklearn.cross_validation</b>.\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now <b> train_test_split </b> will return 4 different parameters. We will name them:<br>\n",
"X_trainset, X_testset, y_trainset, y_testset <br> <br>\n",
"The <b> train_test_split </b> will need the parameters: <br>\n",
"X, y, test_size=0.3, and random_state=3. <br> <br>\n",
"The <b>X</b> and <b>y</b> are the arrays required before the split, the <b>test_size</b> represents the ratio of the testing dataset, and the <b>random_state</b> ensures that we obtain the same splits.\n"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"X_trainset, X_testset, y_trainset, y_testset = train_test_split(X, y, test_size=0.3, random_state=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Practice</h3>\n",
"Print the shape of X_trainset and y_trainset. Ensure that the dimensions match.\n"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of X training set (140, 5) & Size of Y training set (140,)\n"
]
}
],
"source": [
"print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"print('Shape of X training set {}'.format(X_trainset.shape),'&',' Size of Y training set {}'.format(y_trainset.shape))\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Print the shape of X_testset and y_testset. Ensure that the dimensions match.\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of X test set (60, 5) & Size of y test set (60,)\n"
]
}
],
"source": [
"print('Shape of X test set {}'.format(X_testset.shape),'&','Size of y test set {}'.format(y_testset.shape))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"print('Shape of X test set {}'.format(X_testset.shape),'&','Size of y test set {}'.format(y_testset.shape))\n",
"\n",
"```\n",
"\n",
"</details>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"<div id=\"modeling\">\n",
" <h2>Modeling</h2>\n",
" We will first create an instance of the <b>DecisionTreeClassifier</b> called <b>drugTree</b>.<br>\n",
" Inside of the classifier, specify <i> criterion=\"entropy\" </i> so we can see the information gain of each node.\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(criterion='entropy', max_depth=4)"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drugTree = DecisionTreeClassifier(criterion=\"entropy\", max_depth = 4)\n",
"drugTree # it shows the default parameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we will fit the data with the training feature matrix <b> X_trainset </b> and training response vector <b> y_trainset </b>\n"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(criterion='entropy', max_depth=4)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drugTree.fit(X_trainset,y_trainset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"<div id=\"prediction\">\n",
" <h2>Prediction</h2>\n",
" Let's make some <b>predictions</b> on the testing dataset and store it into a variable called <b>predTree</b>.\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
"predTree = drugTree.predict(X_testset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can print out <b>predTree</b> and <b>y_testset</b> if you want to visually compare the predictions to the actual values.\n"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['drugY' 'drugX' 'drugX' 'drugX' 'drugX']\n",
"40 drugY\n",
"51 drugX\n",
"139 drugX\n",
"197 drugX\n",
"170 drugX\n",
"Name: Drug, dtype: object\n"
]
}
],
"source": [
"print (predTree [0:5])\n",
"print (y_testset [0:5])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"<div id=\"evaluation\">\n",
" <h2>Evaluation</h2>\n",
" Next, let's import <b>metrics</b> from sklearn and check the accuracy of our model.\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DecisionTrees's Accuracy: 0.9833333333333333\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"import matplotlib.pyplot as plt\n",
"print(\"DecisionTrees's Accuracy: \", metrics.accuracy_score(y_testset, predTree))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Accuracy classification score** computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.\n",
"\n",
"In multilabel classification, the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"\n",
"<div id=\"visualization\">\n",
" <h2>Visualization</h2>\n",
"\n",
"Let's visualize the tree\n",
"\n",
"</div>\n"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"# Notice: You might need to uncomment and install the pydotplus and graphviz libraries if you have not installed these before\n",
"#!conda install -c conda-forge pydotplus -y\n",
"#!conda install -c conda-forge python-graphviz -y\n",
"\n",
"#After executing the code below, a file named 'tree.png' would be generated which contains the decision tree image."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.tree import export_graphviz\n",
"export_graphviz(drugTree, out_file='tree.dot', filled=True, feature_names=['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K'])\n",
"!dot -Tpng tree.dot -o tree.png\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"## Author\n",
"\n",
"Saeed Aghabozorgi\n",
"\n",
"### Other Contributors\n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01\" target=\"_blank\">Joseph Santarcangelo</a>\n",
"\n",
"<a href=\"https://www.linkedin.com/in/richard-ye/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01\" target=\"_blank\">Richard Ye</a>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n",
" \n",
"<!--\n",
"## Change Log\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"| ----------------- | ------- | ---------- | ------------------------------------------------ |\n",
"| 2022-05-24 | 2.3 | Richard Ye | Fixed ability to work in JupyterLite and locally |\n",
"| 2020-11-20 | 2.2 | Lakshmi | Changed import statement of StringIO |\n",
"| 2020-11-03 | 2.1 | Lakshmi | Changed URL of the csv |\n",
"| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"| | | | |\n",
"| | | | |\n",
"--!>\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"prev_pub_hash": "1228bf81fd1be0f6e7dda62256f4ffcb19b064217fc51f2e012abde9b84c2b0d"
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,705 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"text-align:center\">\n",
" <a href=\"https://skills.network\" target=\"_blank\">\n",
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\">\n",
" </a>\n",
"</p>\n",
"\n",
"\n",
"# Logistic Regression with Python\n",
"\n",
"\n",
"Estimated time needed: **25** minutes\n",
" \n",
"\n",
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"* Use scikit Logistic Regression to classify\n",
"* Understand confusion matrix\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, you will learn Logistic Regression, and then, you'll create a model for a telecommunication company, to predict when its customers will leave for a competitor, so that they can take some action to retain the customers.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Table of contents</h1>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ol>\n",
" <li><a href=\"#about_dataset\">About the dataset</a></li>\n",
" <li><a href=\"#preprocessing\">Data pre-processing and selection</a></li>\n",
" <li><a href=\"#modeling\">Modeling (Logistic Regression with Scikit-learn)</a></li>\n",
" <li><a href=\"#evaluation\">Evaluation</a></li>\n",
" <li><a href=\"#practice\">Practice</a></li>\n",
" </ol>\n",
"</div>\n",
"<br>\n",
"<hr>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a id=\"ref1\"></a>\n",
"## What is the difference between Linear and Logistic Regression?\n",
"\n",
"While Linear Regression is suited for estimating continuous values (e.g. estimating house price), it is not the best tool for predicting the class of an observed data point. In order to estimate the class of a data point, we need some sort of guidance on what would be the <b>most probable class</b> for that data point. For this, we use <b>Logistic Regression</b>.\n",
"\n",
"<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 20px\">\n",
"<font size = 3><strong>Recall linear regression:</strong></font>\n",
"<br>\n",
"<br>\n",
" As you know, <b>Linear regression</b> finds a function that relates a continuous dependent variable, <b>y</b>, to some predictors (independent variables $x_1$, $x_2$, etc.). For example, simple linear regression assumes a function of the form:\n",
"<br><br>\n",
"$$\n",
"y = \\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\cdots\n",
"$$\n",
"<br>\n",
"and finds the values of parameters $\\theta_0, \\theta_1, \\theta_2$, etc, where the term $\\theta_0$ is the \"intercept\". It can be generally shown as:\n",
"<br><br>\n",
"$$\n",
"_\\theta(𝑥) = \\theta^TX\n",
"$$\n",
"<p></p>\n",
"\n",
"</div>\n",
"\n",
"Logistic Regression is a variation of Linear Regression, used when the observed dependent variable, <b>y</b>, is categorical. It produces a formula that predicts the probability of the class label as a function of the independent variables.\n",
"\n",
"Logistic regression fits a special s-shaped curve by taking the linear regression function and transforming the numeric estimate into a probability with the following function, which is called the sigmoid function 𝜎:\n",
"\n",
"$$\n",
"_\\theta(𝑥) = \\sigma({\\theta^TX}) = \\frac {e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +...)}}{1 + e^{(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 +\\cdots)}}\n",
"$$\n",
"Or:\n",
"$$\n",
"ProbabilityOfaClass_1 = P(Y=1|X) = \\sigma({\\theta^TX}) = \\frac{e^{\\theta^TX}}{1+e^{\\theta^TX}} \n",
"$$\n",
"\n",
"In this equation, ${\\theta^TX}$ is the regression result (the sum of the variables weighted by the coefficients), `exp` is the exponential function and $\\sigma(\\theta^TX)$ is the sigmoid or [logistic function](http://en.wikipedia.org/wiki/Logistic_function), also called logistic curve. It is a common \"S\" shape (sigmoid curve).\n",
"\n",
"So, briefly, Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability:\n",
"\n",
"<img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/images/mod_ID_24_final.png\" width=\"400\" align=\"center\">\n",
"\n",
"\n",
"The objective of the __Logistic Regression__ algorithm, is to find the best parameters θ, for $_\\theta(𝑥)$ = $\\sigma({\\theta^TX})$, in such a way that the model best predicts the class of each case.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customer churn with Logistic Regression\n",
"A telecommunications company is concerned about the number of customers leaving their land-line business for cable competitors. They need to understand who is leaving. Imagine that you are an analyst at this company and you have to find out who is leaving and why.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install scikit-learn==0.23.1\n",
"!pip install scikit-learn\n",
"!pip install matplotlib\n",
"!pip install pandas \n",
"!pip install numpy \n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first import required libraries:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pylab as pl\n",
"import numpy as np\n",
"import scipy.optimize as opt\n",
"from sklearn import preprocessing\n",
"%matplotlib inline \n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"about_dataset\">About the dataset</h2>\n",
"We will use a telecommunications dataset for predicting customer churn. This is a historical customer dataset where each row represents one customer. The data is relatively easy to understand, and you may uncover insights you can use immediately. Typically it is less expensive to keep customers than acquire new ones, so the focus of this analysis is to predict the customers who will stay with the company. \n",
"\n",
"\n",
"This data set provides information to help you predict what behavior will help you to retain customers. You can analyze all relevant customer data and develop focused customer retention programs.\n",
"\n",
"\n",
"\n",
"The dataset includes information about:\n",
"\n",
"- Customers who left within the last month the column is called Churn\n",
"- Services that each customer has signed up for phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies\n",
"- Customer account information how long they had been a customer, contract, payment method, paperless billing, monthly charges, and total charges\n",
"- Demographic info about customers gender, age range, and if they have partners and dependents\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load the Telco Churn data \n",
"Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the ChurnData.csv.\n",
"\n",
"To download the data, we will use `!wget` to download it from IBM Object Storage.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Click here and press Shift+Enter\n",
"!wget -O ChurnData.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data From CSV File \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"churn_df = pd.read_csv(\"ChurnData.csv\")\n",
"churn_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"preprocessing\">Data pre-processing and selection</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's select some features for the modeling. Also, we change the target data type to be an integer, as it is a requirement by the skitlearn algorithm:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless','churn']]\n",
"churn_df['churn'] = churn_df['churn'].astype('int')\n",
"churn_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Practice\n",
"How many rows and columns are in this dataset in total? What are the names of columns?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# write your code here\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"churn_df.shape\n",
"\n",
"```\n",
"\n",
"</details>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's define X, and y for our dataset:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])\n",
"X[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = np.asarray(churn_df['churn'])\n",
"y [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, we normalize the dataset:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import preprocessing\n",
"X = preprocessing.StandardScaler().fit(X).transform(X)\n",
"X[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train/Test dataset\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We split our dataset into train and test set:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n",
"print ('Train set:', X_train.shape, y_train.shape)\n",
"print ('Test set:', X_test.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"modeling\">Modeling (Logistic Regression with Scikit-learn)</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's build our model using __LogisticRegression__ from the Scikit-learn package. This function implements logistic regression and can use different numerical optimizers to find parameters, including newton-cg, lbfgs, liblinear, sag, saga solvers. You can find extensive information about the pros and cons of these optimizers if you search it in the internet.\n",
"\n",
"The version of Logistic Regression in Scikit-learn, support regularization. Regularization is a technique used to solve the overfitting problem of machine learning models.\n",
"__C__ parameter indicates __inverse of regularization strength__ which must be a positive float. Smaller values specify stronger regularization. \n",
"Now let's fit our model with train set:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import confusion_matrix\n",
"LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)\n",
"LR"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can predict using our test set:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yhat = LR.predict(X_test)\n",
"yhat"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"__predict_proba__ returns estimates for all classes, ordered by the label of classes. So, the first column is the probability of class 0, P(Y=0|X), and second column is probability of class 1, P(Y=1|X):\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yhat_prob = LR.predict_proba(X_test)\n",
"yhat_prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"evaluation\">Evaluation</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### jaccard index\n",
"Let's try the jaccard index for accuracy evaluation. we can define jaccard as the size of the intersection divided by the size of the union of the two label sets. If the entire set of predicted labels for a sample strictly matches with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import jaccard_score\n",
"jaccard_score(y_test, yhat,pos_label=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### confusion matrix\n",
"Another way of looking at the accuracy of the classifier is to look at __confusion matrix__.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import classification_report, confusion_matrix\n",
"import itertools\n",
"def plot_confusion_matrix(cm, classes,\n",
" normalize=False,\n",
" title='Confusion matrix',\n",
" cmap=plt.cm.Blues):\n",
" \"\"\"\n",
" This function prints and plots the confusion matrix.\n",
" Normalization can be applied by setting `normalize=True`.\n",
" \"\"\"\n",
" if normalize:\n",
" cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
" print(\"Normalized confusion matrix\")\n",
" else:\n",
" print('Confusion matrix, without normalization')\n",
"\n",
" print(cm)\n",
"\n",
" plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
" plt.title(title)\n",
" plt.colorbar()\n",
" tick_marks = np.arange(len(classes))\n",
" plt.xticks(tick_marks, classes, rotation=45)\n",
" plt.yticks(tick_marks, classes)\n",
"\n",
" fmt = '.2f' if normalize else 'd'\n",
" thresh = cm.max() / 2.\n",
" for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
" plt.text(j, i, format(cm[i, j], fmt),\n",
" horizontalalignment=\"center\",\n",
" color=\"white\" if cm[i, j] > thresh else \"black\")\n",
"\n",
" plt.tight_layout()\n",
" plt.ylabel('True label')\n",
" plt.xlabel('Predicted label')\n",
"print(confusion_matrix(y_test, yhat, labels=[1,0]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compute confusion matrix\n",
"cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])\n",
"np.set_printoptions(precision=2)\n",
"\n",
"\n",
"# Plot non-normalized confusion matrix\n",
"plt.figure()\n",
"plot_confusion_matrix(cnf_matrix, classes=['churn=1','churn=0'],normalize= False, title='Confusion matrix')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at first row. The first row is for customers whose actual churn value in the test set is 1.\n",
"As you can calculate, out of 40 customers, the churn value of 15 of them is 1. \n",
"Out of these 15 cases, the classifier correctly predicted 6 of them as 1, and 9 of them as 0. \n",
"\n",
"This means, for 6 customers, the actual churn value was 1 in test set and classifier also correctly predicted those as 1. However, while the actual label of 9 customers was 1, the classifier predicted those as 0, which is not very good. We can consider it as the error of the model for first row.\n",
"\n",
"What about the customers with churn value 0? Lets look at the second row.\n",
"It looks like there were 25 customers whom their churn value were 0. \n",
"\n",
"\n",
"The classifier correctly predicted 24 of them as 0, and one of them wrongly as 1. So, it has done a good job in predicting the customers with churn value 0. A good thing about the confusion matrix is that it shows the models ability to correctly predict or separate the classes. In a specific case of the binary classifier, such as this example, we can interpret these numbers as the count of true positives, false positives, true negatives, and false negatives. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print (classification_report(y_test, yhat))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the count of each section, we can calculate precision and recall of each label:\n",
"\n",
"\n",
"- __Precision__ is a measure of the accuracy provided that a class label has been predicted. It is defined by: precision = TP / (TP + FP)\n",
"\n",
"- __Recall__ is the true positive rate. It is defined as: Recall =  TP / (TP + FN)\n",
"\n",
" \n",
"So, we can calculate the precision and recall of each class.\n",
"\n",
"__F1 score:__\n",
"Now we are in the position to calculate the F1 scores for each label based on the precision and recall of that label. \n",
"\n",
"The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It is a good way to show that a classifer has a good value for both recall and precision.\n",
"\n",
"\n",
"Finally, we can tell the average accuracy for this classifier is the average of the F1-score for both labels, which is 0.72 in our case.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### log loss\n",
"Now, let's try __log loss__ for evaluation. In logistic regression, the output can be the probability of customer churn is yes (or equals to 1). This probability is a value between 0 and 1.\n",
"Log loss( Logarithmic loss) measures the performance of a classifier where the predicted output is a probability value between 0 and 1. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import log_loss\n",
"log_loss(y_test, yhat_prob)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"practice\">Practice</h2>\n",
"Try to build Logistic Regression model again for the same dataset, but this time, use different __solver__ and __regularization__ values? What is new __logLoss__ value?\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Columns: Index(['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip',\n",
" 'callcard', 'wireless', 'longmon', 'tollmon', 'equipmon', 'cardmon',\n",
" 'wiremon', 'longten', 'tollten', 'cardten', 'voice', 'pager',\n",
" 'internet', 'callwait', 'confer', 'ebill', 'loglong', 'logtoll',\n",
" 'lninc', 'custcat', 'churn'],\n",
" dtype='object')\n",
"New LogLoss value: 0.5939226974155307\n"
]
}
],
"source": [
"# ======================================\n",
"# IMPORT LIBRARIES\n",
"# ======================================\n",
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.metrics import log_loss\n",
"\n",
"# ======================================\n",
"# LOAD DATASET\n",
"# ======================================\n",
"df = pd.read_csv(\"ChurnData.csv\")\n",
"\n",
"# Tampilkan kolom supaya yakin\n",
"print(\"Columns:\", df.columns)\n",
"\n",
"# ======================================\n",
"# PREPROCESSING\n",
"# ======================================\n",
"df['churn'] = df['churn'].astype(int)\n",
"\n",
"X = np.asarray(df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])\n",
"y = np.asarray(df['churn'])\n",
"\n",
"# Train/test split\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)\n",
"\n",
"# ======================================\n",
"# LOGISTIC REGRESSION\n",
"# ======================================\n",
"model = LogisticRegression(solver='liblinear', C=0.5, max_iter=2000)\n",
"model.fit(X_train, y_train)\n",
"\n",
"yhat_prob = model.predict_proba(X_test)\n",
"new_logloss = log_loss(y_test, yhat_prob)\n",
"\n",
"print(\"New LogLoss value:\", new_logloss)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"LR2 = LogisticRegression(C=0.01, solver='sag').fit(X_train,y_train)\n",
"yhat_prob2 = LR2.predict_proba(X_test)\n",
"print (\"LogLoss: : %.2f\" % log_loss(y_test, yhat_prob2))\n",
"\n",
"```\n",
"\n",
"</details>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"\n",
"## Author\n",
"\n",
"Saeed Aghabozorgi\n",
"\n",
"\n",
"### Other Contributors\n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n",
"\n",
"<!--\n",
"\n",
"## Change Log\n",
"\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"|---|---|---|---|\n",
"| 2021-01-21 | 2.2 | Lakshmi | Updated sklearn library|\n",
"| 2020-11-03 | 2.1 | Lakshmi | Updated URL of csv |\n",
"| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"| | | | |\n",
"| | | | |\n",
"\n",
"--!>\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
},
"prev_pub_hash": "93c3096a9aa003ffd3856deed45478e9e7e2e1d7091dd85d842ce88e28b0a595"
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,581 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p style=\"text-align:center\">\n",
" <a href=\"https://skills.network\" target=\"_blank\">\n",
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\">\n",
" </a>\n",
"</p>\n",
"\n",
"\n",
"# SVM (Support Vector Machines)\n",
"\n",
"\n",
"Estimated time needed: **15** minutes\n",
" \n",
"\n",
"## Objectives\n",
"\n",
"After completing this lab you will be able to:\n",
"\n",
"* Use scikit-learn to Support Vector Machine to classify\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n",
"\n",
"SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Table of contents</h1>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <ol>\n",
" <li><a href=\"#load_dataset\">Load the Cancer data</a></li>\n",
" <li><a href=\"#modeling\">Modeling</a></li>\n",
" <li><a href=\"#evaluation\">Evaluation</a></li>\n",
" <li><a href=\"#practice\">Practice</a></li>\n",
" </ol>\n",
"</div>\n",
"<br>\n",
"<hr>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install scikit-learn\n",
"!pip install matplotlib\n",
"!pip install pandas \n",
"!pip install numpy \n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import pylab as pl\n",
"import numpy as np\n",
"import scipy.optimize as opt\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"%matplotlib inline \n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"load_dataset\">Load the Cancer data</h2>\n",
"The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n",
"\n",
"|Field name|Description|\n",
"|--- |--- |\n",
"|ID|Clump thickness|\n",
"|Clump|Clump thickness|\n",
"|UnifSize|Uniformity of cell size|\n",
"|UnifShape|Uniformity of cell shape|\n",
"|MargAdh|Marginal adhesion|\n",
"|SingEpiSize|Single epithelial cell size|\n",
"|BareNuc|Bare nuclei|\n",
"|BlandChrom|Bland chromatin|\n",
"|NormNucl|Normal nucleoli|\n",
"|Mit|Mitoses|\n",
"|Class|Benign or malignant|\n",
"\n",
"<br>\n",
"<br>\n",
"\n",
"For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Click here and press Shift+Enter\n",
"!wget -O cell_samples.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Data From CSV File \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_df = pd.read_csv(\"cell_samples.csv\")\n",
"cell_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n",
"\n",
"The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n",
"\n",
"Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ax = cell_df[cell_df['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');\n",
"cell_df[cell_df['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data pre-processing and selection\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's first look at columns data types:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It looks like the __BareNuc__ column includes some values that are not numerical. We can drop those rows:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\n",
"cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n",
"cell_df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"feature_df = cell_df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]\n",
"X = np.asarray(feature_df)\n",
"X[0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want the model to predict the value of Class (that is, benign (=2) or malignant (=4)).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y = np.asarray(cell_df['Class'])\n",
"y [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train/Test dataset\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We split our dataset into train and test set:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)\n",
"print ('Train set:', X_train.shape, y_train.shape)\n",
"print ('Test set:', X_test.shape, y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"modeling\">Modeling (SVM with Scikit-learn)</h2>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n",
"\n",
" 1.Linear\n",
" 2.Polynomial\n",
" 3.Radial basis function (RBF)\n",
" 4.Sigmoid\n",
"Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import svm\n",
"clf = svm.SVC(kernel='rbf')\n",
"clf.fit(X_train, y_train) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After being fitted, the model can then be used to predict new values:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"yhat = clf.predict(X_test)\n",
"yhat [0:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"evaluation\">Evaluation</h2>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import classification_report, confusion_matrix\n",
"import itertools"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def plot_confusion_matrix(cm, classes,\n",
" normalize=False,\n",
" title='Confusion matrix',\n",
" cmap=plt.cm.Blues):\n",
" \"\"\"\n",
" This function prints and plots the confusion matrix.\n",
" Normalization can be applied by setting `normalize=True`.\n",
" \"\"\"\n",
" if normalize:\n",
" cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]\n",
" print(\"Normalized confusion matrix\")\n",
" else:\n",
" print('Confusion matrix, without normalization')\n",
"\n",
" print(cm)\n",
"\n",
" plt.imshow(cm, interpolation='nearest', cmap=cmap)\n",
" plt.title(title)\n",
" plt.colorbar()\n",
" tick_marks = np.arange(len(classes))\n",
" plt.xticks(tick_marks, classes, rotation=45)\n",
" plt.yticks(tick_marks, classes)\n",
"\n",
" fmt = '.2f' if normalize else 'd'\n",
" thresh = cm.max() / 2.\n",
" for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
" plt.text(j, i, format(cm[i, j], fmt),\n",
" horizontalalignment=\"center\",\n",
" color=\"white\" if cm[i, j] > thresh else \"black\")\n",
"\n",
" plt.tight_layout()\n",
" plt.ylabel('True label')\n",
" plt.xlabel('Predicted label')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Compute confusion matrix\n",
"cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])\n",
"np.set_printoptions(precision=2)\n",
"\n",
"print (classification_report(y_test, yhat))\n",
"\n",
"# Plot non-normalized confusion matrix\n",
"plt.figure()\n",
"plot_confusion_matrix(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also easily use the __f1_score__ from sklearn library:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import f1_score\n",
"f1_score(y_test, yhat, average='weighted') "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try the jaccard index for accuracy:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import jaccard_score\n",
"jaccard_score(y_test, yhat,pos_label=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"practice\">Practice</h2>\n",
"Can you rebuild the model, but this time with a __linear__ kernel? You can use __kernel='linear'__ option, when you define the svm. How the accuracy changes with the new kernel function?\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Avg F1-score: 0.9639038982104676\n",
"Jaccard score: 0.9038461538461539\n",
"\n",
"Classification Report:\n",
"\n",
" precision recall f1-score support\n",
"\n",
" 0 1.00 0.94 0.97 90\n",
" 1 0.90 1.00 0.95 47\n",
"\n",
" micro avg 0.96 0.96 0.96 137\n",
" macro avg 0.95 0.97 0.96 137\n",
"weighted avg 0.97 0.96 0.96 137\n",
"\n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"from sklearn import preprocessing\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import svm\n",
"from sklearn.metrics import classification_report, f1_score\n",
"\n",
"# Fungsi hitung Jaccard manual (untuk sklearn lama)\n",
"def jaccard_manual(y_true, y_pred):\n",
" intersection = np.logical_and(y_true == 1, y_pred == 1).sum()\n",
" union = np.logical_or(y_true == 1, y_pred == 1).sum()\n",
" return intersection / union if union != 0 else 0\n",
"\n",
"# ================================\n",
"# 1. Load Data\n",
"# ================================\n",
"cell_df = pd.read_csv(\"cell_samples.csv\")\n",
"\n",
"# ================================\n",
"# 2. Bersihkan Kolom BareNuc\n",
"# ================================\n",
"cell_df = cell_df[pd.to_numeric(cell_df['BareNuc'], errors='coerce').notnull()]\n",
"cell_df['BareNuc'] = cell_df['BareNuc'].astype('int')\n",
"\n",
"# ================================\n",
"# 3. Buat Feature dan Label\n",
"# ================================\n",
"feature_df = cell_df[['Clump','UnifSize','UnifShape','MargAdh',\n",
" 'SingEpiSize','BareNuc','BlandChrom','NormNucl','Mit']].astype(float)\n",
"\n",
"X = np.asarray(feature_df)\n",
"y = np.where(cell_df['Class'] == 2, 0, 1) # 0 = Benign, 1 = Malignant\n",
"\n",
"# ================================\n",
"# 4. Split Train/Test\n",
"# ================================\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)\n",
"\n",
"# ================================\n",
"# 5. Model SVM Kernel LINEAR\n",
"# ================================\n",
"model = svm.SVC(kernel='linear')\n",
"model.fit(X_train, y_train)\n",
"\n",
"# ================================\n",
"# 6. Prediksi\n",
"# ================================\n",
"y_pred = model.predict(X_test)\n",
"\n",
"# ================================\n",
"# 7. Evaluasi\n",
"# ================================\n",
"print(\"Avg F1-score:\", f1_score(y_test, y_pred, average='weighted'))\n",
"print(\"Jaccard score:\", jaccard_manual(y_test, y_pred))\n",
"print(\"\\nClassification Report:\\n\")\n",
"print(classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<details><summary>Click here for the solution</summary>\n",
"\n",
"```python\n",
"clf2 = svm.SVC(kernel='linear')\n",
"clf2.fit(X_train, y_train) \n",
"yhat2 = clf2.predict(X_test)\n",
"print(\"Avg F1-score: %.4f\" % f1_score(y_test, yhat2, average='weighted'))\n",
"print(\"Jaccard score: %.4f\" % jaccard_score(y_test, yhat2,pos_label=2))\n",
"\n",
"```\n",
"\n",
"</details>\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Thank you for completing this lab!\n",
"\n",
"\n",
"## Author\n",
"\n",
"Saeed Aghabozorgi\n",
"\n",
"\n",
"### Other Contributors\n",
"\n",
"<a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>\n",
"\n",
"## <h3 align=\"center\"> © IBM Corporation 2020. All rights reserved. <h3/>\n",
"\n",
"<!--\n",
"## Change Log\n",
"\n",
"\n",
"| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n",
"|---|---|---|---|\n",
"| 2021-01-21 | 2.2 | Lakshmi | Updated sklearn library |\n",
"| 2020-11-03 | 2.1 | Lakshmi | Updated URL of csv |\n",
"| 2020-08-27 | 2.0 | Lavanya | Moved lab to course repo in GitLab |\n",
"| | | | |\n",
"| | | | |\n",
"--!>\n",
"\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python",
"language": "python",
"name": "conda-env-python-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
},
"prev_pub_hash": "33c7dcfb268d8bbcaef711e72c89e89dc7bc1929452f1913b971040b140900c5"
},
"nbformat": 4,
"nbformat_minor": 4
}