\n",
- "\n",
- "\n",
- "# SVM (Support Vector Machines)\n",
- "\n",
- "\n",
- "Estimated time needed: **15** minutes\n",
- " \n",
- "\n",
- "## Objectives\n",
- "\n",
- "After completing this lab you will be able to:\n",
- "\n",
- "* Use scikit-learn to Support Vector Machine to classify\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "In this notebook, you will use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.\n",
- "\n",
- "SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. Following this, characteristics of new data can be used to predict the group to which a new record should belong.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "
\n",
- "The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007)[http://mlearn.ics.uci.edu/MLRepository.html]. The dataset consists of several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:\n",
- "\n",
- "|Field name|Description|\n",
- "|--- |--- |\n",
- "|ID|Clump thickness|\n",
- "|Clump|Clump thickness|\n",
- "|UnifSize|Uniformity of cell size|\n",
- "|UnifShape|Uniformity of cell shape|\n",
- "|MargAdh|Marginal adhesion|\n",
- "|SingEpiSize|Single epithelial cell size|\n",
- "|BareNuc|Bare nuclei|\n",
- "|BlandChrom|Bland chromatin|\n",
- "|NormNucl|Normal nucleoli|\n",
- "|Mit|Mitoses|\n",
- "|Class|Benign or malignant|\n",
- "\n",
- " \n",
- " \n",
- "\n",
- "For the purposes of this example, we're using a dataset that has a relatively small number of predictors in each record. To download the data, we will use `!wget` to download it from IBM Object Storage. \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "--2025-11-20 13:13:55-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv\n",
- "Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
- "Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
- "200 OKequest sent, awaiting response... \n",
- "Length: 19975 (20K) [text/csv]\n",
- "Saving to: ‘cell_samples.csv’\n",
- "\n",
- "cell_samples.csv 100%[===================>] 19.51K --.-KB/s in 0.001s \n",
- "\n",
- "2025-11-20 13:13:56 (31.7 MB/s) - ‘cell_samples.csv’ saved [19975/19975]\n",
- "\n"
- ]
- }
- ],
- "source": [
- "#Click here and press Shift+Enter\n",
- "!wget -O cell_samples.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/cell_samples.csv"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Load Data From CSV File \n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
\n",
- "\n",
- "
\n",
- " \n",
- "
\n",
- "
\n",
- "
ID
\n",
- "
Clump
\n",
- "
UnifSize
\n",
- "
UnifShape
\n",
- "
MargAdh
\n",
- "
SingEpiSize
\n",
- "
BareNuc
\n",
- "
BlandChrom
\n",
- "
NormNucl
\n",
- "
Mit
\n",
- "
Class
\n",
- "
\n",
- " \n",
- " \n",
- "
\n",
- "
0
\n",
- "
1000025
\n",
- "
5
\n",
- "
1
\n",
- "
1
\n",
- "
1
\n",
- "
2
\n",
- "
1
\n",
- "
3
\n",
- "
1
\n",
- "
1
\n",
- "
2
\n",
- "
\n",
- "
\n",
- "
1
\n",
- "
1002945
\n",
- "
5
\n",
- "
4
\n",
- "
4
\n",
- "
5
\n",
- "
7
\n",
- "
10
\n",
- "
3
\n",
- "
2
\n",
- "
1
\n",
- "
2
\n",
- "
\n",
- "
\n",
- "
2
\n",
- "
1015425
\n",
- "
3
\n",
- "
1
\n",
- "
1
\n",
- "
1
\n",
- "
2
\n",
- "
2
\n",
- "
3
\n",
- "
1
\n",
- "
1
\n",
- "
2
\n",
- "
\n",
- "
\n",
- "
3
\n",
- "
1016277
\n",
- "
6
\n",
- "
8
\n",
- "
8
\n",
- "
1
\n",
- "
3
\n",
- "
4
\n",
- "
3
\n",
- "
7
\n",
- "
1
\n",
- "
2
\n",
- "
\n",
- "
\n",
- "
4
\n",
- "
1017023
\n",
- "
4
\n",
- "
1
\n",
- "
1
\n",
- "
3
\n",
- "
2
\n",
- "
1
\n",
- "
3
\n",
- "
1
\n",
- "
1
\n",
- "
2
\n",
- "
\n",
- " \n",
- "
\n",
- "
"
- ],
- "text/plain": [
- " ID Clump UnifSize UnifShape MargAdh SingEpiSize BareNuc \\\n",
- "0 1000025 5 1 1 1 2 1 \n",
- "1 1002945 5 4 4 5 7 10 \n",
- "2 1015425 3 1 1 1 2 2 \n",
- "3 1016277 6 8 8 1 3 4 \n",
- "4 1017023 4 1 1 3 2 1 \n",
- "\n",
- " BlandChrom NormNucl Mit Class \n",
- "0 3 1 1 2 \n",
- "1 3 2 1 2 \n",
- "2 3 1 1 2 \n",
- "3 3 7 1 2 \n",
- "4 3 1 1 2 "
- ]
- },
- "execution_count": 4,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "cell_df = pd.read_csv(\"cell_samples.csv\")\n",
- "cell_df.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The ID field contains the patient identifiers. The characteristics of the cell samples from each patient are contained in fields Clump to Mit. The values are graded from 1 to 10, with 1 being the closest to benign.\n",
- "\n",
- "The Class field contains the diagnosis, as confirmed by separate medical procedures, as to whether the samples are benign (value = 2) or malignant (value = 4).\n",
- "\n",
- "Let's look at the distribution of the classes based on Clump thickness and Uniformity of cell size:\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "",
- "text/plain": [
- "
\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:\n",
- "\n",
- " 1.Linear\n",
- " 2.Polynomial\n",
- " 3.Radial basis function (RBF)\n",
- " 4.Sigmoid\n",
- "Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset. We usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "
SVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.