sklearn을 활용하여 아이리스 데이터 분류 모델 만들기

sklearn을 활용하여 아이리스 데이터 분류 모델 만들기¶

1. 데이터 불러오기(data road)¶

In [1]:

from sklearn.datasets import load_iris
iris = load_iris()

2. 데이터 소개(data introduce) 및 학습 데이터 구성¶

아이리스 데이터 데이터 프레임으로 보여주기 show Iris data as DataFrame ¶

In [2]:

import pandas as pd
df = pd.DataFrame(iris.data, columns=iris.feature_names)
sy = pd.Series(iris.target, dtype="category")
sy = sy.cat.rename_categories(iris.target_names)
df['species'] = sy
df.tail()

Out[2]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

종별 데이터 수 세아리기(count by species)
판다스에 groupby와 count매소드를 이용합니다

In [3]:

df["count"] = 1
df.groupby(["species"])["count"].count()

Out[3]:

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

독립변수와 종속 변수 할당하기 [ Allocate Iris data into IV(X) and DV(y) ]¶

In [4]:

# 데이터 프레임 iloc 함수를 사용하여 입력 값을 넘파이 어레이로 변환하기
X = df.iloc[:,:4].to_numpy()

In [5]:

# 정답에 ID 부여
categories_id = {
    "setosa":0, 
    "versicolor":1,
    "virginica":2
}

df["species_id"] = df["species"].apply(lambda x:categories_id[x] )

In [6]:

# 정답 컬럼을 넘파이 어레이로 변환하기
y = df["species_id"].to_numpy()

train, test 데이터 셋 분류¶

In [7]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

In [8]:

# sklean에서 제공하는 QDA 모델 활용을 위한 데이터 형태
print(X_train.shape)
print(y_train.shape)

(120, 4)
(120,)

3. 데이터 분석(data analysis)¶

분류 문제를 해결하기 위해 사용한 모델은 아래 3가지 입니다(link for each analysis)
3.1 QDA)
3.2 LDA)
3.3 Naive Bayes analysis)

3.1 QDA(quadratic discriminant analysis)¶

sklean document를 통해 모델에 대한 자세한 설명을 확인할 수 있습니다.

QDA 모델 학습 [ QDA(quadratic discriminant analysis) model train ]¶

In [9]:

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# 모델 불러오기
qda = QuadraticDiscriminantAnalysis(store_covariance=True)
# 모델 학습
qda.fit(X_train, y_train)

Out[9]:

QuadraticDiscriminantAnalysis(store_covariance=True)

학습결과 확인하기¶

Confusion Matrix를 활용한 모델 테스트

In [10]:

y_pred = qda.predict(X_test)

In [11]:

from sklearn.metrics import classification_report

target_names = ['setosa', 'versicolor', 'virginica']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.89      0.94         9
   virginica       0.92      1.00      0.96        11

    accuracy                           0.97        30
   macro avg       0.97      0.96      0.97        30
weighted avg       0.97      0.97      0.97        30

QDA 모델에 대한 ROC 커브 시각화 (show the ROC curve of QDA model)

In [12]:

# 패키지 불러오기
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

In [13]:

# 테스트 데이터를 활용하여 roc 커브 그리기

y = label_binarize(y_test, [0, 1, 2])
for i, target_name in enumerate(target_names):

    fpr, tpr, thresholds = roc_curve(y[:, i], qda.decision_function(X_test)[:, i])
    plt.plot(fpr, tpr, label=target_name)
    
plt.xlabel('Fall-Out')
plt.ylabel('Recall')
plt.legend()
plt.show()

모델 활용¶

In [14]:

import numpy as np

# 테스트 데이터에서 예시 데이터 추출
exaple_x = X_test[::5]
# predict_proba를 활용한 클래스 별 예측 비중 산출
example_ratio = qda.predict_proba(X_test[::5])
example_ratio = np.append(exaple_x, example_ratio, axis=1)

In [15]:

# 모델 사용 !
example_predict = qda.predict(X_test[::5]).reshape(6,1)
example = np.append(example_ratio, example_predict, axis=1)

In [16]:

# 판다스를 이용하여 example 데이터에 대한 예측 결과 확인하기
pd.options.display.float_format = '{:.5f}'.format
categories = {0: "setosa", 1: "versicolor", 2: "virginica",}
data_frame_ratio = pd.DataFrame(
    example, 
    columns=(
        "sepal length (cm)", "sepal width (cm)","petal length (cm)","petal width (cm)", 
        "setosa proba", "versicolor proba", "virginica proba", 
        "predict"
    )
)
data_frame_ratio["predict"] = data_frame_ratio["predict"].apply(lambda x : categories[x])

In [17]:

data_frame_ratio

Out[17]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	setosa proba	versicolor proba	virginica proba	predict
0	6.10000	2.80000	4.70000	1.20000	0.00000	0.97125	0.02875	versicolor
1	5.40000	3.40000	1.50000	0.40000	1.00000	0.00000	0.00000	setosa
2	6.50000	3.20000	5.10000	2.00000	0.00000	0.00865	0.99135	virginica
3	6.30000	3.30000	4.70000	1.60000	0.00000	0.99979	0.00021	versicolor
4	4.70000	3.20000	1.60000	0.20000	1.00000	0.00000	0.00000	setosa
5	6.70000	3.00000	5.20000	2.30000	0.00000	0.00000	1.00000	virginica

3.2 LDA(linear discriminant analysis)¶

sklean document를 통해 모델에 대한 자세한 설명을 확인할 수 있습니다.

LDA 모델 학습¶

In [18]:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=3-1, solver="svd", store_covariance=True)
lda.fit(X_train, y_train)

Out[18]:

LinearDiscriminantAnalysis(n_components=2, store_covariance=True)

학습결과 확인하기¶

Confusion Matrix를 활용한 모델 테스트

In [19]:

y_pred = lda.predict(X_test)

In [20]:

from sklearn.metrics import classification_report

target_names = ['setosa', 'versicolor', 'virginica']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

LDA 모델에 대한 ROC 커브 시각화 (show the ROC curve of QDA model)

In [21]:

# 패키지 불러오기
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

In [22]:

# 테스트 데이터를 활용하여 roc 커브 그리기

y = label_binarize(y_test, [0, 1, 2])
for i, target_name in enumerate(target_names):

    fpr, tpr, thresholds = roc_curve(y[:, i], lda.decision_function(X_test)[:, i])
    plt.plot(fpr, tpr, label=target_name)
    
plt.xlabel('Fall-Out')
plt.ylabel('Recall')
plt.legend()
plt.show()

모델 활용¶

In [23]:

import numpy as np

# 테스트 데이터에서 예시 데이터 추출
exaple_x = X_test[::5]
# predict_proba를 활용한 클래스 별 예측 비중 산출
example_ratio = lda.predict_proba(X_test[::5])
example_ratio = np.append(exaple_x, example_ratio, axis=1)

In [24]:

# 모델 사용 !
example_predict = lda.predict(X_test[::5]).reshape(6,1)
example = np.append(example_ratio, example_predict, axis=1)

In [25]:

# 판다스를 이용하여 example 데이터에 대한 예측 결과 확인하기
pd.options.display.float_format = '{:.5f}'.format
categories = {0: "setosa", 1: "versicolor", 2: "virginica",}
data_frame_ratio = pd.DataFrame(
    example, 
    columns=(
        "sepal length (cm)", "sepal width (cm)","petal length (cm)","petal width (cm)", 
        "setosa proba", "versicolor proba", "virginica proba", 
        "predict"
    )
)
data_frame_ratio["predict"] = data_frame_ratio["predict"].apply(lambda x : categories[x])

In [26]:

data_frame_ratio

Out[26]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	setosa proba	versicolor proba	virginica proba	predict
0	6.10000	2.80000	4.70000	1.20000	0.00000	0.99887	0.00113	versicolor
1	5.40000	3.40000	1.50000	0.40000	1.00000	0.00000	0.00000	setosa
2	6.50000	3.20000	5.10000	2.00000	0.00000	0.02183	0.97817	virginica
3	6.30000	3.30000	4.70000	1.60000	0.00000	0.98842	0.01158	versicolor
4	4.70000	3.20000	1.60000	0.20000	1.00000	0.00000	0.00000	setosa
5	6.70000	3.00000	5.20000	2.30000	0.00000	0.00016	0.99984	virginica

3.3 Naive Bayes analysis (Gasussian distribute)¶

sklean document를 통해 모델에 대한 자세한 설명을 확인할 수 있습니다.

가우시안 나이브 베이지안 모델 학습¶

In [27]:

from sklearn.naive_bayes import GaussianNB
model_norm = GaussianNB()
model_norm.fit(X_train, y_train)

Out[27]:

GaussianNB()

학습결과 확인하기¶

Confusion Matrix를 활용한 모델 테스트

In [28]:

y_pred = model_norm.predict(X_test)

In [29]:

from sklearn.metrics import classification_report

target_names = ['setosa', 'versicolor', 'virginica']
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

가우시안 나이브 베이지안 모델에 대한 ROC 커브 시각화 (show the ROC curve of QDA model)

In [30]:

# 패키지 불러오기
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

In [31]:

# 테스트 데이터를 활용하여 roc 커브 그리기

y = label_binarize(y_test, [0, 1, 2])
for i, target_name in enumerate(target_names):

    fpr, tpr, thresholds = roc_curve(y[:, i], model_norm.predict_proba(X_test)[:, i])
    plt.plot(fpr, tpr, label=target_name)
    
plt.xlabel('Fall-Out')
plt.ylabel('Recall')
plt.legend()
plt.show()

모델 활용¶

In [32]:

import numpy as np

# 테스트 데이터에서 예시 데이터 추출
exaple_x = X_test[::5]
# predict_proba를 활용한 클래스 별 예측 비중 산출
example_ratio = model_norm.predict_proba(X_test[::5])
example_ratio = np.append(exaple_x, example_ratio, axis=1)

In [33]:

# 모델 사용 !
example_predict = model_norm.predict(X_test[::5]).reshape(6,1)
example = np.append(example_ratio, example_predict, axis=1)

In [34]:

# 판다스를 이용하여 example 데이터에 대한 예측 결과 확인하기
pd.options.display.float_format = '{:.5f}'.format
categories = {0: "setosa", 1: "versicolor", 2: "virginica",}
data_frame_ratio = pd.DataFrame(
    example, 
    columns=(
        "sepal length (cm)", "sepal width (cm)","petal length (cm)","petal width (cm)", 
        "setosa proba", "versicolor proba", "virginica proba", 
        "predict"
    )
)
data_frame_ratio["predict"] = data_frame_ratio["predict"].apply(lambda x : categories[x])

In [35]:

data_frame_ratio

Out[35]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	setosa proba	versicolor proba	virginica proba	predict
0	6.10000	2.80000	4.70000	1.20000	0.00000	0.99564	0.00436	versicolor
1	5.40000	3.40000	1.50000	0.40000	1.00000	0.00000	0.00000	setosa
2	6.50000	3.20000	5.10000	2.00000	0.00000	0.00059	0.99941	virginica
3	6.30000	3.30000	4.70000	1.60000	0.00000	0.60293	0.39707	versicolor
4	4.70000	3.20000	1.60000	0.20000	1.00000	0.00000	0.00000	setosa
5	6.70000	3.00000	5.20000	2.30000	0.00000	0.00000	1.00000	virginica

'python' 카테고리의 다른 글

파이썬 자연어 처리 샘플 문서 받아오기 nltk news (0)	2023.06.27
파이썬 딕셔너리 min / max 적용 (0)	2023.06.25
Input vector should be 1-D. (0)	2023.06.23
넘파이 최대값 최소값, 넘파이 행별 최대값 (0)	2023.06.22
파이썬 리스트 최대값 최소값 (0)	2023.06.22

아항 !!

sklearn을 활용하여 아이리스 데이터 분류 모델 만들기

sklearn을 활용하여 아이리스 데이터 분류 모델 만들기¶

1. 데이터 불러오기(data road)¶

2. 데이터 소개(data introduce) 및 학습 데이터 구성¶

아이리스 데이터 데이터 프레임으로 보여주기 show Iris data as DataFrame ¶

독립변수와 종속 변수 할당하기 [ Allocate Iris data into IV(X) and DV(y) ]¶

train, test 데이터 셋 분류¶

3. 데이터 분석(data analysis)¶

3.1 QDA(quadratic discriminant analysis)¶

QDA 모델 학습 [ QDA(quadratic discriminant analysis) model train ]¶

학습결과 확인하기¶

모델 활용¶

3.2 LDA(linear discriminant analysis)¶

LDA 모델 학습¶

학습결과 확인하기¶

모델 활용¶

3.3 Naive Bayes analysis (Gasussian distribute)¶

가우시안 나이브 베이지안 모델 학습¶

학습결과 확인하기¶

모델 활용¶

'python' 카테고리의 다른 글

댓글

티스토리툴바

sklearn을 활용하여 아이리스 데이터 분류 모델 만들기

sklearn을 활용하여 아이리스 데이터 분류 모델 만들기¶

1. 데이터 불러오기(data road)¶

2. 데이터 소개(data introduce) 및 학습 데이터 구성¶

아이리스 데이터 데이터 프레임으로 보여주기 show Iris data as DataFrame ¶

독립변수와 종속 변수 할당하기 [ Allocate Iris data into IV(X) and DV(y) ]¶

train, test 데이터 셋 분류¶

3. 데이터 분석(data analysis)¶

3.1 QDA(quadratic discriminant analysis)¶

QDA 모델 학습 [ QDA(quadratic discriminant analysis) model train ]¶

학습결과 확인하기¶

모델 활용¶

3.2 LDA(linear discriminant analysis)¶

LDA 모델 학습¶

학습결과 확인하기¶

모델 활용¶

3.3 Naive Bayes analysis (Gasussian distribute)¶

가우시안 나이브 베이지안 모델 학습¶

학습결과 확인하기¶

모델 활용¶

'python' 카테고리의 다른 글

관련글

댓글

티스토리툴바