6-1.p286~ 군집 알고리즘

햄스텅 2021. 10. 28. 15:59

구글 코랩으로 작성했습니다!

과일 사진 데이터를 준비한다.

!wget https://bit.ly/fruits_300_data -O fruits_300.npy

+) 코랩의 코드 셀에서 "!"문자로 시작하면 코랩은 이후 명령을 파이썬 코드가 아니라 리눅스 셸 명령으로 이해한다. wget명령은 원격주소에서 데이터를 다운로드하여 저장한다.

그 다음 이 파일에서 데이터를 로드한다.

import numpy as np
import matplotlib.pyplot as plt

fruits = np.load('fruits_300.npy')

print(fruits.shape)

이 배열의 첫 번째 차원은 샘플의 개수, 두번째는 이미지 높이, 세번째는 이미지 너비이다.

print(fruits[0,0,:])

첫번째 행에 있는 픽셀 100개에 들어있는 값을 출력했다.
이 넘파이 배열은 흑백 사진을 담고 있으므로 0~255까지의 정수값을 가진다.

plt.imshow(fruits[0], cmap='gray')
plt.show()

plt.imshow(fruits[0], cmap='gray_r')
plt.show()

밝은부분이 0에 가깝고, 짙은 부분이 255에 가까운 값이라는 것을 꼭 기억

fig,axs = plt.subplots(1,2)
axs[0].imshow(fruits[100], cmap='gray_r')
axs[1].imshow(fruits[200], cmap='gray_r')
plt.show()

apple = fruits[0:100].reshape(-1,100*100)
pineapple = fruits[100:200].reshape(-1,100*100)
banana = fruits[200:300].reshape(-1,100*100)

print(apple.shape)

print(apple.mean(axis=1))

plt.hist(np.mean(apple,axis=1), alpha=0.8)
plt.hist(np.mean(pineapple,axis=1), alpha=0.8)
plt.hist(np.mean(banana,axis=1), alpha=0.8)
plt.legend(['apple', 'pineapple', 'banana'])
plt.show()

fig, axs = plt.subplots(1,3,figsize=(20,5))
axs[0].bar(range(10000), np.mean(apple, axis=0))
axs[1].bar(range(10000), np.mean(pineapple, axis=0))
axs[2].bar(range(10000), np.mean(banana, axis=0))
plt.show()

apple_mean = np.mean(apple, axis=0).reshape(100,100)
pineapple_mean = np.mean(pineapple, axis=0).reshape(100,100)
banana_mean = np.mean(banana, axis=0).reshape(100,100)
fig, axs = plt.subplots(1,3,figsize=(20,5))
axs[0].imshow(apple_mean, cmap='gray_r')
axs[1].imshow(pineapple_mean, cmap='gray_r')
axs[2].imshow(banana_mean, cmap='gray_r')
plt.show()

abs_diff = np.abs(fruits - apple_mean)
abs_mean = np.mean(abs_diff, axis=(1,2))
print(abs_mean.shape)

apple_index = np.argsort(abs_mean)[:100]
fig,axs = plt.subplots(10,10,figsize=(10,10))
for i in range(10):
    for j in range(10):
        axs[i,j].imshow(fruits[apple_index[i*10+j]],cmap='gray_r')
        axs[i,j].axis('off')
plt.show()

apple_mean과 가장 가까운 사진 100개를 골랐더니 모두 사과가 나왔다.

흑백사진에 있는 픽셀값을 사용해 과일 사진을 모으는 작업이 완료되었다!

이렇게 비슷한 샘플끼리 그룹으로 모으는 작업을 군집이라고 한다.
군집은 대표적인 비지도 학습 작업 중 하나이다.
군집 알고리즘에서 만든 그룹을 클러스터라고 부른다.

키워드로 끝내는 핵심 포인트

비지도 학습 = 머신러닝의 한 종류로 훈련데이터에 타깃이 없다. 타깃이 없기 때문에 외부의 도움 없이 스스로 유용한 무언가를 학습해야 한다. 대표적인 비지도학습 작업은 군집, 차원축소 등이다.
히스토그램 = 구간별로 값이 발생한 빈도를 그래프로 표시한 것이다. 보통 x축이 값의 구간이고 y축은 발생빈도이다.