개요
CPU(NumPy) and GPU(CuPy and PyTorch) 비교 테스트
내용
- 행렬 곱셈 성능 비교:
- NumPy(CPU), CuPy(GPU), PyTorch(GPU) 구현을 비교
- 다양한 행렬 크기 [128, 256, 512, 1024, 2048] 에 대해 테스트
- 각 구현의 실행 시간과 CPU 대비 속도 향상을 측정
- 실행 시간 시각화
- 신경망 학습 성능 비교:
- 간단한 이진 분류 신경망을 구현
- CPU와 GPU 기반 학습 시간을 비교
- 다양한 히든 레이어 크기(64, 128, 256)에 대해 테스트
- 각 설정에서의 학습 시간과 GPU 가속 효과를 측정
- 구현 및 테스트 환경 : Colab PRO (GPU : Tesla T4)
# GPU 사용 가능 여부 확인
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"Current GPU device: {torch.cuda.get_device_name()}")
PyTorch version: 2.5.1+cu121
CUDA available: True
Current GPU device: Tesla T4
import cupy as cp
print(f"CuPy version: {cp.__version__}")
CuPy version: 12.2.0
# 필요한 라이브러리 임포트
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import time
# Define a function to perform matrix multiplication (practical scenario)
def matrix_multiplication(a, b):
return np.dot(a, b)
def gpu_matrix_multiplication_cupy(a, b):
a_gpu = cp.array(a)
b_gpu = cp.array(b)
result_gpu = cp.dot(a_gpu, b_gpu)
return cp.asnumpy(result_gpu)
def gpu_matrix_multiplication_pytorch(a, b):
a_gpu = torch.tensor(a, device='cuda')
b_gpu = torch.tensor(b, device='cuda')
result_gpu = torch.matmul(a_gpu, b_gpu)
return result_gpu.cpu().numpy()
# Generate synthetic data for testing
def generate_data(size):
a = np.random.rand(size, size).astype(np.float32)
b = np.random.rand(size, size).astype(np.float32)
return a, b
# Test the implementations and measure performance
def test_performance(size):
a, b = generate_data(size)
# CPU-based numpy implementation
start = time.time()
result_cpu = matrix_multiplication(a, b)
cpu_time = time.time() - start
# GPU-based cupy implementation
start = time.time()
result_cupy = gpu_matrix_multiplication_cupy(a, b)
cupy_time = time.time() - start
# GPU-based pytorch implementation
start = time.time()
result_torch = gpu_matrix_multiplication_pytorch(a, b)
torch_time = time.time() - start
return cpu_time, cupy_time, torch_time
# Analyze performance scaling with problem size
sizes = [128, 256, 512, 1024, 2048]
cpu_times, cupy_times, torch_times = [], [], []
for size in sizes:
cpu_time, cupy_time, torch_time = test_performance(size)
cpu_times.append(cpu_time)
cupy_times.append(cupy_time)
torch_times.append(torch_time)
# Print times for each matrix size
print("Performance Times:")
for size, cpu, cupy, torch in zip(sizes, cpu_times, cupy_times, torch_times):
print(f"Matrix Size {size}x{size}:")
print(f" CPU (numpy): {cpu:.6f} seconds")
print(f" GPU (cupy): {cupy:.6f} seconds")
print(f" GPU (pytorch): {torch:.6f} seconds")
print()
Performance Times:
Matrix Size 128x128:
CPU (numpy): 0.004375 seconds
GPU (cupy): 0.297219 seconds
GPU (pytorch): 0.004101 seconds
Matrix Size 256x256:
CPU (numpy): 0.006716 seconds
GPU (cupy): 0.032556 seconds
GPU (pytorch): 0.002549 seconds
Matrix Size 512x512:
CPU (numpy): 0.005299 seconds
GPU (cupy): 0.015380 seconds
GPU (pytorch): 0.002463 seconds
Matrix Size 1024x1024:
CPU (numpy): 0.041209 seconds
GPU (cupy): 0.012589 seconds
GPU (pytorch): 0.010114 seconds
Matrix Size 2048x2048:
CPU (numpy): 0.483878 seconds
GPU (cupy): 0.107040 seconds
GPU (pytorch): 0.082670 seconds
# Display results
import matplotlib.pyplot as plt
plt.plot(sizes, cpu_times, label='CPU (numpy)', marker='o')
plt.plot(sizes, cupy_times, label='GPU (cupy)', marker='o')
plt.plot(sizes, torch_times, label='GPU (pytorch)', marker='o')
plt.xlabel('Matrix Size')
plt.ylabel('Time (seconds)')
plt.title('Performance Comparison: CPU vs GPU (cupy/pytorch)')
plt.legend()
plt.grid()
plt.show()
분석 결과
작은 행렬 크기 (N≤256)
- CPU(numpy)가 GPU에 비해 빠르거나 유사한 성능
- GPU는 초기화 오버헤드가 있기 때문에, 작은 데이터에서는 오히려 성능이 저하
- CuPy는 PyTorch보다 작은 크기에서 초기 성능 저하
중간 행렬 크기 (256<N≤1024)
- GPU(CuPy, PyTorch)가 CPU를 성능 면에서 앞지르기 시작
- CuPy와 PyTorch 간의 성능 차이는 미미하며, GPU가 대규모 병렬 처리를 통해 점차 효율적으로 나타남
큰 행렬 크기 (N>1024)
- GPU의 성능이 커짐
- CPU는 시간이 선형적으로 증가하는 반면, GPU는 상대적으로 완만한 증가
- CuPy와 PyTorch 모두 큰 크기에서 비슷한 성능을 나타내며, GPU 병렬 연산의 효율성을 보여줌
# CPU 버전 신경망
class NeuralNetworkCPU:
def __init__(self, input_size, hidden_size):
scale = np.sqrt(2.0 / (input_size + hidden_size))
self.W1 = np.random.normal(0, scale, (input_size, hidden_size))
self.b1 = np.zeros(hidden_size)
self.W2 = np.random.normal(0, scale, (hidden_size, 1))
self.b2 = np.zeros(1)
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = np.maximum(0, self.z1) # ReLU
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, learning_rate=0.01):
m = X.shape[0]
delta2 = self.a2 - y.reshape(-1, 1)
delta1 = np.dot(delta2, self.W2.T)
delta1[self.z1 <= 0] = 0
self.W2 -= learning_rate * np.dot(self.a1.T, delta2) / m
self.b2 -= learning_rate * np.sum(delta2, axis=0) / m
self.W1 -= learning_rate * np.dot(X.T, delta1) / m
self.b1 -= learning_rate * np.sum(delta1, axis=0) / m
def train(self, X, y, epochs=10):
for epoch in range(epochs):
output = self.forward(X)
loss = -np.mean(y.reshape(-1, 1) * np.log(output + 1e-7) +
(1 - y.reshape(-1, 1)) * np.log(1 - output + 1e-7))
self.backward(X, y)
if (epoch + 1) % 2 == 0:
print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss:.4f}')
# GPU 버전 신경망
class NeuralNetworkGPU:
def __init__(self, input_size, hidden_size):
scale = cp.sqrt(2.0 / (input_size + hidden_size))
self.W1 = cp.random.normal(0, scale, (input_size, hidden_size))
self.b1 = cp.zeros(hidden_size)
self.W2 = cp.random.normal(0, scale, (hidden_size, 1))
self.b2 = cp.zeros(1)
def sigmoid(self, x):
return 1 / (1 + cp.exp(-x))
def forward(self, X):
self.z1 = cp.dot(X, self.W1) + self.b1
self.a1 = cp.maximum(0, self.z1) # ReLU
self.z2 = cp.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, learning_rate=0.01):
m = X.shape[0]
delta2 = self.a2 - y.reshape(-1, 1)
delta1 = cp.dot(delta2, self.W2.T)
delta1[self.z1 <= 0] = 0
self.W2 -= learning_rate * cp.dot(self.a1.T, delta2) / m
self.b2 -= learning_rate * cp.sum(delta2, axis=0) / m
self.W1 -= learning_rate * cp.dot(X.T, delta1) / m
self.b1 -= learning_rate * cp.sum(delta1, axis=0) / m
def train(self, X, y, epochs=10):
for epoch in range(epochs):
output = self.forward(X)
loss = -cp.mean(y.reshape(-1, 1) * cp.log(output + 1e-7) +
(1 - y.reshape(-1, 1)) * cp.log(1 - output + 1e-7))
self.backward(X, y)
if (epoch + 1) % 2 == 0:
print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss:.4f}')
# 데이터셋 생성
print("\nGenerating dataset...")
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X = X.astype(np.float32)
y = y.astype(np.float32)
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.2, random_state=42)
print("Dataset generated successfully!")
Generating dataset...
Dataset generated successfully!
hidden_sizes = [64, 128, 256]
results = {}
for hidden_size in hidden_sizes:
print(f"\n{'='*50}")
print(f"Testing with hidden size: {hidden_size}")
print('='*50)
# CPU (NumPy) 학습
print("\nTraining on CPU...")
start_time = time.time()
model_cpu = NeuralNetworkCPU(X_train.shape[1], hidden_size)
model_cpu.train(X_train, y_train)
cpu_time = time.time() - start_time
print(f"\nTraining completed in {cpu_time:.4f} seconds")
# GPU (CuPy) 학습
print("\nTraining on GPU...")
X_gpu = cp.array(X_train)
y_gpu = cp.array(y_train)
start_time = time.time()
model_gpu = NeuralNetworkGPU(X_gpu.shape[1], hidden_size)
model_gpu.train(X_gpu, y_gpu)
gpu_time = time.time() - start_time
print(f"\nTraining completed in {gpu_time:.4f} seconds")
# 속도 향상 계산
speedup = cpu_time / gpu_time
print(f"\nResults for hidden size {hidden_size}:")
print(f"CPU Training time: {cpu_time:.4f} seconds")
print(f"GPU Training time: {gpu_time:.4f} seconds")
print(f"GPU Speedup: {speedup:.2f}x")
results[hidden_size] = {
'cpu_time': cpu_time,
'gpu_time': gpu_time,
'speedup': speedup
}
==================================================
Testing with hidden size: 64
==================================================
Training on CPU...
Epoch [2/10], Loss: 0.6713
Epoch [4/10], Loss: 0.6660
Epoch [6/10], Loss: 0.6608
Epoch [8/10], Loss: 0.6557
Epoch [10/10], Loss: 0.6508
Training completed in 0.4558 seconds
Training on GPU...
Epoch [2/10], Loss: 0.8242
Epoch [4/10], Loss: 0.8115
Epoch [6/10], Loss: 0.7993
Epoch [8/10], Loss: 0.7875
Epoch [10/10], Loss: 0.7761
Training completed in 0.9028 seconds
Results for hidden size 64:
CPU Training time: 0.4558 seconds
GPU Training time: 0.9028 seconds
GPU Speedup: 0.50x
==================================================
Testing with hidden size: 128
==================================================
Training on CPU...
Epoch [2/10], Loss: 0.7127
Epoch [4/10], Loss: 0.7037
Epoch [6/10], Loss: 0.6950
Epoch [8/10], Loss: 0.6866
Epoch [10/10], Loss: 0.6785
Training completed in 0.9032 seconds
Training on GPU...
Epoch [2/10], Loss: 0.6536
Epoch [4/10], Loss: 0.6472
Epoch [6/10], Loss: 0.6410
Epoch [8/10], Loss: 0.6350
Epoch [10/10], Loss: 0.6292
Training completed in 0.1932 seconds
Results for hidden size 128:
CPU Training time: 0.9032 seconds
GPU Training time: 0.1932 seconds
GPU Speedup: 4.67x
==================================================
Testing with hidden size: 256
==================================================
Training on CPU...
Epoch [2/10], Loss: 0.7342
Epoch [4/10], Loss: 0.7246
Epoch [6/10], Loss: 0.7153
Epoch [8/10], Loss: 0.7064
Epoch [10/10], Loss: 0.6977
Training completed in 1.3714 seconds
Training on GPU...
Epoch [2/10], Loss: 0.7127
Epoch [4/10], Loss: 0.7037
Epoch [6/10], Loss: 0.6950
Epoch [8/10], Loss: 0.6866
Epoch [10/10], Loss: 0.6784
Training completed in 0.1177 seconds
Results for hidden size 256:
CPU Training time: 1.3714 seconds
GPU Training time: 0.1177 seconds
GPU Speedup: 11.65x
# 최종 결과 요약
print("\n" + "="*50)
print("Final Performance Summary")
print("="*50)
for hidden_size, result in results.items():
print(f"\nHidden Size: {hidden_size}")
print(f"CPU Time: {result['cpu_time']:.4f} s")
print(f"GPU Time: {result['gpu_time']:.4f} s")
print(f"Speedup: {result['speedup']:.2f}x")
==================================================
Final Performance Summary
==================================================
Hidden Size: 64
CPU Time: 0.4558 s
GPU Time: 0.9028 s
Speedup: 0.50x
Hidden Size: 128
CPU Time: 0.9032 s
GPU Time: 0.1932 s
Speedup: 4.67x
Hidden Size: 256
CPU Time: 1.3714 s
GPU Time: 0.1177 s
Speedup: 11.65x
'AI > 대학원' 카테고리의 다른 글
RNN 의 Parameter Sharing (1) | 2024.11.21 |
---|---|
PCA 와 FDA 실습 및 분석 (0) | 2024.11.20 |
Gaussian process 실습 (8) | 2024.11.14 |
VGG16 을 이용한 Transfer Learning 실습 (2) | 2024.11.11 |
도메인에 맞는 AI 지능화 전략 (자율주행 보안) (6) | 2024.11.10 |