본문 바로가기

AI/대학원

CPU(NumPy) and GPU(CuPy and PyTorch) 성능 비교

개요

CPU(NumPy) and GPU(CuPy and PyTorch) 비교 테스트

 

내용

  1. 행렬 곱셈 성능 비교:
    • NumPy(CPU), CuPy(GPU), PyTorch(GPU) 구현을 비교
    • 다양한 행렬 크기 [128, 256, 512, 1024, 2048] 에 대해 테스트
    • 각 구현의 실행 시간과 CPU 대비 속도 향상을 측정
    • 실행 시간 시각화
  2. 신경망 학습 성능 비교:
    • 간단한 이진 분류 신경망을 구현
    • CPU와 GPU 기반 학습 시간을 비교
    • 다양한 히든 레이어 크기(64, 128, 256)에 대해 테스트
    • 각 설정에서의 학습 시간과 GPU 가속 효과를 측정
  3. 구현 및 테스트 환경 : Colab PRO (GPU : Tesla T4)
# GPU 사용 가능 여부 확인
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"Current GPU device: {torch.cuda.get_device_name()}")

PyTorch version: 2.5.1+cu121
CUDA available: True
Current GPU device: Tesla T4
import cupy as cp
print(f"CuPy version: {cp.__version__}")

CuPy version: 12.2.0
# 필요한 라이브러리 임포트
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import time

# Define a function to perform matrix multiplication (practical scenario)
def matrix_multiplication(a, b):
    return np.dot(a, b)
    
def gpu_matrix_multiplication_cupy(a, b):
    a_gpu = cp.array(a)
    b_gpu = cp.array(b)
    result_gpu = cp.dot(a_gpu, b_gpu)
    return cp.asnumpy(result_gpu)
    
def gpu_matrix_multiplication_pytorch(a, b):
    a_gpu = torch.tensor(a, device='cuda')
    b_gpu = torch.tensor(b, device='cuda')
    result_gpu = torch.matmul(a_gpu, b_gpu)
    return result_gpu.cpu().numpy()
 
# Generate synthetic data for testing
def generate_data(size):
    a = np.random.rand(size, size).astype(np.float32)
    b = np.random.rand(size, size).astype(np.float32)
    return a, b
    
# Test the implementations and measure performance
def test_performance(size):
    a, b = generate_data(size)

    # CPU-based numpy implementation
    start = time.time()
    result_cpu = matrix_multiplication(a, b)
    cpu_time = time.time() - start

    # GPU-based cupy implementation
    start = time.time()
    result_cupy = gpu_matrix_multiplication_cupy(a, b)
    cupy_time = time.time() - start

    # GPU-based pytorch implementation
    start = time.time()
    result_torch = gpu_matrix_multiplication_pytorch(a, b)
    torch_time = time.time() - start

    return cpu_time, cupy_time, torch_time
    
# Analyze performance scaling with problem size
sizes = [128, 256, 512, 1024, 2048]
cpu_times, cupy_times, torch_times = [], [], []

for size in sizes:
    cpu_time, cupy_time, torch_time = test_performance(size)
    cpu_times.append(cpu_time)
    cupy_times.append(cupy_time)
    torch_times.append(torch_time)
    
# Print times for each matrix size
print("Performance Times:")
for size, cpu, cupy, torch in zip(sizes, cpu_times, cupy_times, torch_times):
    print(f"Matrix Size {size}x{size}:")
    print(f"  CPU (numpy): {cpu:.6f} seconds")
    print(f"  GPU (cupy): {cupy:.6f} seconds")
    print(f"  GPU (pytorch): {torch:.6f} seconds")
    print()
Performance Times:
Matrix Size 128x128:
  CPU (numpy): 0.004375 seconds
  GPU (cupy): 0.297219 seconds
  GPU (pytorch): 0.004101 seconds

Matrix Size 256x256:
  CPU (numpy): 0.006716 seconds
  GPU (cupy): 0.032556 seconds
  GPU (pytorch): 0.002549 seconds

Matrix Size 512x512:
  CPU (numpy): 0.005299 seconds
  GPU (cupy): 0.015380 seconds
  GPU (pytorch): 0.002463 seconds

Matrix Size 1024x1024:
  CPU (numpy): 0.041209 seconds
  GPU (cupy): 0.012589 seconds
  GPU (pytorch): 0.010114 seconds

Matrix Size 2048x2048:
  CPU (numpy): 0.483878 seconds
  GPU (cupy): 0.107040 seconds
  GPU (pytorch): 0.082670 seconds
# Display results
import matplotlib.pyplot as plt

plt.plot(sizes, cpu_times, label='CPU (numpy)', marker='o')
plt.plot(sizes, cupy_times, label='GPU (cupy)', marker='o')
plt.plot(sizes, torch_times, label='GPU (pytorch)', marker='o')
plt.xlabel('Matrix Size')
plt.ylabel('Time (seconds)')
plt.title('Performance Comparison: CPU vs GPU (cupy/pytorch)')
plt.legend()
plt.grid()
plt.show()

분석 결과

작은 행렬 크기 (N≤256)

  • CPU(numpy)가 GPU에 비해 빠르거나 유사한 성능
  • GPU는 초기화 오버헤드가 있기 때문에, 작은 데이터에서는 오히려 성능이 저하
  • CuPy는 PyTorch보다 작은 크기에서 초기 성능 저하

중간 행렬 크기 (256<N≤1024)

  • GPU(CuPy, PyTorch)가 CPU를 성능 면에서 앞지르기 시작
  • CuPy와 PyTorch 간의 성능 차이는 미미하며, GPU가 대규모 병렬 처리를 통해 점차 효율적으로 나타남

큰 행렬 크기 (N>1024)

  • GPU의 성능이 커짐
  • CPU는 시간이 선형적으로 증가하는 반면, GPU는 상대적으로 완만한 증가
  • CuPy와 PyTorch 모두 큰 크기에서 비슷한 성능을 나타내며, GPU 병렬 연산의 효율성을 보여줌
# CPU 버전 신경망
class NeuralNetworkCPU:
    def __init__(self, input_size, hidden_size):
        scale = np.sqrt(2.0 / (input_size + hidden_size))
        self.W1 = np.random.normal(0, scale, (input_size, hidden_size))
        self.b1 = np.zeros(hidden_size)
        self.W2 = np.random.normal(0, scale, (hidden_size, 1))
        self.b2 = np.zeros(1)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = np.maximum(0, self.z1)  # ReLU
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]
        delta2 = self.a2 - y.reshape(-1, 1)
        delta1 = np.dot(delta2, self.W2.T)
        delta1[self.z1 <= 0] = 0
        
        self.W2 -= learning_rate * np.dot(self.a1.T, delta2) / m
        self.b2 -= learning_rate * np.sum(delta2, axis=0) / m
        self.W1 -= learning_rate * np.dot(X.T, delta1) / m
        self.b1 -= learning_rate * np.sum(delta1, axis=0) / m
    
    def train(self, X, y, epochs=10):
        for epoch in range(epochs):
            output = self.forward(X)
            loss = -np.mean(y.reshape(-1, 1) * np.log(output + 1e-7) + 
                          (1 - y.reshape(-1, 1)) * np.log(1 - output + 1e-7))
            self.backward(X, y)
            
            if (epoch + 1) % 2 == 0:
                print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss:.4f}')
    
# GPU 버전 신경망
class NeuralNetworkGPU:
    def __init__(self, input_size, hidden_size):
        scale = cp.sqrt(2.0 / (input_size + hidden_size))
        self.W1 = cp.random.normal(0, scale, (input_size, hidden_size))
        self.b1 = cp.zeros(hidden_size)
        self.W2 = cp.random.normal(0, scale, (hidden_size, 1))
        self.b2 = cp.zeros(1)
    
    def sigmoid(self, x):
        return 1 / (1 + cp.exp(-x))
    
    def forward(self, X):
        self.z1 = cp.dot(X, self.W1) + self.b1
        self.a1 = cp.maximum(0, self.z1)  # ReLU
        self.z2 = cp.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]
        delta2 = self.a2 - y.reshape(-1, 1)
        delta1 = cp.dot(delta2, self.W2.T)
        delta1[self.z1 <= 0] = 0
        
        self.W2 -= learning_rate * cp.dot(self.a1.T, delta2) / m
        self.b2 -= learning_rate * cp.sum(delta2, axis=0) / m
        self.W1 -= learning_rate * cp.dot(X.T, delta1) / m
        self.b1 -= learning_rate * cp.sum(delta1, axis=0) / m
    
    def train(self, X, y, epochs=10):
        for epoch in range(epochs):
            output = self.forward(X)
            loss = -cp.mean(y.reshape(-1, 1) * cp.log(output + 1e-7) + 
                          (1 - y.reshape(-1, 1)) * cp.log(1 - output + 1e-7))
            self.backward(X, y)
            
            if (epoch + 1) % 2 == 0:
                print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss:.4f}')
# 데이터셋 생성
print("\nGenerating dataset...")
X, y = make_classification(n_samples=10000, n_features=20, random_state=42)
X = X.astype(np.float32)
y = y.astype(np.float32)
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.2, random_state=42)
print("Dataset generated successfully!")

Generating dataset...
Dataset generated successfully!
hidden_sizes = [64, 128, 256]
results = {}

for hidden_size in hidden_sizes:
    print(f"\n{'='*50}")
    print(f"Testing with hidden size: {hidden_size}")
    print('='*50)
    
    # CPU (NumPy) 학습
    print("\nTraining on CPU...")
    start_time = time.time()
    model_cpu = NeuralNetworkCPU(X_train.shape[1], hidden_size)
    model_cpu.train(X_train, y_train)
    cpu_time = time.time() - start_time
    print(f"\nTraining completed in {cpu_time:.4f} seconds")
    
    # GPU (CuPy) 학습
    print("\nTraining on GPU...")
    X_gpu = cp.array(X_train)
    y_gpu = cp.array(y_train)
    start_time = time.time()
    model_gpu = NeuralNetworkGPU(X_gpu.shape[1], hidden_size)
    model_gpu.train(X_gpu, y_gpu)
    gpu_time = time.time() - start_time
    print(f"\nTraining completed in {gpu_time:.4f} seconds")
    
    # 속도 향상 계산
    speedup = cpu_time / gpu_time
    
    print(f"\nResults for hidden size {hidden_size}:")
    print(f"CPU Training time: {cpu_time:.4f} seconds")
    print(f"GPU Training time: {gpu_time:.4f} seconds")
    print(f"GPU Speedup: {speedup:.2f}x")
    
    results[hidden_size] = {
        'cpu_time': cpu_time,
        'gpu_time': gpu_time,
        'speedup': speedup
    }

==================================================

Testing with hidden size: 64

==================================================

 

Training on CPU...

Epoch [2/10], Loss: 0.6713

Epoch [4/10], Loss: 0.6660

Epoch [6/10], Loss: 0.6608

Epoch [8/10], Loss: 0.6557

Epoch [10/10], Loss: 0.6508

 

Training completed in 0.4558 seconds

 

Training on GPU...

Epoch [2/10], Loss: 0.8242

Epoch [4/10], Loss: 0.8115

Epoch [6/10], Loss: 0.7993

Epoch [8/10], Loss: 0.7875

Epoch [10/10], Loss: 0.7761

 

Training completed in 0.9028 seconds

 

Results for hidden size 64:

CPU Training time: 0.4558 seconds

GPU Training time: 0.9028 seconds

GPU Speedup: 0.50x

 

==================================================

Testing with hidden size: 128

==================================================

 

Training on CPU...

Epoch [2/10], Loss: 0.7127

Epoch [4/10], Loss: 0.7037

Epoch [6/10], Loss: 0.6950

Epoch [8/10], Loss: 0.6866

Epoch [10/10], Loss: 0.6785

 

Training completed in 0.9032 seconds

 

Training on GPU...

Epoch [2/10], Loss: 0.6536

Epoch [4/10], Loss: 0.6472

Epoch [6/10], Loss: 0.6410

Epoch [8/10], Loss: 0.6350

Epoch [10/10], Loss: 0.6292

 

Training completed in 0.1932 seconds

 

Results for hidden size 128:

CPU Training time: 0.9032 seconds

GPU Training time: 0.1932 seconds

GPU Speedup: 4.67x

 

==================================================

Testing with hidden size: 256

==================================================

 

Training on CPU...

Epoch [2/10], Loss: 0.7342

Epoch [4/10], Loss: 0.7246

Epoch [6/10], Loss: 0.7153

Epoch [8/10], Loss: 0.7064

Epoch [10/10], Loss: 0.6977

 

Training completed in 1.3714 seconds

 

Training on GPU...

Epoch [2/10], Loss: 0.7127

Epoch [4/10], Loss: 0.7037

Epoch [6/10], Loss: 0.6950

Epoch [8/10], Loss: 0.6866

Epoch [10/10], Loss: 0.6784

 

Training completed in 0.1177 seconds

 

Results for hidden size 256:

CPU Training time: 1.3714 seconds

GPU Training time: 0.1177 seconds

GPU Speedup: 11.65x

# 최종 결과 요약
print("\n" + "="*50)
print("Final Performance Summary")
print("="*50)
for hidden_size, result in results.items():
    print(f"\nHidden Size: {hidden_size}")
    print(f"CPU Time: {result['cpu_time']:.4f} s")
    print(f"GPU Time: {result['gpu_time']:.4f} s")
    print(f"Speedup: {result['speedup']:.2f}x")

==================================================

Final Performance Summary

==================================================

 

Hidden Size: 64

CPU Time: 0.4558 s

GPU Time: 0.9028 s

Speedup: 0.50x

 

Hidden Size: 128

CPU Time: 0.9032 s

GPU Time: 0.1932 s

Speedup: 4.67x

 

Hidden Size: 256

CPU Time: 1.3714 s

GPU Time: 0.1177 s

Speedup: 11.65x

'AI > 대학원' 카테고리의 다른 글

RNN 의 Parameter Sharing  (1) 2024.11.21
PCA 와 FDA 실습 및 분석  (0) 2024.11.20
Gaussian process 실습  (8) 2024.11.14
VGG16 을 이용한 Transfer Learning 실습  (2) 2024.11.11
도메인에 맞는 AI 지능화 전략 (자율주행 보안)  (6) 2024.11.10