Svelare il funzionamento interno degli LLM: una prospettiva di valore singolare | di Louis Owen | Giugno 2024 | Intelligenza-Artificiale

Ora passiamo al vero affare di questo articolo. Analisi delle matrici (Q, K, V, O) del modello Llama-3–8B-Instruct tramite i loro valori singolari!

Il codice

Importiamo innanzitutto tutti i pacchetti necessari in questa analisi.

import transformers
import torch
import numpy as np
from transformers import AutoConfig, LlamaModel
from safetensors import safe_open
import os
import matplotlib.pyplot as plt

Quindi scarichiamo il modello e salviamolo nel nostro locale /tmpdirectory.

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
!huggingface-cli download {MODEL_ID} --quiet --local-dir /tmp/{MODEL_ID}

Se sei ricco di GPU, il seguente codice potrebbe non essere rilevante per te. Tuttavia, se sei povero di GPU come me, il seguente codice sarà davvero utile per caricare solo livelli specifici del modello LLama-3–8B.

def load_specific_layers_safetensors(model, model_name, layer_to_load):
state_dict = {}
files = (f for f in os.listdir(model_name) if f.endswith('.safetensors'))
for file in files:
filepath = os.path.join(model_name, file)
with safe_open(filepath, framework="pt") as f:
for key in f.keys():
if f"layers.{layer_to_load}." in key:
new_key = key.replace(f"model.layers.{layer_to_load}.", 'layers.0.')
state_dict(new_key) = f.get_tensor(key)missing_keys, unexpected_keys = model.load_state_dict(state_dict, strict=False)
if missing_keys:
print(f"Missing keys: {missing_keys}")
if unexpected_keys:
print(f"Unexpected keys: {unexpected_keys}")

Il motivo per cui lo facciamo è perché il livello gratuito della GPU Google Colab non è sufficiente per caricare LLama-3–8B anche con fp16 precisione. Inoltre, questa analisi ci impone di lavorare su fp32 precisione a causa di come np.linalg.svd è costruito. Successivamente, possiamo definire la funzione principale per ottenere valori singolari per un dato matrix_type , layer_number E head_number.

def get_singular_values(model_path, matrix_type, layer_number, head_number):
"""
Computes the singular values of the specified matrix in the Llama-3 model.Parameters:
model_path (str): Path to the model
matrix_type (str): Type of matrix ('q', 'k', 'v', 'o')
layer_number (int): Layer number (0 to 31)
head_number (int): Head number (0 to 31)
Returns:
np.array: Array of singular values
"""
assert matrix_type in ('q', 'k', 'v', 'o'), "Invalid matrix type"
assert 0 <= layer_number < 32, "Invalid layer number"
assert 0 <= head_number < 32, "Invalid head number"
# Load the model only for that specific layer since we have limited RAM even after using fp16
config = AutoConfig.from_pretrained(model_path)
config.num_hidden_layers = 1
model = LlamaModel(config)
load_specific_layers_safetensors(model, model_path, layer_number)
# Access the specified layer
# Always index 0 since we have loaded for the specific layer
layer = model.layers(0)
# Determine the size of each head
num_heads = layer.self_attn.num_heads
head_dim = layer.self_attn.head_dim
# Access the specified matrix
weight_matrix = getattr(layer.self_attn, f"{matrix_type}_proj").weight.detach().numpy()
if matrix_type in ('q','o'):
start = head_number * head_dim
end = (head_number + 1) * head_dim
else:  # 'k', 'v' matrices
# Adjust the head_number based on num_key_value_heads
# This is done since llama3-8b use Grouped Query Attention
num_key_value_groups = num_heads // config.num_key_value_heads
head_number_kv = head_number // num_key_value_groups
start = head_number_kv * head_dim
end = (head_number_kv + 1) * head_dim
# Extract the weights for the specified head
if matrix_type in ('q', 'k', 'v'):
weight_matrix = weight_matrix(start:end, :)
else:  # 'o' matrix
weight_matrix = weight_matrix(:, start:end)
# Compute singular values
singular_values = np.linalg.svd(weight_matrix, compute_uv=False)
del model, config
return list(singular_values)

Vale la pena notare che possiamo estrarre i pesi per la testa specificata sulle matrici K, Q e V eseguendo l'affettamento per riga a causa di come viene implementato da Abbracciare il viso.

Implementazione delle matrici Q, K, V in HuggingFace. Nota che in PyTorch la dimensione della matrice sarà presente (d_out,d_in). Fonte: immagine dell'autore.

Per quanto riguarda la matrice O, possiamo eseguire un'affettatura in colonne per estrarre i pesi per la testa specificata sul peso O grazie all'algebra lineare! I dettagli possono essere visualizzati nella figura seguente.