Come migliorare i LLM con RAG. Un'introduzione adatta ai principianti con… | di Shaw Talebi | Marzo 2024 | Intelligenza-Artificiale

Indice contenuti

Importazioni

Iniziamo installando e importando le librerie Python necessarie.

!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
# if not running on Colab ensure transformers is installed too

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

Impostazione della base di conoscenza

Possiamo configurare la nostra base di conoscenza definendo il nostro modello di incorporamento, la dimensione dei blocchi e la sovrapposizione dei blocchi. Qui utilizziamo il parametro ~33M bge-small-en-v1.5 modello di incorporamento di BAAI, disponibile sull'hub Hugging Face. Sono disponibili altre opzioni del modello di incorporamento classifica per l'incorporamento del testo.

# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")Settings.llm = None # we won't use LlamaIndex to set-up LLM
Settings.chunk_size = 256
Settings.chunk_overlap = 25

Successivamente, carichiamo i nostri documenti di origine. Qui, ho una cartella chiamata “articoli“, che contiene le versioni PDF di 3 articoli Medium su cui ho scritto code grasse. Se lo esegui in Colab, devi scaricare la cartella degli articoli dal file Deposito GitHub e caricalo manualmente nel tuo ambiente Colab.

Per ogni file in questa cartella, la funzione seguente leggerà il testo dal PDF, lo dividerà in blocchi (in base alle impostazioni definite in precedenza) e memorizzerà ciascun blocco in un elenco chiamato documenti.

documents = SimpleDirectoryReader("articles").load_data()

Poiché i blog sono stati scaricati direttamente come PDF da Medium, assomigliano più a una pagina Web che a un articolo ben formattato. Pertanto, alcuni blocchi potrebbero includere testo non correlato all'articolo, ad esempio intestazioni di pagine Web e consigli sugli articoli di Medium.

Nel blocco di codice seguente, perfeziono i pezzi nei documenti, rimuovendo la maggior parte dei pezzi prima o dopo la parte centrale di un articolo.

print(len(documents)) # prints: 71
for doc in documents:
if "Member-only story" in doc.text:
documents.remove(doc)
continueif "The Data Entrepreneurs" in doc.text:
documents.remove(doc)
if " min read" in doc.text:
documents.remove(doc)
print(len(documents)) # prints: 61

Infine, possiamo memorizzare i pezzi raffinati in un database vettoriale.

index = VectorStoreIndex.from_documents(documents)

Configurazione del recuperatore

Con la nostra base di conoscenza in atto, possiamo creare un retriever utilizzando LlamaIndex VectorIndexRetreiver(), che restituisce i primi 3 blocchi più simili a una query utente.

# set number of docs to retreive
top_k = 3# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
)

Successivamente, definiamo un motore di query che utilizza il retriever e la query per restituire un insieme di blocchi rilevanti.

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=(SimilarityPostprocessor(similarity_cutoff=0.5)),
)

Utilizza il motore di query

Ora, con la nostra base di conoscenza e il sistema di recupero impostati, usiamoli per restituire blocchi rilevanti per una query. Qui passeremo alla stessa domanda tecnica che abbiamo posto a ShawGPT (il risponditore dei commenti di YouTube) dell'articolo precedente.

query = "What is fat-tailedness?"
response = query_engine.query(query)

Il motore di query restituisce un oggetto di risposta contenente il testo, i metadati e gli indici dei blocchi rilevanti. Il blocco di codice seguente restituisce una versione più leggibile di queste informazioni.

# reformat response
context = "Context:\n"
for i in range(top_k):
context = context + response.source_nodes(i).text + "\n\n"print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma (2).
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness â measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that âfat-tailednessâ is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 â 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape (2).print("mean kappa_1n = " + str(np.mean(kappa_dict(filename))))
print("")
Mean Îº (1,100) values from 1000 runs for each dataset. Image by author.
These more stable results indicate Medium followers are the most fat-tailed,
followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref (3) to better understand each
Îº value. Namely, these values are comparable to a Pareto distribution with Î±
between 2 and 3.
Although each heuristic told a slightly different story, all signs point toward
Medium followers gained being the most fat-tailed of the 3 datasets.
Conclusion
While binary labeling data as fat-tailed (or not) may be tempting, fat-
tailedness lives on a spectrum. Here, we broke down 4 heuristics for
quantifying how fat-tailed data are.
Pareto, Power Laws, and Fat Tails
What they donât teach you in statistics
towardsdatascience.com
Although Pareto (and more generally power law) distributions give us a
salient example of fat tails, this is a more general notion that lives on a
spectrum ranging from thin-tailed (i.e. a Gaussian) to very fat-tailed (i.e.
Pareto 80 â 20).
The spectrum of Fat-tailedness. Image by author.
This view of fat-tailedness provides us with a more flexible and precise way of
categorizing data than simply labeling it as a Power Law (or not). However,
this begs the question: how do we define fat-tailedness?
4 Ways to Quantify Fat Tails

Aggiunta di RAG a LLM

Iniziamo scaricando il file modello messo a punto dall'hub Hugging Face.

# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="main")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Come riferimento, possiamo vedere come il modello risponde alla domanda tecnica senza alcun contesto dagli articoli. A tale scopo, creiamo un modello di prompt utilizzando una funzione lambda, che accetta un commento del visualizzatore e restituisce un prompt per LLM. Per ulteriori dettagli sulla provenienza di questo messaggio, vedere il articolo precedente di questa serie.

# prompt (no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. It reacts to feedback aptly and ends \
responses with its signature 'âShawGPT'.ShawGPT will tailor the length of its responses to match the viewer's comment, \
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.
Please respond to the following comment.
"""
prompt_template = lambda comment: f'''(INST) {intstructions_string} \n{comment} \n(/INST)'''
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)
print(prompt)

(INST) ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature 'âShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.Please respond to the following comment.
What is fat-tailedness? 
(/INST)

Successivamente, possiamo passare questo prompt al modello utilizzando il codice seguente.

model.eval()inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs("input_ids").to("cuda"), 
max_new_tokens=280)
print(tokenizer.batch_decode(outputs)(0))

Ecco la risposta del modello (senza contesto).

Great question!Fat-tailedness is a statistical property of a distribution. In simple terms, 
it refers to the presence of extreme outliers or heavy tails in the 
distribution.
For instance, consider the distribution of heights in a population. A normal 
distribution would have most people clustered around an average height with a 
few people deviating slightly from the mean. However, in a fat-tailed 
distribution, you would observe a larger number of people being significantly 
taller or shorter than the average.
In finance, fat-tailedness is often used to describe the presence of extreme 
market moves or outliers in financial returns. It's an important concept in 
risk management and financial modeling.
I hope this explanation helps! Let me know if you have any questions.
âShawGPT

Sebbene lo stile e la formattazione della risposta siano ottimi, la spiegazione del modello della coda grassa differisce da come l'ho definita nel mio video E serie di blog.

Vediamo cosa succede alla risposta del modello quando includiamo il contesto appropriato. Per fare ciò, creiamo un altro modello di prompt, che può anche prendere in contesto dal sistema RAG.

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""(INST)ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature 'âShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.{context}
Please respond to the following comment. Use the context above if it is helpful.
{comment}
(/INST)
"""

Successivamente, passiamo il prompt con il contesto dal motore di query e il commento della vista al modello.

prompt = prompt_template_w_context(context, comment)inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs("input_ids").to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)(0))

Ecco la nuova risposta (con contesto).

Great question!Fat-tailedness is a property of a distribution that describes how much the 
presence of extreme values (rare events) affects the overall statistics of the 
distribution.
In the context of the video, the author discusses how Mediocristan and 
Extremistan distributions differ in terms of fat-tailedness. Mediocristan 
distributions have a relatively small impact from extreme values, while 
Extremistan distributions are heavily influenced by extreme values.
The author also mentions that fat-tailedness is a more general concept than 
specific distributions like Pareto or Power Laws. Instead, it lives on a 
spectrum ranging from thin-tailed (Gaussian) to very fat-tailed (Pareto 80-20).
I hope that helps clarify things a bit! Let me know if you have any questions.
âShawGPT

Questo fa un lavoro molto migliore nel catturare la mia spiegazione delle code grasse rispetto alla risposta senza contesto e richiama anche i concetti di nicchia di Mediocristan ed Extremistan.

Qui ho fornito un'introduzione a RAG adatta ai principianti e ho condiviso un esempio concreto di come implementarlo utilizzando LlamaIndex. RAG ci consente di migliorare un sistema LLM con conoscenze aggiornabili e specifiche del dominio.

Mentre gran parte del recente clamore sull'intelligenza artificiale è incentrato sulla creazione di assistenti AI, un'innovazione potente (anche se meno popolare) è arrivata dagli incorporamenti di testo (ovvero le cose che usavamo per recuperare). Nel prossimo articolo di questa serie, esplorerò incorporamenti di testo in modo più dettagliato, compreso il modo in cui possono essere utilizzati ricerca semantica E compiti di classificazione.

Altro sui LLM ðŸ'‡