Estrarre testo da file PDF con Python: una guida completa | di George Stavrakis | Settembre 2023 | Intelligenza-Artificiale

Ora che abbiamo tutti i componenti del codice pronti, aggiungiamoli tutti fino a ottenere un codice completamente funzionale. Puoi copiare il codice da qui oppure puoi trovarlo insieme al PDF di esempio nel mio repository Github Qui.

# Find the PDF path
pdf_path = 'OFFER 3.pdf'# create a PDF file object
pdfFileObj = open(pdf_path, 'rb')
# create a PDF reader object
pdfReaded = PyPDF2.PdfReader(pdfFileObj)
# Create the dictionary to extract text from each image
text_per_page = {}
# We extract the pages from the PDF
for pagenum, page in enumerate(extract_pages(pdf_path)):
# Initialize the variables needed for the text extraction from the page
pageObj = pdfReaded.pages(pagenum)
page_text = ()
line_format = ()
text_from_images = ()
text_from_tables = ()
page_content = ()
# Initialize the number of the examined tables
table_num = 0
first_element= True
table_extraction_flag= False
# Open the pdf file
pdf = pdfplumber.open(pdf_path)
# Find the examined page
page_tables = pdf.pages(pagenum)
# Find the number of tables on the page
tables = page_tables.find_tables()
# Find all the elements
page_elements = ((element.y1, element) for element in page._objs)
# Sort all the elements as they appear in the page 
page_elements.sort(key=lambda a: a(0), reverse=True)
# Find the elements that composed a page
for i,component in enumerate(page_elements):
# Extract the position of the top side of the element in the PDF
pos= component(0)
# Extract the element of the page layout
element = component(1)
# Check if the element is a text element
if isinstance(element, LTTextContainer):
# Check if the text appeared in a table
if table_extraction_flag == False:
# Use the function to extract the text and format for each text element
(line_text, format_per_line) = text_extraction(element)
# Append the text of each line to the page text
page_text.append(line_text)
# Append the format for each line containing text
line_format.append(format_per_line)
page_content.append(line_text)
else:
# Omit the text that appeared in a table
pass
# Check the elements for images
if isinstance(element, LTFigure):
# Crop the image from the PDF
crop_image(element, pageObj)
# Convert the cropped pdf to an image
convert_to_images('cropped_image.pdf')
# Extract the text from the image
image_text = image_to_text('PDF_image.png')
text_from_images.append(image_text)
page_content.append(image_text)
# Add a placeholder in the text and format lists
page_text.append('image')
line_format.append('image')
# Check the elements for tables
if isinstance(element, LTRect):
# If the first rectangular element
if first_element == True and (table_num+1) <= len(tables):
# Find the bounding box of the table
lower_side = page.bbox(3) - tables(table_num).bbox(3)
upper_side = element.y1 
# Extract the information from the table
table = extract_table(pdf_path, pagenum, table_num)
# Convert the table information in structured string format
table_string = table_converter(table)
# Append the table string into a list
text_from_tables.append(table_string)
page_content.append(table_string)
# Set the flag as True to avoid the content again
table_extraction_flag = True
# Make it another element
first_element = False
# Add a placeholder in the text and format lists
page_text.append('table')
line_format.append('table')
# Check if we already extracted the tables from the page
if element.y0 >= lower_side and element.y1 <= upper_side:
pass
elif not isinstance(page_elements(i+1)(1), LTRect):
table_extraction_flag = False
first_element = True
table_num+=1
# Create the key of the dictionary
dctkey = 'Page_'+str(pagenum)
# Add the list of list as the value of the page key
text_per_page(dctkey)= (page_text, line_format, text_from_images,text_from_tables, page_content)
# Closing the pdf file object
pdfFileObj.close()
# Deleting the additional files created
os.remove('cropped_image.pdf')
os.remove('PDF_image.png')
# Display the content of the page
result = ''.join(text_per_page('Page_0')(4))
print(result)

Lo script sopra:

Importa le librerie necessarie.

Aprire il file PDF utilizzando il file pyPDF2 biblioteca.

Estrai ogni pagina del PDF e ripeti i passaggi seguenti.

Esamina se sono presenti tabelle nella pagina e creane un elenco utilizzando pdfplumner.

Trova tutti gli elementi nidificati nella pagina e ordinali come apparivano nel suo layout.

Quindi per ciascun elemento:

Esamina se si tratta di un contenitore di testo e non viene visualizzato in un elemento di tabella. Quindi utilizzare il testo_estrazione() per estrarre il testo insieme al suo formato, altrimenti passa questo testo.

Esamina se si tratta di un’immagine e utilizza il file ritaglia l’immagine() per ritagliare il componente immagine dal PDF, convertirlo in un file immagine utilizzando la funzione converti_in_immagini() ed estrarne il testo utilizzando l’OCR con il file immagine_in_testo() funzione.

Esaminare se si tratta di un elemento rettangolare. In questo caso, esaminiamo se il primo rect fa parte della tabella di una pagina e, in caso affermativo, procediamo ai seguenti passaggi:

Trova il riquadro di delimitazione della tabella per non estrarne nuovamente il testo con la funzione text_extraction().
Estrai il contenuto della tabella e convertilo in una stringa.
Quindi aggiungi un parametro booleano per chiarire che estraiamo il testo dalla tabella.
Questo processo terminerà dopo l’ultimo LTRect che cade nel riquadro di delimitazione della tabella e l’elemento successivo nel layout non è un oggetto rettangolare. (Verranno passati tutti gli altri oggetti che compongono la tabella)

Gli output del processo verranno archiviati in 5 elenchi per iterazione, denominati:

page_text: contiene il testo proveniente dai contenitori di testo nel PDF (il segnaposto verrà posizionato quando il testo viene estratto da un altro elemento)
line_format: contiene i formati dei testi estratti sopra (il segnaposto verrà posizionato quando il testo verrà estratto da un altro elemento)
text_from_images: contiene i testi estratti dalle immagini presenti nella pagina
text_from_tables: contiene la stringa simile a una tabella con il contenuto delle tabelle
page_content: contiene tutto il testo visualizzato sulla pagina in un elenco di elementi

Tutte le liste verranno memorizzate sotto la chiave in un dizionario che rappresenterà il numero della pagina di volta in volta esaminata.

Successivamente, chiuderemo il file PDF.

Quindi elimineremo tutti i file aggiuntivi creati durante il processo.

Infine, possiamo visualizzare il contenuto della pagina unendo gli elementi della lista page_content.

Fonte: towardsdatascience.com