Summarizing PDF Content Using LLMs

In this tutorial, we'll demonstrate how to use the GPT-3.5 model through MindsDB Inference Endpoints to summarize text from a PDF, using Python. We will guide you step by step on how to extract text from a PDF and then summarize it efficiently. Additionally, you'll see how easy it is to experiment with different Large Language Models (LLMs) by simply changing the model name in your script. This flexibility allows you to leverage the latest advancements in Large Language Models with minimal effort.

Prerequisites:

  • Install Python by running pip install python.
  • Install PyPDF2 and OpenAI SDK by running pip install PyPDF2 openai.
  • Create a Python file with the code content presented below.

Step 1: Extract Text from a PDF File

In this initial step, you will create a Python function extract_text_from_pdf that reads through each page of a provided PDF file and aggregates the extracted text. This function is essential for the subsequent text summarization step.

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a given PDF file.
    
    :param pdf_path: Path to the PDF file.
    :return: Extracted text as a string or None if an error occurs.
    """
    try:
        response = requests.get(pdf_path)
        if response.status_code == 200:
            pdf_content = io.BytesIO(response.content)
            pdf_reader = PyPDF2.PdfReader(pdf_content)
            text = ""
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text()
            return text
        else:
            print(f"Failed to download PDF. Status code: {response.status_code}")
            return None
    except PyPDF2.utils.PdfReadError as e:
        print(f"An error occurred while reading the PDF file: {e}")
        return None

Step 2: Summarize the Extracted Text Using MindsDB Serve

Now, let's define a function summarize_text_from_pdf to connect with the MindsDB Serve API, which will utilize the GPT-3.5 model for text summarization. You can adjust the summary_prompt parameter if needed for different types of summarization tasks.

def summarize_text_from_pdf(MINDS_API_KEY, pdf_text, model="gpt-3.5-turbo", summary_prompt="Summarize the following text."):
    """
    Summarizes the text extracted from a PDF using the MindsDB Serve API.
    
    :param MINDS_API_KEY: API key for authentication with MindsDB Serve API.
    :param pdf_text: Text content extracted from the PDF to summarize.
    :param model: Model to be used.
    :param summary_prompt: Prompt to use for text summarization.
    :return: Summary of the text or an error message.
    """
    from openai import OpenAI, OpenAIError
    if not pdf_text:
        return "No text available to summarize."
        
    try:
        client = OpenAI(
            api_key=MINDS_API_KEY,
            base_url="https://mdb.ai"
        )
        chat_completion_gpt = client.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": summary_prompt,
                },
                {"role": "user", "content": pdf_text}
            ],
            model=model,
        )
        return chat_completion_gpt.choices[0].message.content
    except OpenAIError as e:
        print(f"An error occurred with the MindsDB Serve API: {e}")
        return "Error in text summarization."

Step 3: Perform the Summarization

The __main__ block below orchestrates the text extraction and summarization process. It prompts you to insert your Minds API key and specify the PDF path. After extracting text from the PDF, it summarizes the content and displays the result.

import requests
import io
import PyPDF2
if __name__ == "__main__":
    MINDS_API_KEY = "xyz"  # ADD YOUR MINDS API KEY HERE
    SUMMARY_PROMPT = "Summarize the following text." # Change the summarization prompt as needed
    MODEL = "gpt-3.5-turbo" # Change the LLM here
    pdf_path = "https://www.princexml.com/samples/invoice/invoicesample.pdf" # Change the PDF path if you want to use your own PDF file
    pdf_text = extract_text_from_pdf(pdf_path)
    if pdf_text:
        summary = summarize_text_from_pdf(MINDS_API_KEY, pdf_text, MODEL, SUMMARY_PROMPT)
        print(summary)
    else:
        print("Failed to extract text from the PDF.")

Step 4: Run the Script

Execute the script in your terminal or command prompt.

python minds_test.py

The script will print the summary of the extracted text to the console. You can modify the script to save the summary to a file or perform additional processing as needed.

Step 5: Switching Between Different Language Models

Refer to the supported language models list to choose an alternative model. For example, if you wish to use the latest Gemma-7B model from Google, you would change the MODEL variable in step 3 to MODEL="gemma-7b"

Was this page helpful?