Wednesday, March 12, 2025

How to Extract Specific Pages from a PDF Using Python


Managing large PDF files can be a challenge, especially when you only need to work with a specific set of pages. Fortunately, Python provides powerful libraries to manipulate PDFs with ease. In this post, we’ll explore how to extract specific pages from a PDF using PyMuPDF (fitz).

Why Extract Specific Pages?

There are several reasons why you might want to extract pages from a PDF:

  • To separate chapters from an eBook.

  • To share only the relevant sections of a document.

  • To split large PDFs into smaller, manageable parts.

Installing the Required Library

Before we begin, make sure you have PyMuPDF installed. If not, install it using pip:

pip install pymupdf

Python Code to Extract Pages from a PDF

Below is a Python script that extracts pages 3 to 40 from an input PDF and saves them as a new file.

import fitz # PyMuPDF

def extract_pdf_pages(input_pdf, output_pdf, start_page=3, end_page=40):
doc = fitz.open(input_pdf) # Open the input PDF
new_doc = fitz.open() # Create a new empty PDF

total_pages = len(doc)
# Ensure we don't exceed total pages
for i in range(start_page - 1, min(end_page, total_pages)): # Adjust for zero-based index
new_doc.insert_pdf(doc, from_page=i, to_page=i)

new_doc.save(output_pdf) # Save the extracted pages
new_doc.close()
doc.close()

print(f"Pages {start_page} to {end_page} saved as '{output_pdf}'.")

# Example usage
input_pdf = "input.pdf" # Replace with your actual input file
output_pdf = "output.pdf"
extract_pdf_pages(input_pdf, output_pdf)

Understanding the Code

  • fitz.open(input_pdf): Opens the existing PDF.

  • fitz.open(): Creates a new empty PDF.

  • for i in range(start_page - 1, min(end_page, total_pages)): Iterates through the required pages, ensuring we don’t exceed the total number of pages.

  • new_doc.insert_pdf(doc, from_page=i, to_page=i): Extracts the specified pages.

  • new_doc.save(output_pdf): Saves the extracted pages into a new PDF file.

Example Scenario

Suppose you have a PDF book and you only need pages 3 to 40. Run the script, and it will generate a new PDF named output.pdf containing just those pages. We will face this kind of scenario when working with RAG.

Conclusion

This simple script allows you to extract specific pages from a PDF efficiently using Python. Whether you're working with large reports, eBooks, or scanned documents, this method can save you time and effort.

Try it out and let us know how it works for you! 

AI Course |  Bundle Offer (including AI/RAG ebook)  | AI coaching 

eBooks bundle Offer India

No comments:

Search This Blog