Managing large PDF files can be a challenge, especially when you only need to work with a specific set of pages. Fortunately, Python provides powerful libraries to manipulate PDFs with ease. In this post, we’ll explore how to extract specific pages from a PDF using PyMuPDF (fitz
).
Why Extract Specific Pages?
There are several reasons why you might want to extract pages from a PDF:
To separate chapters from an eBook.
To share only the relevant sections of a document.
To split large PDFs into smaller, manageable parts.
Installing the Required Library
Before we begin, make sure you have PyMuPDF installed. If not, install it using pip:
pip install pymupdf
Python Code to Extract Pages from a PDF
Below is a Python script that extracts pages 3 to 40 from an input PDF and saves them as a new file.
import fitz # PyMuPDFdef extract_pdf_pages(input_pdf, output_pdf, start_page=3, end_page=40):doc = fitz.open(input_pdf) # Open the input PDFnew_doc = fitz.open() # Create a new empty PDFtotal_pages = len(doc)# Ensure we don't exceed total pagesfor i in range(start_page - 1, min(end_page, total_pages)): # Adjust for zero-based indexnew_doc.insert_pdf(doc, from_page=i, to_page=i)new_doc.save(output_pdf) # Save the extracted pagesnew_doc.close()doc.close()print(f"Pages {start_page} to {end_page} saved as '{output_pdf}'.")# Example usageinput_pdf = "input.pdf" # Replace with your actual input fileoutput_pdf = "output.pdf"extract_pdf_pages(input_pdf, output_pdf)
Understanding the Code
fitz.open(input_pdf)
: Opens the existing PDF.fitz.open()
: Creates a new empty PDF.for i in range(start_page - 1, min(end_page, total_pages))
: Iterates through the required pages, ensuring we don’t exceed the total number of pages.new_doc.insert_pdf(doc, from_page=i, to_page=i)
: Extracts the specified pages.new_doc.save(output_pdf)
: Saves the extracted pages into a new PDF file.
Example Scenario
Suppose you have a PDF book and you only need pages 3 to 40. Run the script, and it will generate a new PDF named output.pdf
containing just those pages. We will face this kind of scenario when working with RAG.
Conclusion
This simple script allows you to extract specific pages from a PDF efficiently using Python. Whether you're working with large reports, eBooks, or scanned documents, this method can save you time and effort.
Try it out and let us know how it works for you!
AI Course | Bundle Offer (including AI/RAG ebook) | AI coaching
No comments:
Post a Comment