Wednesday, March 12, 2025

Extracting Tamil Text from Scanned PDFs Using Python


In today’s digital world, extracting text from scanned documents is a crucial task, especially when dealing with regional languages like Tamil. Optical Character Recognition (OCR) technology helps in converting scanned images into editable text. In this post, we will walk through a Python script that extracts Tamil text from scanned PDFs using pdfplumber and pytesseract.


Prerequisites

To run this script, you need to install the following Python libraries:

pip install pdfplumber pytesseract

Additionally, you need to install Tesseract-OCR, an open-source OCR engine. For Windows users, you can download it from Tesseract GitHub. If Tesseract is not added to your system's PATH, specify its location in the script:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

For accurate Tamil text extraction, ensure you have the Tamil language pack installed:

tesseract --list-langs  # Check installed languages
tesseract -l tam  # Verify Tamil support

If it is not there, we need to install it.
sudo apt install tesseract-ocr-tam



Understanding the Script

The script performs the following tasks:

  1. Opens the scanned PDF using pdfplumber.

  2. Extracts images from each page within a given page range.

  3. Performs OCR on each image using pytesseract with Tamil language support.

  4. Writes the extracted text to a text file.

Here’s the complete script:

import pdfplumber
import pytesseract

# If Tesseract is not in system PATH, set its location (Windows users)
# pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

def extract_tamil_text_from_scanned_pdf(pdf_path, output_file, start_page=4, end_page=34):
tamil_text = ""
with pdfplumber.open(pdf_path) as pdf:
total_pages = len(pdf.pages)
for i in range(start_page - 1, min(end_page, total_pages)): # Adjust for zero-based index
page = pdf.pages[i]
image = page.to_image().original # Convert PDF page to a PIL Image
text = pytesseract.image_to_string(image, lang="tam") # OCR with Tamil language
tamil_text += text + "\n"

with open(output_file, "w", encoding="utf-8") as file:
file.write(tamil_text)

# Example usage
pdf_path = "input.pdf" # Replace with actual PDF file
output_file = "output.txt"
extract_tamil_text_from_scanned_pdf(pdf_path, output_file)

print(f"OCR-extracted Tamil text from pages 4 to 34 has been written to '{output_file}'.")

How It Works

  • pdfplumber.open(pdf_path): Opens the PDF file.

  • pdf.pages[i].to_image().original: Converts each page into an image.

  • pytesseract.image_to_string(image, lang="tam"): Extracts Tamil text from the image.

  • with open(output_file, "w", encoding="utf-8"): Saves the extracted text to a file.


Customizing the Script

  • Changing the Page Range: Modify start_page and end_page to extract text from different pages.

  • Extracting Multiple Languages: Change the lang parameter to extract text in multiple languages, e.g., lang="tam+eng".

  • Enhancing OCR Accuracy: Preprocess images (e.g., binarization, resizing) before passing them to Tesseract for better results.


Conclusion

This Python script provides an efficient way to extract Tamil text from scanned PDFs using Tesseract OCR and pdfplumber. Whether you’re digitizing old Tamil manuscripts, processing government documents, or extracting text from academic papers, this approach can save time and effort.

Try it out and let us know your experience in the comments!

AI Course |  Bundle Offer (including AI/RAG ebook)  | AI coaching 

eBooks bundle Offer India

No comments:

Search This Blog