In today’s digital world, extracting text from scanned documents is a crucial task, especially when dealing with regional languages like Tamil. Optical Character Recognition (OCR) technology helps in converting scanned images into editable text. In this post, we will walk through a Python script that extracts Tamil text from scanned PDFs using pdfplumber and pytesseract.
Prerequisites
To run this script, you need to install the following Python libraries:
pip install pdfplumber pytesseract
Additionally, you need to install Tesseract-OCR, an open-source OCR engine. For Windows users, you can download it from Tesseract GitHub. If Tesseract is not added to your system's PATH, specify its location in the script:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
For accurate Tamil text extraction, ensure you have the Tamil language pack installed:
tesseract --list-langs # Check installed languages
tesseract -l tam # Verify Tamil support
If it is not there, we need to install it.
sudo apt install tesseract-ocr-tam
Understanding the Script
The script performs the following tasks:
Opens the scanned PDF using
pdfplumber
.Extracts images from each page within a given page range.
Performs OCR on each image using
pytesseract
with Tamil language support.Writes the extracted text to a text file.
Here’s the complete script:
How It Works
pdfplumber.open(pdf_path)
: Opens the PDF file.pdf.pages[i].to_image().original
: Converts each page into an image.pytesseract.image_to_string(image, lang="tam")
: Extracts Tamil text from the image.with open(output_file, "w", encoding="utf-8")
: Saves the extracted text to a file.
Customizing the Script
Changing the Page Range: Modify
start_page
andend_page
to extract text from different pages.Extracting Multiple Languages: Change the
lang
parameter to extract text in multiple languages, e.g.,lang="tam+eng"
.Enhancing OCR Accuracy: Preprocess images (e.g., binarization, resizing) before passing them to Tesseract for better results.
Conclusion
This Python script provides an efficient way to extract Tamil text from scanned PDFs using Tesseract OCR and pdfplumber. Whether you’re digitizing old Tamil manuscripts, processing government documents, or extracting text from academic papers, this approach can save time and effort.
Try it out and let us know your experience in the comments!
AI Course | Bundle Offer (including AI/RAG ebook) | AI coaching
No comments:
Post a Comment