Kurdish libraries contain numerous historical publications that have suffered from damage and non-standard fonts, complicating the extraction of text. The study presents the use of Tesseract version 5.0, an open-source OCR framework, to enhance processing capabilities for Kurdish's limited resources. A dataset comprising 1233 images from publications pre-dating 1950 was developed and utilized for training. Various evaluation methods were executed, highlighting the potential of the model while acknowledging ongoing challenges faced with low-resource languages and the quality of historical documents.
Kurdish libraries contain many historical publications with issues like damage and non-standard fonts, making the extraction of text through OCR challenging and time-consuming.
The study utilizes Tesseract 5.0, an open-source OCR framework, to process historical Kurdish documents for better language resource development.
A unique dataset of 1233 images of lines from publications printed before 1950 was created, facilitating the training of an OCR model for Kurdish language documents.
Tesseract's evaluation method demonstrated effective processing capabilities, yet challenges and limitations in dealing with low-resource languages persist.
#kurdish-language #optical-character-recognition #historical-documents #tesseract-ocr #data-processing
Collection
[
|
...
]