IAES International Journal of Artificial Intelligence (IJ-AI)

Received Dec 17, 2021 Revised Aug 28, 2022 Accepted Sep 27, 2022 Manually processing invoices which are in the form of scanned photocopies is a time-consuming process. There is a need to automate the task of extraction of data from the invoices with a similar format. In this paper we investigate and analyse various techniques of image processing and text extraction to improve the results of the optical character recognition (OCR) engine, which is applied to extract the text from the invoice. This paper also proposes the design and implementation of a web enabled invoice processing system (IPS). The IPS consists of an annotation tool and an extraction tool. The annotation tool is used to mark the fields of interest in the invoice which are to be extracted. The extraction tool makes use of opensource computer vision library (OpenCV) algorithms to detect text. The proposed system was tested on more than 25 types of invoices with the average accuracy score lying between 85% and 95%. Finally, to provide ease of use, a web application is developed which also presents the results in a structured format. The entire system is designed so as to provide flexibility and automate the process of extracting details of interest from the invoices.


INTRODUCTION
Large and medium scale companies deal with a large number of invoices on a daily basis.The companies need to keep track of the products being sold through the invoices, the cash flow generated, taxes which are paid and other product related analytics which are useful for their business.Companies have a dedicated set of data entry staff who manually enter these details into the systems or databases.This is a laborious task and time consuming.This entire process can be automated with the help of a system which extracts the data from these invoices and updates the databases with the product details.In general, all invoices have a standard template consisting of the seller company's name and address as well as the consumer's name and address.This is followed by an invoice number which may be of a different format as adopted by the company.Then there is a table with product details followed by the transactional details like tax and total amount.This would help design a system which provides flexibility to the user in selecting the fields of interest, based on their use case.Hence, such a system which aims to capture this functionality of invoices and provide automation resulting in saving countless hours of human effort will be described in our paper.There are two parts to the system, an opensource computer vision library (OpenCV) [1] based framework which runs in the backend detecting text from the invoice, an optical character recognition (OCR) engine to extract text and a web application which serves as a tool to upload the invoices and view the structured details as well as store them in the database.This paper in addition aims to present a system which also automates the processing of an invoice.Another useful application of the text detection and recognition Int J Artif Intell ISSN: 2252-8938  Automated invoice data extraction using image processing (Akanksh Aparna Manjunath) 515 system has been described in [2] but this restricts the system to detect the text regions on objects which are less clustered when compared to text present in documents and invoices.However, the paper presents some useful approaches when dealing with problems belonging to the concerned domain.

RELATED WORK
Efficient and accurate scene text detector (EAST) is a deep learning-based architecture that is used to recognize text in an image which is captured in an outdoor environment [3].When the architecture was applied to recognize text in documents, the results were not promising.In order to automate the process of reading information from these invoices and either storing them in the databases or producing javascript object notation (JSON) template files, character recognition engines are required which are called OCR engines [4].However, before the use of such software is made it is important to preprocess the invoices so that the accuracy of text extraction will increase.The various steps involved in preprocessing are also talked about which involve thresholding, binarization and morphological transformations.These methods have been used to remove background noise and get accurate results.OpenCV is a popular framework among computer vision enthusiasts.It has been used in the field of text detection as well.Almost all invoices contain tabular data, [5] explains about the use of OpenCV methods in extracting tabular data as well as metadata about the table stored in portable document format (PDF) formats.It gives a sound explanation of the algorithm used while detecting tables.
The fundamentals of digital image processing [6] explains the fundamentals of image processing and various operations such as image enhancement and restoration, morphological processing, segmentation, object recognition followed by representation and description.The suitable operations were used while preprocessing the invoice.The operations in MATLAB [7] gave a clear picture of what transformation was being made to the invoice by the operation.
One of the recent technologies in the field of computer vision is the use of artificial neural networks.[8] explains the use of one such deep network while performing text detection.A neural network architecture called spatial transfer network (STN-OCR) is presented in which a single neural network is trained to detect and recognize text from images.It makes use of semi-supervised neural networks for scene text recognition which is further optimized.The model was tested against benchmark detection datasets [9] and gave promising accuracy scores, [10] details the difficulty and challenges in developing a custom dataset and finding high quality varied datasets to annotate and use for training and testing, hence it was made the most out of publicly available datasets.The paper also details their efforts into automating invoice processing by using feature extraction but does not consider the tables present in invoices and the items present in the tables which also need to be processed and extracted if there is a need.It was observed that although the network architecture is simple, it is not easy to train this system, as a successful training requires extensive pretraining on easier sub tasks before the model can converge on the real task.Marinai et al. [11] Extends the architecture of STN with an attention model and uses convolutional recurrent neural network (CRNN) instead of traditional convolutional networks.One such similar neural network architecture is described in [12] which is used to recognize characters.This also requires large volumes of training data in order to fine tune to recognize special characters.A popular algorithm for object recognition is the you only look once (YOLO) algorithm and this algorithm has also been used in order to recognize the various text regions which are present in the document [13].However, the use of the YOLO based detection module gave unsatisfactory results when compared to OpenCV based method, with the YOLO algorithm failing to detect many text regions.Considering the information extraction task as an image segmentation problem loses the text semantics, which can further complicate the processing of unstructured documents [10].The existence of a dataset in order to train the network for segmentation is another problem in the case of training networks for extracting relevant information from structured documents.Baviskar et al. [14] Provides a well annotated dataset which helps in training a network which can recognize varying invoices of the same format.However, in order to train a model which can recognize a wide range of invoices, the network should be trained across a large number of invoices as well.
In order to detect the text present in the document, a recent algorithm as described in the paper [15] has been considered.The algorithm uses neural nets in the backend in order to group characters together which make up the region of text.However, it becomes difficult to categorize the text which is the required use case.This has been mitigated with the help of an annotation tool developed as part of the web app.For the text recognition part as well, the use of popular neural network-based algorithms such as [16] and [17] have been considered and while they gave promising results during the trials, the performance of these algorithms in recognizing new characters was found to be underwhelming.There are many OCR engines available in the current day market, Tesseract being the most popular one [18].Tesseract is an open-source OCR engine.It follows a traditional step by step pipeline for performing image to text conversion.Outlines of the components are detected which are grouped together to form blobs.These blobs are organized into text  ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 2, June 2023: 514-521 516 lines which are broken down according to the character spaces.A two-pass recognition step is performed on these cells in order to identify the words which may have been skipped in the initial pass.Soille [19] Describes in detail the various morphological transformations and their working, which have been used in the image preprocessing module of the proposed system.
In order to improve the accuracy of the Tesseract engine and recognize new symbols and characters [20], the Tesseract base model was trained with custom data files.The methodologies discussed in the paper were followed and new data and config files were created.This paper helped in providing an insight into the Tesseract training which helped in improving the Tesseract accuracy [21].There are also some preprocessing steps suggested for improving the accuracy of OCR in [22].Characters and words are separated from the background by using binarization, noise reduction, skew correction, and slant removal.Image is segmented into text or word using word extraction and text line detection.Structural features like vertical and horizontal lines are also extracted as features.Algorithms like k-nearest neighbour (KNN), support vector machine (SVM), Naive Bayes and neural networks are used to classify characters of languages like latin and devanagari.

METHOD AND IMPLEMENTATION
Selecting the methods of interest to us which best fit our use case, the proposed system consists of two main modules, a frontend web application which is used to mark the annotations, create template files and obtain the invoice data in a structured format.The other module is the backend which consists of an extraction tool and further includes modules such as image preprocessing, text detection and text recognition.The overall architecture of the system shown in Figure 1 contains two main sections, the annotation tool provided as part of the frontend used to mark the fields of interese and the core extraction part which runs at the backend and uses OCR to get fields and tables from invoice after performing the preprocessing steps.The web application would be used to mark and annotate the fields of interest by drawing bounding boxes around them.Further details with respect to the frontend are described in the following section.Before the introduction of the algorithm for extraction, few pre-processing steps to be performed on the invoice are discussed.The extraction part is concluded with a note on improving Tesseract accuracy by training the base model and on expanding its functionality by increasing the number of characters it can recognize.

Web application
The web application was built using basic hypertext markup language (HTML) and javascript.popular python web framework Django [23] was used in the backend to transform user uploaded files into the required format to enable processing.Poppler was used to extract the pages from the PDF and render them on the HTML canvas.A conservative estimate of 9001,200 pixels was assumed to give the best result and hence the canvas size was also set to be the same.However, the image would be scaled to match the canvas and the annotation coordinates would be with respect to the canvas.During extraction, the magnitude of scaling will be used to correspond the canvas coordinates to the corresponding image coordinates so as to ensure accurate extraction.This was done so as to avoid downscaling the image and hence losing clarity and quality which would make it difficult to annotate in a previously low-resolution image.JavaScript is mainly involved in registering the user click inputs and outputs on the canvas and draws rectangles with a unique identifier (ID) associated with each of them.These are called annotations.As certain customizations had to be made for flexibility during annotation, a custom annotation tool was developed using JavaScript and HTML and an option was provided to edit a previous template by uploading it and editing, a feature which is not included in publicly available annotation tools.Once the fields of interest have been marked, the invoice and the fields of interest registered as coordinates are used to extract the details from the invoice.The steps have been divided into image preparation or preprocessing to obtain higher accuracy results by normalizing the invoice, this step is followed by a table detection step which detects the table of interest to the user.Once all the text has been detected, it will be extracted by a custom trained Tesseract OCR engine in the text recognition phase.

Image preprocessing
The invoice has to be processed in order to improve the accuracy of detection and later on recognition.In order to eliminate background noise, preprocessing is done.The various steps performed as part of the preprocessing are mentioned in the subsequent sections.The first step is to remove noise using the OpenCV's denoising library.Denoising is done to remove the unwanted noise from the image so that presence of such noise will not affect the further transformations.Denoising is followed by two morphological operations performed in the same order, erosion, and dilation.Erosion removes white noises but it also shrinks the objects in focus, in order to increase the size of words dilation is performed.A sample invoice Figure 2 has been used to capture these operations and display the before and after state after processing the invoice.

Table detection 3.3.1. Bordered table detection
The horizontal and vertical lines in an image are identified using OpenCV functions.By superimposing these identified horizontal lines and vertical lines, we will be able to identify the structure of the table.After identifying the table structure, we will be applying morphological operations like dilation and erosion to get the mask of the table region.After obtaining the masks, we will apply canny edge detection to identify the edges of the rectangular region and then identify the contours.These contours are cropped from the original image.

Unbordered table detection
The table structure is identified by using a mix of template matching and an openCV based algorithm.The start of the table is specified in a template file JSON.The template file can be created using the user interface (UI) interface provided by the website.The number of columns in the table is also taken as an input parameter.Morphological operations are performed on the image (like dilation) to convert the regions of text into thick black boxes.Later contours are detected from these boxes.These contours are the regions of text in the image.The contours lying along the same horizontal line are grouped together.If the number of text regions along the same line match with the number of columns in the table, it is considered as a row of the table.Each region in the row will be considered as a cell of the table.There are some limitations to this approach as well: − The text should be in dark colors and the background should be in white color.Otherwise, the text region identification would fail.− If some text region is not part of the table, but has the same number of columns as the table, then that would be considered as part of the table.

Text recognition
In order to recognize the characters, present in the text region detected previously, OCR engines such as Tesseract were used.The comparison of different OCR engines has been detailed further.Tesseract is a popular OCR engine and its task is to recognize the characters present in an image or text file.The Tesseract engine is used in the form of a Python wrapper called PyTesseract [24] which is installed as a dependency in the Django platform.We went through a few opensource OCR engines like OCRopus, PaddleOCR and Tesseract.OCRopus is a tool to extract text from scanned documents.The basic pipeline includes binarization, segmentation and text recognition.It gave good results on normal English text but it was not able to recognize certain special characters like the rupee symbol.We could not find good methods to train the special characters on the existing model.Therefore, we could not use OCRopus in our implementation.However, Tesseract was used instead of the other available OCRs mainly because it had good documentation and an option to train special characters on the existing model.
While testing the performance of the system, few characters such as the Indian rupee or the pound symbol were not recognized with the default trained data file provided during with Tesseract.In order to rectify this issue, the base Tesseract model has to be trained to recognize special characters which are of interest.This training has to be done with datasets which include properly formatted files having the required special characters with annotated bounding boxes [25].Open-source pretrained Tesseract files, trained on special characters such as Indian rupee symbol were used during the recognition pipeline.

RESULTS AND DISCUSSION
The extraction step involves two parts, one being extracting product details inside the table and the second part being extracting specific fields like company name, address and the total amount on the bill.The results are presented for each part in the following section.For demonstration purposes the following publicly, available invoice is chosen [9].It is to be noted that the invoice in Figure 3(a) can be an image taken by a smartphone or a computer-generated invoice.The image in Figure 3(b) is obtained after performing the image preprocessing operations.The white regions in Figure 3(b) are contours detected.Using OCR engine text is now detected in each one of these cells and for the purpose of data extraction from tables only the part of the image containing the table is considered.By default, Tesseract expects a page of text when it segments an image.However, as OCR was to be applied on a small region, different page segmentation modes were experimented on.page segmentation mode (PSM) 4 and PSM 6 gave the best results and PSM 6 was used to extract text from the annotation as they resembled a uniform block of text which matched the configuration expected by Tesseract with PSM 6.After running Tesseract on table, we get the output as Figure 4.
Finally in the web application for the user who uploads the invoice the results are displayed as can be seen from Figure 5(a) and Figure 5(b).Figure 5(a) shows the actual image which was extracted from the PDF. Figure 5(b) shows the tables extracted from these images.The system was tested on more than 25 invoices which were captured with varying background images and in varying shades of light.Few invoices were computer generated whereas images of few invoices were clicked on smartphone devices.Using the confidence score given by Tesseract the accuracy of extraction of products from tables was found to be between 80% and 95%.
The accuracy of the Tesseract engine is obtained in the form of confidence score which is given as output by the python wrapper (PyTesseract) for Tesseract.The wrapper returns a list of bounding boxes, the confidence score (accuracy) for each bounding box, detected text within each box when a call is made to the image_to_data function.However, when the results of extraction were manually tested it was found that the accuracy score on average was more than 85%.Text extraction on the entire invoice gave similar results with Int J Artif Intell ISSN: 2252-8938  Automated invoice data extraction using image processing (Akanksh Aparna Manjunath) 519 the average accuracy being around 85%.It was observed that accuracy was significantly higher for computer generated invoices and was on the lower end for the invoices which weren't generated by machines.The code repository with contents details as well as the steps to setup, debug and run the program is available at [26].Following this will lead to a web application where the features discussed in the paper are offered in the tool.

CONCLUSION
Computer vision has found many applications in various fields.In this paper one such application of computer vision in the field of text recognition and extraction was presented.A system is presented which involves a web application at the frontend, OpenCV and a fine-tuned Tesseract OCR engine to perform text extraction at the backend.Algorithms used in image preprocessing for the invoices are also discussed.With  ISSN: 2252-8938 Int J Artif Intell, Vol. 12, No. 2, June 2023: 514-521 520 the help of OpenCV background noises are removed and the lucidity of the pictures is improved.Image preprocessing techniques have been applied in order to improve the image quality to get accurate results.List of products were extracted separately and rest of the fields are based on the user's needs.After the fine tuning of the base Tesseract model, the system is able to recognize the special characters and symbols with an accuracy of 90%.The system was tested on more than 25 types of invoices and gave promising accuracy scores with the average score lying between 85% and 95%.

Figure 1 .
Figure 1.The overall architecture of the system

Int
Automated invoice data extraction using image processing (Akanksh Aparna Manjunath) 517

Figure 2 .
Figure 2. Picture of original image and image after performing preprocessing operations

Figure 3 .
Figure 3. Sample invoice under consideration for (a) demonstration purposes and (b) the image obtained after converting to grayscale, performing binarization, applying morphology gradient, thresholding using Otsu binarization, finding and drawing the contours

Figure 4 .Figure 5 .
Figure 4.The tables recognized and the text extracted from the regions of interest in the template is shown