OPTICAL CHARACTER RECOGNITION (OCR)

Sudarshan S
4 min readAug 19, 2022

In this blog, we’ll be going to discuss about one of the most important applications of “Computer Vision”, that is OCR.

Optical Character Recognition (OCR) is a method for extracting text from images and turning it into machine-readable text that programmers can access and alter as a string variable (or any other computer language). However, using OCR, the image can be converted into a text file and its constituent parts recorded as text data.

Using OCR software, which separates letters from images, turns them into words, and then turns the words into sentences, the original content may be retrieved and changed. Additionally, it eliminates the need for manual data entry. OCR systems combine hardware and software to convert printed documents into text that a machine can read.

Hardware, such as an optical scanner or specialised circuit board, is used to copy or read text; the advanced processing is then often performed by software. In order to construct more complex Intelligent Character Recognition (ICR) approaches, such as identifying languages or handwriting styles, OCR software can incorporate artificial intelligence (AI). The main reason OCR is used is so that users may format, edit, and search documents.

Dataset used

In reality, MNIST is the “Hello World” dataset for computer vision (“Modified National Institute of Standards and Technology”). Since its 1999 publication, this well-known collection of handwritten images has served as the standard for categorization algorithms. Even as new machine learning techniques are created, MNIST remains a reliable resource.

Sample from MNIST dataset

Example using a code

The data set consists of 10,000 test images and 60,000 training images. The data is separated into training and testing datasets in this case. The x_train and x_test employ grayscale codes, while the y_test and y_train employ labels that list the digits 0 through 9. You can tell if a dataset can be used with CNN by looking at its form. You can see that our result is (60000,28,28), meaning that there are 60000 images in our collection, each of which has a size of 28×28 pixels. To use the Keras API, we need a 4-dimensional array, however as the image above shows, we only have a 3-dimensional numpy array.

x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)

After the division, we set the type of the 4-dimensional numpy array to float in order to have floating-point values,

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

We always do this step in our neural network models now that we have reached the normalisation stage. Divide it by 255 to achieve this.

x_train /= 255
x_test /= 255
model_name=Sequential()
model_name.add(Conv2D(28,kernel size=(3,3),input shape=input shape))
model_name.add(MaxPooling2D(pool_size=(2,2)))
model_name.add(Flatten())
model_name.add(Dense(128,activation=tf.nn.relu))
model_name.add(Dropout(0.2))
model_name.add(Dense(10,activation=tf.nn.softmax))

The Keras API was used to build the model. The Sequential Model from Keras should be imported before adding the Conv2D, MaxPooling, Flatten, Dropout, and Dense layers.

Dropout layers avoid overfitting by disregarding some neurons when training. Converting 2D arrays to 1D arrays in the flatten layers comes before building the fully connected layers.

model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x=x_train,y=y_train, epochs=10)
model.evaluate(x_test, y_test)
Evaluation result for test data
model.evaluate(x_train, y_train)
Evaluation result for training data

To check the prediction of the model, whether the model predicts correctly or not, we can run the following code to check the prediction. Here we took a random data from the test data and checked the prediciton for that, and it correctly predicts the character in the image.

image_index = 2853
plt.imshow(x_test[image_index].reshape(28, 28),cmap='Greys')
predict = x_test[image_index].reshape(28,28)
pred = model.predict(x_test[image_index].reshape(1, 28, 28, 1))
print(pred.argmax())

Ouput

The predicted output with the input image

This is all about the Optical character recognition (OCR). This is very useful in scanning through a camera and convert it into text for further processing. This is also useful to convert physical documents into PDF file.

Happy learning!!

--

--

Sudarshan S

Tech enthusiast | Developer | Machine learning | Data science | Cybersecurity