¿Is easy to detect if an image has a dog or human ? if it is a dog ¿which is its breed?

Deep learning help us to answer!!!!

Project Definition

This post is about my Udacity Data Science Nanodegree capstone project. The project that I chose was the use Convolutional Neural Networks to Identify Dog Breeds.

CNN is a Deep Learning algorithm that takes images as input, determines the importance of features (aspect or objects) to be able to differentiate an image of other and its principal class.

The principal problem to solve is to classify with the best performance images and determine if the image has a human or dog, also, which is the breed of dog. This task could be not easy if the number of images was large and similarities between breeds.

How to solve this problem, Deep learning is the best alternative to help us to classify images and detect when an image has a human or dog. Also, this can help to identify which is the breed dog if there is a dog in the image, this task could be not easy when the breeds are very similar.

Deep learning techniques are much easier to use in the last years. There are techniques like transferring learning that allow to improve the performance of models with the help of a pre-trained CNN model trained. The idea of this technique is simple, there are models that have to be trained with a large dataset, this model can be used in similar images to classify images.

Use these techniques is suitable because they allow us to obtain good performance with limitations that we have. With these we can build an algorithm to know if an image has a human or dog, this algorithm can be composed of three parts:

  1. Human Face Detector.
  2. Dog Detector.
  3. The dog Breed classifier.

To evaluate the performance, the principal metrics used were classification Accuracy and F Score. The classification Accuracy is the ratio of the number of correct predictions to the total number of input samples. This metric was used to evaluate simple tasks of classification, but this is not suitable when there are not the equal number of samples belonging to each class.

100*count (predictions == test_targets)/(total predictions)

To avoid the misclassification, especially in the dog Breed classifier, The F-Score was used, this is often used in deep learning. This metric quantifies the area beneath PR Curve or Precision-Recall, that shows the relationship between precision and recall, it is a two-dimensional graph with precision metrics displayed in the y-axis and recall’s in the x-axis. It is most often used when learning from imbalanced data.

F= 2*precision*recall/precision + recall

Analysis

The dataset used in this project has 8,351 total images with 133 different breeds. This was divided in three dataset, Train, validation and test.

The 133 breeds in the dataset are imbalanced. There are breeds with more representation than other like Border collie or Basset hound.

An aspect that is relevant in classify of images is the shape of them. Actually, to improve the train task it is necessary to resize the images to a square image according to the architecture of the network, in this case is 224×224 pixels. This resize of the images might degrade their quality or lead to noisy labels. It is more shocking to upscaling than downscaling images because this affects the accuracy of the model. There are 46 images in the train dataset that need them to upscaling.

shapes of images hxw

Methodology

The images must be preprocessing to use in the models. Among these tasks there are transforming the image to gray or resize, this is important to improve the performance of models in aspects like time and compute resources.

This project divides in three principal parts because the solution to build must detect whether an image has a human or a dog, and if there is a dog whose breed is it. Each part has a process different, and these were joined in an only algorithm.

Human Face Detector.

This detector uses OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. OpenCV provides many pre-trained face detectors, this implementation has been trained with several images with faces (positive) and without faces (negative), and used a detectMultiScale to get the coordinates of all the faces then returns them as a list of rectangles. The function built returns True if the length of that list is greater than zero.

This model was evaluated, and its performance was good, but it can be improved. Of the 100 images of dogs classified, 11 images were misclassified.

Dog Detector

This detector uses OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. OpenCV provides many pre-trained face detectors, this implementation has been trained with several images with faces (positive) and without faces (negative), and used a detectMultiScale to get the coordinates of all the faces then returns them as a list of rectangles. The function built returns True if the length of that list is greater than zero.

This model was evaluated, and its performance was good , of the 100 images of faces human classified, 0 images were misclassified.

Dog Breeds classifier

To identify the breed, CNN was built. This is a type of deep learning algorithm that is uses of classify tasks. Each layer of the network is specialized in extracting information of several objects, for example, the first layers detect lines or curves. The images to train the model were normalized to be used. To do that, each pixel was divided by 255.

The neuronal network CNN was composed of 4 layers. The input was an image preprocessed to have a size of an input shape of 224 by 224 matrix in channel RGB (224,224,3):

First layer: Its purpose is to identify low-level features such as edges in the image. This was composed of a two-dimensional convolution layer with kernel size of a 2 to obtain the same number of parameters and an activation function relu. Also, it had the two-dimensional MaxPooling layer is used to reduce the spatial dimensions of the output volume from (224,224,16) to (112,112,16).

Second layer: This layer was similar to the first layer. In this layer changed the number of output filters for 32.

Third layer: This layer was similar to the first layer. In this layer changed the number of output filters for 64 and the output shape is (28,28,64).

Fourth layer: This layer was composed of a two-dimensional GlobalAveragePooling layer, which calculates the global average of the image and reduces the size to one times the number of output filters (64) and Dense layer contains 133 nodes and a softmax function as activation to obtain a probability for each dog breeds.

The model was trained and tested. This performance was bad, 5.9% of accuracy and 0.06 of F-score. This type of algorithm needs a large train dataset and compute resources, these were limited.

Human Face Detector.

Other option to detect face is HoG Face Detector in Dlib, This is a widely used face detection model, based on HoG features and SVM. With the deployment of this model of the 100 images of dogs classified, 6 images were misclassified. This model had performance better than Haar Cascades.

Dog Breeds classifier

This part of the algorithm was improving with the technique of transferring learning. In this part, 4 available networks in Keras were used and selected the best. The networks used were VGG-19, ResNet-50, Inception, and Xception. This technique allows to use less compute and time resources. Also, it helps to improve models where the datasets are small.

The deployed architecture was simple because these models have been pre-trained, and this helps to use fewer parameters. This Architecture used the extraction of information that has each pre-trained model and only use a two-dimensional GlobalAveragePooling layer to calculate the global average of the image and reduce the size to one times the number of output filters and Dense layer that contains 133 nodes and a softmax function as activation to obtain a probability for each dog breeds.

The performance of all models was better than the previous model. Resnet50, InceptionV3 and Xception obtained accuracy and F-score more than 80% and 0.8 respectively. The Xception was select to Dog Breeds classifier.

Results

Finally, we can build a function to join of the three parts and create an algorithm to detect if an image have a human or dog, also which is the breed of dog.

The Xception model is robust because after the k-fold cross validation was done, the accuracy value was stable and did not fluctuate much. This indicates that the model is robust against small perturbations in the training data. The mean of metric is 85.13% (+/- 1.08%).

The k-fold cross validation fluctuated between 83% and 87% accuracy value with 10 folds.

The performance of final algorithm is good, however, there are images that the algorithm classified wrong like the drawing of a human. Also there are breed misclassified.

A sample of 80 images was obtained of the test_files to test the performance of the Dog Breeds classifier. The 77.5% of images was classified correctly, and 22.5% was misclassified. The breed with more misclassified images was the English cocker spaniel, this breed is similar to boykin spaniel.

The achieved solution responses to the problem that was proposed is able to estimate whether there is a human face, a dog or neither, also if the image is a dog which is its breed. However, this solution in a real-world application could not be optimal because the F-score reached is less than other deployments of deep learning (CNN) where this metric reache values major to 0.9.

Conclusion

The deep learning is a field that gets more important each year. It is interesting that we can find flexibility to solve problems like classify images, techniques such as Transfer Learning that allows us to build and deploy models with the help of pre-trained models that help us to transfer their knowledge to a smaller dataset. This is possible because the convolutional layers extract general, low-level features that are applicable across images.

The performance of the algorithm was good. However, this could improve. The next points could be considered:

  • Provide a larger training set.
  • Used a detector, a face of a human that does not need that face is well-oriented. For this task it is recommended OpenCV-DNN and HoG methods.
  • Use more images of dogs in the train task.
  • Increase of the number of epochs and used techniques to avoid overfitting like early stopping where the training the model stops when the test accuracy has stopped improving after a few epochs.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store