Obesity is one of the health problems in developed countries and it is a contributing factor for many diseases. Keeping the record of daily meal intake is an effective solution for tackling obesity and overweight. This can be done by developing apps that are able to automatically recommend a short list of most probable foods by analyzing the photo taken from food. Additionally, uncontrolled food intake could also lead to micronutrient deficiency which would shorten the life expectancy. One way to cope with vitamin deficiency and tackle obesity is to track our daily meals.
Instead of using conventional dietary record methods, we can develop a system which is able to accurately estimate calorie intake of the user by analyzing images of foods. After finding the category of food, we can extract nutrition facts of the food and estimate its calorie. By tracking this information over time for a user, the system can recommend the user to change his or her habits or prepare a particular food.
Recognizing foods from images possess some challenges. On the one hand, foods are highly deformable objects. This means that visual patterns of the same food in the same plate under the same ambient light can change significantly. On the other hand, occlusion of some ingredients by other ingredients makes recognition of food more difficult. Design of food and cooking style are other factors that make the food recognition harder. Furthermore, different seasoning and chopping might change the appearance of the same food. From the machine learning perspective, there are examples showing the intra-class variation problem in the task of food classification. Another challenge in food classification is the inter-class similarity. For instance, pasta with meat and pasta with vegetables visually may look similar. Misclassification of pasta with meat and pasta with vegetables causes miscalculation of their calories.
There are powerful off-the-shelf machine learning models such as support vector machines and random forest that are able to learn complex decision boundaries. In order to use these models, we first need to extract a compact and informative feature vector. For this purpose, hand-crafted features such as color histogram, HOG, SIFT, LBP, Gabor filter or Bag of Features (BoF) could be extracted for an image. After extracting features, we train a classifier using one of the models that we mentioned above.
Recent studies showed that the hand-crafted feature extraction methods are not adequately accurate for classifying foods. This is mainly due to the fact that different classes of food may overlap in the feature space. In other words, using the hand-crafted feature extraction methods, different classes of food might not be non-linearly separable (ie. two or more distinct classes might overlap) in the feature space. To train a model with a higher accuracy, we need a more powerful representation for images of food. Recently, Convolutional Neural Networks (ConvNets) have achieved impressive results on the ImageNet dataset. A ConvNet usually consists of several convolution layers and pooling layers that enables it to have more representation power than the hand-crafted approaches.
Given a dataset of adequate size and diversity, it is possible to find an architecture with an admissible accuracy on this dataset. Finding the network could be done manually by trying various architectures or it could be done using automatic architecture search algorithms. Although finding an architecture is still an interesting topic in this field, recent networks have been built by plugging several microarchitectures. It is also shown that different networks with considerably different architectures produce comparable results.
In this thesis, we focus our attention on more demanding issues of neural networks which are important in developing practical applications. Specifically, we study knowledge adaptation, domain adaptation, active learning and knowledge distillation. Throughout the thesis, we will also study a few other topics and propose new techniques.
Knowledge adaption is a technique to utilize the current knowledge of the network and adapt it to perform a new task. Knowledge adaptation is a branch of a broader topic called transfer learning. Given a pre-trained network, the common approach is to keep early layers of the network unchanged and finetune late layers of the network on the new task. This approach has been inspired by the fact that early layers of a network learn general features and late layers learn task-specific features.
Picking layers to be trained and layers to be kept unchanged randomly might not be an accurate solution. Alternatively, we show how to use the Fisher Information Matrix to find which layers must be finetuned during knowledge adaptation. Then, we will propose a multi-task loss function to further improve the results.
One of the issues in neural networks is that they do not inherently compute the uncertainty of the predictions. This is important in practical applications since we mainly want to report the prediction with high certainty. In other words, if the network is uncertain about the predicted class, we should not report it to the user.
We will explain different techniques for computing the uncertainty of the prediction based on a single prediction. We will also show a technique called Monte Carlo dropout for estimating the uncertainty. Then, we will propose a method based on multi-crop evaluation for improving the results and approximating uncertainty, simultaneously.
Whereas knowledge adaptation assumes that the pre-trained network is going to learn a different task using labeled data, unsupervised domain adaptation assumes that the network is going to perform the same task but the domain of source and target datasets are different. In addition, the target dataset is unlabeled. Domain adaptation is another branch of transfer learning and it studies how to reduce the divergence of two domains or how to learn an accurate mapping under domain shift. We will explain most commonly used techniques in this field, and we will propose a technique which is computationally more efficient than the state-of-the-art methods and it performs better in the task of food classification. Our method is a variant of self-training where we take into account uncertainty of prediction and use ensemble of predictions rather than a single prediction.
One way to exploit unlabeled data to improve the results is to use many unlabeled data. Previous results on semi-supervised learning have shown that the accuracy is improved by using unlabeled data together with labeled data. Another practical way to use unlabeled data is through a technique called active learning. Given the annotation budget, the major goal of active learning is to pick samples from a pool of unlabeled samples and ask a human to annotate them. The number of selected samples could not exceed the budget. The newly annotated samples are added to the dataset, and they are used to improving the network. The main question in active learning is how to pick samples that work better than random sampling. We will study various techniques for sample selection and perform a complete experiment on the task of food classification. We will show that informativeness measures work better than random selection.
The last part of our thesis deals with the problem of making our models compact. Our ultimate goal is to deploy the model on an embedded system or a smartphone with limited computational resources. Hence, it is essential to make our network as compact as possible in order to reduce the time to completion and use the resources efficiently. We will define a loss function to use labeled and unlabeled data for training a smaller network (mentee) by a bigger network (mentor). The first term in our loss function computes the cross-entropy between the normalized smoothed logits produced by the mentor and the normalized logits produced by the mentee. The second term in the loss function computed the mean square error of logits produced by the two networks on unlabeled images.
Then, we will propose a new network architecture with lower complexity. Our network is composed of successive convolution layers followed by several fire-residual modules, expansion block and a classification layer. The first convolution layers are for reducing the spatial size of the input quickly. Then, the fire residual modules do complex transformations. Also, the expansion layer makes it possible to have rectangle shaped receptive fields. We simply use the softened logits as the ground truth data and use the cross-entropy to transfer the knowledge of the mentor to the mentee. Our experiments show that not only the mentee is able to imitate the mentor, it is also able to slightly improve the results.
© 2001-2024 Fundación Dialnet · Todos los derechos reservados