Research Summary #2
A short summary of the paper: FathomNet: An underwater image training database for ocean exploration and discovery
link: https://arxiv.org/pdf/2007.00114
What have they written about?
They talk about some initial exploration and experimentation performed to develop the FathomNet database, which is a large ocean images-based database.
Primarily, the goal was to try out methods to understand how a localisation-based annotated dataset could be developed from the existing seed data that they had.
The main technical problem that they tried to work on was understanding how a dataset of single-labelled images could be used to produce box-annotated images to enable object detection tasks.
The Dataset
Their seed dataset comprised primarily of more than 26000 hours of high resolution deep-sea videos from ROVs. This was stored stored, annotated, and maintained as a part of the Video Annotation and Reference System (VARS) at MBARI.
For initial exploration they selected the most occurring 18 midwater genera and 17 benthic genera.
This data consisted of video frames which were given a label according to the species most dominantly observed in it. Often, even the not most dominant species was used as a label. Images often also had multiple instances or classes of interest in a single video frame but still only a single label was given.
The data consisted of iconic and non-iconic views of organisms.
Iconic: Natural images in which only a single object of interest exists in the image. Where the image is usually zoomed in to include only the class of interest.
Non-Iconic view: Much more natural images in which multiple objects of interest might exist in the image and the class of interest won’t be particular zoomed in onto. Much more representative of inference and natural deployment settings.
ML Experimentation
They experimented with ML algorithms to speed up their data set generation/bounding-box annotation process. 3 types of models were created: Image Classification, Weakly Supervised Localisation, Object Detection.
Image Classification
Trained a ResNet50 model trained on ImageNet separately for benthic and midwater imagery
Midwater model: 15 species, ~1000 images for each genus. For extending to multi-label classification, they took the top3 or topN categories and would use those classes to tag an image. Top1 accuracy was 85.7% and Top3 was 92.9%. Using this model on actual midwater transect data showed very poor performance since the model was trained on iconic zoomed-in shots of species while transect data was full of non-iconic imagery better suited for object detection task.
Benthic model, 15 classes, ~33000 images, top1 and top3 accuracies were ~72% and ~93%. Good jump between accuracies observed as benthic data is more noisily labelled. There are multiple classes/concepts in many images but single label given to image.
Weakly Supervised Localisation
Very interesting way of going about using a mix of Grad-CAM++ and a trained Resnet model to get localisation outputs
Grad-CAM++ was used. They were testing if they could generate object proposals from a given image using single-labelled imagery. Heatmap using Grad-CAM++ could be used to generate rough bounding boxes around areas that are most important to a particular classification output.
Trained Resnet50 benthic model was used.
Two ways of searching were used: dominant class search and class-specific search.
Dominant class search: The output of the Resnet is what Grad-CAM produces maps for. This is the normal way Grad-CAM models work (at least from what I know)
Class-specific search: Take a specific class of interest and search for the activations for that. Not a 100% sure how this would work.
Overall, approach didn’t work too well but did help in identifying a few instances in an image. Detections would degrade as instances of a class in the image increased. They had other priorities to work on so didn’t bother too much with exploration of this approach.
Object Detection
Used RetinaNet with Resnet50 backbone.
Annotated over 3K benthic images to land up with 23K bounding box annotations
Long-tailed distribution of classes; strongly right-skewed.
Two possible approaches:
Club all classes together into an ‘Object’ class, detect objects in images, and let expert humans label these objects
Identify hierarchies of classes and operate at a limited level of the hierarchy to predict for that level, push this into a human in the loop framework to help expert/non-expert humans.
Very much WIP. Nothing much done in these areas.
Some Learnings and Discussions
Automated image and video annotation should start at a higher taxonomic category instead of a lower taxonomic category. This is sort of obvious. You will have more examples per category if you start higher and the model will also learn better since shapes and features of fish/organisms are super different from one taxon to another. However, the model might get confused between individuals belonging to different classes at a lower resolution since they will have more similarities than different class individuals at a higher resolution.
Promising path for better and quicker annotation seems to be few-shot learning. They want to try and use models from this paradigm to quickly train models in a human-in-the-loop setting. Hopefully in some other paper they have detailed these experiements.