Boulder detection |
tions for mapping geogenic reefs (Heinicke et al.,
’n press), used to characterise geogenic reefs over
‚arger areas. The agreement between the human
axperts is calculated using the F, score of the re-
sulting confusion matrix. An F, score of 1.0 indi
zates perfect agreement, while the lowest value is
I, when either precision or recall are 0. The Fi score
’s calculated from the confusion matrix by Fi = 2 x
(precision X recall) / (precision + recall). Values for
aach class (no boulders, one to five boulders and
more than five boulders) were averaged.
2.4 Automatic boulder count
2.4.1 Neural network
Artifical neural networks are composed of series
af interconnected layers of artificial neurons. In
a trained neural network, input signals are trans-
formed by changing weights at each connection,
until the last layer of the network reports the re-
sult of the computation. Convolutional neural
networks are a subset of neural networks and
were developed for image classification with over-
whelming success. While the architecture of CNNs
varies, all include a series of convolutional layers,
that operate by convolving a small part, often 3 x 3
pixels, of the underlying image (or the output of an
earlier layer in the network) with weights initialised
at random. This assumes that pixels in close vicin-
ity are more likely to form patterns significant for
the image context than those pixels with greater
distance. The weights are adjusted during model
training with annotated images to minimise a loss
function. Loss functions compare the predictions
af the neural network to the annotations. To allow
ZNNSs learning non-linear features, activation func-
tions change the output of layers in the network,
while regular downsampling of the image size al
'ows the network to learn features of larger scales.
The automated boulder count was done using
the YOLO (You Look Only Once) framework, de-
veloped by Joseph Redmon (Redmon et al. 2015),
with the current implementation available under a
yermissive license on GitHub (https:/github.com/
AlexeyAB/darknet). Lary et al. (2016) and Schmid
huber (2015) give a detailed description of convo-
‚utional neural networks and their application for
'mage interpretation.
/he YOLO network was developed for object
detection. To identify and locate different objects
ıNn images is more complicated than the classifica-
tion of entire images and requires a different net-
work architecture. YOLO is a one-stage detector,
mMeaning it analyses images in one pass (hence the
abbreviation, You Only Look Once) while keeping
nigh accuracy. One-stage detectors are a faster
approach compared to other object detection
frameworks that rely on multiple stages for object
detection in images. The YOLO architecture is de-
scribed by Bochkovskiy et al. (2020). In principle, it
Uses a series of different convoalutional lavers (the
A
{19 — 06/2027
oackbone and neck) to extract object features and
divide the input image into grids at three different
‚esolutions. For each grid cell at each resolution,
t predicts the probability that the cell includes a
aarned object within anchor boxes of predefined
size. These probabilities and the corresponding
20unding box coordinates are the output of the
:rained model. YOLO networks are available in dif
‘erent configurations of the backbone, of which
we here utilise the standard configuration of YOLO
version 4.
2.4.2 Model training and application
O create the training data sets, a human inter-
»reter identihed bounding boxes of boulders in
zraining areas in QGIS 3.16. Boulders were required
to have a shadow. The boulders were exported as
an SQLite database. The training database for the
555 model includes 13,847 boulder instances. A
model was trained on a data set with an empha
sis on small boulders comprising only a few pixels
his data set comprises 4,070 entries. The MBES
vraining database was only started with the inves
zigation site reported here (Fig. 2). It is not possible
to use the same training data sets for MBES and
555 models, since the position accuracy of the
side-scan sonar is not good enough to co-locate
“eatures of only a few pixels in size. Therefore, the
VBES training data set comprises 2,654 instances
af boulders (Fig. 2), with typical sizes of 3 x 3 to
3 X 15 pixels including shadows. The training mo-
saics were cut into small georeferenced images of
54 X 64 pixels (corresponding to approximately
16 MX 16 m in this study), overlapping by six pix
als to minimise the number of training boulders
that are cut by image boundaries. In the following,
che pixel coordinates of the annotated examples
were calculated and used as an input for training.
3esides the annotated boulder examples, 182 ex
amples of empty images (defined as containing no
ao0ulders) were selected for the MBES data set and
2,349 examples of empty images for the 5$S data
set.
For training, we used the YOLO network ver
sion 4, in contrast to earlier case studies that used
“he two-stage RetinaNet framework (Lin et al.
2017). We adhered to suggestions published on
*he project’s GitHub page and changed the de-
“ault configuration of the YOLO network. There
"ore, the maximum number of training batches
was reduced to 6,000 for MBES models and 24,000
or SSS models, the number of classes reduced to
ne, and the filter number of the convolutional
ayers before the object detection layers reduced
:o 18. Images were magnified to 512 x 512 pixels
pefore training. Random variations in hue, expo-
sure and saturation applied to the image were re
Juced from their standard settings to 0.1. The size
of the input image was changed by 40 % every
(en batches at random, and the size and aspect