Reproducing DECODE, an U-Net which enables fast and dense single-molecule localization with high accuracy

8 min readApr 16, 2021

DECODE seems a promising computational tool to localize single emitters at high density in 3D from 2D image frames. For personal educational purposes, we have tried to reproduce DECODE as it was represented in the paper by Artur Speiser et al. called “Deep learning enables fast and dense single-molecule localization with high accuracy”. This was done using namely the PyTorch library of Python. DECODE distinguishes itself by reduced imaging time and improved accuracy in comparison to competing networks e.g. DeepSTORM3D and CSpline. Its key characteristics are the doubled stacked U-Net in the Network’s architecture and the usage of three consecutive image frames as input for the decision-making process.

Architecture as described in the paper The architecture is shown in Figure 1 below, it consists of two U-Nets stacked in series. The first stage is called the frame analysis module and the second stage the temporal context module, both stages consist of two up- and downsampling stages and 48 filters in the first stage. In each downsampling stage, the resolution is halved and the amount of filters is doubled. In the upsampling stage, it is the other way around, the resolution is doubled and the amount of filters is halved and is done with nearest neighbor interpolation. The convolution in the downsampling stage has a 3x3 kernel size.

The three parallel stacked U-Nets in the first stage, called frame analysis module, have shared parameters. Each processing one of the image frames, the output of the frame analysis module is concatenated and directly used as input for the temporal context module.

Each hidden layer has an Exponential Linear Unit (ELU) as an activation function. As can be seen from the architecture, the network ends triple-headed with dimensions [1, 40, 40], [4, 40, 40] and [4, 40, 40], they output respectively (1) Bernoulli probability map p that an emitter was detected near that pixel, (2) the coordinates of the detected emitted dx, dy, dz and a nonnegative emitter brightness N, and (3) uncertainties associated with each of these predictions sigma x, sigma y, sigma z, sigma N. As final activation function for each nonnegative output logistic sigmoid nonlinearity is used and for the others the hyperbolic tangent nonlinearity.

Architecture consensus Some parts of the architecture were not described in the paper and other parts were not successfully implemented so we had to make some design consensus. These consensuses included:

‘stride=None’ and ‘padding=1’ was used in any down-convolution
‘kernel_size=2’ and ‘stride=2’ for the up-convolution
ConvTranspose2d was used in upsampling instead of nearest neighbor

Data-set The data-set used to train DECODE is artificially generated and should be easily accessible according to the paper. However, the data-set was not directly downloadable from their Github. To obtain the data we had to dive into their code from which we eventually were able to generate the input images and target values and save them locally.

From this data-set we found out that the format was not in the size we would have expected as it did not align with the output of DECODE directly. It consisted of ‘x_in[9999, 3, 40, 40]’ (which contains the three consecutive time frames for 9999 samples) and a dict called ‘tar’ which was an assembly of the target data-sets (we named them ‘tar_0’, ‘tar_1’ and ‘tar_2’) for the three different heads. The sizes where respectively [9999, 250, 4], [9999, 250] and [9999, 40, 40]. After analyzing the targets we found out that ‘tar_0[1]’ was a placeholder for the to be present emitters, as the limit of the number of emitters is 250 in the used generator from which in ‘tar_0[2]’ respectively the N and x-y-z coordinates (as floating points with respect to the pixel-grid) of the emitter are contained. ‘tar_1’ is a vector with a length of 250 which contained boolean values from which the amount of ‘True’ values corresponds to the true amount of present emitters. ‘tar_2’ contains per sample a tensor of shape [40, 40] in which the entries all exist of the same seemingly random floating point but differs among each sample. To be honest, we could wrap our head around the usage of ‘tar_1’ and ‘tar_2’, we scant through their implementation of the ‘tar_1’ and ‘tar_2’ but could not figure it out.

To visualize the data, below in Figure 3 are three consecutive time-frames depicted (extracted from ‘x_in’) with a scattered plot in red which resembles the true position of the emitters from which the size comply with a certain degree of true N(extracted from ‘tar_0’).

Mutation of the network’s architecture As we could not figure out how to implement ‘tar_1’ and ‘tar_2’ we choose to mutate the architecture to a single-headed version with logistic sigmoid nonlinearity as the final activation function which is shown in Figure 4, it predicts ‘a[40, 40]’.

**Figure 4:** Single-headed mutation of DECODE

To be able to train the network eventually, it is key to have the ground-truth data-set comply with the dimension of ‘a’. We, therefore, choose to transform ‘tar_0’ to be a binary-grid representation sized [40, 40] with entries being valued 1 in case of an emitter is localized and else 0, whereby we for the sake of simplicity neglect N for now. Doing this transformation, we lose some information and thereby precision as the coordinates are no longer represented by floatingpoints but rather as integers referring to the pixel coordinate system.

We named this data-set ‘y_binary’. In our Github repository the data-sets ‘x_in_small’, ‘tar_0_small’ and ‘y_binary_small’ are included (these are smaller data-set of 1000 samples to suffice Github’s file size limit of 25 MB). In Figure 5 is a visual representation of the binary grid shown.

**Figure 5:** Binary grid representation

Optimizer We used the same optimizer as the original DECODE used which is AdamW. With a learning rate of 6e-4 and a weight decay of 0.1.

Loss function Since we are no longer training the network to have the same output we have implemented a different loss function. As the network can be seen as a binary classification per pixel to resembles either the presence of an emitter or not, we choose Binary Cross-Entropy (BCE) as a loss function. Another reason to choose for BCE is that the wrong predicted class -in our case the absence or presence of an emitter- is penalized strongly due to the aggressive gradient descent which leads to faster convergence.

Results The first implementation of the training loop was fairly simple. We used a small sub-sample set to prevent very long training time at the cost of not having a generalizable network, but this way we could get an insight if our network would train at all. To verify if the network would converge towards a solution we used samples from the training set on the previously trained network, unfortunately, we could not see any results as the visualized output was completely black. We tried it several times with different #epochs and sample sizes. But even with 1000 epochs on 1 sample, we could only manage to obtain a resembling of the ground truth one single time. Different optimizers and batch sizes would not result in the desired output.

For a long period, we thought that our network somehow would not train but eventually did conclude that the parameters of our network must have to change as the prediction from an initialized network was clearly different in comparison with a trained network. This let us investigate how a predicted outcome would change during the training loop, we did this by plotting a prediction intermittent in the training loop. The results are shown in Figure 6 from which the top left plot is the outcome from the initialized network and the bottom right corner the final prediction of the training iteration. It is visible that the network tries to lower the cost of the loss function -as it’s clearly updating the weights- but it keeps getting stuck in the wrong configuration.

**Figure 6:** Interesting results during trainings loop iteration, used 1 sample

We did the same with a training loop in which 10 samples were used, these results are shown in Figure 7 and are maybe more interesting as the outcome contains many different patterns.

**Figure 7:** Interesting results during trainings loop iteration, used 10 samples

From these results, we conclude that our network is very prone to get stuck in a local minimum of the loss function. From which the used optimizer AdamW was not able to ‘escape’ local minima regardless of its stochastic property to calculate the gradient descent. This finding of getting stuck in a local minimum is supported by our hypothesis that the augmented data set which -resembles a binary grid representation- is poorly suited to be used for our purposes in combination with the used architecture which was initially designed to make much more complex and different predictions.

Discussion Although the results were not what we would have liked, they certainly contributed to our own personal educational purposes in a positive manner. We are hyped to continue and learn more about deep learning and how to properly implement it in practice as we will pursue the course Computer Vision. We learned the hard way how that our augmented data set will not suffice to train the network and we strongly encourage to make a better-suited data augmentation. A better data augmentation could be some metric which defines the distance to the closes emitter in addition to the binary grid. This way, you introduce some gradient in the ground truth set which possible avoid the network to converge to a local minimum

Conclusion To conclude it all, we were able to reproduce the DECODES architecture from which we made a single-headed mutation later on in the project. We also managed to create a data set that is publicly available in our Github repository. This contains the input images ‘x_in’ and the targets ‘tar_0’, ‘tar_1’, and ‘tar_2’. We thereby made a simplified data augmentation called ‘y_binary’, but this data set is better left untouched as it’s clearly not suited as a target for our network. The used optimizer was the same as DECODEs but the loss function differs to better suit the purpose to predict binary values. Unfortunately resulted this in a working network that is highly prone to converge to local minima in the loss function. We therefore strongly recommend making a better data augmentation as this would possibly result in a network that will have some performance.

Acknowledgment We want to thank especially our teacher assistant Joris as the meeting with him were always helpful and he really supported us to keep continue although the network’s structure was quite complex and the results were not promising. Besides Joris, we want to thanks our teacher for this course as he was able to explain complex material in an understandable matter

[1]: Deep learning enables fast and dense single-molecule localization with high accuracy Artur Speiser et al (10, 2020).

Reproducing DECODE, an U-Net which enables fast and dense single-molecule localization with high accuracy

Written by Mahir Sabanoglu