Monocular Navigation in Duckietown Using LEDNet Architecture

Project Resources

Objective: Autonomous lanel following and obstable avoidance in Duckietown using vision and machine learning.
Approach: Use monocular vision and "LEDNet" with vision transformer models. Simulated tests evaluate LEDNet's high-resolution performance against vision transformer's low-resolution capabilities.
Authors: Angelo R. Broere

Project highlights

Here is a visual tour of the authors’ work on implementing monocular navigation using LEDNet architecture in Duckietown*.

ViT image segmentation outputs for Duckietown showing the effect of 1 block and 3 blocks in the model. — Figure 1. ViT Image Segmentation Outputs for Duckietown: Comparing 1 Block vs 3 Blocks.

Illustration of an encoder-decoder architecture (SegNet) used for pixelwise segmentation for the monocular navigation project. — Figure 2. Encoder-Decoder Architecture (SegNet) for Pixelwise Segmentation.

Visual representation of the LEDNet architecture showing its lightweight encoder-decoder structure. — Figure 3. The LEDNet Architecture.

LEDNet image segmentation of Duckietown showing multi-scale feature pyramids for pixel-level attention. — Figure 4. LEDNet Image Segmentation of Duckietown.

LEDNet loss graph showing the flattening of the loss curve after 200 epochs. — Figure 5. LEDNet Loss Graph.

Simulated Duckietown map 'loop_empty' showing a simple layout with left and right bends. — Figure 6. Simulated Duckietown Map: 'loop_empty'.

Simulated Duckietown map 'loop_empty' with obstacles such as Duckiebots and rubber ducks. — Figure 7. Simulated Duckietown Map: 'loop_empty' with Obstacles.

Visual representation of the lane-following and obstacle-avoidance algorithm from Saavedra-Ruiz et al. (2022). — Figure 8. Lane-Following and Obstacle-Avoidance Algorithm (Saavedra-Ruiz et al., 2022).

Comparison of image segmentations created by LEDNet, ViT 1 Block, and ViT 3 Blocks, highlighting the detection of small obstacles. — Figure 9. Image Segmentations: LEDNet vs. ViT 1 Block vs. ViT 3 Blocks.

*Images from “Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers, M. Saavedra-Ruiz, S. Morin, L. Paull. ArXiv: https://arxiv.org/pdf/2203.03682

Why monocular navigation?

Image sensors are ubiquitous for their well-known sensory traits (e.g., distance measurement, robustness, accessibility, variety of form factors, etc.). Achieving autonomy with monocular vision, i.e., using only one image sensor, is desirable, and much work has gone into approaches to achieve this task. Duckietown’s first Duckiebot, the DB17, was designed with only a camera as sensor suite to highlight the importance of this challenge!

But images, due to the integrative nature of image sensors and the physics of the image generation process, are subject to motion blur, occlusions, and sensitivity to environmental lighting conditions, which challenge the effectiveness of “traditional” computer vision algorithms to extract information.

In this work, the author uses “LEDNet” to mitigate some of the known limitations of image sensors for use in autonomy. LEDNet’s encoder-decoder architecture with high resolution enables lane-following and obstacle detection. The model processes images at high frame rates, allowing recognition of turns, bends, and obstacles, which are useful for timely decision-making. The resolution improves the ability to differentiate road markings from obstacles, and classification accuracy.

LEDNet’s obstacle-avoidance algorithm can classify and detect obstacles even at higher speeds. Unlike Vision Transformers (wiki) (ViT) models, LEDNet avoids missing parts of obstacles, preventing robot collisions.

The model handles small obstacles by identifying them earlier and navigating around them. In the simulated Duckietown environment, LEDNet outperforms other models in lane-following and obstacle-detection tasks.

LEDNet uses “real-time” image segmentation to provide the Duckiebot with information for steering decisions. While the study was conducted in a simulation, the model’s performance indicates it would work in real-world scenarios with consistent lighting and predictable obstacles.

The next is to try it out!

Monocular Navigation in Duckietown Using LEDNet Architecture - the challenges

In implementing monocular navigation in this project, the author faced several challenges:

Computational demands: LEDNet’s high-resolution processing requires computational resources, particularly when handling real-time image segmentation and obstacle detection at high frame rates.
Limited handling of complex environments: the lane-following and obstacle-avoidance algorithm used in this study does not handle crossroads or junctions, limiting the model’s ability to navigate complex road structures.
Simulation vs. real-world application: The study relies on a simulated environment where lighting, obstacle behavior, and road conditions are consistent. Implementing the system in the real world introduces variability in these factors, which affects the model’s performance.
Small obstacle detection: While LEDNet performs well in detecting small obstacles compared to ViT, the detection of small obstacles is still dependent on the resolution and segmentation quality.

Project Report

Project Author

Angelo Broere is currently working as an Oproepkracht at Compressor Parts Service, Netherlands.

Learn more

Duckietown is a modular, customizable and state-of-the-art platform for creating and disseminating robotics and AI learning experiences.

It is designed to teach, learn, and do research: from exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge.