Autoduck: VLM-based Autonomous Navigation in Duckietown

Posted on September 12, 2025 | by Duckietown Admin

Autoduck: VLM-based Autonomous Navigation in Duckietown

Project Resources

Dual-Mode Autonomous Navigation in Duckietown using VLM - the objectives

This project aims to implement an autonomous navigation system on the Duckiebot DB21J platform within Duckietown to enable vision-based control and decision-making using VLM (Vision Language Model).

The system integrates calibrated camera intrinsics and extrinsics, motor gain and trim calibration, ROS nodes for perception and control, AprilTag-based semantic localization, stop line detection, lane filter for lateral pose estimation, finite state machine for state transitions, PID controllers for velocity and steering regulation, and quantized Qwen 2.5 models for multimodal inference on embedded hardware.

The work establishes a reproducible pipeline for benchmarking navigation algorithms, enabling analysis of trade-offs between model size, inference latency, memory limits, communication overhead, and control cycle timing in real-time robotic systems.

VLM in Duckietown - visual project highlights

The technical approach and challenges

This approach, at the technical level, involves:

The method integrates calibrated camera intrinsics and extrinsics for distortion correction and frame alignment, motor gain and trim calibration for odometry consistency, and ROS-based perception nodes for lane filtering, stop line detection, obstacle recognition, and AprilTag-based pose estimation. Control nodes implement PID regulators for velocity and steering, parameterized turning primitives, and synchronized execution through a finite state machine that coordinates lane following, intersection stopping, turning maneuvers, and recovery states. Sensor fusion combines camera streams, encoder feedback, and ToF measurements for robust decision inputs.

Quantized Qwen 2.5 vision-language models were deployed with llama.cpp, configured with reduced context window and batch size to match GPU memory limits. The models were evaluated for trajectory planning and visual reasoning tasks, with both 7B and 3B variants tested under quantization schemes. Integration required Docker containerization for portability and ROSBridge for monitoring and remote interaction.

Challenges included GPU memory capacity restricting larger model execution, inference latency exceeding 100 ms control cycle requirements, CUDA feature mismatches across builds, and instability in container runtimes on the NVIDIA Jetson Nano platform. These issues necessitated systematic parameter tuning of controllers, quantization of VLMs to GGUF formats, pruning strategies to reduce computation load, and hybrid offloading of visual reasoning to external compute nodes while maintaining low-level perception and control locally. Additional constraints involved balancing message-passing overhead in ROS, synchronization delays between perception and control nodes, and variability in inference reproducibility across different hardware builds.

Autonomous navigation system using VLM Duckiebot DB21J assembly components — Duckiebot DB21J Assembly

Autonomous navigation system using VLM Duckiebot DB21J data workflow — System Workflow Architecture

Autonomous navigation system using VLM Duckiebot DB21J camera calibration patterns — Camera Calibration

Autonomous navigation system using VLM Duckiebot DB21J Qwen 2.5 model latency accuracy comparison — VLM Model Performance

Report and Presentation

Looking for similar projects?

Check out the following works on path planning with Duckietown:

VLM in Duckietown: Authors

Sahil Virani is a student at Technical University of Munich, Germany.

Suparatik Patel is a student at Technical University of Munich, Germany.

Esmir Kico is a student at Technical University of Munich, Germany.

Learn more

Duckietown is a modular, customizable, and state-of-the-art platform for creating and disseminating robotics and AI learning experiences.

Duckietown is designed to teach, learn, and do research: from exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge.

These spotlight projects are shared to exemplify Duckietown’s value for hands-on learning in robotics and AI, enabling students to apply theoretical concepts to practical challenges in autonomous robotics, boosting competence and job prospects.

Visual Obstacle Detection using Inverse Perspective Mapping

Posted on January 17, 2025 | by Duckietown Admin

Visual Obstacle Detection using Inverse Perspective Mapping

Project Resources

Project highlights

Here is a visual tour of the authors’ work on implementing visual obstacle detection in Duckietown.

Monocular camera image capturing a Duckiebot's perspective of the road for visual obstacle detection. — Figure 1. Example Image from Monocular Camera.

Bird’s Eye View image created from monocular camera input using inverse perspective mapping. — Figure 2. Image Transformed to Bird’s Eye View.

Final detection output showing identified obstacles in Bird’s Eye View. — Figure 3. Final Detection Output.

Cropped version of the monocular camera input, focusing on relevant road sections. — Figure 4. Cropped Image for Efficient Detection.

Bird’s Eye View showing detected obstacle boxes overlayed on the road. — Figure 5. Display of Obstacle Boxes in Bird’s Eye View.

Detected obstacle in Bird’s Eye View with position and radius annotations. — Figure 6. Position and Radius of Obstacle.

Bird’s Eye View image classifying obstacles as dangerous or non-dangerous based on their position. — Figure 7. Dangerous vs. Non-Dangerous Obstacles.

Search lines in Bird’s Eye View used to determine if white lane boundaries lie between the Duckiebot and an obstacle. — Figure 8. Search Lines for Lane Boundary Detection.

Flowchart showing initial logic stages for obstacle handling during commissioning. — Figure 9. Initial Logic Stages for Commissioning.

Definitions of variables in obstacle detection, as seen from the top view. — Figure 10. Top-View Variable Definitions.

Geometry depicting the Duckiebot’s path and an obstacle's position in the lane. — Figure 11. Geometry of Scene and Obstacle Positioning.

Diagram showing the software architecture for obstacle detection and avoidance in Duckietown. — Figure 12. Software Architecture Overview.

Example of obstacle detection error caused by motion blur. — Figure 13. Motion Blur Impact on Obstacle Detection.

Example of an adaptive bounding box conforming to lane curvature for improved obstacle detection. — Figure 14. Adaptive Bounding Box for Lane Curvature.

Visual Obstacle Detection: objective and importance

This project aims to develop a visual obstacle detection system using inverse perspective mapping with the goal to enable autonomous systems to detect obstacles in real time using images from a monocular RGB camera. It focuses on identifying specific obstacles, such as yellow Duckies and orange cones, in Duckietown.

The system ensures safe navigation by avoiding obstacles within the vehicle’s lane or stopping when avoidance is not feasible. It does not utilize learning algorithms, prioritizing a hard-coded approach due to hardware constraints. The objective includes enhancing obstacle detection reliability under varying illumination and object properties.

It is intended to simulate realistic scenarios for autonomous driving systems. Key metrics of evaluation were selected to be detection accuracy, false positives, and missed obstacles under diverse conditions.

The method and the challenges visual obstacle detection using Inverse Perspective Mapping

The system processes images from a monocular RGB camera by applying inverse perspective mapping to generate a bird’s-eye view, assuming all pixels lie on the ground plane to simplify obstacle distortion detection. Obstacle detection involves HSV color filtering, image segmentation, and classification using eigenvalue analysis. The reaction strategies include trajectory planning or stopping based on the detected obstacle’s position and lane constraints.

Computational efficiency is a significant challenge due to the hardware limitations of Raspberry Pi, necessitating the avoidance of real-time re-computation of color corrections. Variability in lighting and motion blur impact detection reliability, while accurate calibration of camera parameters is essential for precise 3D obstacle localization. Integration of avoidance strategies faces additional challenges due to inaccuracies in pose estimation and trajectory planning.

Visual Obstacle Detection using Inverse Perspective Mapping: Full Report

Visual Obstacle Detection using Inverse Perspective Mapping: Authors

Julian Nubert is currently a Research Assistant & Doctoral Candidate at the Max Planck Institute for Intelligent Systems, Germany.

Niklas Funk is a PHD Graduate Student at Technische Universität Darmstadt, Germany.

Fabio Meier is currently working as the Head of Operational Data Intelligence at Sensirion Connected Solutions, Switzerland.

Fabrice Oehler is working as a Software Engineer at Sensirion, Switzerland.

Learn more

Duckietown is a modular, customizable, and state-of-the-art platform for creating and disseminating robotics and AI learning experiences.

Duckietown is designed to teach, learn, and do research: from exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge.

Intersection Navigation in Duckietown Using 3D Image Features

Posted on December 23, 2024 | by Duckietown Admin

Intersection Navigation in Duckietown Using 3D Image Features

Project Resources

Project highlights

Here is a visual tour of the authors’ work on implementing intersection navigation using 3D image features in Duckietown.

Example showing camera input and its transformation into Bird's Eye View (BEV) space using a homographic matrix. — Figure 1. Camera Input to Bird's Eye View (BEV) Transformation.

Classified stop line clusters in a Bird's Eye View (BEV) with colored dots representing predicted stop line locations and white dots indicating cluster centers. — Figure 2. Classified Stop Line Clusters in BEV.

Three possible trajectories illustrated for navigating an intersection in a Bird's Eye View (BEV) representation. — Figure 3. Possible Intersection Navigation Trajectories.

Camera input image used for generating Bird's Eye View (BEV) representations during intersection navigation. — Figure 4. Camera Input for BEV Generation in Intersection Navigation.

Comparison of Bird's Eye View (BEV) representations from two methods, showing estimated stop lines with colored clusters and white dots indicating cluster centers. — Figure 5. Comparison of BEV Representations During Intersection Navigation.

Comparison of MILE-generated BEVs from CARLA and Duckietown simulators, showing camera inputs and corresponding BEVs with non-drivable areas highlighted. — Figure 6. Comparison of MILE-Generated BEVs from CARLA and Duckietown Simulators.

Intersection Navigation in Duckietown: Advancing with 3D Image Features

Intersection navigation in Duckietown using 3D image features is an approach intented to improve autonomous intersection navigation, enhancing decision-making and path planning in complex Duckietown environments, i.e., made of several road loops and road intersections.

The traditional approach to intersection navigation in Duckietown is naive: (a) stop at the red line before the intersection, (b) read Apriltag-equipped traffic signs (providing information on the shape and coordination mechanism at intersections); (c) decide which direction to take; (d) coordinate with other vehicles at the intersection to avoid collisions; (e) navigate through the intersection. This last step is performed in an open-loop fashion, leveraging the known appearance specifications of intersections in Duckietown.

By incorporating 3D image features in the perception pipeline, extrapolated from the Duckietown road lines, Duckiebots can achieve a representation of their pose while crossing the intersection, closing, therefore, the loop and improving navigation accuracy, in addition to facilitating the development of new strategies for intersection navigation, such as real-time path optimization.

Combining 3D image features with methods, such as Bird’s Eye View (BEV) transformations allows for comprehensive representations of the intersection. The integration of these techniques improves the accuracy of stop line detection and obstacle avoidance contributes to advancing autonomous navigation algorithms and supports real-world deployment scenarios.

The method and the challenges of intersection navigation using 3D features

The thesis involves implementing the MILE model (Model-based Imitation LEarning for urban driving), trained on the CARLA simulator, into the Duckietown environment to evaluate its performance in navigating unprotected intersections.

Experiments were conducted using the Gym-Duckietown simulator, where Duckiebots navigated a 4-way intersection across multiple trajectories. Metrics such as success rate, drivable area compliance, and ride comfort were used to assess performance.

The findings indicate that while the MILE model achieved state-of-the-art performance in the CARLA simulator, its generalization to the Duckietown environment without additional training was, as probably expected due to the sim2real gap, limited.

The BEVs generated by MILE were not sufficiently representative of the actual road surface in Duckietown, leading to suboptimal navigation performance. In contrast, the homographic BEV method, despite its assumption of a flat world plane, provided more accurate representations for intersection navigation in this context.

As for most approaches in robotics, there are limitation and tradeoffs to analyze.

Here are some technical challenges of the proposed approach:

Generalization across environments: one of the challenges is ensuring that the 3D image feature representation generalizes well across different simulation environments, such as Duckietown and CARLA. The differences in scale, road structures, and dynamics between simulators can impact the performance of the navigation system.
Accuracy of BEV representations: the transformation of camera images into Bird’s Eye View (BEV) representations has reduced accuracy, especially when dealing with low-resolution or distorted input data.
Real-time processing: the integration of 3D image features for navigation requires substantial computational resources with respect to utilizing 2D features instead. Achieving near real-time processing speeds for navigation tasks such as intersection navigation, is challenging.

Intersection Navigation in Duckietown Using 3D Image Feature: Full Report

Intersection Navigation in Duckietown Using 3D Image Feature: Authors

Jasper Mulder is currently working as a Junior Outdoor expert at Bever, Netherlands.

Learn more

Duckietown is a modular, customizable, and state-of-the-art platform for creating and disseminating robotics and AI learning experiences.

Duckietown is designed to teach, learn, and do research: from exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge.