Autoduck: VLM-based Autonomous Navigation in Duckietown

Project Resources

Objective: Develop an autonomous navigation system for Duckiebot DB21J in Duckietown using vision-based control and VLM integration.
Approach: Implement calibrated perception, control, and FSM modules with ROS and integrate quantized Qwen 2.5 models on embedded hardware.
Authors: Sahil Virani, Supratik Patel, Esmir Kico, Technical University of Munich

Dual-Mode Autonomous Navigation in Duckietown using VLM - the objectives

This project aims to implement an autonomous navigation system on the Duckiebot DB21J platform within Duckietown to enable vision-based control and decision-making using VLM (Vision Language Model).

The system integrates calibrated camera intrinsics and extrinsics, motor gain and trim calibration, ROS nodes for perception and control, AprilTag-based semantic localization, stop line detection, lane filter for lateral pose estimation, finite state machine for state transitions, PID controllers for velocity and steering regulation, and quantized Qwen 2.5 models for multimodal inference on embedded hardware.

The work establishes a reproducible pipeline for benchmarking navigation algorithms, enabling analysis of trade-offs between model size, inference latency, memory limits, communication overhead, and control cycle timing in real-time robotic systems.

VLM in Duckietown - visual project highlights

The technical approach and challenges

This approach, at the technical level, involves:

The method integrates calibrated camera intrinsics and extrinsics for distortion correction and frame alignment, motor gain and trim calibration for odometry consistency, and ROS-based perception nodes for lane filtering, stop line detection, obstacle recognition, and AprilTag-based pose estimation. Control nodes implement PID regulators for velocity and steering, parameterized turning primitives, and synchronized execution through a finite state machine that coordinates lane following, intersection stopping, turning maneuvers, and recovery states. Sensor fusion combines camera streams, encoder feedback, and ToF measurements for robust decision inputs.

Quantized Qwen 2.5 vision-language models were deployed with llama.cpp, configured with reduced context window and batch size to match GPU memory limits. The models were evaluated for trajectory planning and visual reasoning tasks, with both 7B and 3B variants tested under quantization schemes. Integration required Docker containerization for portability and ROSBridge for monitoring and remote interaction.

Challenges included GPU memory capacity restricting larger model execution, inference latency exceeding 100 ms control cycle requirements, CUDA feature mismatches across builds, and instability in container runtimes on the NVIDIA Jetson Nano platform. These issues necessitated systematic parameter tuning of controllers, quantization of VLMs to GGUF formats, pruning strategies to reduce computation load, and hybrid offloading of visual reasoning to external compute nodes while maintaining low-level perception and control locally. Additional constraints involved balancing message-passing overhead in ROS, synchronization delays between perception and control nodes, and variability in inference reproducibility across different hardware builds.

Autonomous navigation system using VLM Duckiebot DB21J assembly components — Duckiebot DB21J Assembly

Autonomous navigation system using VLM Duckiebot DB21J data workflow — System Workflow Architecture

Autonomous navigation system using VLM Duckiebot DB21J camera calibration patterns — Camera Calibration

Autonomous navigation system using VLM Duckiebot DB21J Qwen 2.5 model latency accuracy comparison — VLM Model Performance

Report and Presentation

Looking for similar projects?

Check out the following works on path planning with Duckietown:

VLM in Duckietown: Authors

Sahil Virani is a student at Technical University of Munich, Germany.

Suparatik Patel is a student at Technical University of Munich, Germany.

Esmir Kico is a student at Technical University of Munich, Germany.

Learn more

Duckietown is a modular, customizable, and state-of-the-art platform for creating and disseminating robotics and AI learning experiences.

Duckietown is designed to teach, learn, and do research: from exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge.

These spotlight projects are shared to exemplify Duckietown’s value for hands-on learning in robotics and AI, enabling students to apply theoretical concepts to practical challenges in autonomous robotics, boosting competence and job prospects.

General Information

Title: Visual Urban Navigation for Mobile Robots: Implementation in the Duckietown Environment
Authors: Shima Akbari, Nima Akbari, Giuseppe Oriolo, Sergio Galeani
Institution: Università degli Studi di Roma Tor Vergata, Italy
Citation: S. Akbari, N. Akbari, G. Oriolo and S. Galeani, "Visual Urban Navigation for Mobile Robots: Implementation in the Duckietown Environment," 2025 International Conference on Control, Automation and Diagnosis (ICCAD), Barcelona, Spain, 2025, pp. 1-6, doi: 10.1109/ICCAD64771.2025.11099311.

Visual Control for Autonomous Navigation in Duckietown

This research presents a visual control framework for in Duckietown using only onboard camera feedback for autonomous navigation. The system models the Duckiebot as a unicycle with constant driving velocity and uses steering velocity as the control input. Virtual guidelines are extracted from the lane boundaries to compute two visual features: the middle point and the vanishing point on the image plane.

The controller drives these features to the image center using a mathematically derived control law. The visual features are obtained from the camera feed using a multi-stage image processing pipeline implemented in OpenCV. The pipeline includes frame denoising, grayscale conversion, edge detection using the Canny edge detection algorithm, region of interest masking, and line detection via the Probabilistic Hough Line Transform. This setup provides robust detection of the white and yellow lane markings under varying conditions.

A scenario-driven transition system detects red lines marking intersections and activates artificial guidelines to execute controlled turns. The visual control implementation runs as a single ROS node following a publisher-subscriber architecture, deployed both in the Duckietown Simulator (gym) and in Duckietown.

Visual control scheme for autonomous navigation in Duckietown — Figure 1. General Scheme of Robot Navigation

Visual control coordinate frames in Duckietown — Figure 3. Coordinate Frames and Virtual Guidelines

Visual control processing pipeline in Duckietown — Figure 5. Complete Image Processing Workflow

Visual control turn timing in Duckietown — Figure 6. Turn Timing Schematics

Visual control artificial guidelines in Duckietown — Figure 7. Artificial Guidelines for Turning

Visual control lane centering in Duckietown — Figure 8. Lane Centering Point Convergence

Visual control steering velocity in Duckietown — Figure 9. Velocity Convergence Graph

Visual control turns evolution in Duckietown — Figure 10. Consecutive Turns Point Evolution

Visual control turns velocity in Duckietown — Figure 11. Consecutive Turns Velocity Evolution

Visual control experimental velocity in Duckietown — Figure 14. Experimental Velocity Evolution (Centering)

Visual control right turn in Duckietown — Figure 16. Experimental Right Turn Velocity

Highlights - Visual Control for Autonomous Navigation in Duckietown

Here is a visual tour of the implementation of visual control for autonomous navigation by the authors. For all the details, check out the full paper.

: Figure 1. General Scheme of Robot Navigation

: Figure 2. Duckietown Real vs Simulated Setup

: Figure 3. Coordinate Frames and Virtual Guidelines

: Figure 4. Image Processing Steps Verification

: Figure 5. Complete Image Processing Workflow

: Figure 6. Turn Timing Schematics

: Figure 7. Artificial Guidelines for Turning

: Figure 8. Lane Centering Point Convergence

: Figure 9. Velocity Convergence Graph

: Figure 10. Consecutive Turns Point Evolution

: Figure 11. Consecutive Turns Velocity Evolution

: Figure 12. Live Camera Image Processing

: Figure 13. Experimental Lane Centering

: Figure 14. Experimental Velocity Evolution (Centering)

: Figure 15. Experimental Turning Trials

: Figure 16. Experimental Right Turn Velocity

: Figure 17. Experimental Left Turn Velocity

Abstract

Here is the abstract of the work, directly in the words of the authors:

This paper presents a vision-based control framework for the autonomous navigation of wheeled mobile robots in city-like environments, including both straight roads and turns. The approach leverages Computer Vision techniques and OpenCV to extract lane line features and utilizes a previously established control law to compute the necessary steering commands.

The proposed method enables the robot to accurately follow the lanes and seamlessly handle complex maneuvers such as consecutive turns. The framework has been rigorously validated through extensive simulations and real-world experiments using physical robots equipped with the ROS framework. Experimental evaluations were conducted at the DIAG Robotics Lab at Sapienza University of Rome, Italy, demonstrating the practicality of the proposed solution in realistic settings.

This work bridges the gap between theoretical control strategies and their practical application, offering insights into vision-based navigation systems for autonomous robotics. A video demonstration of the experiments is available at https://youtu.be/tDvpwSj8X28.

Conclusion - Visual Control for Autonomous Navigation in Duckietown

Here is the conclusion according to the authors of this paper:

This paper proposed a vision-based control framework for lane-following tasks in wheeled mobile robots, validated through both simulations and real-world experiments. The approach effectively maintains the robot position at the center of lanes and enables safe left and right turns by relying solely on visual feedback from onboard camera, without requiring external localization systems or pre-mapped environments.

The system’s modular design and simplicity allow for seamless integration with other robotic systems, making it versatile for diverse urban navigation scenarios. Future research will focus on enhancing the framework to handle complex scenarios, such as autonomous lane corrections, and incorporating obstacle detection and avoidance mechanisms for improved performance in dynamic, real-world environments.

These advancements will expand the applicability of the proposed method, confirming its potential as a robust solution for autonomous navigation.

Did this work spark your curiosity?

Check out the following works on vehicle autonomy with Duckietown:

Project Authors

Shima Akbari is a PhD student at Italian National Program in Autonomous Systems at the University of Rome Tor Vergata, Italy.

Nima Akbari is a PhD student at Basel University of Switzerland in privacy technologies for the Internet of Things.

Giuseppe Oriolo is a Full Professor of Automatic Control and Robotics at Sapienza University of Rome.

Sergio Galeani is a full professor at the University of Rome Tor Vergata, Italy.

Learn more

Duckietown is a platform for creating and disseminating robotics and AI learning experiences.

It is modular, customizable and state-of-the-art, and designed to teach, learn, and do research. From exploring the fundamentals of computer science and automation to pushing the boundaries of knowledge, Duckietown evolves with the skills of the user.