Coding week17 9/23-9/30

This week, I explored some works that integrate VQA with RL to enhance autonomous driving. This approach involves utilizing LLMs and transformer-based architectures to generate questions and responses based on driving scenarios.

VQA in Autonomous Systems

VQA leverages visual data to answer user-generated questions. In the context of autonomous driving, VQA can help process and analyze visual data, transforming real-time scenarios into structured interactions. For example, an autonomous vehicle using VQA might analyze traffic signals and generate questions regarding potential obstacles or safe navigation paths. VQA’s role in autonomous driving can be categorized into three core areas:

For VQA in autonomous systems, core aspects include image understanding and object detection, with model accuracy in recognizing pedestrians, vehicles, and road signs. Extensive datasets enhance model capability for object differentiation.

Additionally, contextual question generation offers support for decision-making, with relevant questions like “Is the pedestrian crossing clear?” at intersections aiding navigation. Real-time adaptability remains essential due to the dynamic nature of autonomous environments, which require high processing speeds and optimized algorithms to maintain safety and responsiveness without latency.

I delved into this paper . This architecture for autonomous driving uses a relatively straightforward structure, combining a DDPG-based RL agent with a VQA model. The use of pre-trained VGG-19 and fully connected layers for visual encoding suggests a simpler approach, likely prioritizing efficiency over complexity. The pipeline appears to rely heavily on offline processing of driving videos, where the pre-trained model and LSTM handle sequence-based vision and language tasks. While effective for basic visual and language alignment tasks, this setup may lack the sophistication needed for real-time adaptability or complex, dynamic driving scenarios due to its offline nature and relatively simple model architecture.

LINGO-1 and LINGO-2

LINGO-1 and LINGO-2 provide questionnaire-based interactions within a controlled environment. These simulations enable the analysis of vehicle responses, situational understanding, and interactivity levels. Key features of these simulations include:

Action Items

Based on meeting insights, we will create a detailed technical plan, identifying open-source solutions for reproducibility testing.