Bonding 5/21 - 5/27

Analysis of code replication and literature review

Weekly Meeting

During the May 20, 2024, we discussed beginning the project by replicating last year’s model using a simple LLM for handling different input commands and gradually progressing towards more complex models. Key tasks for moving forward include conducting a literature review to define the project’s specific research question, setting up necessary tools like CARLA and behavior metrics, and addressing technical setup challenges.

More details can be found here: Google Doc

To-Do List during the bonding period

Set up a blog based on examples from previous years.
Set up CARLA.
Run Qi’s models.
Read and analyze literature on autonomous driving and LLMs.

Code Replication

This week, I attempted to replicate certain elements of the project codebase Meiqizhao’s code and encountered some challenges that required raising issues for resolution. Specifically, I opened two issues issue1 issue2 and one PR regarding bugs found during the replication process. To enhance reproducibility, I am currently working with Docker, and I also plan to provide a Docker branch later on.

Behavior Metrics Exploration

I reviewed the Behaviour Metrics repos and related papers. The Behavior Metrics can provide a structured framework for quantifying the effectiveness and performance of autonomous system in simulated scenarios. Incorporating text input for autonomous driving guidance enhances the Behavior Metrics benchmark by interactivity and interpretability. Here are some potential integration methods and benefits:

Expanded Testing Scenarios: It enables the creation of a broader range of test environments and situations that include verbal commands and interactions.
Enhanced Textual Interpretability: Provides clarity on how the system interprets and responds to natural language inputs, which improves the system’s transparency and trustworthiness.
Adapted Interaction Methods: Allows for modifications in user interaction, offering more intuitive and accessible ways for users to communicate with autonomous systems.

Literature Review and Feasibility Analysis

I conducted a review on research papers related to our project. The focus was on assessing the feasibility of replicating the studies, considering factors like data availability, computational requirements, and whether the methods are open-source. Such analysis helps in understanding the practical aspects of implementing these research findings in our work.

Paper Title	Reproducibility	Data Volume	Technical Difficulty	GPU Requirements
GPT-4V Takes the Wheel	Low: Uses publicly available datasets. Not open-sourced	JAAD, WiDEVIEW	High: Integrates vision and language models for dynamic behavior prediction	High: VLM processing but not illustrated
Driving with LLMs	Low: New dataset and unique architecture, reproducibility GitHub	Custom 160k QA pairs, 10k driving scenario. Which simulator?	Very High: Novel fusion of vector modalities and LLMs	Moderate: Minimum of 20GB VRAM for running evaluations, Minimum of 40GB VRAM for training
LMDrive	High: Dataset and models are open-sourced, but complexity in GPU setup	64K parsed clips and 464K notice instructions	Very High: Real-time, closed-loop control with LLMs in vehicles	Very High: 2~3 days for the visual encoder on 8x A100 (80G)
Language Models as Trajectory Generators	High: Standard dataset, clear methodology and evaluation process	Flexible data generation with Pybullet	Moderate: Focus on trajectory generation using LLMs, less complex than real-time control systems	Low: Less demanding compared to real-time visual tasks

Here is a summary of the preliminary analysis of different literature pieces:

GPT-4V Takes the Wheel: This work utilizes publicly available datasets but is not open-sourced, which poses a significant barrier to reproducibility. Although it can serve as a conceptual reference, the lack of open access means it cannot be directly replicated.
Driving with LLMs: The source code is open. However, the simulator used is proprietary to Wayve, restricting access and thus full replication of the project. While the architecture and approach can be studied.
LMDrive: This project appears the most promising in terms of openness and practical usability. It is conducted on the Carla simulator platform, and pre-trained models along with the dataset are provided. Although there are no current reproducibility issues or bugs reported, the main challenge is the significant computational requirement—training requires eight A100 GPUs (80GB each). Initial testing might focus on evaluating the provided pre-trained models due to these resource demands.
Language Models as Trajectory Generators: This work offers a unique perspective by using zero-shot methods in manipulators, which is the least resource-intensive approach among the ones listed. However, for real-time systems like autonomous driving, this approach would need to incorporate more robust and safer control mechanisms to ensure reliability and safety in dynamic environments.

From the feasibility standpoint, some of the literature reviewed indicated very high resource requirements, such as one paper necessitating 8 * A100 GPUs. These are substantial resource demands that pose challenges for replication.

The core question we need to address is: What is our objective? If the goal is to replicate existing solutions and integration, we need to identify the features and MVP. However, if our aim is to optimize, the biggest hurdle is the training phase, particularly the GPU bottlenecks during this process. This will need to be discussed further in next week’s meeting.

Moving Forward

Understanding these resource limitations and objectives will help guide our project’s direction. Our next steps involve deciding whether to seek resource optimization or to focus on adapting our goals to fit the available computational resources. Additionally, we are currently addressing several issues and plan to conduct further literature research to deepen my understanding of the field.