Analysis of code replication and literature review
During the May 20, 2024, we discussed beginning the project by replicating last year’s model using a simple LLM for handling different input commands and gradually progressing towards more complex models. Key tasks for moving forward include conducting a literature review to define the project’s specific research question, setting up necessary tools like CARLA and behavior metrics, and addressing technical setup challenges.
More details can be found here: Google Doc
This week, I attempted to replicate certain elements of the project codebase Meiqizhao’s code and encountered some challenges that required raising issues for resolution. Specifically, I opened two issues issue1 issue2 and one PR regarding bugs found during the replication process. To enhance reproducibility, I am currently working with Docker, and I also plan to provide a Docker branch later on.
I reviewed the Behaviour Metrics repos and related papers. The Behavior Metrics can provide a structured framework for quantifying the effectiveness and performance of autonomous system in simulated scenarios. Incorporating text input for autonomous driving guidance enhances the Behavior Metrics benchmark by interactivity and interpretability. Here are some potential integration methods and benefits:
I conducted a review on research papers related to our project. The focus was on assessing the feasibility of replicating the studies, considering factors like data availability, computational requirements, and whether the methods are open-source. Such analysis helps in understanding the practical aspects of implementing these research findings in our work.
| Paper Title | Reproducibility | Data Volume | Technical Difficulty | GPU Requirements | 
|---|---|---|---|---|
| GPT-4V Takes the Wheel | Low: Uses publicly available datasets. Not open-sourced | JAAD, WiDEVIEW | High: Integrates vision and language models for dynamic behavior prediction | High: VLM processing but not illustrated | 
| Driving with LLMs | Low: New dataset and unique architecture, reproducibility GitHub | Custom 160k QA pairs, 10k driving scenario. Which simulator? | Very High: Novel fusion of vector modalities and LLMs | Moderate: Minimum of 20GB VRAM for running evaluations, Minimum of 40GB VRAM for training | 
| LMDrive | High: Dataset and models are open-sourced, but complexity in GPU setup | 64K parsed clips and 464K notice instructions | Very High: Real-time, closed-loop control with LLMs in vehicles | Very High: 2~3 days for the visual encoder on 8x A100 (80G) | 
| Language Models as Trajectory Generators | High: Standard dataset, clear methodology and evaluation process | Flexible data generation with Pybullet | Moderate: Focus on trajectory generation using LLMs, less complex than real-time control systems | Low: Less demanding compared to real-time visual tasks | 
Here is a summary of the preliminary analysis of different literature pieces:
From the feasibility standpoint, some of the literature reviewed indicated very high resource requirements, such as one paper necessitating 8 * A100 GPUs. These are substantial resource demands that pose challenges for replication.
The core question we need to address is: What is our objective? If the goal is to replicate existing solutions and integration, we need to identify the features and MVP. However, if our aim is to optimize, the biggest hurdle is the training phase, particularly the GPU bottlenecks during this process. This will need to be discussed further in next week’s meeting.
Understanding these resource limitations and objectives will help guide our project’s direction. Our next steps involve deciding whether to seek resource optimization or to focus on adapting our goals to fit the available computational resources. Additionally, we are currently addressing several issues and plan to conduct further literature research to deepen my understanding of the field.