Coding week14 8/26-9/08

This week, we focused on exploring several potential research to create an “Ideas List”, which is a collection of possible research concepts. As we evaluate each idea in alignment with project goals, our approach includes balancing technical feasibility with anticipated resource needs and practical challenges.

This list includes ideas focused on recent advancements, integrations with LLMs, and potential research gaps with current works. During our review, we evaluated each idea based on several key factors to identify those that promise impact while remaining feasible given our technical and resource constraints. This post outlines the main points from our discussions, along with actionable insights and next steps.

Update

In addition to our research review and ideas list development, the following updates were made:

Resource Availability

We examined the viability of implementing from both technical and resource-based perspectives. This included considering our current toolset and any additional resources that may be required to bring an idea to fruition. Access to high-performance computing resources has emerged as a critical consideration, as the computational demands of the advance LLMs projects currently exceed our available resources. Given this, we are actively exploring options like using gaming GPUs, such as the RTX 4060.

In autonomous driving model design, different architectures significantly impact GPU resource utilization. In this discussion, we use SparseDrive and LMDrive as examples to see these trade-offs in GPU resource utilization.

The SparseDrive model achieves computational efficiency through a sparse representation framework, which minimizes reliance on dense bird’s-eye view (BEV) features, thereby reducing resource consumption, particularly in multi-GPU setups . Specifically, SparseDrive employs ResNet50 and ResNet101 backbones and train through a parallelized approach for perception and planning tasks. On an 8x NVIDIA RTX 4090 GPU system, SparseDrive demonstrates up to sevenfold increases in training and inference speeds compared to models such as UniAD , which traditionally employ dense representations. This efficiency is due to SparseDrive’s reduced floating-point operations (FLOP) requirements and decreased memory usage in its sparse, hierarchical planning structure, resulting in enhanced scalability and throughput with fewer GPU requirements .

In contrast, LMDrive’s architecture is more resource-intensive for closed-loop, language-guided autonomous driving. LMDrive incorporates multimodal encoders and additional adapters, such as Q-Formers and token adapters, to handle both visual and textual data inputs. This design supports the processing of extensive multi-view camera and LiDAR data and consequently increasing computational requirements relative to SparseDrive. The LLaMA-based language backbone also requires memory and processing power.

Training LMDrive requires approximately 4-6 days on 8 A100 GPUs with 80GB memory and consists of two stages: vision encoder pre-training and instruction fine-tuning, as outlined in their documentation. LMDrive’s large parameter count, coupled with the need for real-time closed-loop processing, imposes a substantial load on GPU memory; however, it achieves robustness in language-guided navigation and adaptive control.

Action Items

To carry forward the selected ideas, we outlined specific action items. These steps are critical to ensuring that our top-priority ideas move steadily through the development pipeline. The key tasks include: