Coding week1 5/27-6/02

Weekly Meeting

In this week’s meeting, we reviewed the project’s current progress. I updated the project blog and asked for feedback. I have successfully set up CARLA on Docker, with plans to transition to a physical machine soon, and I will post a Docker installation tutorial later. We discussed several technical issues, including dependencies and model loading errors, and discussed data collection script problems in relation to graphical mode and ROS Bridge compatibility.

Open issues on GitHub were reviewed, with a particular note on the need to make a PR and future plans for Docker installation. During the open floor discussion, the team discussed the potential for reproducibility and future enhancements similar to the LMdrive model. Concerns about GPU resources and API token support from Google were raised, with plans to inquire further.

More details can be found here: Google Doc

To-Do List

Check dependencies and model loading error
Conduct a literature review about the potential direction
A CARLA docker installation documentation
Inquiry about potential GPU resources and API token support

Following previous issues

Following previous issues: issue1 and issue2, the model’s issues were resolved by changing the number of outputs, and the environment dependency issues have also been resolved. However, there are still problems with the current data collection, but an issue has not yet been raised; it is still being checked. The current data collection scripts may encounter errors or pauses, which could require manual intervention or result in delays in data collection. There are several solutions currently available:

Use a physical machine to set up checks for cameras, etc., in a graphical interface.
LMDrive has provided some scripts for data collection that can be tested and explored.

Literature review

Modules

The architectural framework of world models is structured to facilitate complex decision-making processes that closely emulate human cognitive functions. These models are comprised of several distinct but interconnected modules, each serving a crucial role in the system’s overall performance and capability:

Perception: This module serves as the interface between the external environment and the model, capturing multi-modal sensory inputs such as images, sounds, and tactile feedback. By transforming raw sensory data into a more digestible and actionable format, this module ensures that the model is well-equipped to respond to the complexities of its surroundings. It leverages advanced sensory technologies and encoding mechanisms. This module is adept not only at recognizing objects or scenes but also at interpreting the structural and contextual relationships within the sensory information. This precise environmental understanding forms the foundation for all subsequent data-driven decision-making processes.
Memory Module: This module serves a role akin to the human hippocampus, crucial for encoding, storing, and retrieving information from past and present environmental interactions. It supports both short-term and long-term memory functionalities. Short-term memory processes immediate tasks—such as remembering specific details of a traffic intersection, while long-term memory retains the outcomes of particular events or the efficacy of complex strategies over time. By managing a dynamic repository of experiences, this module enables continuous learning and adaptation, which are indispensable for responding to evolving challenges.
Action Module: This module operates based on the information processed by the Perception and Modules to formulate and execute decisions. It evaluates the current conditions along with forward-looking predictions to develop actions aimed at achieving specific objectives, such as optimizing resource use or maximizing operational efficacy. Typically, this process relies on RL, aiming to maximize rewards. The capability of this module to integrate information and implement strategic decisions is crucial for responsive interactions with the environment, ensuring that the system can adapt to changes effectively and execute tasks efficiently.
World Model Module: Situated at the heart of the system’s architecture, its primary function is to refine and enhance the system’s understanding of the environment by generating comprehensive simulations. Through these simulations, the module projects future environmental conditions, providing a foresight that is critical for strategic planning. This predictive capability allows the system to prepare for potential scenarios, offering a degree of anticipatory adaptation and flexibility.

These modules form an integrated framework that enables world models to simulate human-like cognitive processes and decision-making. This module structure not only enhances the operational capabilities of such systems but also contributes to their ability to operate independently and efficiently in a variety of real-world applications.

Architectures

The architecture of world models is designed to predict future states of environments by balancing deterministic forecasts with the uncertainty of real-world dynamics. In high-dimensional sensory input scenarios, the challenge lies in efficiently representing observed information through latent dynamical models to make compact and accurate predictions. To manage these complexities, a variety of architectures have been proposed, including the RSSM and the JEPA, as well as Transformer-based architecture.

RSSM: The RSSM stands at the forefront of this architectural evolution, designed to efficiently navigate and predict within latent spaces. By decomposing the latent state into deterministic and stochastic elements, RSSM manages the unpredictable nature of real-world environments. This model excels in continuous control tasks by learning dynamic environmental models from sensory data like pixels and formulating action plans within the encoded latent space. The RSSM features a dual-path architecture where its deterministic components provide stability and its stochastic components enhance adaptability. This structure makes RSSM ideal for scenarios that require robust yet flexible predictive capabilities.
JEPA: JEPA revolutionizes predictive modelling by focusing on a higher-level representation space rather than traditional pixel-level output generation. The architecture’s predictive process involves generating and utilizing latent variables to fill in gaps or predict missing elements in the input data. This architecture marks a paradigm shift by abstracting inputs and targets through dual encoders into representations and leveraging a latent variable for prediction. JEPA excels in filtering out noise and irrelevancies, concentrating on essential data elements. Its use of self-supervised learning enables pre-training on unlabeled datasets, refining predictive accuracy for both visual and non-visual tasks.
Transformer-based Architecture: Leveraging the attention mechanism inherent in Transformer architectures, these models provide a framework for handling memory-intensive tasks. Transformer-based world models like the Spatial Temporal Patchwise Transformer (STPT) and the Transformer State Space Model (TSSM) focus on different segments of input data simultaneously. These models excel at managing intricate temporal and spatial dependencies. This capability enables effective management and prediction of dynamic environmental interactions through their advanced memory access and dependency tracking.

Please note that there is also a section of review content that will be part of a paper to be submitted and will be expected to make public next week.

GPU resource

Assessing the risks, it’s clear that we should rely on external sources like university clusters, especially when I need consistent access to high-performance GPUs such as the NVIDIA A100. But I’ve faced challenges with availability. While I can access 30 series and A4000 GPUs, I’m also exploring potential access through a university cluster. Unfortunately, GSOC has confirmed they cannot provide my GPU resources. I’m also considering GPU resources from the University of Edinburgh, though this might require queuing for access. Sergio mentioned that we have access to a powerful GPU cluster at his university, and they might give me access when needed.

Docker installation

We have updated the step-by-step tutorial for installing CARLA based on Docker. Unlike the official website, this version includes more detailed troubleshooting steps. More details can be find in this post.