Recently, Tsinghua University master's student Xu Dongyang and his team, in order to promote the further development of autonomous driving technology, have proposed a module called LVAFusion, aimed at more efficiently and accurately fusing multimodal information.
Autonomous driving on the road should possess the ability to learn from excellent human drivers, as humans can quickly locate key areas when facing most scenarios.
To improve the interpretability of end-to-end autonomous driving models, the team introduced the attention mechanism of human drivers for the first time.
By predicting the driver's attention area in the current context, they use it as a mask to adjust the weights of the original images, enabling autonomous vehicles to have the ability to effectively locate and predict potential risk factors like experienced human drivers.
Advertisement
The introduction of predicting the driver's visual attention area not only provides more fine-grained perceptual features for downstream decision-making tasks, thus ensuring greater safety. Moreover, it also makes the scene understanding process closer to human cognition, thereby improving interpretability.In terms of potential applications:
- The LVAfusion module developed this time can be used on vehicles equipped with lidar, which is expected to enhance the perception fusion capabilities of multimodal large models.
- The current model can be combined with existing multimodal large models.
- For example, the driver's attention mechanism can output in real-time, allowing passengers to observe in real-time the sections that the current large model considers to have greater weight.
- If passengers believe it is unreasonable, they can verbally inform the end-to-end model, thereby achieving automatic adjustment and continuous learning and optimization.What Are the Advantages of End-to-End Autonomous Driving?
It is reported that autonomous driving includes key links such as environmental perception, positioning, prediction, decision-making, planning, and vehicle control. By coordinating these modules, real-time perception of the surrounding environment and safe navigation can be achieved.
However, this system architecture not only has a huge amount of code, complex post-processing logic, and high maintenance costs in the later stage.
Moreover, in the actual application process, it is prone to the accumulation of errors. For example, if a pedestrian suddenly appears in front, due to the omission of the perception module, the downstream prediction and decision-making module does not have the input of the pedestrian's information, which may lead to danger.End-to-end autonomous driving is expected to solve this problem. End-to-end autonomous driving refers to the process of using deep learning models to directly convert raw input data (such as camera images, LiDAR point clouds) into control commands (such as steering wheel angle, throttle, and brake).
This approach attempts to simplify the traditional multi-module autonomous driving system by treating the entire driving task as a mapping problem from perception to action.
The key advantage of end-to-end learning is that it can reduce the complexity of the system and has the potential to improve generalization ability, as the model can be trained to directly handle a variety of different driving situations.
Moreover, multimodal end-to-end autonomous driving, by integrating data from various sensors such as cameras, LiDAR, and radar, is expected to enhance the system's understanding and response to complex environments, improving the accuracy and robustness of decision-making, thereby enhancing the safety and reliability of autonomous vehicles.
However, end-to-end autonomous driving is based on black-box deep learning models, so how to improve the driving performance of the model and the interpretability of the model is an urgent problem and pain point to be solved.There are a multitude of existing methods that are end-to-end for autonomous driving. After a detailed analysis of the model structure by Xu Dongyang and his team, it was found that people had not fully utilized multimodal information before.
Cameras possess rich semantic information but lack depth information. LiDAR can provide excellent distance information. Therefore, the two have very good complementary characteristics.
However, most of the existing end-to-end learning methods use the backbone network to extract modal information separately, then splice it in high-dimensional space, or use a Transformer to fuse multimodal information.
Among them, the query is randomly initialized, which may lead to the inability to utilize the prior knowledge hidden in the multimodal features during the fusion process using the attention mechanism.
This may lead to misalignment of the same key object across multiple modalities, ultimately resulting in slower convergence speed and suboptimal of the model learning.On a snowy winter night in Zhongguancun, coding and conducting experiments.
During his research, with the accumulation of Xu Dongyang's professional skills and the development of end-to-end autonomous driving, he found some shortcomings in the field of end-to-end while reading literature.
For example, there was not enough exploration of whether multimodal information was integrated, and how to improve the interpretability of the model while ensuring accuracy. After some research, Xu Dongyang chose end-to-end autonomous driving as his research topic.
End-to-end autonomous driving is a large system that includes multiple modules such as perception, tracking, prediction, decision-making, planning, and control. Therefore, it is necessary to design a method that can effectively connect the aforementioned modules.After determining the method, it is necessary to collect a large amount of data. Since end-to-end models are based on deep learning, a large amount of data is required for training.
It is also necessary to determine what inputs and outputs the model needs, and to collect data under various weather conditions and working conditions on the autonomous driving simulation platform Carla, while also checking the integrity of the data.
After completing the data collection, it is necessary to analyze whether the structural design of the model can help with this task.
In the experiment, when importing pre-trained weights, Xu Dongyang mistakenly imported the wrong set of weights. However, due to weight matching, the system did not report an error, but the experimental results were always unsatisfactory.
After a large amount of model debugging, the problem was still not found. One night, while walking in Zhongguancun, Xu Dongyang suddenly thought that he had not checked the training code, could it be a problem with the training process?So, he immediately ran back to the computer, checked the training process, and finally determined that the problem was with the import of pre-trained weights.
After the adjustment, the experimental results were very much in line with expectations. "This kind of discovery brings not only an understanding of the problem, but also a profound sense of satisfaction and achievement," said Xu Dongyang.
Due to the long training time, Xu Dongyang would submit multiple tasks to the training cluster every night. One night, because there were too many experiments handed in, some tasks were stopped due to priority reasons.
When he checked the next day, he found that some experimental results were missing, so he had to carefully analyze the results again and resubmit the missing experiments.
In such a complicated process, he finally completed the research. In the end, the relevant paper was published on arXiv with the title "M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving" [1].Subsequently, the research group will focus on further optimizing the model, expanding application scenarios, and enhancing the robustness and security of the system.
Specifically:
Firstly, deepening multimodal fusion technology is necessary.
Continue to explore and develop more efficient algorithms to improve the integration of data from different sensors. For instance, using graph networks for matching different modalities, with a particular focus on handling traffic scenarios in high-dynamic and complex environments.
Secondly, enhancing the driver's attention model is essential.Further research into the simulation mechanism of driver attention is needed, exploring how to more accurately predict and simulate the focus of human drivers' attention, as well as investigating the impact of these focal points on driving decisions.
Once again, it is necessary to conduct verification of safety and robustness.
The existing model will be deployed into a small car in the physical world, and through more physical experiments, the performance of the model under real-world conditions will be verified.
This will extend the research to a wider and more diverse range of driving scenarios and environmental conditions, such as adverse weather and night driving, thereby verifying and improving the system's universality and adaptability.
Finally, research on human-computer interaction should be carried out.The article to be translated into English is as follows:
Exploring how to more closely integrate this technology with human-computer interaction, for example, by providing drivers with more intuitive risk warnings and decision-making support, enhancing the interaction between autonomous driving vehicles and human drivers.
Through these follow-up research plans, Xu Dongyang hopes not only to improve the performance of autonomous driving technology but also to ensure a better understanding of human driving behavior, laying the foundation for the realization of safer and smarter autonomous driving technology.
POST A COMMENT