Zhejiang University alumni develop a cross-modal model, creating a "universal gr

To understand the concept of "embodied intelligence," it is helpful to start with the term "embodied." Embodied is not simply "having a body," but its core indeed lies in the concept of "body." In 1945, French philosopher Maurice Merleau-Ponty proposed the concept of "embodied" in his book "Phenomenology of Perception."

He believed that bodily experience forms the basis of human interaction with the world and understanding of the world. "Embodied" means being immersed in reality, that is, participating in a defined environment, integrating with certain plans, and continuously intervening in it. Due to its embedded nature, the body becomes the foundation of human cognition of the world.

"Having a body is having a general device, having a map of all types of perception." Coincidentally, it was also during this period that British computer scientist Alan Turing, in his paper "Computing Machinery and Intelligence," proposed an artificial intelligence that could interact with the environment through sensors and learn on its own, which is the original idea of today's "embodied intelligence" [1].

Therefore, "embodied intelligence" can be understood as a different form of robots that combine intelligent software with perceptual hardware. Like humans, they are immersed in the real environment and continuously promote their own "evolution" in the process of interacting with the environment.

Traditional AI needs to rely on built-in models to represent the world, and then construct behavioral concepts based on these representations. This model is highly dependent on artificial data annotation, not only lacks flexibility in dealing with changing situations, but also cannot understand unannotated factors related to the task.Due to the lack of generalization capability in traditional AI, developers must meticulously define every possible behavioral state and situation and collect corresponding training data. This process inevitably leads to an exponential increase in task complexity, making it extremely difficult or even impossible to pre-train for every minor change.

The introduction of the Transformer architecture in large models, however, endows the model with efficient parallel computing capabilities and flexibility, enabling it to handle large-scale datasets and quickly adapt to different task scenarios through fine-tuning of pre-trained models. At the same time, its hierarchical structure can achieve deep abstraction and analysis of complex data.

Therefore, the introduction of the Transformer architecture has brought a paradigm shift to the field of embodied intelligence, making truly intelligent embodied robots possible. This transformation can be likened to the evolution from feature phones to smartphones, with the advantages of embodied intelligent robots lying in their interactivity and versatility, that is, the ability to achieve natural interaction in open scenarios.

Let's first understand the development history of embodied intelligence based on large models. The first generation of models mainly relied on large language models (LLM) and visual language models (VLM) to handle interactions with the physical world.

However, these models are limited to indirect interaction with the real world through visual question-answering forms, lacking the ability to understand complex environments and real-time interactions.With technological advancements, the second-generation models represented by Google's PaLM-E/RT2 attempt to integrate Large Language Models (LLMs) with visual Transformers, more closely combining natural language with the real world.

Dialogue with OpenAI's resident artist: Even if AI can create images, humans are

Scientists create a new method for green hydrogen production, and the price of t

Three filmmakers use Sora to generate short films, including characters such as

Help dual carbon development, scientists discover a combination of reactive nucl

Exclusive interview with ASML CTO: The company's next big strategy will be a sup

Scientists create a deep learning framework for material design, achieving mater

OpenAI Sora can generate seven surreal short films, and industries such as anima

Scientists prepare nanosheet superlattices, allowing LEDs to directly emit stron

Scientists use porphyrins to create electrochemical polymers, with a battery dis

Scientists reveal the heat transfer mechanism between heterojunctions of two-dim

However, even so, there are still challenges in building a four-dimensional world model, effectively predicting future behaviors, and flexible reasoning in complex interactive scenarios.

The transition from functional to intelligent

In general, due to the limitations of technological level, the concept of embodied intelligence has not been fully developed. It was only with the recent explosion of large model technology that the enthusiasm for exploring embodied intelligence among a large number of researchers was reignited.In the midst of this surge, Chen Junbo is one of the participants. Having graduated with a Ph.D. from the Department of Computer Science at Zhejiang University, he has accumulated considerable experience in the field of artificial intelligence. Particularly in the research direction closely related to embodied intelligence, such as autonomous driving, he has led the development of projects like the "Xiaomanlv" unmanned logistics robot.

Upon discovering new opportunities for the development of embodied intelligence, Chen Junbo realized that to explore a broader application space, a new platform was needed.

Therefore, he resigned from his position as the head of the autonomous driving department at Alibaba's DAMO Academy and founded Youlu Robotics in February 2023.

The embodied intelligence large model LPLM (Large Physical Language Model) developed by Chen Junbo and his team serves as an end-to-end embodied intelligence solution. It breaks through the limitations of traditional deep learning, which relies on closed sets and manually annotated data, thanks to the predictive learning strategy used by its decoder.

Specifically, it automatically deduces complex time series patterns from the observed data, thereby understanding and predicting the implicit dynamic changes within the data. In this way, any given segment of data can be automatically annotated based on the existing data.This self-annotation mechanism can greatly enhance the learning efficiency and quality of the model from unannotated data, because it allows the model to continuously correct and optimize its understanding and representation abilities through its own generated predictions, thereby achieving adaptation to the dynamic changes of the real world.

Taking autonomous driving as an example, when encountering complex game scenarios that require vehicle overtaking, LPLM can not only predict the behavioral intentions of other participants but also formulate the best action strategy based on this, such as safe left turn or yielding, to ensure driving safety while improving traffic efficiency.

In addition, LPLM has also enhanced the ability to understand and execute natural language commands. Chen Junbo cited an example to explain the importance of this ability: "Why are the various types of Robo-taxi currently available, but still unable to replace drivers? One of the reasons is that when we provide some vague information about location, it cannot achieve accurate recognition of natural language."

By introducing a 3D Grounding mechanism that goes beyond the original two-dimensional Visual Grounding method, LPLM can more accurately locate objects. At the same time, the LPLM model has significantly improved its grasp of the complexity of the physical world through deep abstraction and fine modeling.

It refines the information of the physical environment to the same level as the internal features of large language models, performs explicit logical mapping, and constructs a comprehensive and detailed representation of the environment by integrating multimodal data such as point clouds, images, sounds, and text.These various forms of data provide rich environmental information, from three-dimensional shapes and spatial positions, to visual features, and then to contextual environmental instructions, offering the model a comprehensive view of the world. This enables the model to understand and respond to imprecise or ambiguous instructions, significantly improving the adaptability and execution efficiency of embodied intelligent systems.

Creating a "Universal Grammar" for Embodied Intelligence

Chen Junbo stated that the biggest characteristic of embodied intelligence is its ability to autonomously perceive, think, and learn like humans. Therefore, humanoid robots, which are similar to human behavior, naturally become one of the highly focused directions in embodied intelligence, such as Tesla Optimus and Xiaomi CyberOne, which are representatives of this type of product.

However, embodied intelligence is far more than just humanoid robots, especially in industrial and logistics scenarios, where the combination with various types of equipment is where its broader value lies. Based on this, Chen Junbo and his team have created a universal "brain."

What they have endowed this brain with is a Chomskyan "universal grammar" capability in the field of intelligent devices, aiming to provide a universal cognitive structure and behavioral guidance rules for robots of various forms.However, such generalization is not straightforward. Due to differences in sensor models, the distribution of observed data, and interaction capabilities, implicit knowledge gained by a robot through object exploration cannot be directly utilized by another robot with a different form [3].

Thanks to the LPLM's ability to understand the three-dimensional and even four-dimensional world, the model can extract many common pieces of information from data. Through processes such as abstraction, projection, and transfer, it serves as a basic model suitable for various robots, making its general use possible.

At present, Chen Junbo and his team have launched an intelligent cleaning robot, which has been tested in iconic locations such as the Liangzhu Ancient City Ruins in Hangzhou and the Shanghai Center Building.

Chen Junbo said that the main reason for choosing cleaning and logistics robots as a breakthrough is that the field of embodied intelligence is currently in the pioneering stage of "from nothing to something."

If the concept of a general-purpose intelligent robot is introduced directly from the beginning, many potential customers may resist due to a lack of understanding of the technology, uncertainty about its use, and unclear recognition of its potential.Therefore, he and his team have more intuitively revealed the potential capabilities of general intelligent models through practical cases of intelligent cleaning, thereby promoting the popularization of this technology in a broader field to achieve the vision of generalization.

After launching the intelligent cleaning robot, Chen Junbo plans to expand this core technology - the intelligent "brain" to more traditional mechanical equipment fields such as excavators and loaders, to achieve a wider range of intelligent transformations.

However, to break through the limitations of traditional machine learning that relies on human programming and modular integration, it cannot rely solely on the growth of data volume.

So for Chen Junbo, the potential of LPLM to self-evolve driven by large-scale data still needs to be fully explored. He added: "In the field of embodied intelligence, what is important is not only the innovation of the technology itself, but more importantly, how to apply this intelligent technology in an appropriate way to different industries."

To this end, he and his team are also solving specific problems in the application scenarios one by one. At the same time, they are promoting the rapid popularization and industrialization of technology with a sustainable business model.In the future, they will continue to commit to achieving the Scaling Law of the physical world. By expanding the coverage of data collection and application, they will form a positive cycle between data growth and technological progress. The development of embodied intelligence technology will promote more transformations in traditional industries.

Zhejiang University alumni develop a cross-modal model, creating a "universal gr

POST A COMMENT