Scientists establish a new evaluation benchmark to help assess the data analysis

In the era of big data, automatic data analysis has become an indispensable tool for individuals across various technological backgrounds.

Large language models, represented by GPT-4, have already been able to understand natural language queries and generate corresponding code or analysis, making automatic data analysis more realistic.

For instance, the success of Devin has sparked widespread interest in automatic data analysis based on large language models.

Existing datasets such as Text2Analysis and BIRD-SQL have, to some extent, measured the capabilities of large language models in handling complex data science or data analysis tasks.

However, actual data analysis often involves complex multi-turn human-computer interactions. This is because human queries often contain ambiguity.For example, the phrase "noteworthy" in "Please list three noteworthy opponents" can be interpreted in multiple ways.

In addition, effective data analysis requires not only generating the correct code or answers but also the model's ability to adjust according to user feedback and provide an in-depth understanding of the results to support the decision-making process.

Advertisement

Given the importance of interactivity in data analysis, Li Jinyang, a PhD student at the University of Hong Kong, and his team initiated a research project to establish an interactive data analysis agent.

In the study, the research team thoroughly observed the historical data of users using ChatGPT and summarized six key intelligent agent behaviors.

After transforming the observations into research questions, it is necessary to create a dataset to support this study. At this point, they found that existing datasets could not meet the research needs, so they began to build their own dataset.Although the generation of such data is relatively low in cost and does not require much manpower, the development of evaluation methods requires them to verify each one individually, because the results of data analysis do not solely depend on the consistency of execution.

For instance, when it comes to the issue of generating classifiers, even if the results of the reference code and the prediction code are inconsistent, as long as the prediction code is problem-free or even performs better according to common sense, it should be considered a success.

Therefore, they have designed independent evaluation code for almost every problem to ensure the avoidance of false negatives.

At the same time, they also approach the problem from the perspective of the model, thinking about why the model makes such mistakes and how to avoid them.

In addition, they have also explored which methods can be used to make the model pay more attention to the historical information that is valuable for the current problem, in order to meet user expectations to the greatest extent possible.In the study, they utilized a variety of large language models to perform this task, each of which exhibited unique personalities or habits when solving complex problems.

 

Especially the GPT-4-Turbo, the team found it to display a "performative" personality, which was reflected in various test settings.

 

For instance, in the code generation task, GPT-4-Turbo tends to produce longer code segments, sometimes even creatively customizing functions before calling the code, showing a tendency to "show off."

 

In private scenarios, GPT-4-Turbo will more frequently and somewhat boastfully use user-defined functions.

 

The most interesting example appears under the behavior (Action) mode, when the model needs to ask the user for clarification on ambiguous conditions, such as facing a question like "How many have a good credit history among all accounts?" Other models might simply ask, "What is a good credit history?"And GPT-4-Turbo goes a step further by posing hypothetical questions: "A good credit history means their credit column contains 'good credit,' right?"

This indicates that GPT-4-Turbo engages in active thinking and hypothesizing before asking questions. While this approach can demonstrate its intelligence and "showing off skills," it also carries risks.

If the hypothesis is incorrect and the user's response is negative, it will cause GPT-4-Turbo to miss the opportunity to understand the truly ambiguous conditions.

Although this "personality" trait may introduce errors when dealing with complex tasks, such as over-calling the code provided by the user leading to execution failure, or clarifying requests based on incorrect assumptions, it will also enhance the experience of human-computer interaction.

Researchers are gradually realizing that in order to improve the efficiency and reliability of human-computer interaction, users need to adapt to and even imitate these characteristics of the model.The process of mutual adaptation and learning not only improves the quality of interaction but also deepens people's understanding of the "personality" of the intelligent agent and the impact of interaction, allowing the model to produce more expected results.

Overall, the lack of a benchmark for interactive data analysis is one of the biggest problems faced by this research. To address this issue, they were inspired by the "Stanford Town" project and created "DECISION COMPANY."

"DECISION COMPANY" is the first multi-agent sandbox environment in the field of data analysis, including customers, data scientists, administrators, and AI ChatBot agents. Through this, researchers can simulate the interaction between data scientists and ChatBot agents.

Based on this environment, they developed the Tapilot-Crossing benchmark, which covers a variety of modes from routine code generation to handling ambiguous issues, private code repository integration, and can comprehensively evaluate the model's interactive data analysis capabilities.

This benchmark not only includes code generation tasks but also designs multiple-choice tasks, requiring the model to understand, summarize, and reason about the results after code execution, providing valuable insights.Although Tapilot-Crossing is already a large-scale and comprehensive test set, its construction cost is also less than 100 US dollars, demonstrating the potential of using virtual multi-agent containers to generate complex, high-quality datasets.

However, the experiments conducted by researchers showed that even the GPT-4-32k model equipped with effective tools and reasoning performed poorly on this benchmark (< 30%), revealing the limitations of large language models in interactive scenarios.

In the experiments, they found that these models rarely reflect on the information from previous successful interactions. When faced with similar issues or related conditions, the model either continues to ask questions incessantly or ignores these issues and conditions.

Therefore, the research team proposed a dynamically transferable interactive reflection strategy (AIR) to improve the model's interactive performance.

During the interaction process, the model can learn from successful historical cases. It can be seen that the AIR strategy can significantly improve the model's understanding and execution of user instructions.In general, compared to existing academic datasets in data science or data analysis, this dataset effectively narrows the gap between academic research and practical application.

This dataset not only covers explicit user questions but also includes scenarios such as ambiguous questions, user-defined functions, and a comprehensive assessment of the interaction behavior of data analysis agents.

In addition, this dataset can integrate the characteristics of multi-target user instructions in real scenarios, setting a new high in the average length of code per round, which is closer to the actual task of data analysis code generation.

In the research, the team proposed the CSE (Creative Self-Efficacy scale) index, exploring a new type of evaluation method that is economically efficient and better reflects the ability to generate long code, opening up new avenues for the generation and evaluation of long code.

This is similar to the final big question in the math section of the college entrance examination. Even if some students get the final result wrong, if they get some steps right, they will still get a lot of points.In the study, the team also introduced an economical and efficient method for generating benchmarks, aimed at minimizing the investment of manpower and costs while ensuring data quality.

 

This method can effectively avoid data contamination issues, providing a guarantee for the evaluation of agent performance.

 

At the same time, the AIR strategy proposed by the researchers is based on a simple and effective reflection mechanism to improve problems such as users repeatedly stating their needs when using intelligent interactive systems.

 

This strategy can optimize the interactive experience by analyzing the experience of the previous round of interaction and learning user preferences, without the need for additional training or searching example libraries.

 

This strategy is expected to be widely applied in interactive intelligent systems related to reasoning.Recently, a related paper was published on arXiv[1] titled "Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents," with Li Jinyang as the first author.

Next, researchers plan to introduce more data analysis languages. Currently, this study mainly focuses on data analysis of tables and the Python language.

However, they found that relational databases and SQL hold an indispensable and important position in data analysis. Therefore, it is very necessary to include these elements in the research scope.

In addition, they also plan to improve the evaluation method for long code generation. The research team realized that under the current evaluation system, even if the execution results of two pieces of code are the same, their actual performance may still differ.

Therefore, they hope to develop more refined and economical soft evaluation criteria to better distinguish the actual performance and potential value of the code, so as to ensure that when facing seemingly identical results, the true capabilities of the code can be accurately reflected.

POST A COMMENT