Scientists launch a large model dataset, covering Olympiad math problems, which

Zhao Zilong, who earned his Ph.D. from Grenoble Alpes University in France, later conducted postdoctoral research at Delft University of Technology in the Netherlands and Technical University of Munich in Germany. Currently, he serves as a Research Manager at the National University of Singapore.

During his time at the Technical University of Munich, he and his colleagues successfully enhanced the ability of large models to solve complex mathematical problems in a research project.

Through this, they not only improved the reasoning speed of the algorithms but also the quality of the intermediate results searched by the algorithms. The newly launched dataset, TriMaster100, also better fits the evaluation scenarios of algorithms under complex mathematical problems.

At present, Zhao Zilong's collaborators are working on a project based on this achievement, which is a math tutor, that is, math tutoring based on artificial intelligence.

Advertisement

Solving Olympiad math problems with large models.The topic of this study can be traced back to February 2023. At that time, some research teams had already begun to use large models for logical and mathematical reasoning. Zhao Zilong and his collaborators also believed that this direction was very promising.

He said that the example that impressed him the most was a mathematical reasoning problem on the OpenAI website: Simplify tan100 + 4sin100. According to OpenAI's own statement, the probability of solving this problem using ChatGPT is about 0.1%.

He was very curious about how well ChatGPT could actually answer the question. Afterwards, he tried using different prompt words and provided different intermediate results to ChatGPT to see if he could improve the success rate.

The results showed that under the condition of gradually providing prompt words, the problem-solving probability was far greater than 0.1%. Subsequently, Zhao Zilong and his collaborators began to model the existing methods and then tried them on other mathematical problems.

The results of small-scale tests showed that compared with other industry-leading results, the aforementioned method indeed improved the problem-solving probability. Then, they began to conduct a large number of tests.During the process, it was found that if only the accuracy rate is used as the final test result, it cannot fully reflect the advantages of this algorithm.

Since requests for large models come with costs, when the algorithm uses a large model to solve mathematical problems, it will set an upper limit for the number of requests for the large model.

This will lead to some complex mathematical problems not being fully solved within a limited number of requests. The same conclusion also applies to other algorithms that use large models for mathematical reasoning. Therefore, new evaluation criteria need to be established for large model mathematical reasoning algorithms.

In fact, there have already been research teams using large models for mathematical reasoning and have achieved good results on simple mathematical datasets.

In this study, Zhao Zilong and others hope to carry out reasoning for complex mathematical problems.At the same time, this project has received support from the German National Fund. The reason lies in the postdoctoral supervisor of Zhao Zilong - Professor Enkelejda Kasneci of the Technical University of Munich in Germany, who wishes to undertake an open-source project to solve mathematical problems at least at the high school level using large models.

In this way, in economically underdeveloped areas, as long as students can access the internet, they can seek help from online teachers at any time, thereby greatly improving the fairness of education.

Currently, there are mainly two ways to improve the mathematical reasoning ability of large models:

Firstly, using mathematical datasets to fine-tune the model to enhance its inherent logical reasoning ability. Secondly, by utilizing the method of prompt engineering, that is, without changing the large model itself, the input of the large model is designed to make its output more in line with the requirements.

Zhao Zilong and his colleagues believe that the training resources in the industry far exceed those in the academic field where he is located. Therefore, it is difficult for him to make efforts in fine-tuning the model, so he and his colleagues decided to start with the second method.Previously, at the Neural Information Processing Systems (NIPS) international conference, an algorithm known as tree-of-thought (ToT) was demonstrated.

When performing reasoning, ToT can only generate a few options at a time for the next step of reasoning, then uses a large model to evaluate, thus selecting the most likely next step, and then continues to generate possible next steps in reasoning until the problem is resolved.

In the research, the team found that ToT can only take one step on the reasoning logic chain at a time. Although this can also solve the problem, the speed is very slow.

For complex problems, if the logic chain is relatively long, ToT will take a long time to possibly answer the question, provided that there are no mistakes in the middle, otherwise it will have to go back to the previous state.

Based on this, they began to imagine: Can the problem be solved faster under the same number of large model requests?For this purpose, the team proposed an algorithm called SSC-CoT (stepwise self-consistent chain of thought).

In it, CoT (chain of thought) also originates from a classic paper published at the NIPS conference. The main contribution of the paper is: when using a large model for reasoning, it is necessary to let the large model output intermediate steps and results at the same time, rather than just the result, which can greatly enhance the correctness of the reasoning.

Zhao Zilong and others found that when using a large model for mathematical reasoning, it may not be possible to obtain the correct reasoning and the correct answer at once. However, among different attempts, there are always some correct intermediate steps.

Therefore, they believe that all reasoning, that is, the chain of thought, can be broken down into intermediate steps. Then, in different reasonings, find the same intermediate steps. And these repeated intermediate steps are likely to be the correct intermediate steps in solving the problem.

After all, for a problem, there can be countless wrong answers, but the correct answer is definitely limited. If different reasonings all have the same intermediate results, then for solving a certain problem, this result is likely to be a useful intermediate result.Then, external validation is conducted on these intermediate results. At this point, in subsequent reasoning, one can continue to reason downwards based on these intermediate results until the final outcome is obtained.

TriMaster100: Contains 100 trigonometric function problems.

And because it was found that SSC-CoT can indeed identify more correct intermediate results, they decided to build a new dataset themselves: TriMaster100, which contains 100 trigonometric function problems, with difficulty levels ranging from high school to Olympiad level.

The reason for doing this is because: for existing mathematical reasoning datasets, without exception, they all focus solely on the final result.Even though some datasets include the process of intermediate reasoning, the lack of step-by-step scoring of intermediate results makes it difficult for existing algorithms to fully answer questions when faced with more challenging mathematical reasoning. Therefore, in terms of accuracy alone, these algorithms show almost no difference.

For a question, if two students answer 90% and 10% respectively, their scores will also be different. This aligns with the experience of many Chinese people in solving math problems, that is, certain points should also be awarded for the correct intermediate steps.

In addition to calculating accuracy, the TriMaster100 dataset can also calculate the specific score of each algorithm on each problem, and then calculate the final total score. Therefore, this is a better way to evaluate mathematical reasoning models.

The reason for launching such a dataset aimed at trigonometric function problems is: firstly, because the reasoning of trigonometric functions is relatively abstract, some scholars have pointed out that high school students find it difficult to solve trigonometric function problems. Secondly, because the transformation of trigonometric functions is clearer, it is easier to score the intermediate results step by step.

For the TriMaster100 dataset, the team also drew a trigonometric function knowledge graph. In the experiment, they found that providing relevant knowledge information by searching the knowledge graph can effectively improve the reasoning level of large models.When solving a mathematical problem, if some advanced theorems can be provided as hint words for the large model, then the large model does not need to start reasoning from scratch. This will certainly improve the reasoning efficiency of the large model.

So far, the research has also come to an end. Recently, the related paper was published on arXiv with the title "Stepwise Self-Consistent Mathematical Reasoning with Large Language Models" [1].

Zhao Zilong is the first author, and Professor Enkelejda Kasneci of the Technical University of Munich in Germany is the corresponding author.

In the future, if you want to build a platform that can be truly commercialized, you still need to iterate on visualization. When students are using it, it's not just about providing answers, but encouraging students to think for themselves.

For example, when solving problems, students can first give their own reasoning, or choose a possible reasoning direction from the options provided by the platform. Of course, the student's choice may be wrong, and in this case, it's best if the platform can provide an explanation.That is to say, when solving math problems, students not only want to know the correct answer, but also want to know where their method went wrong.

At this time, if the large model can provide a reasonable explanation for the wrong answer, it can provide students with a very good learning experience.

Currently, Zhao Zilong's collaborators are cooperating with a German online education institution, using the learning data of the institution's students for further research.

POST A COMMENT