The Qianwen team of Peking University has launched a special version of CriticGPT for mathematics, which "finds differences" and makes the large model progress faster

Qubits
2024/07/07 21:28

qubits | Official account QbitAI

Criticism not only makes people improve, but also improves the ability of large models.

OpenAI used this idea to create a "fault-finding model" CriticGPT. Coincidentally, just a few days before the release of CriticGPT, Peking University, together with Qianwen and other teams, designed a "mathematics-specific version" of CriticGPT with similar ideas.

In the setting without training, the accuracy of the validator to assist the model on GSM8K during inference increased from 86.6% to 88.2%.

On the GSM8K dataset, it can improve the accuracy of the model from 86.6% to 88.2%.

The core idea of CriticGPT is to deliberately set bugs in the code and annotate them in detail, and then use the obtained data to train a model that can debug.

The Peking University team found that this method is not only useful in code, but also helps language models solve mathematical problems.

So the team used a similar idea, replaced the code with a math problem, and launched the "mathematical version of CriticGPT" - Math-Minos.

In the field of mathematical reasoning, verifying the correctness of a solution is a critical step in ensuring the quality of reasoning.

However, most of the existing mathematical validators rely on binary classification labels for training, which is obviously insufficient in providing explanations for correct or wrong reasons, and cannot provide sufficient supervised signals for validators to be trained.

Math-Minos overcomes this limitation and provides more in-depth explanations that greatly enrich the training information of the validator.

It introduces step-by-step natural language feedback as a justification label, which not only points out the right or false of the solution, but also analyzes the cause of the error step by step.

In terms of obtaining natural language feedback, the research team initially used GPT-4 to generate training data, but through experiments, it was found that even GPT-4 had a certain percentage of errors when evaluating mathematical reasoning tasks step by step.

To avoid this problem to some extent, the researchers simplified GPT-4's task by introducing step-level binary classification labels in the prompts, allowing GPT-4 to generate assessments more accurately.

Firstly, through supervised fine-tuning, natural language feedback was used as training data, which effectively improved the evaluation ability of the model.

Secondly, efficient inference is achieved through standard ORM (Outcome Reward Model) and PRM (Process Reward Model) training, which has two benefits.

First, through two-stage training, binary classification data and supervised fine-tuning data can be decoupled.

Due to the sparsity of the supervised signal, there are often far more data for training binary classification than for supervised fine-tuning, and the study finds that only a small amount of supervised fine-tuning data is needed to greatly improve the evaluation ability of the model.

On the other hand, when the validator performs the validation, it does not need to generate natural language feedback explicitly, making the inference process more efficient.

Overall, the researchers added 30K of natural language feedback data during the training phase, which brought an increase in the mathematical ability of the Mistral-7B validator, under the experimental setup of Best-of-256:

Under the ORM setting, MATH-Minis improved the accuracy of Mistral-7B from 86.2% to 87.3% in the GSM8K dataset and from 35.9% to 37.4% in the MATH dataset.

Under the PRM setting, MATH-Minis improved the accuracy of Mistral-7B from 87.1% to 87.6% on the GSM8K dataset and from 36.7% to 37.8% on the MATH dataset.

In combination with Self-Consistency, MATH-Minis improved the accuracy of Mistral-7B from 87.1% to 88.2% in the GSM8K dataset and from 37.8% to 38.6% in the MATH dataset.

Math-Minos shows superior performance in both ORM and PRM task settings, especially in ORM settings, where the improvement is even more significant.

In addition, the research team conducted an in-depth analysis of the errors generated by the generator at the step level, categorizing them into five types – extraneous errors, cumulative errors, calculation errors, logical errors, and other errors.

The analysis results show that there are many possible reasons for step errors in multi-step inference, and the model may make errors in all of these error types, which further emphasizes the importance of introducing natural language feedback to guide model learning.

Experiments have found that cumulative errors (i.e., errors in one step are likely to lead directly to errors in all subsequent steps) account for the highest proportion of all error types on both datasets.

The error distribution on different datasets also has different characteristics, and on the relatively simple GSM8K, there are more calculation errors. On more difficult MATH datasets, there are more logical errors.

By constructing a meta-evaluation set, the research team evaluated the verifier's ability to accurately judge the final answer without the influence of the generator.

The results show that the meta-evaluation of Math-Minos in the training process is consistently better than that of the traditional ORM, and shows faster convergence speed and more accurate judgment ability.

At the same time, the experimental results also show that Math-Minos has a strong potential for Scale Up.

In conclusion, the development of Math-Minis not only improves the performance of mathematical validators, but also provides a new training paradigm in the field of natural language processing.

The research team hopes that this work will inspire future research to explore the potential integration of natural language feedback and categorical validators to advance the capabilities of large language models for complex inference tasks.

This article is from Xinzhi self-media and does not represent the views and positions of Business Xinzhi.If there is any suspicion of infringement, please contact the administrator of the Business News Platform.Contact: system@shangyexinzhi.com