更可能是大语言模型训练时已经见过这些题目,很可惜论文并没有对此进行分析。
The UCLA team, Taylor Webb, Keith Holyoak, and Hongjing Lu, relied on a large collection of ways that past research has tested humans’ ability to reason via analogy. The classic form of this is the completion of a comparison—think “cold is to ice as hot is to ____”—where you have to select the best completion from a set of options.
Related tests involve figuring out the rules behind transformations of a series of letters. So, for example, if the series a b c d is transformed to a b c e, then the rule is to replace the last letter of the series with its alphabetical successor. The participant’s understanding of the rule is tested by asking them to use the rule to transform a different set of letters. Similar tests with numbers can involve complex rules, such as “only even numbers in order, but can be ascending or descending.”
On all of these tests, GPT-3 consistently outperformed undergrads, although the margins varied depending on the specific test involved. The researchers also found that the software could develop rules based on a series of numbers, and then apply them to a different domain, such as descriptions of temperatures like “warm” and “chilly.” They conclude that “these results suggest that GPT-3 has developed an abstract notion of successorship that can be flexibly generalized between different domains.”
Source: GPT-3 aces tests of reasoning by analogy | Ars Technica