Better Thinking or a Bigger Model? Thinking–Answering Shuffles with Qwen3 on GPQA

Authors

  • Edvard A. Khalafyan Moscow Institute of Physics and Technology

DOI:

https://doi.org/10.51408/1963-0136

Keywords:

Chain-of-thought (CoT), cross-model reasoning transfer, Qwen3, GPQA benchmark, LLM token entropy

Abstract

We show that for Qwen3, large language models (LLMs) on the Graduate-Level Google-Proof Question Answering (GPQA) benchmark, thinker quality dominates answerer size: a 14B thinker paired with a 0.6B answerer reaches 54.24% accuracy, close to the 14B→14B diagonal (59.15%), whereas a 0.6B thinker reduces a 14B answerer to 20.54%. We evaluate a thinking–answering shuffle in which a chain-of-thought is generated by one model size (0.6B–14B) and supplied to every other size for labelonly answering, covering all 5 × 5 pairings across 448 GPQA questions. Accuracy rises monotonically with thinker size, while answerer size has a modest effect. Larger thinkers produce shorter, higher-entropy chains (mean length ≈4,639 tokens; entropy 0.416) than smaller thinkers (14,566; 0.404), and these properties correlate with better cross-model transfer. Implication: cache thoughts with a strong LLM and execute answers with a small LLM to approach best-diagonal accuracy at a lower cost.

References

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models”, Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.

T. Kojima, S. S. Gu, M. Reid, Y. Matsuo and Y. Iwasawa, “Large language models are zero-shot reasoners”, Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery and D. Zhou, “Self-consistency improves chain of thought reasoning in language models”, arXiv preprint arXiv:2203.11171, 2022.

D. Zhou, N. Sch¨arli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting enables complex reasoning in large language models”, arXiv preprint arXiv:2205.10625, 2022.

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners”, Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael and S. R. Bowman, “GPQA: A graduate-level google-proof q&a benchmark”, First Conference on Language Modeling, 2024.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.

J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni and P. Liang, “Lost in the middle: How language models use long contexts”, Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models”, Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.

Z. Bi, K. Chen, T. Wang, J. Hao, and X. Song, “Cot-x: An adaptive framework for cross-model chain-of-thought transfer and optimization”, arXiv preprint arXiv:2511.05747, 2025.

L. Ranaldi and A. Freitas, “Aligning large and small language models via chain-ofthought reasoning,” Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1812–1827, 2024.

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report”, arXiv preprint arXiv:2505.09388, 2025.

B. Peng, J. Quesnelle, H. Fan and E. Shippole, “Yarn: Efficient context window extension of large language models”, arXiv preprint arXiv:2309.00071, 2023.

C. E. Shannon, “A mathematical theory of communication”, The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.

M. E. Haroutunian and V. Avetisyan, “New approach for test quality evaluation based on shannon information measures”, Mathematical Problems of Computer Science, vol. 44, pp. 7–21, 2015.

M.E. Haroutunian and V.K. Avetisyan, “Analysis of experiments of a new approach for test quality evaluation”, Mathematical Problems of Computer Science, vol. 45, pp. 35–43, 2016.

S. Farquhar, J. Kossen, L. Kuhn and Y. Gal, “Detecting hallucinations in large language models using semantic entropy”, Nature, vol. 630, no. 8017, pp. 625–630, 2024.

M.E. Haroutunian, D.G. Asatryan and K.A. Mastoyan, “Analyzing the quality of distorted images by the normalized mutual information measure”, Mathematical Problems of Computer Science, vol. 61, pp. 7–14, 2024.

M.E. Haroutunian and G.A. Gharagyozyan, “Information theory tools and techniques to overcome machine learning challenges”, Mathematical Problems of Computer Science, vol. 63, pp. 25–41, 2025.

C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes”, Findings of the Association for Computational Linguistics: ACL 2023, pp. 8003–8017, 2023.

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi and D. Zhou, “Rationale-augmented ensembles in language models”, arXiv preprint arXiv:2207.00747, 2022.

Downloads

Published

2025-12-01

How to Cite

Khalafyan, E. A. (2025). Better Thinking or a Bigger Model? Thinking–Answering Shuffles with Qwen3 on GPQA. Mathematical Problems of Computer Science, 64, 17–28. https://doi.org/10.51408/1963-0136