from Diary 2023-12-04 Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine https://arxiv.org/abs/2311.16452

y_matsuwitter I see a lot of discourse about how itā€™s good to build small LLMs with specialized models, but Iā€™m not sure if itā€™s due to reasoning ability or other factors, but GPT-4 is a specialized However, research shows that GPT-4 performs better than specialized models, perhaps due to its inference ability. It is not necessarily correct to say that it is strong because it is specialized for Japanese or medical care. Of course, the computational cost falls, and it is not affected by the rules and biases of the operator, so it may be better in that respect.

Article Summary General-purpose foundational models such as the GPT-4 have shown remarkable ability in a variety of tasks without domain-specific training. However, there is a general consensus that domain-specific competence requires training in models using specialized knowledge. In this study, we conducted an exploratory study of prompt engineering for the GPT-4 using medically relevant benchmarks and showed that it is possible to demonstrate superior expertise competence without domain-specific fine-tuning.

https://arxiv.org/abs/2311.16452

icoxfog417 Since there is no guarantee that GPT-4 has not studied the medical papers used to train the specialized models (the Technical Report Not mentioned), I donā€™t think we know if this conclusion can be applied to the generic model as a whole. In general, the larger the model, the more training data is needed, so I think there is a non-zero possibility that GPT-4 has already been trained.

y_matsuwitter There are multiple points of contention. 惻Is GPT-4 more powerful than specialized types? 惻Can a generic model achieve high accuracy through inference capability? 惻Does the general-purpose model enhance over the specialized model by using inference ability and knowledge of each domain for learning? So, this paper is about the first, the second is as you say, and if the third is possible, it seems possible that each large-scale infrastructure model will gain high capability in each domain in the future.

odashi_t Itā€™s not a proper comparison to begin with, since you didnā€™t apply any of the methods you engineered around to win at GPT-4 to Med PaLM. I agree. This is a problem before the contamination of the training data, and it is a bad paper.


This page is auto-translated from [/nishio/Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine](https://scrapbox.io/nishio/Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. Iā€™m very happy to spread my thought to non-Japanese readers.