In the dynamic world of artificial intelligence (AI), OpenAI’s ChatGPT has emerged as a key player. However, recent research indicates a discernible shift in its performance over time. This article delves into the findings of this study, providing a comprehensive understanding of the implications for businesses leveraging AI technologies.
Researchers from renowned institutions like Berkeley and Stanford Universities have been meticulously tracking the performance of ChatGPT, specifically versions 3.5 and 4. Their findings reveal a significant fluctuation in the quality of these models over several months. This fluctuation, often unannounced by OpenAI, has sparked discussions among users on platforms like Twitter and Facebook, with many noting a perceived downgrade in quality.
The researchers embarked on this study to understand how these performance changes impact the integration of ChatGPT into larger workflows. They highlighted the importance of benchmarking, as it helps identify whether updates improve certain aspects of the language model while negatively affecting others. The study focused on four key areas: solving math problems, answering sensitive questions, code generation, and visual reasoning.
In the realm of math problem-solving, the researchers noted a decrease in GPT-4’s performance between March and June 2023. They also observed changes in how GPT-3.5 and GPT-4 responded to sensitive questions over the same period. Interestingly, while GPT-4 answered fewer questions, GPT-3.5 answered slightly more. However, the responses from both models became less verbose and more succinct.
The code generation performance of the models also underwent significant changes. The percentage of directly executable code generated by GPT-4 dropped from 52.0% in March to 10.0% in June. Similarly, GPT-3.5 also experienced a substantial drop from 22.0% to 2.0%. The researchers attributed this decline to the models adding non-code text to their output, which some users believe is markdown intended to make the code easier to use.
In terms of visual reasoning, there was a slight overall improvement of 2%. However, the performance scoring remained low, with GPT-4 scoring 27.4% and GPT-3.5 scoring 12.2%. Despite the improvement, the researchers noted that the models did not uniformly make better generations over time, emphasizing the need for fine-grained drift monitoring, especially for critical applications.
These findings underscore the need for businesses relying on OpenAI’s technologies to regularly assess the quality and stability of these models. As AI continues to evolve, staying abreast of these changes can help businesses optimize their use of such technologies and maintain efficient workflows. The researchers concluded that the lack of stability in the output means that companies that depend on OpenAI should consider instituting regular quality assessment in order to monitor for unexpected changes.
Contact us today: https://synergy11.marketing/contact/
Original research: https://arxiv.org/pdf/2307.09009.pdf