for complex reasoning problems, which Sam always called “strawberry”. The model is able to “think” longer before answering a question, and the longer it thinks, the higher the quality of its reasoning. Principle: Internalized learning in the chain of thought based on reinforcement learning By removing the problem of the chain of thought, the model can be continuously verified and corrected. Performance: The o model has improved significantly on tasks such as doctoral-level problems in programming, mathematics, physics and chemistry, but its performance on tasks such as writing is not as good as GPTo. Composition: The o series includes o, o-preview and o-mini.
o has not yet been publicly released, but belgium email list paying users of o-preview and API users can already use it. o-mini is faster and more cost-effective. Impact: New scaling laws have emerged. Ilya summarizes reinforcement learning in one sentence: Let the AI try new tasks using random paths If the effect exceeds expectations, update the weights of the neural network so that the AI remembers to use this successful event more before starting the next attempt. The game itself: The essence is to use the unlimited computing power of artificial intelligence to compensate for the lack of data efficiency.
Critical model: By decomposing the inference process and using an additional stronger and more specialized critic model, the supervision of the reasoning process can be extended to more complex problems. Technical assumption: . answer is unacceptable, or uses a more economical search . The ability of the iterative Bootstrap model to generate reasonable reasoning and integrate reasoning during training; process, the model learns to reason, similar to an extended version of -a. Reverse engineering: Consists of a synthetic data generator, a reward function, a policy optimizer, and other modules.