The Costs of Building Large AI Models and DeepSeek's Innovations
Leading artificial intelligence systems such as OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude have generated significant public interest by delivering fluent textual responses in multiple languages. These companies have not only captured attention for their groundbreaking AI capabilities but also for the enormous financial investments involved in developing their models.
A Chinese startup named DeepSeek has challenged the prevailing belief regarding the costs associated with creating top-tier AI technologies. Their impressive advancements have sparked skepticism about the vast expenditures made by major AI firms.
As someone who studies machine learning, I recognize that DeepSeek’s impressive entry into the field does not stem from a singular technological marvel, but rather from a classic approach: driving efficiencies. This is particularly important in a domain renowned for its substantial demands on computing resources.
Understanding the Costs
The process of developing powerful AI systems begins with constructing a large language model, which predicts the next word based on prior words. For instance, given the phrase “The theory of relativity was discovered by Albert,” a well-trained model might predict “Einstein” as the next word. These large language models undergo a phase called pretraining to improve their prediction accuracy.
Pretraining is a resource-intensive process, requiring vast amounts of data and computational power. Companies typically gather data by crawling the internet and scanning through books. Most computations are powered by graphics processing units, or GPUs. This is because both computer graphics and the neural networks used in large-language models share a foundation in linear algebra. Large language models contain hundreds of billions of parameters, which are adjusted during pretraining.
Due to their expansive resource needs, large language models consume a significant amount of energy to function. However, merely pretraining a large language model is insufficient to create a consumer-ready product like ChatGPT. Typically, a trained language model struggles to follow human instructions and may output harmful or inappropriate content, much of which is found online.
To resolve this, additional training stages are necessary. One stage is instruction tuning, where models are shown examples of how they should respond to human directives. Following this, reinforcement learning from human feedback takes place, where human reviewers evaluate multiple model outputs to select the preferred ones.
When evaluating the expenses involved in AI model development, a multitude of costs comes into play: employing top-tier AI specialists, establishing a data center filled with GPUs, accumulating data for pretraining, and executing pretraining itself. Moreover, further expenses arise from data collection during instruction tuning and reinforcement learning phases.
In total, the costs associated with building a cutting-edge AI model can reach up to $100 million. A large portion of this budget is consumed by GPU training.
Once the model is finalized, additional expenses continue as the model responds to user queries, utilizing what is called test time or inference time compute. This stage also requires GPUs for processing. For instance, in December 2024, OpenAI reported that their latest model, o1, demonstrated improved capabilities in logical reasoning tasks as test time compute increased.
Reducing Resource Consumption
Till now, it appeared that escalating investments in computing resources during the training and inference stages were the hallmark of building leading AI models. However, the arrival of DeepSeek has disrupted this narrative.
DeepSeek has sent ripples through the technology investment community.
Their V-series models, culminating in the V3 model, have leveraged a series of optimizations that significantly reduce the costs of training cutting-edge AI models. According to their technical documentation, they spent less than $6 million to train V3. Although this figure excludes expenses related to hiring, research, experimentation, and data collection, it remains impressive given that it competes with models created with significantly higher investments.
The drop in costs can be attributed to a variety of intelligent engineering decisions rather than a singular breakthrough. These include using fewer bits for model weight representation, innovative neural network designs, and minimizing data communication overhead among GPUs.
Notably, due to U.S. export restrictions, DeepSeek's team did not have access to high-performance GPUs such as the Nvidia H100. Instead, they employed Nvidia H800 GPUs, which are purposely limited in performance to comply with these regulations. This limitation seemingly spurred even greater creativity among DeepSeek’s engineers.
DeepSeek has also innovated cost-effective strategies for inference, thereby minimizing the expenses of deploying their model. Additionally, they introduced a model named R1, which matches OpenAI’s o1 model in reasoning capabilities.
DeepSeek has made the weights for both V3 and R1 publicly accessible, enabling anyone to download and enhance or customize their models. They have also released their models under the flexible MIT license, which permits usage for personal, academic, or commercial endeavors with minimal restrictions.
Resetting Expectations
DeepSeek has fundamentally transformed the landscape surrounding large AI models. Their open-weight models, trained with impressive efficiency, are now comparable to more costly, proprietary models that require paid subscriptions.
As the research community and investors assess this new reality, they will need time to recalibrate their perspectives on the costs and values of AI technology.
AI, Costs, Innovation