The artificial intelligence ( AI ) industry is confronting a critical shortage of high-quality training data , a constraint that may already be shaping the next generation of AI systems, Neema Raphael , Goldman Sachs ' chief data officer and head of data engineering, has said.
“We've already run out of data,” Raphael stated, noting that this deficit is forcing companies to increasingly rely on synthetic data—machine-generated text, images, and code. Raphael made the assertion on the bank's "Exchanges" podcast, confirming a growing industry suspicion that the readily available data on the open web has been exhausted.
The risk of low-quality output
While synthetic data offers a limitless supply, Raphael cautioned that this reliance carries significant risk, potentially overwhelming models with low-quality output, or “AI slop.” He pointed to China's DeepSeek as a case study, hypothesising that its development costs may reflect training conducted on the output of existing models rather than entirely new, human-created data.
“I think the real interesting thing is going to be how previous models then shape what the next iteration of the world is going to look like,” he said.
His comments align with similar warnings, including one from OpenAI co-founder Ilya Sutskever earlier this year, who suggested that the era of rapid AI development could “unquestionably end” once all useful online data is consumed.
Despite the global data crunch, Raphael does not believe the lack of fresh, open-internet data will be a “massive constraint” for corporations. He argued that enterprises are sitting on vast, untapped reserves of proprietary data, such as trading flows and client interactions.
"From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he remarked.
“We've already run out of data,” Raphael stated, noting that this deficit is forcing companies to increasingly rely on synthetic data—machine-generated text, images, and code. Raphael made the assertion on the bank's "Exchanges" podcast, confirming a growing industry suspicion that the readily available data on the open web has been exhausted.
The risk of low-quality output
While synthetic data offers a limitless supply, Raphael cautioned that this reliance carries significant risk, potentially overwhelming models with low-quality output, or “AI slop.” He pointed to China's DeepSeek as a case study, hypothesising that its development costs may reflect training conducted on the output of existing models rather than entirely new, human-created data.
“I think the real interesting thing is going to be how previous models then shape what the next iteration of the world is going to look like,” he said.
His comments align with similar warnings, including one from OpenAI co-founder Ilya Sutskever earlier this year, who suggested that the era of rapid AI development could “unquestionably end” once all useful online data is consumed.
Despite the global data crunch, Raphael does not believe the lack of fresh, open-internet data will be a “massive constraint” for corporations. He argued that enterprises are sitting on vast, untapped reserves of proprietary data, such as trading flows and client interactions.
"From an enterprise perspective, I think there's still a lot of juice I'd say to be squeezed in that," he remarked.
You may also like
Health Alert: This habit is prematurely aging the brain, and elderly-like diseases are occurring at a young age..
Women's World Cup: Hopefully we can keep Wolvaardt quiet, says Knight
Whooping cough can be fatal in children under age 2: Study
Global economy seeing structural transformation, India's capacity to absorb shocks strong: FM
Photo Gallery: Priyanka Chopra's killer look in a stylish white dress, see here...