The hottest new idea in AI? Chatbots that look like they think.

Chatbots that yap to themselves before answering are now spreading as American rivals race to outdo DeepSeek’s viral moment. (Getty Images)

Chinese start-up DeepSeek recently displaced ChatGPT as the top-ranking artificial intelligence app, in part by dazzling the public with a free version of the hottest idea in AI - a chatbot that “thinks” before answering a user’s question.

The app’s “DeepThink” mode responds to every query with the text, “Thinking…,” followed by a string of updates that read like the chatbot talking to itself as it figures out its final answer. The monologue unspools with folksy flourishes like, “Wait,” “Hmm,” or “Aha.”

Subscribe to The Post Most newsletter for the most important and interesting stories from The Washington Post.

Chatbots that yap to themselves before answering are now spreading as American rivals race to outdo DeepSeek’s viral moment. This style of AI assistant can be more accurate on some tasks, but also mimics humans in ways that can hide its limitations.

ADVERTISEMENT

The self-talk technique, sometimes dubbed “reasoning,” became trendy in top artificial intelligence labs late last year, after OpenAI and Google released AI tools that scored higher on math and coding tests by monologuing through problems, step by step.

At first, this new type of assistant wasn’t available to the masses: OpenAI released a system called o1 in December that cost $200 a month and kept its inner workings secret. When DeepSeek launched its “thinking” app free, and also shared the R1 reasoning model behind it, a developer frenzy ensued.

“People are excited to throw this new approach at every possible thing,” said Nathan Lambert, an AI researcher for the nonprofit Allen Institute for AI.

In the two weeks since DeepSeek’s rise tanked U.S. tech stocks, OpenAI made some of its reasoning technology free inside ChatGPT and launched a new tool built on it called Deep Research that searches the web to compile reports.

Google on Wednesday made its competing product, Gemini 2.0 Flash Thinking Experimental, available to consumers for the first time, free, via its AI app Gemini.

ADVERTISEMENT

The same day, Amazon’s cloud computing division said it was betting on “automated reasoning” to build trust with users. The next day, OpenAI had ChatGPT started showing users polished translations of its raw “chains of thought” in a similar way to DeepSeek. (Amazon founder Jeff Bezos owns The Washington Post.)

U.S. companies will soon spend “hundreds of millions to billions” of dollars trying to supercharge this approach to AI reasoning, Dario Amodei, chief executive of Anthropic, the maker of the chatbot Claude, predicted in an essay on the implications of DeepSeek’s debut on U.S.-China competition.

The flood of investment and activity has boosted the tech industry’s hopes of building software as capable and adaptable as humans, from a tactic first proven on math and coding problems. “We are now confident we know how to build AGI,” or artificial general intelligence, OpenAI Sam Altman wrote in a blog post last month.

Google’s vice president for its Gemini app, Sissie Hsiao, said in a statement that reasoning models represent a paradigm shift. “They demystify how generative AI works - making it more understandable and trustworthy by showing their ‘thoughts,’” while also helping with more complex tasks, she said.

“As we introduce reasoning models to more people, we want to build a deeper understanding of their capabilities and how they work” to create better products, OpenAI spokesperson Niko Felix said in a statement. “Users have told us that understanding how the model reasons through a response not only supports more informed decision-making but also helps build trust in its answers.”

ADVERTISEMENT

- - -

Hitting a wall

Silicon Valley’s obsession with reasoning began with the hunt for the next leap forward in language models, the technology that powers ChatGPT.

The attention won by OpenAI previously helped rally the tech sector around a simple paradigm for smarter machines: Pump more data and computing power into larger and larger AI models to make them more capable.

But in recent years that dependable formula started to plateau. Language models were no longer improving as fast on industry benchmarks for math, science and logic. And most of the readily available data on the internet had already been scraped.

In response, labs at companies like Google, OpenAI and Anthropic started to focus on squeezing better performance from AI models they had already created.

One promising trick involved directing language models to break down a problem into steps called “chains of thought” instead of answering in a single try - part of the reasoning technique used by DeepSeek and others. This forces an AI model to spend more time and processing power answering a query.

ADVERTISEMENT

The strategy paid off - especially when paired with a technique called reinforcement learning, which has enabled computers to master games such as Go. It involves steering how AI systems behave by rewarding the right response over numerous instances of trial and error.

This framework lends itself to domains like math, logic and coding, where computers can verify if the final answer is correct. Still, companies lacked data that showed how humans reasoned their way through problems.

At first, they tried hiring human contractors to write down the steps they took when answering questions, a method that proved slow and expensive.

But as AI technology improved, it could reliably generate copious examples that mimicked human-written “chains of thought.” Gradually, researchers were able to remove people from the loop.

In a technical report published in January, DeepSeek claimed that one of its earlier reasoning models, called R1-Zero, began to show long “chains of thought” just from researchers increasing the number of rounds of trial and error it performed, without any specially created data.

“You’re effectively are setting up a sandbox where the model changes its behavior on its own,” Lambert said.

Some observers argue the excitement over this new direction in AI has overshadowed discussion of its limits.

It’s still an open question whether “chains of thought” reflect how an AI system actually processes information, said Subbarao Kambhampati, a computer science professor at Arizona State University.

His recent research suggests that AI models’ reasoning skills can fall apart if challenged on tests for real-world applications like planning and scheduling.

What’s more, he said, the labs building these models tend to focus on the accuracy of the final answers, not whether the reasoning is sound - a quality that’s difficult to measure.

For example, DeepSeek’s technical paper for R1 noted that an earlier version of its model provided more accurate final answers when its chains of thought mixed text in both Chinese and English. Yet its researchers opted for a model that yammered to itself in English because it was more pleasing to users.

Kambhampati argues that companies should allow chatbots to “mumble to themselves” in whatever way produces the most accurate answers, rather than try to make their “chains of thought” more pleasing to humans. “It’s better off getting rid of that anthropomorphization. It doesn’t matter,” he said.

The AI industry appears to be headed in a different direction. Reasoning models widely released since Silicon Valley’s DeepSeek shock include design features that, like those in the Chinese app, encourage consumers to believe that software’s “thoughts” show it reasoning like a human.

On the ChatGPT homepage, a “Reason” mode button features prominently in the chat box. In a post on X, Altman called “chain of thought” a feature where the AI “shows its thinking.”

“To an everyday user it feels like gaining insight into how an algorithm works,” said Sara Hooker, head of the research lab Cohere for AI. But it’s a way to boost performance, not peek under the hood, she said.

Ethan Mollick, a professor who studies AI at the Wharton School of the University of Pennsylvania, said seeing a chatbot’s supposed inner monologue can trigger empathy.

Compared with ChatGPT’s flatter tone, responses from DeepSeek’s R1 seemed “neurotically friendly and desperate to please you,” he said.

“We’re kind of seeing this weird world where the hardcore computer science is aligning with marketing - it’s not clear if even the creators know which is which.”

Related Content

My mom died in 2020. I just found 86 unheard voicemails from her on my phone.

In Idaho, a preview of RFK Jr.’s vaccine-skeptical America

Circles of life