Ai2's Molmo shows open source can meet, and beat, closed multimodal models

Devin Coldewey

Updated 25 September 2024 at 2:07 pm·6-min read

The common wisdom is that companies like Google, OpenAI, and Anthropic, with bottomless cash reserves and hundreds of top-tier researchers, are the only ones that can make a state-of-the-art foundation model. But as one among them famously noted, they "have no moat" — and Ai2 showed that today with the release of Molmo, a multimodal AI model that matches their best while also being small, free, and truly open source.

To be clear, Molmo (multimodal open language model) is a visual understanding engine, not a full-service chatbot like ChatGPT. It doesn't have an API, it's not ready for enterprise integration, and it doesn't search the web for you or for its own purposes. You can think of it as the part of those models that sees an image, understands it, and can describe or answer questions about it.

Molmo (coming in 72B, 7B, and 1B-parameter variants), like other multimodal models, is capable of identifying and answering questions about almost any everyday situation or object. How do you work this coffee maker? How many dogs in this picture have their tongues out? Which options on this menu are vegan? What are the variables in this diagram? It's the kind of visual understanding task we've seen demonstrated with varying levels of success and latency for years.

What's different is not necessarily Molmo's capabilities (which you can see in the demo below, or test here), but how it achieves them.

https://youtu.be/tsbVZ3rDA2w

Visual understanding is a broad domain, of course, spanning things like counting sheep in a field to guessing a person's emotional state to summarizing a menu. As such it's difficult to describe, let alone test quantitatively, but as Ai2 CEO Ali Farhadi explained at a demo event at the research organization's HQ in Seattle, you can at least show that two models are similar in their capabilities.

"One thing that we're showing today is that open is equal to closed," he said, "And small is now equal to big." (He clarified that he meant ==, meaning equivalency, not identity; a fine distinction some will appreciate.)

One near constant in AI development has been "bigger is better." More training data, more parameters in the resulting model, and more computing power to create and operate them. But at some point you quite literally can't make them any bigger: There isn't enough data to do so, or the compute costs and times get so high it becomes self-defeating. You simply have to make do with what you have, or even better, do more with less.

Farhadi explained that Molmo, though it performs on par with the likes of GPT-4o, Gemini 1.5 Pro, and Claude-3.5 Sonnet, weighs in at (according to best estimates) about a tenth their size. And it approaches their level of capability with a model that's a tenth of that.

"There are a dozen different benchmarks that people evaluate on. I don't like this game, scientifically… but I had to show people a number," he explained. "Our biggest model is a small model, 72B, it's outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is really better than them or not? I don't know. But at least to us, it means that this is playing the same game."

If you want to try to stump it, feel free to check out the public demo, which works on mobile too. (If you don't want to log in, you can refresh or scroll up and "edit" the original prompt to replace the image.)

The secret is using less, but better quality, data. Instead of training on a library of billions of images that can't possibly all be quality controlled, described, or deduplicated, Ai2 curated and annotated a set of just 600,000. Obviously that's still a lot, but compared with six billion it's a drop in the bucket -- a fraction of a percent. While this leaves off a bit of long tail stuff, their selection process and interesting annotation method gives them very high quality descriptions.

Interested in how? Well, they show people an image and tell them to describe it — out loud. Turns out people talk about stuff differently from how they write about it, and this produces not just accurate but also conversational and useful results. The resulting image descriptions Molmo produces are rich and practical.

That is best demonstrated by its new, and for at least a few days unique, ability to "point" at the relevant parts of the images. When asked to count the dogs in a photo (33), it put a dot on each of their faces. When asked to count the tongues, it put a dot on each tongue. This specificity lets it do all kinds of new zero-shot actions. And importantly, it works on web interfaces as well: Without looking at the website's code, the model understands how to navigate a page, submit a form, and so on. (Rabbit recently showed off something similar for its r1, for release next week.)

So why does all this matter? Models come out practically every day. Google just announced some. OpenAI has a demo day coming up. Perplexity is constantly teasing something or another. Meta is hyping up Llama version whatever.

Well, Molmo is completely free and open source, as well as being small enough that it can run locally. No API, no subscription, no water-cooled GPU cluster needed. The intent of creating and releasing the model is to empower developers and creators to make AI-powered apps, services, and experiences without needing to seek permission from (and pay) one of the world's largest tech companies.

"We're targeting, researchers, developers, app developers, people who don't know how to deal with these [large] models. A key principle in targeting such a wide range of audience is the key principle that we've been pushing for a while, which is: make it more accessible," Farhadi said. "We're releasing every single thing that we've done. This includes data, cleaning, annotations, training, code, checkpoints, evaluation. We're releasing everything about it that we have developed."

He added that he expects people to start building with this dataset and code immediately — including deep-pocketed rivals, who hoover up any "publicly available" data, meaning anything not nailed down. ("Whether they mention it or not is a whole different story," he added.)

The AI world moves fast, but increasingly the giant players are finding themselves in a race to the bottom, lowering prices to the bare minimum while raising hundreds of millions to cover the cost. If similar capabilities are available from free, open source options, can the value offered by those companies really be so astronomical? At the very least, Molmo shows that, though it's an open question whether the emperor has clothes, he definitely doesn't have a moat.

BuzzFeed
32 Painfully Awkward Talk-To-Text Fails That Spiraled Way, Way, Way, Way, Way, Way, Way Wayyyyyy Out Of Control
It happens to all of us.
South China Morning Post
China's 'AI Tiger' Zhipu in race to bring enhanced features to video-generation apps
China's state-backed unicorn Zhipu AI has introduced a series of artificial intelligence (AI) features to its video tool Ying, in a heated race with other start-ups and Big Tech companies to woo users in a growing market for AI-assisted video creation tools. After the updates, Ying will be able to generate 4K high-definition video clips as long as 10 seconds at 60 frames per second, Zhipu said in a statement. The Beijing-based firm is considered one of China's four "AI Tigers", a group of unicor
Yahoo News Australia
Aussie woman's $20,000 fine for garage secret highlights 'sad' $30 billion issue
Officers have since issued a public warning for Australians guilty of this offence.
Yahoo News Australia
Unexpected Bunnings 'visitor' highlights sad trend: 'Poor guy'
The 'bizarre' scenes shocked Aussies who witnessed the incident.
HuffPost
Donald Trump Jr.’s ‘Disgusting’ Taunt Of Ukraine's Zelenskyy Is Slammed Online
President-elect Donald Trump's son was ripped as "vile."
Parade
Demi Moore, 62, Flaunts Toned Body With Teeny-Tiny Bikini in New Video
The star of 'The Substance' showed off what she's got while celebrating a special occasion.
Yahoo News Australia
Paralysis warning after 'horrible' invasive tree appears in Aussie garden
Australians are being told to avoid this species at all costs.
Yahoo Sport Australia
Pat Cummins cops backlash as legends rip cricket captain over photo with wife at concert
The Australian captain's decision has come under fire. Read more here.
Yahoo Sport Australia
Payne Haas delivers dagger to Kevin Walters with eye-opening comments about Broncos culture
The comments don't paint the former Broncos coach in the best light. More here.
Yahoo Sport Australia
Matty Johns delivers harsh reality check for Reece Walsh in massive prediction about Broncos
Reece Walsh will be pivotal to Brisbane's success in 2025. Read more here.
Yahoo Sport Australia
Alex de Minaur's five-year misery continues as Daniil Medvedev meltdown stuns at ATP Finals
Daniil Medvedev is copping plenty of backlash. Read more here.
Cosmo
Maya Jama just referenced the sexiest naked dress of the 90s
Maya Jama just strutted down the MTV EMAs 2024 red carpet while wearing the Versace safety pin dress made famous by Liz Hurley in 1994
BuzzFeed
Hugh Grant's Real Name Is Causing Quadruple Takes
No wonder he's grumpy all of the time.
The Daily Beast
Conspiracies Run Wild as It’s Reported Kate Never Had Cancer
Conspiracy theories surrounding Kate Middleton’s cancer diagnosis and recovery have erupted online again, after a report by a respected and accredited royal reporter suggested the Princess of Wales never had cancer but was instead found to have “pre-cancerous cells.” Vicious rumors perpetrated by online trolls alleging that Kate either faked or exaggerated her cancer to cover up personal difficulties have been turbocharged by the claim. When Kate announced she had cancer in March 2024, she said
InStyle
Megan Fox Announces Pregnancy By Posing Nude While Covered in Black Paint
She debuted her baby bump with an emotional nod to Machine Gun Kelly.
Yahoo Lifestyle
MKR fans slam 'ridiculous' episode after major editing fail: 'It's all staged'
Viewers have called out a strange detail in Monday night’s episode when Janey and Maddie hosted their Ultimate Instant Restaurant. Read more.
Yahoo News Australia
Bizarre detail in BYD charging photo shows issue in neighbourhoods across Australia
One EV owner appears to find a thoughtful work around to the common problem – but has been brutally mocked in the process.
The Daily Beast
Musk Turning Himself Into the ‘Guest Who Wouldn’t Leave’ at Mar-a-Lago
After all that election night excitement, it seems Elon Musk just doesn’t want to go home. Multiple sources have told CNN that amid the post-victory buzz around Mar-a-Lago, the Tesla CEO has been at Donald Trump’s Florida resort almost every single day over the past week, with Instagram posts under the location tag showing him dining with the president-elect and his wife on Sunday, as well as spending time on the grounds with his son over the weekend. “Dining with him on the patio at times, toda
Yahoo Sport Australia
James Hird in massive career turn as AFL world left saddened over news about Tim Watson
James Hird and Tim Watson are both Essendon legends. Read more here.
The Daily Beast
Even Fox News Can’t Let Lara Trump Get Away With Ridiculous Attack on Harris
Republican National Committee co-chair Lara Trump attacked Democrats for “constant mudslinging” on Sunday before immediately turning around and dismissing her father-in-law’s swearing, racism and petty personal attacks as simply “who he’s always been.” President-elect Donald Trump‘s daugher-in-law, married to his son Eric, appeared on Fox New’s Media Buzz, where she claimed Democrats tried to “insult” voters into supporting them in the run-up to last week’s election. “They got to a level of just

Latest stories