SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • July 14, 2023
  • Rss Fetcher

Over the past two years, AI-powered image generators have become commodified, more or less, thanks to the widespread availability of — and decreasing technical barriers around — the tech. They’ve been deployed by practically every major tech player, including Google and Microsoft, as well as countless startups angling to nab a slice of the increasingly lucrative generative AI pie.

That isn’t to suggest they’re consistent yet, performance-wise — far from it. While the quality of image generators has improved, it’s been incremental, sometimes agonizing progress.

But Meta claims to have had a breakthrough.

Today, Meta announced CM3leon (“chameleon” in clumsy leetspeak), an AI model that the company claims achieves state-of-the-art performance for text-to-image generation. CM3leon is also distinguished by being one of the first image generators capable of generating captions for images, laying the groundwork for more capable image-understanding models going forward, Meta says.

“With CM3leon’s capabilities, image generation tools can produce more coherent imagery that better follows the input prompts,” Meta wrote in a blog post shared with TechCrunch earlier this week. “We believe CM3leon’s strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding.”

Most modern image generators, including OpenAI’s DALL-E 2, Google’s Imagen and Stable Diffusion, rely on a process called diffusion to create art. In diffusion, a model learns how to gradually subtract noise from a starting image made entirely of noise — moving it closer step by step to the target prompt.

The results are impressive. But diffusion is computationally intensive, making it expensive to operate and slow enough that most real-time applications are impractical.

CM3leon is a transformer model, by contrast, leveraging a mechanism called “attention” to weigh the relevance of input data such as text or images. Attention and the other architectural quirks of transformers can boost model training speed and make models more easily parallelizable. Larger and larger transformers can be trained with significant but not unattainable increases in compute, in other words.

And CM3leon is even more efficient than most transformers, Meta claims, requiring five times less compute and a smaller training data set than previous transformer-based methods.

Interestingly, OpenAI explored transformers as a means of image generation several years ago with a model called Image GPT. But it ultimately abandoned the idea in favor of diffusion — and might soon move on to “consistency.”

To train CM3leon, Meta used a data set of millions of licensed images from Shutterstock. The most capable of several versions of CM3leon that Meta built has 7 billion parameters, over twice as many as DALL-E 2. (Parameters are the parts of the model learned from training data and essentially define the skill of the model on a problem, like generating text — or, in this case, images.)

One key to CM3leon’s stronger performance is a technique called supervised fine-tuning, or SFT for short. SFT has been used to train text-generating models like OpenAI’s ChatGPT to great effect, but Meta theorized that it could be useful when applied to the image domain, as well. Indeed, instruction tuning improved CM3leon’s performance not only on image generation but on image caption writing, enabling it to answer questions about images and edit images by following text instructions (e.g. “change the color of the sky to bright blue”).

Most image generators struggle with “complex” objects and text prompts that include too many constraints. But CM3Leon doesn’t — or at least, not as often. In a few cherrypicked examples, Meta had CM3Leon generate images using prompts like “A small cactus wearing a straw hat and neon sunglasses in the Sahara desert,” “A close-up photo of a human hand, hand model,” “A raccoon main character in an Anime preparing for an epic battle with a samurai sword” and “A stop sign in a Fantasy style with the text ‘1991.’”

For the sake of comparison, I ran the same prompts through DALL-E 2. Some of the results were close. But the CM3Leon images were generally closer to the prompt and more detailed to my eyes, with the signage being the most obvious example. (Until recently, diffusion models handled both text and human anatomy relatively poorly.)

Meta image generator

Meta’s image generator.

DALL-E 2

The DALL-E 2 results.

CM3Leon can also understand instructions to edit existing images. For example, given the prompt “Generate high quality image of ‘a room that has a sink and a mirror in it’ with bottle at location (199, 130),” the model can generate something visually coherent and, as Meta puts it, “contextually appropriate” — room, sink, mirror, bottle and all. DALL-E 2 utterly fails to pick up on the nuances of prompts like these, at times completely omitting the objects specified in the prompt.

And, of course, unlike DALL-E 2, CM3leon can follow a range of prompts to generate short or long captions and answer questions about a particular image. In these areas, the model performed better than even specialized image captioning models (e.g. Flamingo, OpenFlamingo) despite seeing less text in its training data, Meta claims.

But what about bias? Generative AI models like DALL-E 2 have been found to reinforce societal biases, after all, generating images of positions of authority — like “CEO” or “director” — that depict mostly white men. Meta leaves that question unaddressed, saying only that CM3leon “can reflect any biases present in the training data.”

“As the AI industry continues to evolve, generative models like CM3leon are becoming increasingly sophisticated,” the company writes. “While the industry is still in its early stages of understanding and addressing these challenges, we believe that transparency will be key to accelerating progress.”

Meta didn’t say whether — or when — it plans to release CM3leon. Given the controversies swirling around open source art generators, I wouldn’t hold my breath.

Meta claims its new art-generating model is best-in-class by Kyle Wiggers originally published on TechCrunch

Previous Post
Next Post

Recent Posts

  • Security startup Horizon3.ai is raising $100M in new round
  • Nvidia expects to lose billions in revenue due to H20 chip licensing requirements
  • Victoria’s Secret hit by outages as it battles security incident
  • GameStop bought $500 million of Bitcoin
  • Stellantis pivots to Google’s Android as in-car partnership with Amazon ends

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.