Elon Musk's Grok 2 Generates AI Images—How Does It Stack Up?

Artificial intelligence company xAI, founded by tech mogul Elon Musk, unveiled Grok 2 on Wednesday, the next evolution of its AI chatbot. This latest release takes Grok into multimodal territory, boasting capabilities that span text comprehension, real-time Twitter analysis, and image generation.

“We are excited to release an early preview of Grok-2, a significant step forward from our previous model Grok-1.5, featuring frontier capabilities in chat, coding, and reasoning,” xAI said in its official announcement. The company said an earlier version of Grok 2 “is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo.”

LmSYS, an open-source ranking system for large language models based on blind testing and user preferences, confirmed xAI’s claims. An update to the ranking puts Grok-2 ahead of Claude 3.5 Sonnet and just behind OpenAI’s newest GPT-4o and Google’s Gemini 1.5 Pro.

Image: xAI

“With over 12,000 community votes, [Grok 2] has secured the #3 spot on the overall leaderboard, even matching GPT-4o! It excels in Coding (#2), Hard Prompts (#4), and Math (#2),” LmSYS reported on Twitter.

Woah, another exciting update from Chatbot Arena❤️‍🔥

The results for @xAI’s sus-column-r (Grok 2 early version) are now public**!

With over 12,000 community votes, sus-column-r has secured the #3 spot on the overall leaderboard, even matching GPT-4o! It excels in Coding (#2),… https://t.co/gqSWSwYN0z pic.twitter.com/j9UYDBYNt4

— lmsys.org (@lmsysorg) August 14, 2024

Notably, the new Grok 2 and its faster and less capable “mini” version are only available on X (aka Twitter) for X Premium+ subscribers, which is priced at $16 a month or $168 a year.

First impressions

xAI said both “Grok-2 and Grok-2 mini are currently in beta on X” but we could only get access to the mini version, so it’s probably a gradual rollout. Also, the platform briefly stopped generating images, suggesting a service cap or a possible server overload. Either case could constitute a disadvantage to AI art power users.

We tried Grok 2’s image generator and our first impressions were not good, with outputs that seemed lackluster at best. However, we refined our prompting technique, and a few generations later, things improved a lot.

We started with this:

However, by combining SDXL-style aesthetic elements (using specific keywords separated by commas) with natural language scene descriptions (similar to Flux or Dall-E 3 approaches), we unlocked a higher level of realism in our generations, which ended up looking like this:

Not bad… Could be better, but not bad at all.

Grok 2 takes on AI art titans

Before Grok entered the image generation arena, MidJourney, Flux, Ideogram, Leonardo and MidJourney were wrangling to take the top spot for the best image generator, with each model excelling in different categories. So we’ve pitted it against the leaders in specific tasks, based on what each tool does best.

Here are our takes, but you can be the judge.

Realism

Prompt: Polaroid photo with VSCO filter, 1990, gorgeous woman, night, flash photo, blonde, cute, young face, beautiful shadows, tropical plants, urban clothing, inside an apartment, DSLR, holding a sign written in ballpoint pen on a notebook saying “This photo was generated by Decrypt using Grok 2 Mini.”

Grok 2 Mini:

Grok 2 Mini delivered a highly realistic image, effectively capturing the aesthetic of a 1990s Polaroid with a VSCO filter. Details like the shadows, tropical plants, and urban clothing were accurately portrayed. The model avoided significant mistakes, ensuring the image closely followed the prompt. It framed the image to resemble a Polaroid picture.

There might be minor areas where the 1990s aesthetic could have been more pronounced, but these do not detract significantly from the overall realism.

Also, the writing was perfect, but did not seem to be handwritten with a ballpoint pen.

Flux Dev (with Realism LoRA):

Flux Dev generated a visually appealing image that aligned well with the prompt, particularly in capturing the nighttime, indoor setting.

However, it made more noticeable errors compared to Grok 2 Mini, particularly in the fine details that contribute to overall realism. The VSCO filter is not as noticeable, the finger placement is odd, and there is no urban clothing visible. There was also a minor error in the writing, but the font seems more natural.

Winner: Grok 2 Mini wins in this category due to its superior realism, attention to detail, and minimal mistakes.

However, it is extremely important to note that specific keywords are needed to achieve this level of realism. If those are omitted, Grok 2 Mini falls pretty down to levels similar to MidJourney v5. So beware

Text generation

Prompt: Polaroid photo with VSCO filter, 1990, gorgeous woman, night, flash photo, blonde, cute, young face, beautiful shadows, tropical plants, urban clothing, inside an apartment, DSLR, holding a sign written in ballpoint pen on a notebook saying “Emerge by Decrypt is the best source for AI, tech, biohacking, and all that stuff. Read us.”

Grok 2 Mini:

Grok 2 Mini excelled in this category by generating the text with fewer mistakes, ensuring that the message was clear and well-integrated into the image. The model maintained the realism of the scene while effectively incorporating the long text.

There may be slight room for improvement in the handwriting aesthetic, but this is a minor issue. The only mistake was a missing word: “for” as in “the best source for AI.”

Flux Pro:

Flux Pro also generated the text well, but it struggled more with clarity or integration, leading to more noticeable errors compared to Grok 2 Mini.

The mistakes in text generation were more apparent, affecting the overall effectiveness of the image. It generated artifacts and missed a few words.

Winner: Grok 2 Mini wins in text generation, handling the long text with fewer mistakes and maintaining overall realism.

Artistic styles

Prompt: A man and a woman having dinner in a futuristic restaurant, illustration in the style of Vincent Van Gogh. The restaurant has a sign saying “Welcome to Emerge, by Decrypt.”

Grok 2 Mini:

Grok 2 Mini attempted to capture the style of Van Gogh while integrating the futuristic elements of the prompt. VanGogh’s style is noticeable only on the outside night sky, but the main elements of the composition don’t resemble his style at all.

Overall, the Van Gogh style may not have been convincingly replicated, as it lacks the distinctive brushwork and color palette that characterizes his work.

Leonardo:

Leonardo performed better in replicating the Van Gogh style, with more accurate brushstrokes and vibrant colors.

There might be some minor discrepancies in how the futuristic elements were portrayed, but the artistic style was the focus and was well-executed.

Winner: Leonardo wins in this category for its superior replication of Van Gogh’s artistic style.

Spatial awareness

Prompt: A dog standing on top of a cat, rendered in a highly photorealistic style with meticulous attention to fur texture and lighting. On the left, a worn, retro-futuristic robot with a cracked, analog screen displaying the word “Emerge” in faded, orange-tinted pixels. On the right, a creepy, vintage-clad doctor in a gas mask, holding a vintage-style syringe with a hint of steam rising from it. The background blends elements of emerging technologies, but with a retro, 1970s-inspired aesthetic: distressed, grainy DNA helices, binary code printed on yellowed paper, old-school space exploration equipment, and worn, retro-futuristic electronics.

Grok 2 Mini:

Grok 2 Mini tried to handle the complex scene well, ensuring that the spatial relationships between the elements were logical and visually coherent but failed at incorporating all the elements into the same scene. Instead of a dog on top of a cat, we got a cat on top of a monitor.

The lack of a wider image ratio may play against its capabilities. Also, the fact that there is no way to properly guide or influence the prompt enhancement or interpretation that Grok’s LLM does before generating the image is a negative point when some specific elements are required in complex scenes.

Ideogram:

Ideogram excelled in spatial awareness, ensuring that all elements were correctly positioned and integrated into the scene. The attention to detail in the arrangement and interaction between objects was superior.

There were, of course, some minor imperfections in texture or lighting, and the elements are placed more as a collage than the seamless, logical blend that Grok 2 mini aimed for. However, this was secondary to the overall spatial accuracy.

Winner: Ideogram wins for its superior spatial awareness and composition.

Known figures and copyright-sensitive images

Grok 2 Mini demonstrates a higher degree of flexibility by successfully generating images of political figures like Donald Trump and Kamala Harris. It can produce images even when ethical or legal constraints might deter other models.

In fact, this is so unique for a proprietary model that X is awash in questionable examples, generating images of George Bush doing drugs, or Trump and Harris about to crash an airplane into the twin towers of the World Trade Center in New York. Many include copyrighted characters from companies like Disney and Ninetendo.

We didn’t go that far, and instead generated a crypto-loving Vice President Harris with no problem:

Other models, like MidJourney and ChatGPT, adhere to stricter ethical standards. They refuse to generate images of political figures or other copyright-sensitive content. This approach ensures compliance with legal frameworks and ethical considerations, reducing the risk of misuse.

Winner: Grok 2 Mini wins in terms of capability, as it can generate a broader range of images, including known figures. However, for ethical content generation, MidJourney and ChatGPT are preferable.

Nudity and censorship

In general, all the proprietary models are mostly censored for sex, gore, and other types of derogatory or sensitive content. For that specific use case, the best solution is to use fine-tuned versions of open-source models or third-party components like LoRAs, Lycoris, and embeddings that alter the capabilities of open-source models like Stable Diffusion or Flux.

MidJourney has more defined limits regarding nudity and violence. It can generate slight nudity or violent imagery under certain prompts, but these instances are typically controlled, do not cross ethical boundaries, and are mostly either workarounds or random.

Comparing close source models, Grok 2 Mini wins in terms of capability due to its ability to generate a wider range of content, including uncensored material. However, it does not stand a chance against Stable Diffusion and its extreme levels of customizability.

Conclusion:

According to our preliminary tests, Grok 2 Mini outperformed its competitors in text generation, so it can be seen as the overall winner in this category.

It can also be the best model for realism as long as it is prompted correctly with specific keywords because word position seems to play a big role in the output. Those looking for more realism without being too specific on prompts may go with MidJourney or Flux.

Grok 2 Mini is really bad at dealing with complex compositions or artistic imagery that requires specific creative elements, so that may be a negative point for more specialized users.

Leonardo still holds the edge in artistic style, and the Ideaogram leads in spatial awareness. Stable Diffusion remains the king when it comes to uncensored generations, whereas Flux can be a better choice for those looking for the best overall local and open-source image generator with great text capabilities, realism, and natural prompt understanding.

The choice of the “best” model depends on the specific requirements of the task at hand, with Grok 2 Mini being the preferred choice for a specific type of realism, text-heavy scenarios, and sensitive generations. For anything else, there are better models.