3-year AI video generation progress comparison (Modelscope vs Grok Imagine v1)
Three Years of AI Video: From Blurry Blobs to Broadcast-Ready
I’ve been in this field long enough to get numb to progress. New model drops, benchmarks improve, everyone tweets about it, and two weeks later it’s the baseline expectation. So when Min Choi posted his side-by-side comparison this week, Modelscope from 2023 on the left and Grok Imagine v1 from this month on the right, I actually stopped scrolling. That’s rare for me.
The gap is not incremental. It doesn’t look like a version bump or a 20% quality improvement. It looks like someone skipped a generation of hardware cycles and landed somewhere we weren’t supposed to be yet.
What Modelscope Actually Was
Let’s be fair to the 2023 moment. Modelscope was genuinely exciting when it released. Text-to-video that ran on consumer hardware, open weights, free to experiment with. Yes, the output was blurry. Yes, subjects dissolved into color soup after a few frames. Yes, the motion looked like a watercolor painting trying to escape its own canvas. But it moved. It generated something from nothing. People who had spent their careers in VFX were paying close attention, and rightfully so.
The model was built on a latent diffusion architecture fine-tuned specifically for temporal consistency, but temporal consistency was exactly its weakness. It could generate a reasonable frame but struggled to make the next frame agree with it. That artifact-ridden, slightly nauseating motion was the field’s ceiling at the time.
What Grok Imagine v1 Is Doing Differently
The Grok output in Min’s comparison is a different animal. Clean motion paths. Subjects that hold their shape across frames. Lighting that responds to scene geometry instead of flickering randomly. This is not a minor improvement in the same approach. The underlying architecture decisions and the scale of training data required to produce that output represent a fundamentally different level of investment.
xAI hasn’t published a full technical report on Grok Imagine v1’s video generation stack as of this writing, so I’m reading the output rather than the paper. What the output tells me is that someone solved, or at least dramatically improved on, the temporal coherence problem. Motion blur looks intentional now, not like a compression artifact. Skin tones stay stable. Backgrounds don’t breathe with that telltale AI shimmer.
The Exponential Curve Problem
Here’s what I keep turning over in my head. We are genuinely terrible at internalizing exponential curves while we’re sitting inside one. Each individual model release feels modest. “Oh, the new version is a bit better,” people say, and then they move on. But Modelscope to Grok Imagine v1 in 36 months is not modest. Output that required a professional VFX team, a render farm, and a five-figure budget in 2022 is now being generated from a text prompt on a consumer product.
The real danger isn’t that we overestimate this progress. It’s that we underestimate it precisely because we experienced it incrementally, step by step, and each step felt small.
What This Actually Means for Production
I’m not going to pretend the current output is ready to replace a cinematographer or a VFX supervisor. It isn’t, not for complex narrative work. But the bar for “good enough” is moving fast, and it’s moving into territory that affects real budgets and real jobs. Short-form advertising, social content, product demos, explainer videos – these are already in the crosshairs. Any studio or agency that isn’t actively pressure-testing these tools against their current production pipeline is making a planning error they’ll regret by 2027.
The three-year arc from Modelscope to Grok Imagine v1 should function as a calibration tool. Look at where we are now. Project that curve forward another 36 months. Whatever you think that endpoint looks like, you’re probably still undershooting it.
Sources
#AIVideo #GenerativeAI #MachineLearning #VideoGeneration #AIProgress #glenrhodes
