Why I ripped Mockito out of a Flutter codebase and switched to Valenty (and what I got back)

Straight answer: I pulled Mockito out of a Flutter codebase because the bottleneck was never writing the mock — it was build_runner. Mockito generates mocks through code generation: you annotate with @GenerateMocks, run dart run build_runner build, and it spits out a .mocks.dart file per suite. Flutter's own docs are explicit: "mockito 5.0.0 supports Dart's null safety thanks to code generation. To run the required code generation, add the build_runner dependency" (Flutter docs, 2026). On a small project nobody notices. On a 40,000-line app I inherited, build_runner became a 12-to-25-second toll every time an interface changed. I swapped all of it for manual fakes plus Valenty (a testing framework I published to pub.dev). Local test time on my machine dropped from 45s to 8s — because code generation simply left the loop.

I didn't reach this from a Reddit thread. I reached it after deleting an 800-line generated mock file that existed only to cover 5 dependencies. This post is the why, with numbers, and where I'd still keep Mockito.

Mockito's problem isn't the mock — it's the codegen

Let me separate two things that constantly get conflated. The mock itself — a stand-in that records calls and returns stubbed values — is a fine idea. The mechanism Mockito uses to build that stand-in in Dart, ever since null safety, is code generation. That's where the cost lives.

build_runner is a tax that scales with your project

build_runner analyzes your project's Dart files to decide what to regenerate. The bigger the project, the more files it scans, the slower it gets — a problem the community has documented to death. Code With Andrea has an entire guide just on speeding up code generation, and the top recommendation is literally "use the generate_for option to specify exactly which files the builder should process" (Code With Andrea, 2026). Think about what that means: the tool is slow enough by default that best practice becomes manually configuring what it should NOT look at.

There's a public report from a Flutter app with ~40 developers and 3,000 tests where generation took 8 minutes on an M1 Mac, 16 minutes on a Linux CI box, and outright failed (exit code -9, out of memory) on a 4GB box. The team only got it down from 12 to ~5 minutes by renaming sources with a .buildable suffix and configuring generate_for (MobileNativeFoundation, discussion #200, 2026). Five minutes is the "optimized" state. To regenerate mocks.

On the 40k-line app, the number wasn't minutes, it was 12 to 25 seconds per build_runner pass — every time I touched a method signature that had a mock. Multiply that by the dozens of times you change an interface during a day of refactoring. The cost isn't the isolated second. It's the micro-cut to flow, repeated, that makes you stop running tests before you commit. That's the wrong way to spend a senior dev's attention.

Opaque errors when a stub doesn't match

The second problem is sneakier. When you forget to stub a method on a generated mock, the default behavior either throws or returns a "legal" default — and the message is about the mock's internals, not your scenario. @GenerateNiceMocks (today's recommended path) returns a simple legal value for unstubbed calls (Flutter docs, 2026). Great for not crashing — terrible for debugging. Your test passes by returning 0, '', or null somewhere you didn't even know was being called, and the bug leaks to production disguised as a green test.

I've burned a full afternoon on exactly this. The mock returned the "legal" default, the test stayed green, and the real rule was wrong. Which is precisely the scenario Valentina Jemuovic describes when she explains why the old test pyramid fails:

"Your unit tests pass. Your E2E tests pass. And yet, the tax calculation was wrong. [...] There's a massive gap between 'all my unit tests pass' and 'this feature actually works as the customer expects.' That gap is where your production bugs live." — Valentina Jemuovic, Optivem Journal

Manual fakes: same coverage, zero codegen

The alternative nobody wants to hear because it sounds like grunt work: the manual fake. One class per dependency, implementing only the methods that test uses. No annotation, no .mocks.dart, nothing to run.

The instant objection is "but that's more lines of code." In practice, it isn't. The generated mock file was 800 lines to cover 5 dependencies — code I neither wrote nor read, but that landed in the diff and in build time. The equivalent manual fakes are 15-30 lines per dependency, written by me, readable, and they stop existing as a generated artifact. Code you control is not the same as code a tool vomits.

And the error stays legible. If the fake doesn't implement a method, the Dart compiler tells you immediately, with the class and method name — not at runtime, not with a silent "legal" value. A compile error is the best kind of bug: it happens before you run anything.

// Manual fake: 0 codegen, compile error if the interface changes
class FakeOrderApi implements OrderApi {
  final List<Order> _orders;
  FakeOrderApi(this._orders);

  @override
  Future<List<Order>> fetchOrders() async => _orders;

  // If OrderApi gains a new method, THIS STOPS COMPILING.
  // The generated Mockito mock would have returned a "legal"
  // value and masked it.
}

valentyTest: the real trick for component tests

Manual fakes solve "how do I kill code generation." Valenty solves "how do I stop testing implementation instead of behavior." Valenty is an open-source package I published to pub.dev (valenty_test + valenty_cli) after getting fed up with the boilerplate. It's built on Valentina Jemuovic's Modern Test Pyramid, and it covers the component test level: it runs the entire Flutter app, with real business logic, but with external dependencies (Firebase, Dio, databases) swapped for fakes.

The valentyTest pattern separates three things that a normal widget test mashes into one soup:

BackendStubDsl — the setup, where you configure what external systems return.
SystemDsl — the body, where you describe user actions in domain language.
UiDriver — the boring layer, where every find.byKey, tester.tap, and pumpAndSettle lives.

The test itself looks like this — note there's not a single find.byKey in the body:

valentyTest(
  'should show order total after placing the order',
  setup: (backend) {
    backend.stubProduct(sku: 'APPLE1001', price: 2.50);
    backend.stubOrderCreation(totalPrice: 12.50);
  },
  body: (system, backend) async {
    await system.openApp();
    await system.selectProduct('APPLE1001');
    await system.setQuantity(5);
    await system.placeOrder();
    await system.verifyConfirmation('Total: \$12.50');
  },
);

That reads like a user story. Six months later, when you come back to this test, you'll understand the scenario without decoding a widget tree. And when the "place order" button changes its key, you fix it in one place — the UiDriver — and every scenario keeps passing. A generated mock gives you none of that separation: it pushes you toward verifying "method X was called with Y," which is testing implementation, not behavior.

Why it's not just "fakes with a nice name"

The deep difference is what the test asserts. With Mockito it's easy to land on verify(mock.placeOrder(any)).called(1) — you're testing that a call happened. With valentyTest you assert verifyConfirmation('Total: $12.50') — what the user sees. When a refactor changes the internal call sequence but the user-facing result stays correct, the Mockito test breaks (false negative) and the Valenty test stays green. That's the entire point of the Modern Test Pyramid: test the edge of behavior, not the guts of the implementation.

The actual migration: 40k lines, 3 sprints, 45s → 8s

It wasn't a big bang. On a 40,000-line Flutter app I maintain, the migration ran across 3 sprints, feature by feature:

Sprint	What moved	Result
1	New features born with `valentyTest`; Mockito frozen	Team stopped generating new mocks
2	Hot paths (checkout, auth) — the most-run tests	`build_runner` left the loop for those modules
3	Rest of the base + removing `build_runner` from test deps	Local test time: 45s → 8s

The 8 seconds aren't some Valenty performance magic — they're the absence of code generation. Pulling build_runner off the path is what gives the time back. Valenty is what makes the tests worth maintaining afterward.

Honest cost of the migration: each feature took half a day to a day to rewrite its tests in the valentyTest format, and there was friction early because the team was used to when(...).thenReturn(...). I won't sell that as free. I sell it as paid once and charged never again — unlike build_runner, which charges a toll every run, forever.

When I'd still use Mockito (or mocktail)

I won't pretend generated mocks are always wrong. If I needed to verify a fine-grained interaction — like "this billing method was called exactly once, with this argument, in this order" — Mockito's verify() is right to the point and I wouldn't reinvent it by hand. For that case, though, I'd reach for mocktail, which does the same thing without code generation (mocktail, pub.dev, 2026) — which kills half my problem outright. If your only pain is build_runner and you love the mock API, swap Mockito for mocktail and move on.

But for component tests — running the feature end to end with fakes — I'm not going back to generated mocks. Verifying user behavior with verify(mock).called(1) is the wrong tool for the job. That's where valentyTest wins.

A transparency note: valenty_test is in pre-release (v0.2.3) and scores 160 of 160 pub points on pub.dev — format, docs, and examples in place, the ceiling of the quality bar (valenty_test, pub.dev, 2026). But it's new, with little community adoption yet. I run it in production because I wrote it and trust it, but I won't sell you an installed base that doesn't exist.

How to start in 30 seconds

The CLI is just an installer — your editor's AI does the heavy lifting, reading the generated skill files to learn the valentyTest architecture:

dart pub global activate valenty_cli && cd your_flutter_app && valenty init

valenty init detects the Flutter project, adds valenty_test as a dev dependency, creates .valenty.yaml, and generates skill files for Claude Code, Cursor, or Codex. Then you ask your agent: "scaffold the Order feature for valentyTest" and "write the test: user adds an item and sees the total." Run flutter test. No build_runner in the middle.

The summary fits in one sentence: the mock was never the problem, code generation was. Take build_runner off the loop and you get back time you didn't know you were losing.

If you've got a Flutter app with a test suite that's slow because of codegen — or no suite at all because "it's too much work" — that's exactly the kind of thing Hens fixes. We ship Flutter for clients in Brazil and abroad. Send a message.