Testing a system whose outputs are non-deterministic is genuinely different from testing traditional software. The difference is not primarily technical — it's philosophical.
In traditional software testing, a test either passes or fails. The system either returns the right value or it doesn't. You write the assertion, you run the test, you know. Repeatability is assumed. Any deviation is a bug.
In an AI-native system, the same input can produce different outputs on different runs. Some of those outputs are correct. Some are not. The question is not "does this system always return the right answer?" but "does this system return a good-enough answer, often enough, in the right contexts?" That requires a fundamentally different approach to test design, coverage, and what it means for a test to pass.
The test management discipline hasn't fully caught up with this shift. Most teams apply traditional QA patterns to AI systems and then wonder why their test coverage doesn't translate to production confidence.

