"Bad data" often arises when real-world information fails to conform to predefined rules or constraints, causing breakdowns in how that information is processed and interpreted. In the case of highly structured discrete data, these constraints can be especially unforgiving: even a minor deviation from expected formats can cause complete failure.
Consider how differently various data types respond to small changes:
Even with carefully crafted adversarial noise affecting thousands of pixels,
a panda is still recognizably a panda to humans
Adding just one wall tile (shown in red)
completely blocks the only path to the goal
This distinction is crucial. In game levels, a "tile" is the smallest building block that constructs both the visual and functional layout of a game's environment, with each tile corresponding to specific gameplay mechanics or aesthetics (walls block movement, keys unlock doors, spikes harm the player). Unlike pixels in images, tiles have semantic meaning and mechanical function.
Structured discrete data, whether game levels or other domains, must satisfy various types of constraints. Understanding these constraints is crucial for developing effective generation models:
Definition: Mandatory conditions that MUST be satisfied for validity. Violation means complete failure.
In constraint satisfaction problems (CSP), these are non-negotiable requirements that cannot be relaxed.
Definition: Preferred conditions that SHOULD be satisfied for quality. Violation reduces desirability but not validity.
These can be treated as optimization objectives with penalties proportional to violation degree.
Definition: Rules governing relationships between neighboring or nearby elements.
These ensure proper patterns and connections at the micro-level, affecting small regions of the structure.
Definition: Rules governing properties of the entire structure as a whole.
These ensure overall correctness and completeness, requiring analysis of the full structure to verify.
These constraint dimensions combine to create four distinct categories:
Beyond global solvability, game levels must also satisfy local constraints - structural patterns that ensure visual and functional coherence. Even state-of-the-art models frequently violate these, producing "broken" content that looks wrong even if technically solvable:
Cave: Broken decoration pattern in a maze
(isolated corner piece)
Platform: Incomplete pipe structure in a platformer
(pipe doesn't connect properly)
Vertical: Floating decorative element
(flower without stem)
Let's see this brittleness in action. Below are two platformer levels that differ by exactly one tile. In platformers, precise jump distances are crucial - watch how removing a single platform tile creates an impossible gap:
Maximum jump distance: 3 tiles. All gaps are ≤ 3 tiles wide.
Single platform tile removed creates a 4-tile gap - impossible to jump!
This extreme sensitivity creates unique challenges for procedural content generation via machine learning (PCGML):
The idea of robustness originally comes from studying classifiers—machine learning models that assign labels to data points. A robust classifier is one that gives the same answer even when the input is slightly perturbed. But here, we flip the question: instead of asking "is the classifier robust?" we ask "is the data itself robust?"
Think of it this way: if you take a data point and nudge it just a tiny bit, does its true label change? For most image datasets, the answer is no—a slightly noisy photo of a cat is still a cat. But for game levels, a single tile change can flip a level from solvable to impossible.
Robustness was first defined to measure how consistently a classifier \(f\) behaves when inputs are slightly perturbed. Given a distribution of data \(\mathcal{D}\) and a distance threshold \(r\), the robustness measures the probability that nearby points get the same predicted label:
In plain English: pick a random data point \(x\). Now consider all points \(x'\) within distance \(r\) of it. Robustness is the probability that the classifier gives all of them the same label. High robustness means the classifier is stable—small input changes don't flip predictions.
We adapt this idea to measure a property of the data, not the classifier. Instead of asking whether a model's predictions stay consistent, we ask whether the true labels stay consistent under small perturbations:
In practice, it's often more useful to measure the opposite—non-robustness—which captures how likely it is that a small change will flip the label:
This reads as: pick a random data point \(x\). If there exists any nearby point \(x'\) (within distance \(r\)) that has a different label, we count that. Non-robustness is the probability of finding such a label-flipping neighbor.
Higher non-robustness = more sensitive data. If \(\text{ND}_r(\mathcal{D})\) is high, it means many data points are "close to the edge"—small perturbations can easily change their labels. For game levels, this manifests as levels that become unsolvable with just one tile swap.
Robustness of Data vs. Robustness of Classifier:
For game levels, the ground truth should change when critical tiles are modified - a robust classifier needs to capture this sensitivity!
Existing game level datasets are small (typically hundreds of levels) and often copyrighted. We created a large, diverse, and freely available dataset to study these unique features of game level.
We use Sturgeon, a 2D tile-based level generator that transforms game level generation into binary constraint solving problems.
You can find Sturgeon codebase in this Github Repository.
To validate our claim that game levels are fundamentally less robust than traditional ML datasets, we conducted two complementary experiments. The first directly measures the true distribution by exhaustively testing all possible single-tile changes. The second analyzes our generated dataset by embedding levels into a continuous space and measuring how often nearby points have different labels.
For the true distribution experiment, we generated 100 random levels from each game and systematically tested every possible single-tile replacement. This gives us the exact probability that a random single-tile change will break a level—no approximation needed.
The table below shows what percentage of single-tile changes either break solvability (the level becomes impossible to complete) or break acceptability (the level violates local structural constraints, like broken pipe segments):
| Game | Changed Solvability | Changed Acceptability |
|---|---|---|
| Cave | 43.1% | 43.0% |
| Platform | 17.1% | 0.1% |
| Crates | 78.9% | 0% |
| Vertical | 17.5% | 42.5% |
Crates (our Sokoban-style puzzle game) is the most fragile: nearly 4 out of 5 random tile changes break the level. This makes sense—pushing crates into corners or blocking narrow passages easily creates deadlocks. Cave levels are also highly sensitive because maze-like structures have many potential chokepoints where a single wall tile can block the only path.
Compare this to image datasets like CIFAR-10 or MNIST, where changing a single pixel almost never changes the true label. A cat with one different pixel is still a cat. But a puzzle level with one different tile might be completely unsolvable.
The exhaustive approach above works for measuring sensitivity to single-tile changes, but how do we compare game levels to image datasets on equal footing? We can't enumerate all single-pixel changes for CIFAR-10 images—there are too many pixels and possible values.
Instead, we embed all data into a shared continuous space using CLIP (chosen because prior work showed it aligns well with human similarity judgments for game levels), then reduce to 2D with UMAP. In this space, we measure non-robustness by checking whether nearby points (within radius \(r\)) have different labels.
The plot below shows non-robustness across different radius values. Higher values mean the data is more sensitive—nearby points in embedding space are more likely to have opposite labels:
Cave dominates across the board, with non-robustness reaching 36% even at small perturbation sizes. This aligns with our exhaustive analysis: maze games are inherently brittle. Notice that even Platform and Crates exceed CIFAR-10 at larger radius values, despite CIFAR-10 having far more "pixels" per sample than our games have tiles.
To see why game levels are less robust, we can visualize the embedding space directly. The plots below show 100 randomly sampled solvable (blue) and unsolvable (red) levels projected into 2D. Notice how the classes overlap and interleave—there's no clean separation. The complete visualizations with all data points are available in the full paper.
The purple regions where classes overlap are exactly where non-robustness is highest. A generative model trying to produce solvable levels must navigate this treacherous landscape where a tiny step in the wrong direction crosses into unsolvable territory.