Analysis of Robustness of a Large Game Corpus

Mahsa Bazzaz & Seth Cooper· FDG 2025

1. The Fragility of Structured Discrete Data

"Bad data" often arises when real-world information fails to conform to predefined rules or constraints, causing breakdowns in how that information is processed and interpreted. In the case of highly structured discrete data, these constraints can be especially unforgiving: even a minor deviation from expected formats can cause complete failure.

Sensitivity to small changes in different data types

Consider how differently various data types respond to small changes:

Traditional Images (CIFAR-10, ImageNet)

Even with carefully crafted adversarial noise affecting thousands of pixels,
a panda is still recognizably a panda to humans


  • Pixel: Smallest visual element (165536 in a 256×256 image)
  • Redundancy: Adjacent pixels often similar
  • Continuity: Small changes → small effects
  • Outcome: Robust to noise

Game Levels (Structured Discrete)

Adding just one wall tile (shown in red)
completely blocks the only path to the goal



  • Tile: Functional building block with gameplay mechanics
  • Constraints: Each tile must satisfy local & global rules
  • Brittleness: Small changes → big effects
  • Outcome: playable or unplayble

This distinction is crucial. In game levels, a "tile" is the smallest building block that constructs both the visual and functional layout of a game's environment, with each tile corresponding to specific gameplay mechanics or aesthetics (walls block movement, keys unlock doors, spikes harm the player). Unlike pixels in images, tiles have semantic meaning and mechanical function.

Understanding Constraints in Structured Data

Structured discrete data, whether game levels or other domains, must satisfy various types of constraints. Understanding these constraints is crucial for developing effective generation models:

Types of Constraints

Hard Constraints

Definition: Mandatory conditions that MUST be satisfied for validity. Violation means complete failure.

In constraint satisfaction problems (CSP), these are non-negotiable requirements that cannot be relaxed.

Soft Constraints

Definition: Preferred conditions that SHOULD be satisfied for quality. Violation reduces desirability but not validity.

These can be treated as optimization objectives with penalties proportional to violation degree.

Scope of Constraints

Local Constraints

Definition: Rules governing relationships between neighboring or nearby elements.

These ensure proper patterns and connections at the micro-level, affecting small regions of the structure.

Global Constraints

Definition: Rules governing properties of the entire structure as a whole.

These ensure overall correctness and completeness, requiring analysis of the full structure to verify.

The Interaction Effect

These constraint dimensions combine to create four distinct categories:

  • Hard + Local: Solid ground tiles must exist under the player spawn point
  • Hard + Global: Level must be solvable from start to goal
  • Soft + Local: Decorative elements should follow aesthetic patterns
  • Soft + Global: Overall difficulty should progress smoothly

When Local Constraints Break: Visual Examples

Beyond global solvability, game levels must also satisfy local constraints - structural patterns that ensure visual and functional coherence. Even state-of-the-art models frequently violate these, producing "broken" content that looks wrong even if technically solvable:

Description of the resource image a.png

Cave: Broken decoration pattern in a maze
(isolated corner piece)

Platform: Incomplete pipe structure in a platformer
(pipe doesn't connect properly)

Vertical: Floating decorative element
(flower without stem)

When One Tile Changes Everything

Let's see this brittleness in action. Below are two platformer levels that differ by exactly one tile. In platformers, precise jump distances are crucial - watch how removing a single platform tile creates an impossible gap:

Solvable Level

Player can jump across all gaps!

Maximum jump distance: 3 tiles. All gaps are ≤ 3 tiles wide.

Unsolvable Level

Gap too wide - Level unsolvable!

Single platform tile removed creates a 4-tile gap - impossible to jump!

Why This Matters for Machine Learning

This extreme sensitivity creates unique challenges for procedural content generation via machine learning (PCGML):

  1. Gradient-based learning struggles: The discrete nature and binary outcomes (solvable/unsolvable) make it difficult to backpropagate meaningful gradients through neural networks.
  2. No tolerance for errors: Unlike image generation where slight imperfections are acceptable, a single misplaced tile can render a level unplayable. Even state-of-the-art models like transformers, VAEs, and GANs frequently generate levels with broken local structures (pipes that don't connect, floating platforms, inaccessible areas).
  3. Training data is sparse: Unlike millions of images available for computer vision, game level datasets typically contain only hundreds to thousands of examples, making it harder to learn these complex constraints.

2. What is Data Robustness?

The idea of robustness originally comes from studying classifiers—machine learning models that assign labels to data points. A robust classifier is one that gives the same answer even when the input is slightly perturbed. But here, we flip the question: instead of asking "is the classifier robust?" we ask "is the data itself robust?"

Think of it this way: if you take a data point and nudge it just a tiny bit, does its true label change? For most image datasets, the answer is no—a slightly noisy photo of a cat is still a cat. But for game levels, a single tile change can flip a level from solvable to impossible.

The Original Definition (for Classifiers)

Robustness was first defined to measure how consistently a classifier \(f\) behaves when inputs are slightly perturbed. Given a distribution of data \(\mathcal{D}\) and a distance threshold \(r\), the robustness measures the probability that nearby points get the same predicted label:

Classifier Robustness
\[ A_r(f, \mathcal{D}) = \mathbb{P}_{x \sim \mathcal{D}}\left[f(x) = f(x') \;\Big|\; \forall x',\, d(x, x') \leq r\right] \]

In plain English: pick a random data point \(x\). Now consider all points \(x'\) within distance \(r\) of it. Robustness is the probability that the classifier gives all of them the same label. High robustness means the classifier is stable—small input changes don't flip predictions.

Our Twist: Robustness of the Data Itself

We adapt this idea to measure a property of the data, not the classifier. Instead of asking whether a model's predictions stay consistent, we ask whether the true labels stay consistent under small perturbations:

Data Robustness
\[ D_r(\mathcal{D}) = \mathbb{P}_{x \sim \mathcal{D}}\left[\text{Label}(x) = \text{Label}(x') \;\Big|\; \forall x',\, d(x, x') \leq r\right] \]
💡 The Key Insight: We replace the classifier's prediction \(f(x)\) with the ground-truth label. This tells us about the nature of the data itself—how "fragile" or "stable" the label boundaries are.

Non-Robustness: Measuring Sensitivity

In practice, it's often more useful to measure the opposite—non-robustness—which captures how likely it is that a small change will flip the label:

Non-Robustness (Our Main Metric)
\[ \text{ND}_r(\mathcal{D}) = \mathbb{P}_{x \sim \mathcal{D}}\left[\text{Label}(x) \neq \text{Label}(x') \;\Big|\; \exists x',\, d(x, x') \leq r\right] \]

This reads as: pick a random data point \(x\). If there exists any nearby point \(x'\) (within distance \(r\)) that has a different label, we count that. Non-robustness is the probability of finding such a label-flipping neighbor.

📊 Interpreting the Numbers

Higher non-robustness = more sensitive data. If \(\text{ND}_r(\mathcal{D})\) is high, it means many data points are "close to the edge"—small perturbations can easily change their labels. For game levels, this manifests as levels that become unsolvable with just one tile swap.

Important Distinction

Robustness of Data vs. Robustness of Classifier:

  • Data Robustness: Whether the true label actually changes with input perturbation (our focus)
  • Classifier Robustness: Whether a model's prediction changes (traditional adversarial ML focus)

For game levels, the ground truth should change when critical tiles are modified - a robust classifier needs to capture this sensitivity!

3. The Generated Game Level Corpus (GGLC)

Why We Created It

Existing game level datasets are small (typically hundreds of levels) and often copyrighted. We created a large, diverse, and freely available dataset to study these unique features of game level.

How We Generate Levels: The Sturgeon Engine

We use Sturgeon, a 2D tile-based level generator that transforms game level generation into binary constraint solving problems.

You can find Sturgeon codebase in this Github Repository.

Our Games Collection

Cave

Top-down cave maze

Top-down maze navigation with keys and doors

60,000+ levels

View Dataset

Platform

Mario-inspired platformer

Mario-inspired platformer with jumps

10,000+ levels

View Dataset

Crates

Sokoban puzzle with crates

Sokoban-style puzzle pushing boxes

10,000+ levels

View Dataset

Vertical

Vertical platformer level

Vertical platformer with wall climbing

10,000+ levels

View Dataset

Slide

Ice sliding puzzle

Ice physics navigation puzzle

8,000+ levels

View Dataset

4. Experiments

To validate our claim that game levels are fundamentally less robust than traditional ML datasets, we conducted two complementary experiments. The first directly measures the true distribution by exhaustively testing all possible single-tile changes. The second analyzes our generated dataset by embedding levels into a continuous space and measuring how often nearby points have different labels.

True Distribution: Exhaustive Single-Tile Analysis

For the true distribution experiment, we generated 100 random levels from each game and systematically tested every possible single-tile replacement. This gives us the exact probability that a random single-tile change will break a level—no approximation needed.


The table below shows what percentage of single-tile changes either break solvability (the level becomes impossible to complete) or break acceptability (the level violates local structural constraints, like broken pipe segments):

Game Changed Solvability Changed Acceptability
Cave 43.1% 43.0%
Platform 17.1% 0.1%
Crates 78.9% 0%
Vertical 17.5% 42.5%

Crates (our Sokoban-style puzzle game) is the most fragile: nearly 4 out of 5 random tile changes break the level. This makes sense—pushing crates into corners or blocking narrow passages easily creates deadlocks. Cave levels are also highly sensitive because maze-like structures have many potential chokepoints where a single wall tile can block the only path.

Compare this to image datasets like CIFAR-10 or MNIST, where changing a single pixel almost never changes the true label. A cat with one different pixel is still a cat. But a puzzle level with one different tile might be completely unsolvable.

Sample Distribution: Embedding-Based Analysis

The exhaustive approach above works for measuring sensitivity to single-tile changes, but how do we compare game levels to image datasets on equal footing? We can't enumerate all single-pixel changes for CIFAR-10 images—there are too many pixels and possible values.

Instead, we embed all data into a shared continuous space using CLIP (chosen because prior work showed it aligns well with human similarity judgments for game levels), then reduce to 2D with UMAP. In this space, we measure non-robustness by checking whether nearby points (within radius \(r\)) have different labels.

The plot below shows non-robustness across different radius values. Higher values mean the data is more sensitive—nearby points in embedding space are more likely to have opposite labels:

Cave dominates across the board, with non-robustness reaching 36% even at small perturbation sizes. This aligns with our exhaustive analysis: maze games are inherently brittle. Notice that even Platform and Crates exceed CIFAR-10 at larger radius values, despite CIFAR-10 having far more "pixels" per sample than our games have tiles.

Visualizing the Label Boundaries

To see why game levels are less robust, we can visualize the embedding space directly. The plots below show 100 randomly sampled solvable (blue) and unsolvable (red) levels projected into 2D. Notice how the classes overlap and interleave—there's no clean separation. The complete visualizations with all data points are available in the full paper.

The purple regions where classes overlap are exactly where non-robustness is highest. A generative model trying to produce solvable levels must navigate this treacherous landscape where a tiny step in the wrong direction crosses into unsolvable territory.

5. Learn More