Minimizing Leaks: The Data of Datamining

Datamining, in relation to video games, happens for multiple reasons. Some people enjoy seeing how something is made, others want to use video game props, worlds, or characters to make fan art renders. However, there are some who look for spoilers information or “leaks” about upcoming content or features. These come in two forms. One is your hobbyist; who is posting what they find on a site like Reddit, and may be quoted on a gaming site. The second form are professionals, commonly, working for a gaming site.

Regardless of form, the data they look for is the same. There are different types of data and we will go over all of them here.

Audio/Visual

Models, animations, textures, shaders, sounds, music. This is data that every game has, and the formats, although they can vary, all have similar structures. In the case of textures and audio they commonly use standardized formats the dataminers have seen before. This data has some obvious benefits to being found and extracted; any leak with images of actual models and their textures makes for a good article.

Within the audio/visual category, animations and shaders usually aren’t as high on the priority list for dataminers looking for leaks, but that isn’t always the case. Animations can be deprioritized because they don’t usually leak anything, at least not easily, and other people extracting game assets are making their own animations a lot of the time. If, however, your game has a lot of animation heavy unlocks they will be targeted especially if it is a common format or a common engine. Shaders aren’t universally compatible without your rendering engine, so they are often ignored. However, if your build process isn’t removing the original HLSL or reflection data from your shaders then they may be used to extract strings to look for feature leaks. It may also help them recreate the shader in Blender or another tool.

Visual/audio data, although cool to see and hear, aren’t as effective without context which brings us to Metadata.

Metadata

Metadata is everywhere. From file names to build records for downloaders, metadata is telling us a lot of information we need to know to work with data. This is no different for dataminers.

File names are a major source of context for dataminers. It helps them identify what an asset is without even opening it and can drastically help their tooling focus their analysis on interesting new assets. We rely on file names in development to navigate our assets, so it makes sense that they would be just as helpful to dataminers if still present in the build.

Who uses file names though, this is 2023? Well, lots of games. But, let’s assume, for a moment, you do not. It is still fairly common to need to know an asset type giving away the same information via a manifest or reference. Or, maybe it is stored in the first four bytes of the file. Or, maybe all the item descriptions in your game have file sizes within a fairly narrow band. Metadata will always be important to dataminers, especially Relational Data.

Relational Data

Relational data is sometimes stored as metadata, or, more commonly, in asset files. Occasionally, it is both. Regardless of where it lives, relational data is the map of datamining. You can replace file names with hashes or IDs, and you can encrypt files, but your assets still need to reference each other. It is how the game makes sense of relationships and it is how dataminers do it, too.

Relational data that dataminers use to understand how everything fits together is arguably more important to them than most everything else, at least for their tooling. Sure, they could extract every file from the archives and compare them to the files from the last build. My finding differences in this comparison they could easily look at every texture and model that was added. Maybe they will see a new item model was added to the game, but with relational data they can make educated guesses on how to get the item, too. Few things can be as informative as relational data, well, unless you literally spell it out for them in Text.

Text

Text is obviously an extremely common source of leaks and it is EVERYWHERE in your game. Text can include tooltips, log lines, or dev comments never displayed but still present in the data. Regardless, text can include spoilers about future features, or dialog line subtitles giving an easily scannable version of all dialog without having to listen to it all. Text can be one of the hardest types of data to control, because it does tend to pop up everywhere even when it, commonly, is not strictly needed for a build of the game. Some studios do put in the work to replace strings with hashes, but it may surprise you that Hashes are our next type of data.

Hashes

Hashes are a common way to represent strings in games used for logic. They take up a fraction of the size in most cases and they compare super fast. I can hear your confusion from here, though. Why do dataminers care about hashes? Admittedly, hashes themselves are rarely, if ever, to blame for leaks. Dataminers, normally, only care about hashes as they try to reverse engineer data formats and game logic. They come across a hash and know it stands for something and they try and find out. Now, in an ideal world they have no real way to get a string from a hash but we don’t live in an ideal world.

Some older school compile time hash generation methods could accidentally leave a string in an executable. Or, you could hash your strings with crc because “it is fast” and we are game programmers not cryptologists. Then, you may proceed to give a GDC talk or two or three (complete with screen shots of your tools and data definition language) giving dataminers examples of your naming convention. This leads them to start pushing to reverse engineer hashes using words related to the systems they were found in. Not that that has ever happened to anyone I know before or anything . . . oh, look over there, it is Game API Data!

Game API Data

Game API Data is the data that some games provide via web APIs to be queried about their game. This API can be used by the game itself, but, more commonly, it exists for websites to query information for use in their own site showcasing found items or whatever it is the site specializes in using official data directly from the developers.

Even if an API doesn’t provide a list of things that can be queried, it can be paired with metadata datamining to sometimes query information from the API. It isn’t uncommon to publish data that isn’t ready for the public eyes yet but and rely on obscurity to hide it, but dataminers can and will find the keys to search for.

That is it for this week. We have a lot of data to try and protect starting at an unknown future date as the series is on temporary hold with “Minimizing Leaks: Removing Data.”

Quasi-functional Technology