About Data Quality
The goal of data analysis is always insight—insight derived from a discovered pattern or patterns. However, patterns and findings can be largely misleading if the underlying data is erroneous.
Let us explore a game example.
The game is a multiplayer PVP experience that begins with a simple single-player progression and onboarding of core gameplay mechanics. Following a series of single-player bot-based matches, players progress into a multiplayer mode, where they first encounter real-player opponents.
As we joined the team and received early gameplay data, we started the process of simply cleaning up the data and running early funnel analysis—but something was off. A conspicuous pattern emerged. Many players appeared [in the data] to quit by shutting down the game client. This behavior resulted in incomplete data for those games, thus leading to their exclusion from the analysis.
The issue hid deeper concerns for both the product and the gameplay. Removing the incomplete data inadvertently made the win/loss ratio for the first multiplayer encounter look much better than it is. More importantly, this cleaning excluded data that indicated a deep gameplay issue. Players were struggling. The game felt unfair as players transitioned from single-to-multiplayer, and players chose to discontinue their progression and gameplay.
Safeguards in 3 easy steps
On the surface, the scenario seems easy to overlook. However, identifying data collection challenges early is core to our analyst responsibilities. It allows you to place the necessary safeguards to ensure data issues won't compromise your analysis.
On some projects, up to 80% of our time is dedicated to identifying and fixing bugs that cause data inaccuracy. While identifying incomplete and erroneous data is inevitable downstream and during analysis, the source of these data issues is nearly always avoidable and pays dividends later.
Implementing safeguards against potential telemetry issues can identify bugs early, saving time and money and ensuring insight and stakeholders' confidence.
So what's the secret? Put it simply, test your telemetry code! Test locally. Ensure that the implementation is correct, and when changes are made to the telemetry and/or product, commit to a basic test plan incorporating a qualitative review of the telemetry output. Developer time is precious, and extensive late-stage testing will be time-consuming.
When it comes to early issue identification, we recommend a 3-step approach—this is also where our QA friends play a crucial role in further telemetry testing.
Step 1 - Telemetry Testing
The quality assessment starts at the telemetry level.
Get your QA involved in telemetry testing early!
Back in my Ubisoft days, we had a dedicated data QA team. Although they were part of the larger QA organization and did other forms of QA, they contributed to early data validation at critical points. Their deep knowledge of the project and ability to mimic actual player behavior made them perfect for generating complicated use cases for data validation like edge-case player-death causes.
While having a dedicated Data QA team may be challenging in smaller organizations, telemetry must be treated with respect as all other game features and mechanics. This is especially true for free-to-play titles and games as a service. Many games we have worked with had an extensive pre-launch phase where the products were validated by a smaller group of real users before being rolled out to a global audience. It is tempting to think that it is ok to have telemetry bugs in this phase, but for every telemetry issue, there is a missed opportunity to honestly evaluate the product, which is the goal for such phase in the games' life-cycle,
At a very basic level, running Telemetry Smoke testing before every release is an absolute must. Ideally, each new build should have a comprehensive level of testing to flag issues early. Telemetry can even be added to the regular build smoke test to get tested alongside the rest of the game.
The level and extent of smoke testing will vary, but the general ideas are as follows:
Validate telemetry by following the data through the pipeline—this is crucial for builds planned for release. One issue that can creep in is telemetry flowing to the wrong environment, for example, live data being sent to a development environment.
Validate that key telemetry events are firing. This should be done for a number of business-critical events—logins and transactions, at the very least. Extending this validation to a larger group of telemetry events is common, particularly for events that generate essential data for analytics cohorts. One example is testing progression events if players are expected to be cohorted by progression variables in the analysis or reporting.
It is typical to provide a smoke test script that QA follows to trigger correct telemetry events. Below are some examples of such scripts. Each step should be accompanied by a list of the events that are expected to trigger it.
Example QA script
Step 2 - Extended Data Testing
Conducting a comprehensive sanity check of the collected or received telemetry data by exposing it to QA is important.
This exercise is particularly crucial when a new feature is added, or changes are made to a gameplay system, and there is a need to validate various use cases at a granular level.
Returning to our earlier game example...with both single-player and multiplayer gameplay, we recommend testing different permutations of groups, match outcomes, and end reasons. The goal is to ensure that every participant is represented in the telemetry for the multiplayer game. It is not uncommon for telemetry in a multiplayer setting to send patchy data for all participants or only send data for the player who started the game (or is hosting it).
Step 3 - Data validation by analysts or data engineers
Your final bastion in the battle against bugs is the data team. The data team can deploy tools that detect logical abnormalities that are either rare or hard to detect through gameplay and telemetry event monitoring. Data teams can detect changes in data types, ratios between different events (like starts and ends), abnormal durations, incorrect order or timestamp formatting, etc.
Having bugs in telemetry is almost unavoidable, but by following these steps, and making sure to retest the data upon launch of updates, should get you close to perfectly reliable data.
Finally, let me provide you with some examples of bugs that are very common across most games.
Missing events due to client shutdown are prevalent in mobile games where an internet connection is required. Missing telemetry can occur when the internet connection is lost, or game focus is sent to the background due to a phone call or message received. Missing client shutdown events are typical on other platforms, primarily in development stages where client crashes are not frequent. Understanding which critical events get lost may help you understand if you need to collect events in a local cache or create server-side redundancies for client-side events.
As mentioned before, patchy data in multiplayer settings is routine, with a typical scenario in which only some players send complete data. In many cases, meta-events for multiplayer games are implemented to compensate for potential data loss. Such an event would be picking up data from the server whenever a match/mission is started or completed. It may not provide the needed granularity, but it can be a partial buffer for missing data points. In some cases, further buffers may be required.
Transaction data can be lost because of app behavior when connecting to the transaction service. For example, a user makes a transaction in the app store, a password pop-up appears, and the game disconnects. In this case, the server would know that the transaction has happened, but the client enters a bad state, and telemetry does not receive this information. We recommend not relying on client-side data for monetary transactions or adding a backup solution for identifying that the currency was acquired upon client reconnection.
Failed transactions register as successful. This issue is very common and extremely important to identify and find a solution for, as it can significantly alter the findings of your premium economy analysis.