Data Validation


Ideally, our model training data would consist of the complete medical files for tens of millions of individuals along with associated outcomes, e.g. medical costs, LTD costs, death, etc. The medical data would include standardized diagnosis and procedure codes, but also lab tests, doctors' notes, etc. The standardized codes are usually available due to insurance reimbursement requirements, but ancillary data is often fragmentary.

For our Life and LTD products, this idealized data does not exist, especially with the outcomes consistently linked and coded. Bayesian modeling can overcome the linkage issue, but it suffers from the amount of equivalent information on the outcome data, e.g. death data and LTD claims do not have procedure codes, lab tests, or doctor's notes. These limitations are outweighed by the advantages of using a large population of outcomes with diagnosis codes to calculate risks across a large number of clinical conditions. For example, our death model can utilize over two decades of data, approximately 15 million working-age deaths, allowing us to make informed, refined estimates on everything from relatively rare infectious diseases to bed sores. On more common conditions, the large size of the data allows for more fine-grained estimates based on disease progression.

TruRisk assumes all data is limited and/or flawed. We can provide accurate estimates only if we understand the issues in each data set. For each production cycle, we spend as much effort validating the data as we do modeling it.

All data must pass a multitude of filters, designed to highlight erroneous or missing data. The errors are sometimes obvious, but uncovering missing data requires a more subtle approach. As groups that are missing claims will look healthier than they actually are, we have developed many different indicators for this issue, e.g. claim density metrics, unusual demographics, clinical casetype distributions, etc.

Claims review: We review claims looking for a variety of errors, including:

  • Does the distribution of clinical case types match previous data submissions and national norms?
  • Does the density of claims match expectations based on age, clinical casetypes and clinical severity?
  • Does the claim diagnosis conform to other information available about claimant, i.e. male pregnancies or newborn diagnosis for an adult?

Example Member checks:

  • If multiple member records are provided, do they have consistent information?
  • Is the member eligible for coverage for ancillary products, i.e. not a dependent, not a student, not COBRA coverage, etc?
  • Retiree coding is often missing or inconsistently collected. Regardless of the retirement indicator, does it appear that the member is part of a retiree subgroup? If so, those members can be excluded as they are not usually eligible for ancillary Life/LTD coverage.

Sample Group filters:

  • Does the group have enough history to produce a reliable estimate?
  • Are claim density metrics reasonable and stable?
  • Has member turnover been so frequent that it renders the Bayesian expectation unreliable? Did most of the group's members join in the last few months, overwhelming the more reliable experience value of the members that have been in a full year?

Example Overall Reviews:

  • Does the data fit together, both between its constituent pieces as well as over time?
  • Does the distribution of group clinical factors match its usual distribution? Are the expected seasonal changes in respiratory infections observable?
  • Did the member retention rate fall to zero because the ID encryption was changed? Was there some other change to the data warehouse that is making longitudinal comparisons impossible?

Why do we have this extensive, time consuming audit process? Our long experience with client data has taught us that it is necessary. We receive completely bad, somewhat flawed, or just arbitrarily changed data on a disturbingly frequent basis. We understand which data issues can be ignored, which issues require additional questions to understand the impact, and which issues that require a "No - we do not recommend making pricing decisions using this data." We find that the IT personnel producing the data often do not appreciate how some data problems will directly translate into negative financial outcomes.




Copyright© 2025 TruRisk LLC
Questions?  info@trurisk.com