Data ValidationIdeally, our model training data would consist of the complete medical files for tens of millions of individuals along with associated outcomes, e.g. medical costs, LTD costs, death, etc. The medical data would include standardized diagnosis and procedure codes, but also lab tests, doctors' notes, etc. The standardized codes are usually available due to insurance reimbursement requirements, but ancillary data is often fragmentary. For our Life and LTD products, this idealized data does not exist, especially with the outcomes consistently linked and coded. Bayesian modeling can overcome the linkage issue, but it suffers from the amount of equivalent information on the outcome data, e.g. death data and LTD claims do not have procedure codes, lab tests, or doctor's notes. These limitations are outweighed by the advantages of using a large population of outcomes with diagnosis codes to calculate risks across a large number of clinical conditions. For example, our death model can utilize over two decades of data, approximately 15 million working-age deaths, allowing us to make informed, refined estimates on everything from relatively rare infectious diseases to bed sores. On more common conditions, the large size of the data allows for more fine-grained estimates based on disease progression. TruRisk assumes all data is limited and/or flawed. We can provide accurate estimates only if we understand the issues in each data set. For each production cycle, we spend as much effort validating the data as we do modeling it. All data must pass a multitude of filters, designed to highlight erroneous or missing data. The errors are sometimes obvious, but uncovering missing data requires a more subtle approach. As groups that are missing claims will look healthier than they actually are, we have developed many different indicators for this issue, e.g. claim density metrics, unusual demographics, clinical casetype distributions, etc. Claims review: We review claims looking for a variety of errors, including:
Example Member checks:
Sample Group filters:
Example Overall Reviews:
Why do we have this extensive, time consuming audit process? Our long experience with client data has taught us that it is necessary. We receive completely bad, somewhat flawed, or just arbitrarily changed data on a disturbingly frequent basis. We understand which data issues can be ignored, which issues require additional questions to understand the impact, and which issues that require a "No - we do not recommend making pricing decisions using this data." We find that the IT personnel producing the data often do not appreciate how some data problems will directly translate into negative financial outcomes. |
|