I have been pointing out for awhile that we have giant databases of patient medical records (e.g., Kaiser Permanente has about 12 million patients) that could be subjected to Big Data analyses to get a better feel for the relative risks of coronavirus infection in different workplaces. For example, if schools were reopened, how much at risk would teachers be? How does the risk of mass transit compare to the risk of flying, as judged by infection rates of bus drivers and stewardesses?
A reader explains the subtle ways privacy rules hamstring massive data-mining of medical records for clues about how best to fight the epidemic:
I work in an obscure branch of healthcare IT, and I am pretty familiar with HIPAA and HITECH provisions. HIPAA has an explicit provision for de-identification, but here’s why it doesn’t help. There are two different ways to de-identify. The easiest is the so-called “safe harbor” provision. It’s so straightforward, anyone can understand it. But part of requires all dates (including disease onset, diagnosis, treatment, etc.) to be rounded to the nearest year. So kind of useless for what you want it for. Also, the government reserved the right to come after you, even if you use the safe harbor. There’s some fine print that says that if you have reason to believe that the data can be re-identified in any way, you can’t use the safe harbor (like if your dataset includes the diagnosis V95.43XS: Spacecraft collision injuring occupant). Entire books have been written on the perils of de-identification, so it ought to make potential anonymizers nervous.
The second method is the “expert determination” method. Basically it means hiring a statistician familiar with these kinds of protocols to figure out how you can get what you want while maintaining anonymity. It’s a small fraternity, and they are booked months and years out by data miners, pharmaceuticals, etc. It becomes a negotiation in which you give some bit of interesting data in order to keep some other bit that’s even more interesting. The expert has to build and test models of your special method and estimate re-identification risk. It takes months, and there is no guarantee you will even get to a satisfactory solution. There is no official certification for these guys, so the government can always come back and claim your expert was not expert enough. And there’s no objective threshold for acceptable risk, either.
Every time you mention Chetty, I cringe. I have looked in the past, but I can’t find any detailed documentation of the anonymization protocols the IRS used. Here is the closest thing I could find, but it doesn’t lay out the protocols. I’m going to make a bold claim: there was a massive risk of re-identification in the datasets Chetty got his hands on. It’s naive to think that stripping out the obvious personally identifying data is sufficient. There are massive publicly available consumer datasets out there that can be correlated with your “anonymized” data to re-identify individuals.
I presume the guys working for Chetty at Harvard and Stanford on his Equality of Opportunity project are extremely good at what they do, so that if they felt like identifying Trump’s or Gates’ or Bezos’ 2012 tax return, they probably could. On the other hand, it would be a big faux pas for their careers for this to leak out, so perhaps they’ve all chosen to be above reproach.
Anyway, my point is that Chetty and Co. have had access to information since 2013 that nobody before them ever had the gall to imagine they could get … and the world hasn’t come to an end because of it. Maybe somebody could do something similar to help figure out how to get out of this medical and economic hole we’re in?