Mitigating Identity Dislosure Risk: What I Learned at Summer Camp (well, the ICPSR Summer Program actually)

This past week I (Jen Darragh) attended a week-long workshop on assessing and mitigating identity (individual or organizational) disclosure risk when releasing data files for secondary use at ICPSR on the University of Michigan campus in Ann Arbor.  This workshop was led by JoAnne McFarland O’Rourke, an ICPSR staff member and leading scholar in this area.

When it comes to releasing data on individuals (or even companies and other organizations) there are several things you need to consider and steps that have to be taken to ensure confidentiality.  First, data distribution options and methods should be thought about before participants are even recruited.  Informed consent should be explicit about data release for secondary use and that necessary steps will be taken to ensure participant confidentiality.
When preparing a data file for public use there is first a disclosure review, and if necessary, a full disclosure analysis (this involves a deeper look at the data).  The most important thing in all of this is to facilitate a harmonic balance of high analytic utility with the lowest possible disclosure risk. 
During initial disclosure review, JoAnne recommends five essential steps:
1. Review and remove direct identifiers: name, address, SSN, linked ID numbers (medicare IDs, insurance IDs, etc. - not necessary for analysis).
2. Review specific dates and remove and recode them as needed, especially specific day (need to create specific time codes for calendar files, restricted use files may have more detailed information).  Birth date, marriage/divorce date, job start date, etc.
3. Review specific geography and remove or recode it as needed (only keep at the level needed for analysis - lower level geographies always lead to higher disclosure risk).
4. Review possible links to external files – can be linked by organization size, revenue/income, or other specific counts.
5. Re-number cases.  In essence, get rid of original ID.  This can be done by sorting on a random variable first.
Other techniques that can be employed to reduce disclosure risk include top- and bottom-coding to eliminate extreme outliers, creating range variables from continuous numeric variables (e.g. single years of age to five-year ranges), and developing ratios (e.g. paid lunches to free lunches in schools).  More complicated methods such as data swapping (used by the Census Bureau), blanking and imputing, and microaggregation can also be used depending on the level of detail you want to release in the file.  As the techniques become more sophisticated, a full understanding of the statistics behind them and how they impact the final file is absolutely necessary.
ICPSR is fortunate in that they have a disclosure review board with expert staff.  Other universities without a resource such as ICPSR, but in the business of helping faculty to manage and share their data should consider developing their own disclosure review board.  This board should involve researchers with expertise in the field (they often know of potential external file linkages), experienced statisticians, a person knowledgeable about data management and dissemination (data librarian, archivist, etc.), and potentially a university legal representative.  The board should work closely with the PIs to ensure that the highest analytical utility of the file is preserved (no one knows the data like the PI).
I would highly recommend this workshop to anyone who is involved in research involving human subjects – researchers, IRB personnel, data librarians/archivists, graduate students, and even representatives from university research administration. 
Further Reading:
Checklist on Disclosure Potential of Proposed Data Releases:
Disclosure Analysis at ICPSR -
O’Rourke et al. (2006). Solving Problems of Disclosure Risk While Retaining Key Analytic Uses of Publicly Released Microdata. Journal of Empirical Research on Human Research Ethics. Vol 1(3). PP. 63-84