Part of a series on Sun’s OpenID@Work initiative; see the introduction for more context.
Data governance is the term used for knowing what happens to the data that is stored, particularly when that data has any PII (personally identifiable information), which the OpenID IdP does. Using OpenID isn’t the reason we keep this information; any registration system keeps at least some information about the people who have accounts on it, even if it’s only a name, email, and password (or openid identifier). I thought it might be useful to others to see some of the basic steps that we went through when discussing how to protect that PII, and some of the decisions we made on what data to keep and what not. If you’re setting up a registration system yourself, you may make completely different decisions, depending on what information you’re keeping and what your registration system is being used for.
Obviously, step 1 is to make someone responsible for figuring it out. In our case, that person was me, with the grand title of “Data Steward” in Sun’s process. Yes, there’s a process to be followed and checklists to be filled out, and people whose job it is to help us figure it all out (the Chief Privacy Office with Michelle Dennedy and her team). What you need to do is:
- figure out what data you need to have, whether for technical or policy reasons
- figure out who will need access to the data
- figure out how to prevent people accessing the data who don’t need access
- figure out when you can destroy the data
- write the decisions up and make the information available
- What data needs to be kept?
In this service, people can use fake names, but often choose to use their real ones. For compliance reasons, in case there needs to be an investigation into an allegation of wrong-doing by a user, we need to keep the employee ID that was used to sign up for the openid identifier. Even after the openid account is closed, the information is kept for a set period of time to allow any problems to surface. Yes, the users are warned about this during the registration process.
The web server logs are in the Common Log Format, which includes a record of the HTTP GET request from the consuming site (relying party) asking for authentication of the openid identifier. This HTTP GET request includes the openid identifier and the site’s URL, thus allowing correlation of who went where (though not what they did after logging in). This happens with every OpenID Identity Provider that has web server logs, which I would guess is basically all of them, so it’s certainly not a problem that is specific to Sun’s service. Every OpenID IdP could perform such correlations about their users. This is not necessarily a problem, and some people would say that allowing people to see that this openid identifier was used in different places allows reputations to be built, but it also has privacy implications. I might not want my employer (or anyone else, for that matter) knowing what sites I visit, how often, and when. So on principle we mask the data, so that we can see how often a site is visited, but not who’s doing the visiting.
- Who needs access to the data?
If there is an allegation of wrongdoing on the part of a user, then Corporate Compliance may need access to the information about whose openid identifier it is, and access to the web server logs showing whether the user actually did log in to the web site in question. This data is only passed on after review of the allegations by Sun’s legal team.
Apart from that, support personnel need access to the openid accounts to help people with things like forgotten passwords (if they forgot to set a secret question), or deleting the account on a voluntary basis. The user has to file a support request using Sun’s internal support system, and the employee ID of the person filing the request has to match that of the owner of the account.
Engineering may need access to some of the files for debugging. There is also a script that runs over the web server logs and extracts records of which sites were visited and when, discarding all information about who the user was who visited that site.
- Restrict access
Only a few people have access to the accounts; support, engineering, and me as data governance steward. That access is controlled through operating-system access control. The same applies to the logs and everyone who has access has gone through training to ensure they know the privacy conditions applying to the use of the information (i.e., used only for debugging or support once the user’s identity is verified, as above).
As a side-note, to log in to my account on the machines, I have to log in to Sun’s internal network, ssh from there to the machine I want to access and then log in with my standard Sun credentials followed by a one-time password that uses a challenge-response mechanism with a secret passphrase. Then I need to su to the appropriate user account, using yet another password (of course).
- Destroying Data
Once an account has been deactivated, either because the employee left Sun, or because they asked for it to be deleted, it remains inactive for 6 months. Once that time has passed, the account is deleted. The web server logs are deleted automatically after 6 months. This time was chosen as it seemed to meet both the privacy principles (delete as soon as possible) and the corporate compliance principles (keep around for a reasonable length of time, just in case it’s needed).
Once it was all figured out, and reviewed by the privacy specialists in Sun, documenting it was the easy part (just like writing standards, really, coming to the consensus is the difficult bit). So we have information in the disclaimer that people need to agree to when they sign up for an account, the user policy, the FAQ, and the more formal checklists etc are available from the Sun-internal project site. And people can always ask me, or email one of the mailing lists we have, if they have any questions.