Posted: 2006-09-03 16:05:02
An “anonymized” dataset [439 MB compressed, 2 GB expanded] of 20 million web queries collected from 650,000 AOL users was publicly released a few weeks ago by AOL in a good will gesture towards the research community… Shortly after, damage control set in and the data was removed.
The great mistake here, made by AOL, was to associate each query with a corresponding ‘random user id'; hence grouping together all queries performed by a particular user.
While the ‘random user id’ might not provide a link to the actual AOL user, the search queries themselves do… Have you ever searched for your name, address, phone number, social security number, package tracking number, friends, family, or place of work? Any one particular search query, or a given combination, can reveal your identity — and much much worse: link you with your search history.
The sad fact is that everything you search for, every site you visit, and just about everything you do on the Internet is logged, processed, cross-referenced, and stored in a number of databases. This data can either be connected to you by your ISP records (directly by name/user-id), or can be crossed referenced with other data to get at your identity.
Companies want to know as much as possible about you; all to sell you products and services. Governments want to grab as much power as possible. While ISPs, SEs, and other players are more than happy to gather, use, and sell that personal/private information to anyone that will pay, via cash or credit [also know as favorable government contracts — think ATT and NSA].
Here is a link to the mirrored AOL data…
This was a screw up, and we’re angry and upset about it. It was an innocent enough attempt to reach out to the academic community with new research tools, but it was obviously not appropriately vetted, and if it had been, it would have been stopped in an instant.
Although there was no personally-identifiable data linked to these accounts, we’re absolutely not defending this. It was a mistake, and we apologize. We’ve launched an internal investigation into what happened, and we are taking steps to ensure that this type of thing never happens again.
Here was what was mistakenly released:
* Search data for roughly 658,000 anonymized users over a three month period from March to May.
* There was no personally identifiable data provided by AOL with those records, but search queries themselves can sometimes include such information.
* According to comScore Media Metrix, the AOL search network had 42.7 million unique visitors in May, so the total data set covered roughly 1.5% of May search users.
* Roughly 20 million search records over that period, so the data included roughly 1/3 of one percent of the total searches conducted through the AOL network over that period.
* The searches included as part of this data only included U.S. searches conducted within the AOL client software.
We apologize again for the release.
Some other interesting data…
In other news…