PII. The Great Taboo of Data Analysis

By JenWaller is licensed under CC BY-NC-SA 2.0

Many of us have had our personal information stolen from the Internet, some more than once. Even governments can’t prevent the thefts. Professionals who work with data come under a lot of scrutiny because of PII.

What Is PII?

Personally identifiable information (PII) is any data that can be used to identify a specific individual. PII is used mainly within the U.S.; Personal Data is the approximate equivalent of PII in Europe. Examples of PII include:

  • Full name. This depends on your name and the population you’re looking in. If you’re looking for Charlie Kufs in the U.S., you’ll find just one. If you’re looking for John Smith in the world, you’ll find quite a few.
  • Home address. A full address is usually unique although there may be aliases. A name plus a home address is good enough for a voter registration.
  • Date of birth. A lot of people have the same date-of-birth but tie it to a name and you’ll probably have a unique identification.
  • Email addresses and telephone numbers. Everybody seems to have many, some of which may be linked back to a real person.
  • Personal IDs. Social security number, passport number, driver’s license number, and credit card numbers. These are unique and will identify a specific individual even better than a name.
  • Computer-related details. Log-in and usage information, device IDs, IP addresses, GPS tracking, cell phone records, and social media links. Police caught the BTK killer because he sent them a document that had his name and location in the metadata.
  • Biometric data. Finger and palm prints, retinal scans, and DNA profile. They’ll find you; they always do.

To these PII elements I would add information that would not be able to identify a specific person unless combined with other information. Examples include: gender, race, age range, former address, and education and work history. There is also personal information that might be a password security question—first pet, grandmother’s maiden name, least favorite boss, first person you kissed, favorite teacher—although I don’t know why this information would be in a database.

Where Does PII Come From?

PII can come from a variety of sources. You can generate PII by conducting surveys, for instance. You can obtain it from a caretaker, like a Human Resources Department of an organization, Or, you can collect it in a variety of ways from the Internet.

Once you have your PII, you have to scrub it, which is a whole other discussion. Ultimately, you have to decide what is worth keeping for your analysis and what should be deleted immediately to prevent its loss. If you keep it for analysis, be sure you have a plan for what analyses you plan to conduct and how you’re going to safeguard the data when you’re not actively using it.

I’ve analyzed a lot of data in my career in both the private sector and in government. Federal PII requirements are much more strict, by a lot. We had to complete training on PII and computer security every year. I wasn’t supposed to keep PII on my work computer or in the cloud; it was only supposed to reside on the secure government server. The data set had to be password-protected and not shared with co-workers without a “business need to know.” And I wasn’t working for the NSA either. This was the General Services Administration. They manage federal buildings and provide office supplies and other things. They have some sensitive data (SBU) but nothing classified as even Secret. Nonetheless, I used some types of PII data quite often.

I obtained data from a variety of sources. My co-workers in our data analytics group (I can’t remember what it was called; it went through several name changes) provided most of the internal business data. I often supplemented that with data from sources on the Internet, usually other government databases, some public and some restricted. PII data came from Human Resources, usually requiring approvals from at least one higher level of management. Sometimes requests had to go through Headquarters offices in Washington D.C. I also conducted my own surveys, which also had to be approved by Headquarters. With all those different sources, data compilation and scrubbing was always an adventure.

The data sets I analyzed were miniscule by big-data standards. I almost always had fewer than thousands of rows and hundreds of columns. Typically, I had just hundreds of rows and dozens of columns. Most of the PII data I analyzed came from individuals in the same organization I worked in, though I did do a few analyses for outside organizations. Consequently, I was usually able to develop a good rapport with my data.

I rarely received social security numbers and other personal ID numbers. They may have been useful for sorting and data merges but there were always other data elements that could be used instead.  I’ve never had a reason to use them so I deleted them immediately. I also routinely deleted log in details, telephone numbers, and all but one of the multiple versions of name and email address that might be in a data set, mostly because they were an extraneous nuisance. Other PII that I saw often was an employee’s ID number and organizational unit, which I usually kept, and their supervisor, which I usually deleted as superfluous.

By Librarian Avenger, licensed under CC BY 2.0

What Do Analysts Do With PII?

My analyses that involved PII covered a range of business issues involving staff, both individually and in groups. Examples included: staff recruitment, hiring, , demographics, engagement, satisfaction, morale, productivity, capabilities, and workload; telework and wanderwork; and usage and preferences for data, cell phones, and computer hardware and software.

For these analyses, I used name, email address, and employee ID number for sorts and merges. I used home address for one analysis to assess employee commutes. I used race on one occasion to assess hiring practices. Getting that information was involved and required a lot of persistence. I used log-in information for online surveys to evaluate survey difficulty and patterns of responses.

I used sex and date-of-birth on almost every analysis I conducted concerning staff. In all those analyses, sex was never a significant factor. Still, it was important to verify that non-significance. I used birth date to calculate age. From that I could also determine the age they joined the agency and a few other age-related employment factors. Age was a significant fact in a great many of my analyses. I also used age to evaluate employees’ generations. My boss was a Gen-Xer who was convinced Millennials did not behave like older staff members. None of the analyses he had me do suggested that generation was any more than a minor factor. In those cases, the ratio-scaled age was a much better explanatory variable.

One time during a slow period over the end-of-year holidays I decided to have some fun and calculated the zodiac signs for the staff from the birth dates, then I conducted the same analyses of staff characteristics and preferences that I had previously completed. Not surprisingly, nothing related to astrological sign was significant, but now at least I have analytical evidence.

Data analysts are only interested in population characteristics. You are of no real interest as an individual. It’s true, “you’re just a statistic” unless you are some kind of crazy outlier. In that case, you might be interesting.

By Steve took it  licensed under CC BY-NC-SA 2.0

About statswithcats

Charlie Kufs has been crunching numbers for over thirty years. He retired in 2019 and is currently working on Stats with Kittens, the prequel to Stats with Cats.
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a comment