One of the first things I realized when I started doing statistical analysis for real research projects, instead of in the classroom, was how messy and unconventional real data sets can be. I’ve talked about this before in the context of power analysis , but it gets much worse.
In the post on power analysis, the main problem wasn’t really the quality of the data (although I could say a thing or two about that) but rather the somewhat atypical analysis we were hoping to do. Since that post, I’ve had the opportunity to work with several non-research oriented groups hoping to analyze data collected for administrative purposes. If you don’t think why the data is collected makes a difference, then let me warn you – you’re in for a shock.
Administrative data can be really good, especially when the administration realizes in advance that this data will be useful for research or decision making in the future. Best case scenario, the data is clean and broad. What this means is that the dataset contains a number of different measures that all get at the same basic question, that there is very little missing data, and that the data all makes sense conceptually. Worst case scenario, none of that is true.
But more typically, what you do have is a fairly complete dataset (it is needed for administrative purposes after all, so for example if it’s payroll data everyone will probably have job hours worked). The real problems are awkward measures and what I like to call ‘crazy data’ – data points that just don’t make sense, categorical variables that are so contradictory or over-generalized that they’re useless, or coding options that are basically bizarre.
For example, you might have hospital discharge data with a variable called ‘birth weight’ which should only apply to newborn infants and be missing for all other patients, but for some reason is coded as ’0′ for both still-born infants and for all children and adults. Why did someone make a decision like that? Because they don’t plan on using that variable in the ways a researcher might.
Alternately, you might have a variable called ‘duration of short-term work absence’, where short-term is defined as 6 months or less. When you go to split that variable into categories, you discover a whole sub-group of people with so-called “short-term” absences of 6-12 months. What’s going on? More than likely, data you thought was continuous is actually cumulative, so this group of people had 2 or more absences of less than 6 months each. But there’s no way to know for sure unless you can get ahold of someone in charge of the dataset, which isn’t always easy or possible. Frustrating!
And then there’s the problem of sub-optimal measures. For instance, you might want to investigate the rate of injury among hospital employees. Well, for a rate, you’ll need a denominator. The most obvious denominator would be the total number of employees – usually researchers try to use full-time equivalents, which is a way of adjusting for part-time and casual workers. But what if the only data your hospital can provide is the number of beds or patients at any given time? What’s that you ask? Why don’t they know how many employees (or employee-equivalents) they have? Sigh. I wish I knew. But it’s more common than you’d think.
The final problem I’ve come across with administrative data is the question of format. Since administrative data is often used for billing or payroll within a single organization, it’s often inputted in whatever software the billing or payroll people in that organization are most comfortable with. You might get lucky and find that they can provide it in Excel or even a .csv file, but chances are that’s not going to be the case. I’ve had situations where I’ve had to manually enter the numbers in a spreadsheet from pdf tables, or even pull data from text-based case-descriptions (fairly common with surveillance data).
Overall, my biggest lesson has been in terms of budgeting time for preparatory research activities. When you’re in class, it seems like the hardest part of research should be setting up the experiment or observational study and collecting the data. Analysis should be a breeze!
And maybe that’s true if you’re in a position to design the whole project from scratch – it certainly was true when I worked in Biology labs – but if you’re not doing the data collection, watch out! So far, my experience has been that it takes 4-6 months to really be sure that your administrative dataset is clean, sensible, and that you understand what all the variables are, what the limitations of the dataset are, and what the best method for analyzing the data you have is.
Have you had any frustrating experiences with “real world” datasets you’d like to share?