In 2016 I was approached by the Children’s Research Network to prepare for secondary analysis three quantitative projects that were part of the Incredible Years Ireland Study. This was a research project exploring the long-term outcomes of the Incredible Years Parent and Teacher Classroom Management programmes, undertaken by several of my current colleagues in Maynooth University’s ENRICH project11. My task was to identify the appropriate data files, standardise formatting and naming conventions, and anonymise data, editing where necessary to protect the anonymity of participants and remove harmful information. 

Identifying the data

An initial challenge was the identification of the relevant data files. No final, completed folder had been assigned with all SPSS data files; instead a number of data files were spread across several folders, often versions of the same file with very minor differences. Seeking the correct, final file was complicated further by the fact that the control group did not receive the second follow-up interview, which meant that there was no one merged file containing all the data for intervention and control at all three-time points. At the start of the project I was unaware of this and could not understand why the numbers appeared not to match. In retrospect, this was a simple mistake and it would be important to clarify the numbers expected in each wave of the survey before identifying the data files. This highlights the desirability of a data management plan.

Searching for the data files through a complicated list of folders and sub-folders containing a great many slightly differing data files was most time-consuming. Research data managers would do well to identify completed data files and place them in a clear unique folder to aid archival work in the future. Perhaps data managers could create an Excel file to serve as a map to the many folders and sub-folders involved in the project, describing and linking to important data.

A similar challenge lay in the compilation of all instruments (questionnaires, etc.) used in interviews. These documents were generally filed into a small number of folders, which was helpful, but there was still some scrambling to identify which documents applied to which project. The Incredible Years study featured three research projects, each of which had different questionnaires and scale measures; I was fortunate that a member of the Incredible Years research team was my colleague in ENRICH and her help in identifying relevant files was invaluable. Without the support of a member of the original research team, distinguishing between the slightly different questionnaires used in the three separate projects would have been difficult, which perhaps shows that the work to prepare anonymised research data for archiving can helpfully begin during the research project itself. At least, good housekeeping by researchers, leaving reference files in clearly-named folders, can greatly speed future work on archiving. 

Naming conventions

Variables in the archived data were to be standardised to a simple formula: root/ item number/suffix. For example, a Profile Questionnaire question 3b at baseline is rendered: PQ3bT0. The same question in the first follow-up survey is: PQ3bT1.

To individually change several hundred variables in each wave would take an enormous amount of time. Instead, I used some useful formulae in Excel to speed this up. Consider the original variable name for the Profile Questionnaire 3b at baseline: PQ_3b. This contains the “PQ” and the “3b” I want to include, but with an unnecessary underscore.

The LEFT formula reproduces the leftmost characters of a cell. In this case, LEFT(A1,2) takes the two leftmost characters “PQ” from cell A1. The MID formula does the same thing, but starting a stated number of characters into the source cell, i.e. in this case MID(A1,4,1000) reproduces the 1,000 characters in the middle of cell A1, starting at character 4. I selected 1,000 characters as an arbitrary large number, simply to capture all the characters after the starting point. In column D I entered “T0”, representing baseline. The final formula simply merges the other three: PQ and 3b and T0 becomes PQ3bT0. 

It is a simple task to apply such formulae to the entire list of variables. This kind of methodology allows fairly rapid standardisation of hundreds of variables.

Missing values

Many variables may have missing values either because an answer is not applicable, refused by the respondent, respondent answered “don’t know”, or for some other unknown reason. It can be important for analysis to distinguish between cells appropriately left blank because they were not applicable to that respondent and cells left blank because of the refusal of the respondent to reply or some other reason. I did not have access to original paper copies of the surveys, but usually it was clear if the missing value represented a valid “not applicable” response. Where such responses were already identified, a simple piece of SPSS syntax could replace blank cells with the number 96, representing undefined missing values.

Anonymisation

By far the most onerous part of data archiving was the anonymising of variables, where risk assessment was based on two criteria: risk of identification and risk of harm. In particular, many variables featured open-ended textual questions, any of which could include personal details like names or private information on behavioural problems or illness, faithfully taken down from the respondent by the interviewer.

In a few cases it was appropriate to simply delete the variable, including for example questions about names of schools, teachers or phone numbers. It was decided, however, that most string variables were too valuable to omit entirely so there was no option but to read every single response and systematically recategorise them to conceal harmful or identifying data while preserving useful information. For example, a question may ask parents about the medication being used by their children. Below are fictional examples of responses:

-  Asthma

– Allergy medication

– Inhaler

– She had breathing problems and was recently prescribed an inhaler by Doctor O’Malley. Better now

– ADHD

– Azelastine

– Methylphenidate

These diverse responses represent three basic categories: asthma medication (“asthma”, “inhaler”, and “She had breathing problems and was recently prescribed an inhaler by Doctor O’Malley. Better now.”), allergy medication (“allergy medication” and “Azelastine”) and ADHD medication (“ADHD” and “Methylphenidate”). The response describing the prescription of medication by a Doctor O’Malley shows how identifying information can be included in unexpected variables, illustrating the need to read every string variable and recategorise many into categorical variables.

In the fictional example above, some parents responded with exact names of medication like azelastine and methylphenidate, while others knew only the general illness or condition being treated. Since I am not knowledgeable about these forms of medication, I had to quickly browse the internet for corporate or scientific names to check their general purpose.

In some open questions participants gave several replies. There are a number of possible solutions to this. One could split the string variable into several categorical variables, or use SPSS’s Multiple Response Sets, which similarly require the generation of categories from string data. I generally chose the former, generating up to three categorical variables from the one string variable. For example, supposing parents were asked to list any concerns about their child. Below I give a fictional example to show how these questions could be answered in the full string variable, and then how I categorised them into further variables.

In this example some respondents give three answers, some give two and some just one. I split these answers into three categorical variables, listing “not applicable” for the second and third variable where no answers were given. Note that the fourth example includes some potentially identifying information in the name of the sister Ellie, again illustrating the importance of categorising or editing these variables.

The Multiple Response Set command in SPSS follows a similar methodology. Categories are derived from the text variables and each category becomes a new binary (yes/no) variable.

This process of anonymising string variables was extremely time-consuming, involving large numbers of decisions on every variable. Individual responses could sometimes be ambiguous and could potentially fit in different categories. A response “shouts, screams, not good vocabulary” could feasibly fit in either the aggression/temper or speech development categories. While time-consuming and onerous, at least the data processor here should try to be consistent, making all decisions on consistent criteria across the data.

Conclusion

I know from my own experience working with data on a research project that files tend to multiply and become difficult to organise. Working on one’s own data, at least one has a shot at remembering past decisions in organising files and folders. This is much more complicated when coming fresh to other researchers’ folders: ad hoc decisions made by data managers are invisible to those archiving old research projects.

Even the decisions made in anonymising data often required knowledge of the field. External data processors attempting to prepare old research projects are disadvantaged by their lack of local knowledge on the project. In the examples I gave above, I would not automatically know whether future analysts of the data would prefer to know brand names of medication or general areas of illness; this is a question answered by researchers or practitioners in the field.

All this suggests that a simple piece of organisation, undertaken by the data manager or relevant researchers towards the end of their project, could be very helpful for future archiving and analysis. An ideal situation might even involve the inclusion of data archiving into the timeline and budget of the project, allowing the original research team with their specialist knowledge to make the relevant decisions that both protect the anonymity of participants and preserve the most valuable data for future analysis. Indeed, such plans are now commonplace and required by major funders, such as the British ESRC and Horizon 2020, and it is likely that these will facilitate future archival projects and the productive secondary analysis of archived data.