Preparing for Life (PFL) is a prevention and early intervention programme that is operated by the Northside Partnership in Dublin. Since 2008, the programme has worked with families to help children achieve their full potential. Mothers, who were on an overall average of 23.4 weeks pregnant, were randomly assigned into the PFL programme (Doyle et al., 2010). Data were collected over time to measure developmental outcomes of children and households. The richness of the data allows many valuable analyses to be undertaken, utilising both cross-sectional and panel dimensions. Due to the small sample size and region-specific nature of the data, anonymising and statistical disclosure controls were carefully performed before the release of the data to the public data archives. The following technical note summarises the motivation, process, and risks of archiving the quantitative data from the PFL evaluation. As there are no international standardised protocols in data archiving, precautious steps were undertaken during PFL data curation process which may serve as a reference in archiving other longitudinal evaluation data. 

Motivations for data archiving

The motivations for archiving and sharing data should have two goals. First, to protect the confidentiality of respondents (Elliot et al., 2016; University College Dublin (UCD) Library, 2017). No information on the identity of respondent or household should be revealed without lawful authority (Office of National Statistics (ONS), 2001; Irish Statue Book (ISB), 1988). Direct identifiers such as names, date of birth, geographic information8, should be omitted as they reveal respondents’ identity. Indirect identifiers, such as occupation, age, or wages, that can be obtained from local knowledge, should be processed carefully. Second, to release useful data where statistically valid conclusions can be drawn (Growing Up in Ireland (GUI), 2013).

Three practical procedures

In processing PFL data, anonymity of participants was ensured through three procedures: 1) Small cell adjustments were commonly performed on outcomes that constituted fewer than five observations, including zero observations, which can easily compromise confidentiality protection (ONS, 2001). These variables from PFL include sensitive information on drug use during pregnancy, domestic risks, multiple pregnancies9, specialists’ consultation, and postnatal depression. Due to the low number of reported observations, these variables were removed entirely from the archived dataset. 2) Extensive banding was applied to socioeconomic data such as occupation, level of education, and ethnic background, as respondents could be identified through cross tabulation (ONS, 2015). For example, instead of specific job titles, occupations of mothers, fathers, and grandparents, were grouped into broad categories following the Standard Occupational Classification 2010 (SOC2010). These categories are comparable to census data in the UK and Ireland (ONS, 2015; UKAN, 2013). In addition, rather than reveal the ethnic backgrounds of the minority in the sample, maternal ethnic group was broadly re-categorised as Irish and non-Irish. 3) Top and bottom coding was applied to information related to income, demographics, and the household. Where an individual or household output was an outlier, the statistical output was amalgamated into neighboring sample groups, such as age of first pregnancy that was “below 17 years”, wages that were below or above a certain level, and family size that was “greater than seven” (GUI, 2013; ONS, 2001). While one can follow national or international guidelines on banding, small cell adjustments and top and bottom coding thresholds are rather data-driven.

Potential challenges

Since the thresholds constructed are data-specific, they can also constitute problems in the process of anonymisation. Depending on the utility of the data, the cut-offs being created should be useful for the purpose of socioeconomic analysis without compromising one’s identity. For instance, the age range in the PFL dataset is between 16 and 38 years. This range would allow one to conduct analysis based on the standard categorisation of youth who are between 15 and 24 years. However, the outliers need to be removed as one may piece together several variables and identify the respondent from a small sample size. Another common problem in data processing is the issue of identifying missing observations or zero values. This requires understanding of the dataset, such as attrition due to social processes, or the existence of skip pattern in the survey. In the archived dataset, missing values are handled with caution as they can affect the final psychosocial scores.

Due to the nature of longitudinal data, extra risks in disclosing one’s identity may include changes in demographic variables, such as change of marital status over the course of data collection period (UKAN, 2013). Thus, it is important to perform tabulations and cross tabulations to ensure the anonymity of individuals or households in cross-sectional and panel dimensions is maintained.

Conclusion

As a final note, a thorough and clear audit record documenting these procedures should be kept. For instance, Stata users can keep track of all relevant anonymisation activities, processes, and notes being performed on each wave of data in their .do files10 (Stata Press, 2017). A clear audit record is useful in demonstrating the correct procedures and tracking mistakes.

In archiving the PFL longitudinal dataset, it made the evaluative data, that is rich in demographic content and contains measurement of children’s development over time, available to the public. The above stated procedures have proven useful in preserving the identity of programme participants while making the evaluative results useful for researchers and policy makers. In sharing the experience of archiving the PFL quantitative dataset, this technical note will hopefully be deemed useful to the research community.