Key lessons learned in building a research database
As early as 2014, before data was widely recognised as a major challenge to research, the Energy Research Centre (ERC) realised they had a data problem. The group hired Wiebke Toussaint, engineer and data scientist, to manage their data dilemma. Toussaint – with some help from UCT eResearch – built an energy data portal. Now, after a five-year journey of learning about research data management for medium-sized research centres, she has some advice for research groups setting out on a similar path.
“Diverse data assets – from big data to small qualitative surveys – play an important role in scientific research, yet many research centres lack the capacity and technology expertise to build data ecosystems to manage their data sets,” says Toussaint. She advises that research centres, before investing resources or seeking funds to build a data solution, consider a few key factors.
1. Think strategically
The first question for research groups to ask is: What is the strategic value of data in our institution? They need to decide what role they want to play, as a research group, in the national, continental and global space in terms of data.
Toussaint says there can be a massive strategic advantage in deciding to make data management part of the group’s mandate. “For instance, the University of California, Irvine owns a number of data sets that are used globally for benchmarking. There’s no reason our research groups couldn’t do something similar.”
This is important to decide up front, she says, because if you wish to go this route, you should plan at least five or 10 years into the future, rather than simply a year or two.
At this point, however, a research group may well realise that they would rather put their resources into the research and have somebody else take care of their data – as long as it is safe and properly archived. In this case, ZivaHub is the way to go, as a digital object identifier (DOI) will be applied to your data set for persistent identification and proper citation.
Toussaint advises research groups to get expert strategic advice at this early stage, to ensure that they have considered all options and know what they are working towards.
2. Map out the landscape
“Data sharing requires multi-party involvement, and the more partners there are in the endeavour, the greater the chance of future sustainability,” says Toussaint.
She strongly advises against a research group taking on such a project alone. “If there is an existing initiative, join it; otherwise, start building partnerships. The best situation is where a few groups with the same mandate work together to build a resource for the greater good – whether they are in South Africa, Africa, the Global South or spread across the world.”
3. Be ready for a culture-change
“Data tends to play a very strategic role in a research centre, with many different touchpoints. Changing the way data is managed often involves a component of organisational change,” says Toussaint.
An important factor to consider, therefore, is the mandate of the person or people employed to implement sustainable data practices. Researchers must recognise that this will affect how they work with their data and must be open to changing some practices.
4. Know what skills you need
For a database such as the one the ERC built, a group would effectively be hiring for three different skill sets, which may not come neatly rolled up in one person.
First, says Toussaint, you need a person with the technical skills to build the database – somebody with an engineering mindset who is happy to tinker and problem-solve. Once it is built, you need to employ a person long term to maintain it: somebody with a librarian mindset to focus on the fine details around curating and archiving. Finally, you need someone to do the “data science” – the visualisation and storytelling from the stored and curated data set.
“Quite often, I think, staff from an organisation see the data visualisation and think: ‘This is what we want’, and then they probably hire an engineer. It’s critical to understand what skills are involved before embarking on the project.”
Toussaint is extremely excited that researchers are starting to grapple with these problems: it is new and difficult, but she believes it is important to send the message that this is not a space to be feared.
“For me, the worst is a default closed mode, as you’re not aware of the risks or the opportunities of publishing your data,” she says. “Data is only as valuable as the value people add to it, and if yours is inaccessible, there is other data out there that researchers will use, while your untouched data grows stale.”
How the portal was built
Toussaint was tasked with building a data portal where the ERC’s energy data – currently being gathered from a range of sources and scattered across different storage options – could be stored in one place. In addition, the data had to be made open to researchers worldwide.
An early pioneer, before ZivaHub was available, Toussaint relied on web-based open-source software called Comprehensive Knowledge Archive Network (CKAN), created specifically for storing and distributing open data.
“CKAN is a great resource,” says Toussaint. “It’s not an out-the-box solution, as setting it up requires technical know-how, but it’s free.” She says a number of governments and research organisations are already using it for this purpose.
UCT eResearch assisted by providing a virtual server. Toussaint says her interactions with UCT eResearch – and their support – facilitated her work.
Toussaint adds that she hopes to collaborate more with eResearch in the future: “As the project comes to an end, and I reflect on what I’ve learned, I have a greater idea of what’s actually possible. I would like to engage with eResearch at the next level, and say: You’re doing great work, and changing the way things are done; what else can you do for me?”