Data Science for Net Zero, the Value of Reproducibility – Dr Stephen Haben
Comment by Dr Stephen Haben, Senior Data Scientist at Energy Systems Catapult.
Ensuring innovation and research add value to the energy system
From policy decisions through to business model and product decisions, we have many important decisions to make on the road to Net Zero energy. A number of organisations aim to make evidence-based decisions based on scientific research to ensure the best choices are made. What happens if that research is not robust and reproducible? It means that the decisions we make risk taking policy and businesses in unfruitful directions.
The report also highlights some specific reproducibility issues within machine learning and AI research including:
The 2021 ‘State of AI report’ which found that only 26% of AI papers published that year had made their code available.
A contribution from Professor Vollmer (TU Kaiserslautern in Germany), stating that “there is pressure in the field to publish even more and even faster than in other fields” meaning there is a risk of more “research garbage” because “the field is so fast moving”.
This short comment piece discusses the importance of reproducibility in data science and presents some of our own learnings concerning reproducibility. We also share some ways that Energy Systems Catapult is trying to support reproducibility for data science in the energy sector with actionable steps for academics, funding bodies, and publishers.
Reproducibility in Data Science
Reproducibility has long been considered a key consideration for both natural and social sciences. Many of these systems are now being analysed and modelled using the data collected from sensors and other sources of monitoring. This inextricably links reproducibility of the hypothesis and models of natural and social science problems to the corresponding machine learning and AI models of these systems. Questions arise as to whether the current model accurately describes the processes observed and can therefore be trusted to be applied more generally to similar environments and situations.
Issues with reproducibility are also a roadblock to knowledge and innovation transfer as found in our academia to industry impact investigation. For organisations to optimally utilise university outputs requires accessible findings (open access) with the methodologies clearly described to enable a faithful replication. Further, this also requires the sharing of the data used to demonstrate the model, ideally with the code to identify nuances and quirks of the approach which may otherwise be hidden. To optimise its utility not only does data need to be open but it should be of sufficient diversity and size, as up to date as possible, and well documented.
Researchers from Princeton University workshop held a workshop on the Reproducibility Crisis in ML-based Science where speakers highlighted reproducibility issues within their own fields. Further, the organisers have collected a list of papers from 17 different fields which have data leakage errors leading to reproducibility failures. In turn, these papers have been shown to directly affect 329 additional papers. This highlights the devastating snowball effect that reproducibility errors can cause, which can mean wasted time on research dead ends, or promising areas of research being unexplored.
There are many causes behind reproducibility issues. A major cause is the aforementioned “data leakage” in which information from the test set is “leaked” into the training set, thus making the evaluation unreliable since the models may simply be learning the true values. Another issue is the quality and quantity of data. If algorithms are only tested on small or overly used datasets, then the results will not generalise to new observed instances.
To take an example from forecasting, consider the attempt to replicate the results from one of the most popular forecast challenges, the M-Competitions. In the third iteration of the competition, M3, it was found that even though the test data and submitted forecasts were available, the accuracy scores do not match those in the published paper. This has been found throughout forecasting research. In later M competitions the entrants had to share their code to improve reproducibility of the results.
The reproducibility crisis
Reproducibility in Energy Research
A 2017 report by Huebner et al noted that reproducibility in energy research is largely untested. In the same report they note that in fact “replicability/reproducibility of energy research could be lower than in other disciplines because no energy journals require authors to provide the accompanying data and step-by-step information on how the data was collected and analysed”. The same problems are inherited by specific data science energy research.
Data science energy research is focused on several applications including:
Assessing the impact of energy efficiency interventions
Many of the challenges are therefore dependent on localised energy demand and generation. One such popular technique is load forecasting which has been an important aspect of balancing the national supply and demand for decades. More recently load forecasting is becoming increasingly important at the distribution level and will be essential for optimally deploying flexibility services. However, such models have been largely unexplored and therefore there are many unknowns.
To illustrate this, a recent load forecasting study of 100 low voltage feeders showed some surprising results including the fact that, in contrast to national level, using weather data as an input to the lower voltage model actually reduced the accuracy of the forecasts. If this applies more generally then it could potentially save Distribution System Operators, Aggregators and Flexibility Service Operators time and money since they no longer need to subscribe to expensive numerical weather forecasting services or integrate them into their pipelines and models. On the other hand, if this is simply a quirk of the specific data, the testing period chosen or a random anomaly it could result in significantly worse forecasts, unnecessarily increasing costs for organisations and consumers. This is a tangible example of why reproducibility is important.
Other energy data research has similarly shown some surprising results. Replication and reproducibility are necessary to ensure any conclusions are realistic, reliable, and scalable.
Open data is an essential component of reproducibility, of which there is unfortunately very little, especially for lower voltage applications. For example, a recent survey of low voltage level forecasting (including households) showed that less than 24% of research articles used at least one open data set. Of these, the authors found that around 42% used the same data set. This highlights another problem in which open data is so sparse that most researchers continually use the same dataset even though it may be unrepresentative or contain particular biases. This will further reduce the reliability and generalisability of the results produced.
Supporting Reproducibility
Ensuring reproducible research is not trivial. Although researchers have the core responsibility for ensuring their work can be replicated, the wider research community, including journals and funding bodies, can play a key role in supporting them. Below we outline and collate some of the recommendations in the area of reproducibility as well as some of our own resources.
Academics and universities
The conclusions and recommendations from the House of Commons report align strongly with many of our findings from the Data Science: From Academia to Industry investigation. This includes supporting the move to open access publishing, and ensuring, where possible, all research data is open and published with the associated open-source code. In addition to this they lay out many recommendations concerning the revision of current academic culture including:
Researchers should implement stronger tests for adequate software and statistical skills within research teams at outset of funding application. This also depends on the wider needs for technical skills across many industries, including data science in the energy sector.
Three-year minimum contracts should be imposed for post-doctoral researchers to incentivise and prioritise reproducibility. Short term contracts and a competitive academic job market promote quick turnovers of published papers with lower emphasis on making the outputs immediately useful for others.
Publishers should ensure sufficient options for publication of negative and confirmatory science. What doesn’t work can be as valuable knowing what does and can save unnecessary replication.
As the paper by Huebner et al. remarks it is important that when testing energy interventions that proper research practices are adopted. This includes, where possible, utilising randomised control trials, avoiding “selection bias” and forming research hypothesis prior to implementation of any experiments.
Energy Systems Catapult have also been engaged with many other projects and platforms to support replicable data science in the Energy Sector. To support developing reproducible research we developed guidelines on necessary contents for any model methodology documentation and also produced guidance for sharing accessible and reproducible code. This should ensure that shared code has the greatest possible impact and the methods can be accurately implemented by practitioners and other researchers.
We have also been implementing several data science competitions. Not only does this crowdsource machine learning algorithms to compare to real-world energy sector problems, but it also encourages the creation of competitive benchmarks which have been compared using the same open data. Further to this we are trying to support innovators through sharing our data and, where possible, opening up our code.
Funding bodies
Funding bodies have a key role to play in supporting academics producing reproducible research. Funding bodies can stipulate open access requirements for the outputs from the research of various funding calls, especially if they are funding through public resources. The UK Research and Innovation (UKRI) have guidance for open access publishing of projects they finance. Although many grants and funds focus on new and novel research, the funding bodies should also reserve funding explicitly for replication studies as recommended out by Huebner et al. This ensures the necessary validation and testing of any hypothesis generation from original research.
A wider view of the funding landscape is one way to help identify previous research and outputs. The Catapult are developing a Catalogue for Project on Energy Data (CoPED) which has reproducibility as a primary objective. CoPED is an open-source platform contains meta data from the UKRI gateway on energy innovation projects. Identifying necessary research to replicate supports reproducibility since it requires understanding the project landscape and what has been done before. More information on CoPED can be found in the documentation.
Publishers and Academic Journals
Research journals also have a responsibility to promote proper, replicable and reproducible research in the community. Journals are increasingly asking for data and code to be shared and this helps others utilise the research, validate the results and build upon the findings. In addition to sharing data and code it has been suggested there should be an opportunity to pre-register trials to prevent insignificant results not being reported. This is similar to the AllTrials Campaign for use in medical trials. This is to ensure all evidence, good or bad, is reported and therefore all risks and benefits are available when assessing a drug or intervention.
Although the above tools, guidance and frameworks can help support reproducibility it is ultimately the responsibility of the individual data scientist to ensure that they are properly applying their algorithms to avoid data leakage and developing incorrect understanding and conclusions. This will require a wider development of skills and experience in data science.
It is also important to mitigate some of the risks being created by the increased application of machine learning algorithms to automated decision making. This will require some form of algorithmic governance to reduce the potential harms from inappropriate and incorrect modelling.
Harnessing Digital and Data
Independent thought leadership and practical expertise that harnesses digital innovation to tackle the hardest challenges on the way to Net Zero