Joining the dots on disparate data sources – Jessica Steinemann
Comment by Jessica Steinemann, Data Scientist at Energy Systems Catapult.
Previous work by Energy Systems Catapult has identified data as the single biggest enabler of a decarbonised, decentralised and digitised energy future. Fortunately, several organisations in the energy sector are offering open data platforms, such as the National Grid ESO Data Portal, UK Power Network’s Open Data Portal or the Balancing Mechanism Report Service (BMRS) provided by Elexon. However, to unlock the true value of this data, disparate data sources often need to be joined together.
One open-source (OS) tool that tries to address this requirement is the Power Station Dictionary (PSD), which provides mappings of electricity generating assets in the UK between various data sources, such as Elexon and the Renewable Energy Planning Database. However, when evaluating existing OS energy data science projects, I found that most failed to build on work such as the PSD, leading to different data scientists and researchers repeating time-consuming data mapping tasks.
As such, I wanted to explore the value and challenges of building upon an existing OS data project when developing a common energy system use case. I set out to achieve the following objectives:
Develop a useful dataset of live UK electricity generation by location (e.g. for mapping of this data), and making this publicly available
Document the work and outputs to help others trying to achieve a similar objective better understand the data sources that were used and the logic to derive the live generation figures
Better understand the landscape of existing OS energy data projects in the UK, and how they could be reused in my project
Understand the barriers and opportunities related to reusing existing OS tools
Contribute to the existing OS projects and feedback to their creators
The Data: Power Station Dictionary (PSD)
One of the key inputs into this project, the PSD is available via a GitHub code repository and also accessible via an end-user user-friendly website. It provides mappings between 15 commonly used energy datasets and databases and contains information about more than 270 electricity generating assets.
Most useful from this package was the information about the power station fuel types and, more importantly, the power station locations in latitude and longitude. To my knowledge, there is no other single database for the UK’s power generator locations, meaning this information would otherwise need to be collated from government datasets, such as the Renewable Energy Planning Database, and through time-consuming online research (e.g., Google Maps, Wikipedia).
It’s possible to install the PSD as a python package, allowing integration of the data into other applications and various programming languages using the ‘Frictionless Data Tabular Schema’, a standardised format for expressing metadata and linking datasets.
However, for the purposes of this project, the main data sources required within it were the dictionary IDs, ‘plant locations’, ‘common names’ and ‘fuel types’, all of which are available on the repository as CSV files and can be accessed.
The Data: Balancing Mechanism Reporting Service (BMRS)
The second source for data is electricity generation data. The BMRS provides various data reports about electricity generation and demand in the UK, many at half-hourly level to match the 48 settlement periods (SPs) of the UK Balancing Market. This data was accessed using the “Elexon Data Portal” Python package which greatly simplifies querying of the BRMS API, for example by automatically translating SPs into human-readable timestamps.
Historic generation per generator is usually published five to seven working days after the actual generation date which makes it difficult to see what the live generation output in different locations would look like. This data is ingested into the developed pipeline for showing historical generator outputs. Currently, the pipeline is capped to retain this data for 45 days to avoid excessive duplication of data that can easily be queried via a public API; however, this parameter can easily be changed if a particular use case required it.
Live generation data is contained within the Physical Data of the BMRS which, in turn, breaks down into five subcategories, which are defined in detail on this website:
Final Physical Notifications (FPN): the best estimate of the level of generation a generator expects to export in a SP. This must be submitted one hour prior to start of the SP, also referred to as “Gate Closure”.
Quiescent Physical Notifications (QPN) – optional: a series of MW values and associated times expressing the volume of generation or demand expected to be generated or consumed (as appropriate) by an underlying process that forms part of the operation of a particular generator.
Maximum Export Levels (MEL): the maximum power export level of a particular BM Unit at a particular time. For example, in the case of an outage affecting a certain generator, this could override their previously notified generation levels.
Minimum Import Levels (MIL): the minimum power import level of a particular BM Unit at a particular time.
Bid Offer Acceptance Levels (BOAL): a formalised representation of the purchase and/or sale of generating capacity by the System Operator (i.e. National Grid) to balance the transmission grid.
For the purposes of understanding live electricity generation at the half-hourly level, the FPN, BOAL and MEL data is most relevant. It is important to understand that, although SPs are distinct half-hourly periods, most of the time, the physical data is not provided in neat half-hourly intervals as generation levels often bridge two separate SPs. BOALs can also change multiple times and override each other within any SP if the System Operator (National Grid) requires this to be the case, for example in the case of wind energy curtailment.
Therefore, an important task within this project was to extract the most recent information from the different notification messages and aggregate it so that the half-hourly generation volumes (in MWh) could be calculated based on the multiple generation output levels (in MW) that a generator could have within any SP. To achieve this, the physical data was resampled from records with start and end times to minutely generation-level records from which only the latest notification was retained.
Following development of this method, it was validated by testing several days of aggregated “live” physical data data against the historic generation per generator once the historic data had been published. Generally, this data reconciled relatively well, particularly for conventional generators with high levels of control over their generation (e.g. nuclear or gas-fired power stations). However, some larger discrepancies were found for some of the wind farms included in the dataset. These discrepancies were questioned with a subject matter expert in this area who clarified that such discrepancies would largely be due to poor forecasting, i.e. some windfarms consistently predicting to generate at 100% of their installed capacity rather than updating their FPNs with their internal generation forecasts. It is worth noting that the incentives for these generators to provide accurate forecasts to the BMRS service are currently weak, and as a result anyone interested in future generation is likely to have to invest in creating their own forecasts.
In the data pipeline, once cleaned, the most recent live generation data of the past days is appended to the already aggregated historic generation data. The generator location information from the PSD is then joined to finalise the dataset to output it in CSV format ready for further analysis. From this location, it can now be accessed by anyone interested in using it, for example to create live mapping visualisations of electricity generation in the UK. A conscious decision was made not to build a new energy map, as several great visualisation platforms for UK energy data already exist. However, an example of how this visualisation could look like can be found here.
Automating the pipeline and future development avenues
The pipeline was written to query the BRMS on a half-hourly basis and request updates from the PSD once per week. To automate both update cycles, GitHub Actions were deployed that automatically execute the relevant python scripts at the preset intervals. With this process, the pipeline has been running successfully since the beginning of April 2023.
The project also highlighted several future areas where additional contributions to create a live generation dataset for the UK could be made:
Integration of solar PV and other embedded generation: Currently the dataset only includes data from generators which are part of the balancing mechanism, namely larger generators connected directly to the UK’s electricity transmission network. Future iterations could attempt to integrate data about embedded generation or Solar PV, e.g. that available from Sheffield Solar, to provide a more comprehensive picture.
Integrate more accurate wind forecasts for the wind farms with the worst forecasts: This would provide the benefit of increasing the accuracy of the individual windfarm’s live generation.
Add wind curtailment data to the dataset: Curtailment of wind energy can easily be derived from the BOAL data; hence, it could be added to the dataset with relative ease.
Outcome and Learnings
From delivering this project, I discovered several useful learning points that could inform future projects to develop OS energy data analytics or data science use cases:
Open-source mappings between different IDs are invaluable, but hard to maintain and publicise without strong examples of how they can be used in practice. By publicizing this project, I’m hoping to raise the awareness about their existence.
When building foundational open-source data tools it’s important to provide documentation focused on end-users as the existing documentation is often aimed at industry experts rather than novices. This, in turn, creates barriers for innovators or individuals newly entering the field.
Data documentation that focuses on individual datasets is insufficient – most value comes from combining datasets and nuances around this are often not captured. Whilst invaluable insight from networking with individuals already involved in the OS energy analytics space can be gained, developers should dedicate time to documenting their newly gained knowledge for others following in their footsteps.
Skills required to productionise solutions, as well as resourcing for ongoing support and maintenance of OS projects, need to be considered: they represent a risk to the long-term success of developed open-source solutions.
Harnessing Digital and Data
Independent thought leadership and practical expertise that harnesses digital innovation to tackle the hardest challenges on the way to Net Zero