The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system.
In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find.
As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy, Ofgem and Innovate UK have commissioned Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce. To do this, the Catapult has engaged with a large number of stakeholders to gather views and opinions about what best practice means for those working with data across the energy sector and beyond. Below is the first draft of the guidance that we are seeking feedback on. The guidance is still very much in development and there are more events planned in January 2020 (dates TBC) for stakeholders to test and develop the guidance.
The draft guidance presented below is designed to help organisations and individuals understand how to best manage data in a way that supports the Energy Data Taskforce recommendations and accelerates the transition towards a modern, digitalised energy system that enables net zero.
We are seeking feedback to help develop the guidance and would appreciate your input. If you would like to make comments then please contact us by email by the 15th of January.
We have a very large number of stakeholders so we are asking for input to be specific and actionable. Please specific what section your comment relates to, the existing passage you would like to change and the proposed alternative. This will really help us to effectively review and integrate your feedback. Note, the guidance is still under development and will be proofread before final publication so please provide feedback on content rather than typos.
Latest Update: 03/01/2019
Data Best Practice Guidance
The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system. In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find. As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy (BEIS), Ofgem and Innovate UK have commissioned the Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce.
This guidance describes a number of key outcomes that taken together are deemed to be ‘data best practice’. Each description is accompanied by more detailed guidance that describes how the desired outcome can be achieved. In some areas the guidance is very specific, presenting a solution which can be implemented easily. In other areas the guidance is less prescriptive, this may be because there are many possible ‘best practice’ solutions (e.g. understanding user needs) or there is a disadvantage to providing prescriptive guidance (e.g. cyber security). Where this is the case the guidance provides organisations with useful information that can be used to inform the implementation of a solution.
The Data Best Practice outcomes are:
- Datasets should be accurately described with industry standard metadata
- Data, Metadata and supporting information should use common terms
- Data custodians should ensure that datasets are discoverable by potential users
- Data Custodians should ensure datasets have the supporting information required to make the data understandable for potential users
- Data Custodians should seek to learn and understand the needs of their current and prospective data users
- Identify the types of roles played by stakeholders of the data
- Ensure data quality improvement is prioritised by user needs
- Data relating to common assets is Presumed Open
- Presumed Open data should go through Open Data Triage, conducted by the data custodian
- Data should be interoperable with other data and digital services
- Protect data and systems in accordance with Security, Privacy and Resilience best practice
- Ensure that data is stored and archived in such a way to maximise sustaining value
The guidance touches on many issues which have existing regulation or authoritative guidance such as personal data protection and security. In these areas this guidance should be seen as complimentary rather than competitive. The guidance includes references to many existing resources and includes key extracts of the content within the guidance where licencing allows.
This guidance has been designed to help organisations implement the vision of a Modern, Digitalised Energy System which is described in the Energy Data Taskforce report including for those looking to implement ‘Presumed Open’. However, the guidance has been designed as far as possible to be sector agnostic so it can have wider value to organisations beyond the energy sector. We expect there to be particularly strong read across for other organisations managing infrastructure and other regulated sectors.
The Data Best Practice Guidance is a living resource and will be regularly updated to reflect the changing technology and regulatory landscape. If you have a suggestion or comment, then we would like to hear from you so please get in touch email@example.com
1. Data should be accurately described with industry standard metadata
To realise the the maximum value creation from data within an organisation, across an industry or across the economy actors need to be able to understand basic information that describes each dataset. To make this information accessible, the descriptive information should be structured in an accepted format and it should be possible to make that descriptive information available independently from the underlying dataset.
Metadata is a dataset that describes and gives information about another dataset.
The Energy Data Taskforce recommended that the Dublin Core ‘Core Elements’ metadata standard (Dublin Core) ISO 15836-1:2017 should be adopted for metadata across the Energy sector. Dublin Core is a well established standard for describing datasets and has many active users across a number of domains including energy sector users such as the UKERC Energy Data Centre with a small number of key fields which provide a minimum level of description which can be built upon and expanded as required.
There are 15 ‘core elements’ as part of the Dublin Core standard which are described as follows:
Many of the fields are straight forward to populate but others are open to interpretation. In the table below we have listed best practice tips to help provide consistency across organisations.
This should be a short but descriptive name for the resource.
- Be specific – generic titles will make useful resources harder to find e.g. ‘Humidity and temperature readings for homes in Wales’ is better than ‘sensor data’
- Be unique – avoid reusing existing titles where possible to help users find the required resources more effectively
- Be concise – ideally less than 60 characters in order to optimise search engine display but many storage solutions will have an upper character limit (e.g. UKERC Energy Data Centre limit to 100 characters)
- Avoid repetition or stuffing – use the other metadata fields for keywords and longer descriptions
- Title is different from filename – use the title to describe your resource in a way that other users will understand e.g. use spaces rather than underscores
Identify the creator(s) of the resource, individuals or organisations.
- Creator – a creator is a primary entity which generated the resource which is being shared, this can be the same as the publisher
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Identify the key themes of the resource
- Keywords – select keywords (or terms) which are directly related to the resource
- Glossary – select subject keywords from an agreed vocabulary e.g. UKERC Energy Data Centre uses the IEA energy balance definitions
Provide a description of the resource which can be read and understood by the range of potential users.
- Overview – the description should start with a high level overview which enables any potential user to quickly understand the context and content of the resource
- Accessible – use language which is understandable to the range of potential users, avoiding jargon and acronyms where possible
- Accuracy – ensure that the description objectively and precisely describes the resource, peer review can help identify potential issues
- Quality and Limitations – include detail about perceived quality of the resource and any known limitations or issues
- Core Supporting Information – ensure that any core supporting information is referenced in the description
Identify the organisation or individual responsible for publishing the data, this is usually the same as the metadata author.
- Publisher – a publisher is an entity which is making the resource available to others, this can be the same as the creator or contributor
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Identify the contributor(s) of the resource, individuals or organisations.
- Contributor – a contributor is an entity which provided input to the resource which is being shared, this can be the same as the publisher
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Date is used in a number of different ways (start of development/collection, end of development/collection, creation of resource, publication of resource, etc.), the usage of a date field should therefore be explained in the resource description. Where data collection is concerned it may be useful to state the latest date of collection as more recent data may be of greater interest. However, the nature and potential use cases of data will dictate the most useful use of this field.
- Standardisation – the use of a standardised date format is recommended e.g. ISO 8601 timestamps
Provide a unique identifier for the resource
- Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use
Identify the source(s) material of the derived resource
- Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use
Identify other resources related the the resource
- Identifiers – DOIs and URIs provide a truely unique identifier but where these are not available then a system or organisation specific identifier can be of use
- Supporting Information – Where additional resources are required to understand the resource these should be referenced
Identify the spatial or temporal remit of the resource
- Standardisation – utilising standard, authoritative spatial identifiers is recommended e.g. UPRN, USRN, LSOA, Country Code, etc.
Specify under which licence conditions the resource is controlled.
- Essential – this field is vitally important as without clear articulation of user rights the resource is useless, no entry does not meet open
- Common terms – where possible use standard licence terms e.g. Creative Commons by Attribution etc.
The Dublin Core metadata should be stored in an independent file from the original data and machine readable format, such as JSON, YAML or XML that can easily be presented in a human readable format using free text editors, for example Notepad++. This approach ensures that metadata can be shared independently from the dataset, that it is commonly accessible and not restricted by software compatibility. The DCMI have provided schemas for representing Dublin Core in XML and RDF which may be of help.
The Department for Business, Energy and Industrial Strategy has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. Note, this data does not have a URI or DOI so a platform ID has been used as identifiers, this is not ideal as it may not be globally unique.
“title”:”Solar PV cost data”
“creator”:”Department for Business, Energy and Industrial Strategy”
“subject”:”solar power station [http://www.electropedia.org/], energy cost [http://www.electropedia.org/]”
“description”:”Experimental statistics. Dataset contains information on the cost per kW of solar PV installed by month by financial year. Data is extracted from the Microgeneration Certification Scheme – MCS Installation Database.”
“publisher”:”Department for Business, Energy and Industrial Strategy”
“contributor”:”Microgeneration Certification Scheme”
“coverage”:”Great Britain (GB)”
“rights”:”Open Government Licence v3.0”
Note, the data.gov.uk platform holds metadata for files in JSON format but the above has been simplified for the purposes of providing an example.
2. Data custodians should ensure that datasets are discoverable by potential users.
The value of data can only be realised when it is possible for potential users to identify what datasets exist and understand how they could utilise them effectively. Data custodians should implement a strategy which makes their data inclusively discoverable by a wide range of stakeholders within and outside of their organisation. There may be instances where it is not advisable to make it known that a dataset exists but this is expected to be exceptionally rare e.g. in cases of national security.
Discoverable: The ability for data to be easily found by potential users
Please note, there is a difference between a dataset being discoverable and accessible. In some cases a dataset may be too sensitive to be released widely but the description of the dataset (metadata) almost certainly can be made available without any sensitivity issues, this visibility can provide significant value through increasing awareness of what data exists. For example, an advertising agency’s dataset that describes personal details about individuals cannot be made openly available without the explicit consent of each subject in the data, but many of the advanced advertising products that provide consumers with more relevant products and services would be significantly less effective if the advertising platform could not explain to potential advertisers that they can use particular data features to target their advertisements.
Metadata can be used to describe the contents and properties of a dataset. It is possible to make metadata open (i.e. published with no access or usage restrictions) in all but the rarest of cases without creating security, privacy, commercial or consumer impact issues due to it not actually involving the underlying data.
Metadata can be published by individual organisation initiatives or by collaborative industry services. Individual organisations may choose to host their own catalogue of metadata and/or participate with industry initiatives.
Where data is made available via services such as a website (open, public or shared) organisations can choose to markup datasets to make them visible to data-centric search engines and data harvesting tools. The Schema.org vocabulary is becoming an increasingly popular way to structure embedded data (such as recipes) but it can also be used to describe datasets. Structured markup has similarities to formal metadata but should not be seen as a replacement for standardised metadata.
Search Engine Optimisation
Search engines are likely to be the way in which most users will discover datasets. It is therefore the responsibility of the data custodian to make sure that data is presented in a way that search engines can find and index. Most major search engines provide guides explaining how to ensure that the correct pages and content appear in ‘organic searches’ (e.g. Microsoft, Google) and a range of organisations provide Search Engine Optimisation (SEO) services. The Geospatial Commission have released a guide to help organisations optimise their websites to enable search engines to identify and surface data.
Direct stakeholder engagement is a powerful tool to drive interest into new or underutilised datasets. Additionally, this technique may be used when there is a specific use case or challenge which the data custodian is seeking to address.
The Department for Business, Energy and Industrial Strategy has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. This dataset is published under an Open Government Licence so the data can be accessed by all. The data custodians have registered the dataset with the UK Government open data portal Data.gov.uk, this makes the metadata publicly available (albeit only via the API) in JSON – a machine readable format that can easily be presented in a human readable form. It additionally provides search engine optimisation to surface the results in organic search and uses webpage markup to make the data visible to dataset specific search engines.
Note, the discoverability actions taken above are related to a dataset which is publicly available but could also be used for a metadata stub entry in a data catalogue.
3. Data, Metadata and supporting information should use common terms
It is critically important for data users to be able to search for and utilise similar datasets across organisations. A key enabler of this is finding a common way to describe the subject of data including in formal metadata, this requires a common glossary of terms.
There has been a proliferation of glossaries within the energy sector, with each new document or data store providing a definitive set of definitions for the avoidance of doubt, the list below is provided as an example.
- IEC Electropedia
- IEA Energy Catagories
- Electricity Distribution Standard Licence Conditions
- Electricity Generation Standard Licence Conditions
- Electricity Interconnector Standard Licence Conditions
- Gas Interconnector Standard Licence Conditions
- Gas Shipper Standard Licence Conditions
- Gas Supplier Standard Licence Conditions
- Electricity Ten Year Statement definitions
Equally, the same has been occurring across other sectors and domains, the following sources are all data related.
There are currently efforts to standardise the naming conventions used across a range of infrastructure domains by the Digital Framework Task Group as part of the National Digital Twin programme of work. The long term goal is to define an ontology which enables different sectors to use a common language which enables effective cross sector data sharing.
In the near term, it is unhelpful to create yet another glossary so we propose a two staged approach.
- Organisations label data with keywords and the authoritative source of their definition e.g. Term [Glossary Reference]
- An industry wide Data Catalogue should be implemented with an authoritative glossary based on existing sources which can be expanded or adapted with user feedback and challenge
Implementing a standard referencing protocol enables organisations to understand terms when they are used and slowly converge on to a unified subset of terms where there is common ground. Implementing an industry wide data catalogue with agreed glossary and mechanism for feedback enables the convergence to be accelerated and discoverability of data revolutionised.
4. Data Custodians should ensure datasets have the supporting information required to make the data understandable for potential users
When data is published openly, made publicly available or shared with a specific group it is critical that the data has any supporting information that is required to make the data useful for potential users. There is a need to differentiate between Core Supporting Information, without which the data could not be understood by anyone, and Additional Supporting Information that makes understanding the data easier. As a rule of thumb, if the original custodian of the dataset were to stop working with it and then come back 10 years later with the same level of domain expertise, but without the advantage of having worked with the data on a regular basis, the Core Supporting Information is that which they need to make the dataset intelligible. It is reasonable to expect that Core Supporting Information should be made available with the dataset.
Data custodians should consider the following topics for areas where Core Supporting Information may be required:
- Data collection methodology
- Data structure description (e.g. data schema)
- Granularity (spatial, temporal, etc.) of the data
- Units of measurement
- Version number of any reference data that has been used (e.g. the Postcode lookup reference data)
- References to raw source data (within metadata)
- Protocols that have been used to process the data
It may be possible to minimise the required Core Supporting Information if the dataset uses standard methodologies and structures (e.g. ISO 8601 timestamps) but it should not be assumed that an externally hosted reference dataset or document will be enduring unless it has been archived by an authoritative, sustainable body (e.g. ISO, BSI, UK Data Archive, UKERC Energy Data Centre, etc.). If there is doubt that a key reference data source or document will be available in perpetuity this should be archived by the publisher and, where possible, made available as supporting information.
Depending on the goals of the organisation sharing data, it may be prudent to also include additional supporting information. Reasons to make additional supporting information available could include:
- Maximising user engagement with the dataset
- Addressing a particular user need
- To reduce the number of subsequent queries about the dataset
- To highlight a particular issue or challenge which the data publisher would like to drive innovators towards
Core Supporting Information
The UK Energy Research Centre (UKERC) are the hosts of the Energy Data Centre, which was set up to “to create a hub for information on all publically funded energy research happening in the UK”. The centre hosts and catalogues a large amount of data which is collected by or made available to aid energy researchers, however much of the data is made available using an open licence (e.g. Creative Commons Attribution 4.0 International License). This licence enables data use for a wide range of purposes, including research. The centre aims to archive data for future researchers and as such, the administrators have embedded a range of data best practice principles including the provision of core supporting information that is required to understand the stored dataset. The custodians of the Energy Data Centre provided the ’10 year’ rule of thumb, described above.
The Energy Data Centre host the Local Authority Engagement in UK Energy Systems data and associated reports. The data is accompanied by rich metadata and a suit of core supporting information that enables users to understand the data. In this case, the core information includes:
- Core Supporting Information
- Description of the source datasets
- Description of the fields and their units
- ReadMe.txt files with a high level introduction to the associated project
- Core and Additional Supporting Information
- A list of academic reports about the data and associated findings
The list of academic reports contains some core supporting information (e.g. the detailed collection methodology), but much of the content is additional supporting information.
Additional Supporting Information
When an organisation has a particular goal in mind it may be prudent to include additional supporting information that enables the maximum number of users to engage with the dataset. A good example of this are data science competitions, which are commonly used in industry to solving particular problems by drawing on a large number of experts.
A recent Kaggle competition asked participants to utilise sensor data to identify faults in power lines. To maximise engagement, the hosts provided high level overviews of the problem domain, more detailed explanations of datasets and offered additional advice as required through question and answer sessions that were widely published. This information was not strictly essential for an expert to understand the data, but the underlying goal of the project was to attract new talent to the area. If the potential participants could not easily understand the data they would likely move on to another lucrative project.
- Core Supporting Information
- Descriptions of datasets and fields (metadata)
- Core and Additional Supporting Information
- Description of the problem in non technical language
- Description of common problems and how to identify them
- Question and Answer feeds
5. Data Custodians should seek to learn and understand the needs of their current and prospective data users
Digital connectivity and data are enabling a wealth of new products and services across the economy and creating new data users outside of the traditional sector silos. In order to maximise the value of data it is vital custodians develop a deep understanding of the spectrum of their users and their differing needs such that datasets can be designed to realise the maximum value for consumers.
Data custodians should develop a deep understanding of range of topics.
- Who is using your dataset or service?
- Who would like to use your dataset or service?
- Who should be using your dataset or service?
- For organisations with many individual users, it may be helpful to group users into categories or create personas to represent users with similar needs and objectives.
- For the purposes of user research, a data user could be an individual, an organisation or a persona.
- What outcome does each user hope to achieve by using the data or service?
- Individual users may have multiple objectives; it can be helpful to rank these by importance / impact or set constraints, especially where they are competing needs
- Users will have a range of needs driven by all sorts of factors including, but not limited to differing objectives, existing data / systems and technical capability.
- Users may exhibit conflicting needs or provide different combinations of needs depending on their range of use cases and desired outcomes.
- Data custodians should be considering the different types of needs:
- explicit needs: derived from how users describe what they are trying to do
- implicit needs: those that are not expressed and that users are sometimes not aware of, but that are evident from observation
- created needs: where a user has to do something because it is required by the service
- In addition, needs can be catagorised into the following groups:
- high-level needs – for example: ‘I need to understand the data so that I don’t use it incorrectly’
- needs – for example: ‘I need to trust the data so I can defend my decision’
- detailed needs – for example: ‘I need to know how reliable the data is, so that I can provide caveat if needed’
- User needs may include requirements for:
- Data Granularity (time, space, subject)
- Data Accuracy or Precision (how closely does the data reflect reality)
- Data Timeliness and Consistency (duration between data creation and access)
- Functionality and simplicity of access (file download, API requests, etc.)
- Reliability (system availability over time)
- Stability (consistency over time)
- Agility (the ability to adapt to changing needs)
- Linkability (joining to other datasets)
- How do the objectives of the users compare to the objectives of the data custodian?
- Does meeting the need of the user provide value to the end customer?
- Where there are conflicting objectives between the data custodian and user, does one provide significantly more value to the consumer?
- It is important to recognise that some needs will require more time and effort to address than others
- What is the realistic time in which the need can be addressed?
- Is it possible to address all user needs with one delivery or are multiple iterations / versions needed?
There are a range of methods which organisations can use to elicit the needs of current and potential users.
- Direct Engagement
- Interviews – detailed research with representative users
- Workshops – broad research with groups of users (or potential users)
- Usability Testing – feedback from service users
- Monitoring – tracking usage of a live service
- Feedback forms – user initiated feedback on services
- Innovation Projects – generating new user needs through novel work
- Knowledge Sharing – collaboration between organisations with similar user types
- Direct Requests – prospective user needs
The Government Digital Service have published advice on the topic of user research within their service manual.
One approach that can be used to gain input from potential users is to convene a workshop where individuals from different backgrounds can come together to discuss the challenges they face and the needs that this creates. Formulating the needs of users as structured user stories which can be used to subsequently identify trends:
As a Role given that Situation I need Requirement so that Outcome
6. Identify the types of roles played by stakeholders of the data
The data custodian should understand the subject of the data as well as those that interact with the data.
Data Custodian: An organisation or individual which holds data
Data Subject: The identified or identifiable living individual to whom personal data relates.
Data Controller: A person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data.
Data Processor: A person, public authority, agency or other body which processes personal data on behalf of the controller.
Definitions from ICO
“To determine whether you are a controller or processor, you will need to consider your role and responsibilities in relation to your data processing activities. If you exercise overall control of the purpose and means of the processing of personal data – ie, you decide what data to process and why – you are a controller. If you don’t have any purpose of your own for processing the data and you only act on a client’s instructions, you are likely to be a processor – even if you make some technical decisions about how you process the data.” ICO
- Examples: https://ico.org.uk/media/for-organisations/documents/1546/data-controllers-and-data-processors-dp-guidance.pdf
7. Ensure data quality improvement is prioritised by user needs
Data quality is subjective, a dataset may be perfectly acceptable for one use case but entirely inadequate for another. Data accuracy can be more objective but there remain many instances where the required precision differs across use cases.
Data is not perfect, even the most diligent organisation that makes the greatest effort to collect and disseminate the highest quality data cannot guarantee enduring accuracy or foresee all the potential future needs of data users. Data custodians are therefore faced with an ongoing task to identify quality and accuracy limitations which can be improved over time. This is a particular challenge for the owners and operators of infrastructure who have the task of deploying assets which will be in situ for many years (often decades). The data needs of potential users are almost certain to change dramatically within the lifespan of the asset and the ability to recollect information is often challenging (especially for buried assets).
Note: Organisations should not see data quality as a barrier to opening datasets. Potential users may find the quality acceptable for their use, find ways to handle the quality issues or develop ways to solve issues which can improve the quality of the underlying data.
Data is most useful when it is accurate and trust worthy. In many cases, data will be used to store information which can objectively be categorised as right or wrong e.g. customer addresses, asset serial numbers, etc.
Data controllers should make reasonable efforts to ensure that data is accurate and rectify issues quickly when they are identified
This is in line with the GDPR accuracy requirements for personal data. Organisations should utilise master data management techniques to validate inputs, monitor consistency across systems and rectify issues quickly when they are identified by internal or external stakeholders.
Beyond accuracy, data custodians should consider how they can iteratively improve the quality of data.
Data custodians should seek to improve data quality in a way that responds to the needs of users
Not all data will be of sufficient quality for all users and in some cases significant investment may be required to rectify shortcomings. Data controllers should consider if the insufficient quality is due to lack of quality in the underlying data source (e.g. the sensor data is not precise or frequent enough), subsequent processing of the data (e.g. aggregation, rounding, etc.) or technical choices (e.g. deleting data due to storage constants). Data custodian should propose how they could increase the quality of the data and indicate the time and cost requirements, where possible the prospective data user should be given the opportunity to propose alternative solutions.
Where simple, cost effective improvements can be made within existing constants these should be actioned. Where incremental funding is required, the data custodian should consider the benefit to wider stakeholders and users before considering the funding options available (organisational investment, innovation funding, data user collaboration, data user fees, funding request, etc.).
An organisation collects and holds data about companies which operate in a sector in order to help potential innovators find suppliers or collaboration partners. The data that is held is made available on a website and can be queried by users. A potential user searches for an organisation and finds that a company has been incorrectly categorized. The data publishers have provided a contact form which enables the user to submit the correction which is subsequently verified by the data publisher before the dataset is updated.
A data user wishes to develop an application which enables public transport users to select options based on their impact on air pollution in a city. The public transport provider makes a range of data available about the various modes of transport including routes travelled, vehicle type, average emissions and passenger numbers. However, the data user has identified that the emissions data is not of a sufficient quality for their use case for the following reasons:
- The average emissions data field is not consistently populated
- There is likely to be a variation between the average emission output and the real impact on local areas
The issues highlighted above are quite different in nature and have different potential solutions.
- The public transport provider can check the dataset for missing values and populate based on the vehicle type and known specification
- Monitoring real emissions from each vehicle is likely to require new equipment and create a large amount of cost
- An alternative option could be to use existing static air quality monitoring sites and data analytics to estimate the impact in certain conditions.
Having proposed a quick solution to point 1 and an alternative solution for point 2 the data user can continue with their use case.
8. Data relating to common assets is Presumed Open
Data relating to common assets should be open unless there are legitimate issues which would prevent this, legitimate issues include Privacy, Security, Negative Consumer Impact, Commercial and Legislation and Regulation. It is the responsibility of the data controller to ensure that issues are effectively identified and mitigated where appropriate. It is recommended that organisations implement a robust Open Data Triage process.
Common Assets are defined as a resource (physical or digital) that is essential to or dependent on common shared infrastructure
In cases where there has been data processing applied to raw data (e.g. Issue mitigation, data cleaning, etc.) it is considered best practice for the processing methodology or scripts to be made available as core supporting information in order to maximise the utility of the data to users
Data relating to common assets should be open unless there are legitimate issues.
9. Presumed Open data should go through Open Data Triage, conducted by the data custodian
The triage process considers themes such as privacy, security, commercial and consumer impact issue. Where the decision is for data to not be made open the data controller will: share the rationale for this; identify sensitivity mitigation options and deliver these if user needs require; maximise sharing of the mitigation protocols and desensitised version of the data. Users of the data should have reasonable opportunity to challenge decisions and have a point of escalation where agreement between data users and data controllers cannot be reached.
Identification of Discrete Datasets
The goal of open data triage is to identify where issues exists that would prevent the open publication of data in it’s most granular format and address them in a way that maintains as much value as possible. In order to make this process manageable for the data controller, the first step in this process is:
Identify thematic, usable datasets that can be joined if required rather than general data dumps
In this context, we define a ‘thematic, usable dataset’ as a discrete collection of data which relates to a focused, coherent topic but provides enough information to be of practical use. e.g. 1 year of time series monitoring of a defined range of metrics for a particular category of technical asset, where each asset is identified and described either within the dataset or through unique linkage to a related dataset.
The approach described above minimises the risk that the size and complexity of datasets results in issues are not correctly identified. It also reduces the risk that an issue in one part of the dataset results in the whole dataset being made less open or granular therefore maximising the amount of useful data that is openly available in it’s most granular form. For example, providing complete output from a data warehouse in one data dump could contain information about consumers, employees, financial performance, company KPIs, etc. all of which would present issues that would mean the data needs to be modified or the openness reduced. Whereas extracting tables (or parts of tables) from the data warehouse would provide a more granular level of control which enables individual issues to be identified and addressed accordingly which would in turn maximise the data which is made openly available.
Identification of Issues
Once a thematic, usable dataset has been identified the data controller should assess the dataset to identify if there are any issues which would prevent the open publication of the data in it’s most granular format.
Identify the potential issues which might limit the openness or granularity of dataset
In the table below, we outline a range of issue categories which should be carefully considered. Some of these categories will directly relate to existing triage processes which already exist in organisations but others may require the adaptation of existing processes or creation of new processes to provide a comprehensive solution.
Companies and organisations that own and operate infrastructure should already have a risk identification and mitigation program to support the protection of Critical National Infrastructure (CNI). The Centre for the Protection of National Infrastructure (CPNI) have advice and guidance for organisations involved in the operation and protection of CNI.
Outside of CNI, organisations should assess the incremental security risks that could be created through the publication of data. Organisations should consider personnel, physical and cyber security when identifying issues and identify if the issue primarily impacts the publishing organisation or if it has wider impacts. Issue identification should take into account the existing security protocols that exist within an organisation and flag areas where the residual risk (after mitigation) is unacceptably high.
Note, where that information contained within a dataset is already publicly available via existing means (such as publicly available satellite imagery) the security issue assessment should consider the incremental risk of data publication using the existing situation as the baseline.
Organisations should consider how the dataset could be used to drive outcomes that would negatively impact consumers by enabling manipulation of markets, embedding bias into products or services, incentivising of actions which are detrimental to decarbonisation of the system, etc.
Commercial data relating to the private administration of a business (HR, payroll, employee performance, etc.) is deemed to be private information and as a legitimate reason for data to be closed, although organisations may choose to publish for their own reasons such as reporting or corporate social responsibility (CSR) reasons.
Data which does not relate to the administration of the business but has been collected or generated through actions which are outside of the organisation’s legislative or regulatory core obligations and funded through private investment may also have legitimate reason to be closed. This description may include the data generated through innovation projects but consideration should be given to the source of funding and any data publication or sharing requirements this might create.
Where an organisation is a regulated monopoly, special consideration should also be given to the privileged position of the organisation and the duty to enable and facilitate competition within their domain.
Where datasets contain IP belonging to other organisations or where the data has been obtained under a licence which would restrict onward publishing this should also be identified. Note, the expectation is that organisations should be migrating away from restrictive licences / terms and conditions that restrict onward data publishing and sharing where possible.
Organisations should have legal and regulatory compliance processes which are able to identify and drive compliance with any obligations the company has.
Consideration should include:
- Utilities Act 2000
- Electricity Act 1989
- Gas Act 1986 / 1995
- Competition Act 1998
- Enterprise Act 2002
- Enterprise and Regulatory Reform Act 2013
- Data Protection Act 2018
- General Data Protection Regulation (GDPR)
Consider the impact of related or adjacent datasets
When assessing the sensitivity of data, thought should be given to the other datasets which are already publicly available and the issues which may arise from the combination of datasets. Organisations should consider where there are datasets outside of their control which, if published, could create issues which would need to be mitigated. Special consideration should be given to datasets which share a common key or identifier, this includes but is not limited to:
- subject reference (e.g. Passport Number),
- technical reference (e.g. Serial Number),
- time (e.g. UTC),
- space (e.g. Postcode or Property Identifier)
As new datasets are made available, markets develop and public attitudes change there may be a need to revise the original assessment. For example, a dataset which was initially deemed too sensitive to be released openly in it’s most granular form could be rendered less sensitive due to changes to market structure or change in regulatory obligations. Equally, a dataset which was published openly could become more sensitive due to the publication of a related dataset or technology development.
Mitigation of Issues
Where the assessment process identifies an issue, the aim should be to mitigate the issue through modification of data or reduced openness whilst maximising the value of the dataset for a range of stakeholders.
Mitigate issues through modification of data or reduced openness whilst addressing user needs
Open data with some redactions may be preferable to shared data without, but if redactions render the data useless then public or shared data may be better. In some cases, the objectives of the prospective data users might create requirements which cannot be resolved by a single solution so it may be necessary to provide different variations or level of access, for example providing open access to a desensitised version of the data for general consumption alongside shared access to the unadulterated data to a subset of known users.
Modification of Data
Modification of data can serve to reduce the sensitivity whilst enabling the data to be open. There are a wide variety of possible modifications of data which can be used to address different types of sensitivity.
An organisation has a licence condition to collect certain data about individual usage of national infrastructure. The data is collected about individual usage on a daily basis and could reveal information about individuals if it was to be released openly.
By removing identifying features such as granular location and individual reference it could be possible to successfully anonymse the data such that individuals cannot be re-identified such that the data could be made openly available.
Simple anonmysation can be very effective at protecting personal data but it needs to be undertaken with care to minimise the risk of re-identification. Anonmysation techniques can be combined with other mitigation techniques to minimise this risk.
An organisation collects information about how individuals use a privately built product or service (e.g. a travel planner). This data could be of great use for the purposes of planning of adjacent system (e.g. energy system or road network) but releasing the anonmysed, granular data would given competitors a commercial advantage.
By introducing seemingly random noise into the dataset in a way that ensure that the data remains statistically representative but the detail of individuals is subtly altered the data can be made available whilst reducing the commercial risk.
Introducing noise to data in a way that successfully obfuscates sensitive information whilst retaining the statistical integrity of the dataset is a challenging task that requires specialist data and statistics skills. Consideration needs to be given to the required distribution, which features the noise will be applied to and the consistency of application.
An organisation operates a network of technical assets some of which fail on occasion. If the data related to those assets was made available innovators could help to identify patterns which predict outages before the occur and improve the network stability. However, the data could also be used to target an attack on the network at a point which is already actively under strain and cause maximum impact.
However, by introducing a sufficient delay between the data being generated and published the organisation can mitigate the risk of the data being used to attack the network whilst benefiting from
Delaying the release of data is a simple but effective method of enabling detailed information to be released whilst mitigating many types of negative impact. However, it may be necessary to combine this with other mitigation techniques to completely mitigate more complicated r
An organisation collects rich data from consumers which is highly valuable but sensitive (e.g. email content). The sensitivity of this data is very high but the potential for learning is also very high.
Differential privacy enables large amounts of data to be collected from many individuals whilst retaining privacy. Noise is added to individuals’ data which is then ingested by a model, as large amounts of data is combined the noise averages out and patterns can emerge. It is possible to design this process such that the results cannot be linked back to an individual user and privacy is preserved.
Differential privacy is an advanced technique but can be very effective. It is used by top technology firms to provide the benefits of machine learning but without the privacy impact that is usually required.
Sharing a model can be a highly effective way of enabling parties to access the benefit of highly sensitive, granular data but without proving direct access to the raw information. However, this is an emerging area so carries some complexity and risk.
Security / Legislation and Regulation
An organisation maintains data about a larger number of buildings across the country and their usage. Within the dataset there are a number of buildings which are identified as Critical National Infrastructure (CNI) sites which are at particular risk of targeted attack if they are known.
In this case it is possible to simply redact the data for the CNI sites and release the rest of the dataset (assuming there is no other sensitivity). Note, this approach works here because the dataset is not complete and therefore it is not possible to draw a conclusion about a site which is missing from the data, it may simply have not been included.
Redaction is commonplace when publishing data as it is a very effective method of reducing risk. However, care needs to be taken to ensure that it is not possible to deduce something by the lack of data.
An organisation collects information about the performance of their private assets which form part of a wider system (e.g. energy generation output). This data could be of great use to the other actors within the system but releasing the data in it’s raw format may breach commercial agreements or provide competitors with an unfair advantage.
By aggregating the data (by technology, time, location or other dimension) the sensitivity can be reduced whilst maintaining some of the value of the data.
Aggregation is effective at reducing sensitivity but can significantly reduce the value of the data. It may be worth providing multiple aggregated views of the data to address the needs of a range of stakeholders.
Where aggregation is the only effective mechanism to reduce sensitivity organisations may want to consider providing access to aggregated data openly alongside more granular data that can be shared with restricted conditions.
An organisation collects data on how their customers use a mobile product including when and where. This movement data could be of value to other organisations in order to plan infrastructure investment but the data reveals the patterns of individuals which cannot be openly published.
The initial step is to remove any identifying features (e.g. device IDs) and break the movement data into small blocks. Each block of movement can then be shifted in time and space such that they cannot be reassembled it identify the movement patterns of individuals. This means that realistic, granular data can be shared but the privacy of individuals can be protected.
Shifting or rotating data can be useful to desensitise spatial or temporal data. However, it is important to recognise context to ensure that the data makes sense and cannot be easily reconstructed. For example, car journey data will almost always take place on roads and therefore rotation can make the data nonsensical and it can be pattern matched to the underlying road network with relative ease.
An organisation may be the custodian of infrastructure data relating to a number of sensitive locations such as police stations or MoD buildings. The data itself is of use for a range of purposes but making it openly available could result in security impacts.
Randomising the data (generating arbitrary values) relating to the sensitive locations (rather than redaction) could reduce the sensitivity such that it can be open.
Randomisation can be very effective to reduce sensitivity but it is also destructive so impacts the quality of the underlying data.
Negative Consumer Impact
An organisation provides collects data about usage of a product in order to diagnose problems and optimise performance. The data has wider use beyond the core purpose but the associated demographic data could result in bias towards certain groups.
By normalising the data (reducing variance and the ability to discriminate between points) it is possible to reduce the ability for certain factors to be used to to differentiate between subjects and hence reduce types of bias.
Normalisation is a statistical technique that requires specialist skills to apply correctly. It may not be enough on it’s own to address all sensitivities so a multifaceted approach may be required.
Level of Access
When the mitigation techniques have been applied as appropriate the data custodian should consider how open the resulting data can be. Where the mitigation has been sucessful the data can be published openly for all to use. However, if the nature of the data means that it is only valuable in it’s most granular form it may be necessary to reduce openness but keep granularity.
The above table is based on the ODI data spectrum.
Balancing Openness, Modification and User Needs
A key factor to consider is the needs of the potential data users. Initially, there may be value in providing aggregated summaries of data which can be made entirely open but as new user cases and user needs emerge we may find that access to more granular data is required which necessitates a more sophisticated mitigation technique or a more granular version of the data which is shared less openly. In some cases, it may be prudent to make multiple versions of a dataset available to serve the needs of a range of users.
Issues that are identified should be clearly documented. Where issues have been mitigated through reduced openness or data modification, the mitigation technique should also be clearly documented.
After the triage process has taken place the resulting data should be made available in line with the triage outcome accompanied by any supporting information and metadata. Where issues have been identified and mitigated this should be documented and made available ideally with the processing methodology or script such that potential users can understand the modifications to the data. This provides transparency, promotes consistency across the sector and enables challenge.
Publish descriptions of issues and mitigation actions alongside data
Challenge and Review
Once a dataset and accompanying material has been published the data custodian should ensure that they have a process to regularly review the datasets that have been made available to ensure they are correct, relevant and there are no new issues arising. Equally, there should be a process for those outside of the publishing organisation to challenge the granularity, format and level of openness of any data which is published or publicised.
Assessment of issues should be an ongoing activity which can be interrogated and challenged
Substation Connection Capacity
Identification of Discrete Datasets
The dataset that has been identified is substation connection capacity, this represents the likely connection headroom at every substation within a electricity distribution network’s area. Monitoring data (substation and smart meters) is used to measure the current demand and network data is used to provide the maximum capacity.
Note – this is a scenario for the purposes of providing an example and does not represent an operational network monitoring solution.
Identification of Issues
Substation connection capacity is not personal data. However, if the dataset includes total capacity and used or available capacity then in the rare case where a substation serves a single customer this could be an individual’s private data
Analysis will take place before the data is released to identify any substations which serve a single premises. If these cases materialise the sensitive fields will be redacted and the data will be displayed as available capacity only to avoid personal data issues.
If the data represents the live status of the network then a bad actor could use this information to inform an attack
Data can be delayed by 24 hours such that it is not possible to determine live status of the network
If an actor used the data to opportunistically request connections that utilise all of the available capacity it could drive up costs for future users, including for new housing, a cost which could be passed on to the consumer.
Ensure there is a process in place to stop actors stockpiling capacity and distribute costs fairly
None – the network is a monopoly player with the duty to deliver an efficient, competitive system
GDPR (if personal data is involved)
See privacy mitigation.
Due to the ability to mitigate all of the issues above the data can be open.
The metadata description will include details about how the dataset is generated, the processing which has taken place to remove private data and the delay will be clearly stated. If the privacy issue is more prominent and a data pipeline is required then the methodology will be documented and made available alongside the data.
Data will be published as a csv file with core supporting information which describes the fields and provides units. The data will be hosted on an open data access platform on the company website and metadata will be registered with open data catalogues.
Challenge and Review
A feedback form will be provided on the website to enable challenge.
An innovator has indicated that for their specific use case they need to be able to access the data in near real time. In order to make this possible the decision has been made to provide a shared access data dashboard and API such that verified actors can gain access to the data in near real time. This has wider value for the technicians and engineers within the company as it provides high quality, low latency data which they can use in their roles.
10. Data should be interoperable with other data and digital services
Data is most useful when it can be shared, linked and combined with ease.
Interoperablity (Data): enabling data to be shared and ported with ease between different systems, organisations and individuals
Data custodians should, directed by user needs, ensure that their data is made available in a way that minimises friction between systems through the use of standard interfaces, standard data structures, reference data matching or other methods as appropriate. Wherever possible, the use of cross sector and international standards is advised.
Standard Data Structures
Data structure standardisation is a common method of aligning data across organisations and enabling seemless portability of data between systems. This method provides robust interoperability between systems and if the standard has been correctly adhered to enables entire data structures to be ported between systems as required. However, standardisation of data structures can be expensive, time consuming and require significant industry, regulatory or government effort to develop the standard if one does not already exist.
Data interfaces can also be standardised, this means that the formal channels of communication are structured in a standard way which enables systems to ‘talk’ to one another with ease. This approach has the advantage of being very quick to implement as interfaces can be implemented as required and providing there is robust documentation a single organisation can define a ‘standard’ interface for their users without the need for cross sector agreement as they can be easily developed and evolved over time. However, this approach is limited in that a new interface needs to be developed or deployed for each type of data that needs to be shared. Additionally, in sectors where there a few powerful actors they can use interface standardisation to create siloed ecosystems which reduces portability.
Reference Data Matching
Matching datasets back to reference data spines can be a useful method to enable non standardised data to be joined with relative ease. This approach can provide a minium level of interoperability and linkability across datasets without the need for full standardisation. However, it does require the users to learn about each new dataset rather than being able o understand the data from the outset with standard data structures.
Standard Data Structures
Electricity network data is essential for a number of emerging energy system innovations including the successful integration of a highly distributed, renewables dominated grid. However, the network is broken into a number of areas which are operated by different organsations that have implemented different data structures to manage their network data (power flow model, GIS and asset inventory). The Common Information Model (CIM) is the common name for a series of IEC standards (IEC 61970 and IEC 61968) which standardise the data structure for electricity network data. The deployment of the CIM standards enables network operators to provide third parties with access to their data in a standard form which enables innovation to be rolled out across network areas with relative ease.
11. Protect data and systems in accordance with Security, Privacy and Resilience best practice
Ensure data and systems are protected appropriately and comply with all relevant data policies, legislation, and Security, Privacy and Resilience (SPaR) best practice principles.
Data Custodians should consider:
- How to protect data appropriately
- How the release of data could impact the security of systems
- How the systems which are used to release data are made secure
- How to ensure compliance with related policy, legislation and regulation
A number of frameworks, standards and regulation exist which provide organisations with implementable guidance on the topic of SPaR such as:
- Cyber Assessment Framework (CAF)
- ISO 27000 – Information Security Management Systems
- IEC 26443 – Industrial Network and System Security
- IEEE C37.240-2014 – Cybersecurity Requirements for Substation Automation, Protection, and Control Systems
- PAS185 – Smart Cities. Specification for establishing and implementing a security-minded approach.
- Network Information Systems Directive (NISD)
- General Data Protection Regulation (GDPR)
In addition, there is a wealth of advice available to organisations through the following organisations:
- BEIS and Ofgem – The joint competent authorities for downstream gas and electricity
- National Cyber Security Centre (NCSC)
- Centre for the Protection of National Infrastructure (CPNI)
- International Electrotechnical Commission (IEC)
- Institute of Electrical and Electronics Engineers (IEEE)
- International Organization for Standardization (ISO)
- British Standards Institution (BSI)
12. Ensure that data is stored and archived in such a way to maximise sustaining value
Organisations should consider the way in which data is stored or archived in order to ensure that potential future value is not unnecessarily limited. This includes ensuring that the storage solution is specified in such a way to ensure that the risk of data being lost due to technical difficulties is minimal. Technical storage solutions which offer component and geographical redundancy are common place with many cloud providers offering data resilience as standard. In addition to technical resilience the data custodian should ensure that data is not unduly aggregated or curtailed in a way that limits future value. Where possible the most granular version of the data should be stored for future analysis but there may be cases where the raw data is too large to store indefinitely. Where this occurs the data custodian should consider how their proposed solution (aggregation, limiting retention window, etc.) would impact future analysis opportunities.
Where data does not have value to the original data custodian the information could be archived with a trusted third party e.g. UKERC Energy Data Centre, UK Data Archive, etc. to ensure that it continues to have sustaining value.
The UK Smart Meter Implementation Program is responsible for rolling out digitally connected meter points to all UK premises with the goal of providing an efficient, accurate means of measuring consumption alongside a range of other technical metrics. Electricity distribution networks can request to gain access to this data in order to inform their planning and optimise operation. The personal nature of smart meter data means that organisations have to protect individual consumer privacy and are therefore choosing to aggregate the data at feeder level before it is stored, this approach provides the network with actionable insight for the current configuration of the network.
However, network structure is not immutable. As demand patterns change and constraints appear, network operators may need to upgrade or reconfigure their network to mitigate problems. The decision to aggregate the data means that it is not possible to use the granular data to simulate the impact of splitting the feeder in different ways to more effectively balance demand across the new network structure. In addition, it means that the historic data cannot be used for modelling and forecasting going forwards. Finally, the data ingest and aggregation processes need to be updated to ensure that any future data is of value.
Northern Power Grid (a UK Electricity Distribution Network) have recently proposed to store the smart meter data that they collect in a non-aggregated format but strictly enforce that data can only be extracted and viewed in an aggregated format. This approach is novel in that it protects the privacy of the consumer whilst retaining the flexibility and value of the data.