Data Best Practice Guidance

Published: 22 January 2020

Introduction

The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system.

In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find.

As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy, Ofgem and Innovate UK have commissioned Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce. To do this, the Catapult has engaged with a large number of stakeholders to gather views and opinions about what best practice means for those working with data across the energy sector and beyond. Below is the first draft of the guidance that we are seeking feedback on. The guidance is still very much in development and there are more events planned in January 2020 (dates TBC) for stakeholders to test and develop the guidance.

The draft guidance presented below is designed to help organisations and individuals understand how to best manage data in a way that supports the Energy Data Taskforce recommendations and accelerates the transition towards a modern, digitalised energy system that enables net zero.

We are seeking feedback to help develop the guidance and would appreciate your input. If you would like to make comments then please contact us by email by the 15th of January.

We have a very large number of stakeholders so we are asking for input to be specific and actionable. Please specific what section your comment relates to, the existing passage you would like to change and the proposed alternative. This will really help us to effectively review and integrate your feedback. Note, the guidance is still under development and will be proofread before final publication so please provide feedback on content rather than typos.


Latest Update: 03/01/2019


Data Best Practice Guidance

The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system. In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find. As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy (BEIS), Ofgem and Innovate UK have commissioned the Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce.

This guidance describes a number of key outcomes that taken together are deemed to be ‘data best practice’. Each description is accompanied by more detailed guidance that describes how the desired outcome can be achieved. In some areas the guidance is very specific, presenting a solution which can be implemented easily. In other areas the guidance is less prescriptive, this may be because there are many possible ‘best practice’ solutions (e.g. understanding user needs) or there is a disadvantage to providing prescriptive guidance (e.g. cyber security). Where this is the case the guidance provides organisations with useful information that can be used to inform the implementation of a solution.

The Data Best Practice outcomes are:

  1. Datasets should be accurately described with industry standard metadata
  2. Data, Metadata and supporting information should use common terms
  3. Data custodians should ensure that datasets are discoverable by potential users
  4. Data Custodians should ensure datasets have the supporting information required to make the data understandable for potential users
  5. Data Custodians should seek to learn and understand the needs of their current and prospective data users
  6. Identify the types of roles played by stakeholders of the data
  7. Ensure data quality improvement is prioritised by user needs
  8. Data relating to common assets is Presumed Open
  9. Presumed Open data should go through Open Data Triage, conducted by the data custodian
  10. Data should be interoperable with other data and digital services
  11. Protect data and systems in accordance with Security, Privacy and Resilience best practice
  12. Ensure that data is stored and archived in such a way to maximise sustaining value

The guidance touches on many issues which have existing regulation or authoritative guidance such as personal data protection and security. In these areas this guidance should be seen as complimentary rather than competitive. The guidance includes references to many existing resources and includes key extracts of the content within the guidance where licencing allows.

This guidance has been designed to help organisations implement the vision of a Modern, Digitalised Energy System which is described in the Energy Data Taskforce report including for those looking to implement ‘Presumed Open’. However, the guidance has been designed as far as possible to be sector agnostic so it can have wider value to organisations beyond the energy sector. We expect there to be particularly strong read across for other organisations managing infrastructure and other regulated sectors.

The Data Best Practice Guidance is a living resource and will be regularly updated to reflect the changing technology and regulatory landscape. If you have a suggestion or comment, then we would like to hear from you so please get in touch energydata@es.catapult.org.uk

1. Data should be accurately described with industry standard metadata

To realise the the maximum value creation from data within an organisation, across an industry or across the economy actors need to be able to understand basic information that describes each dataset. To make this information accessible, the descriptive information should be structured in an accepted format and it should be possible to make that descriptive information available independently from the underlying dataset.

Metadata is a dataset that describes and gives information about another dataset.

The Energy Data Taskforce recommended that the Dublin Core ‘Core Elements’ metadata standard (Dublin Core) ISO 15836-1:2017 should be adopted for metadata across the Energy sector. Dublin Core is a well established standard for describing datasets and has many active users across a number of domains including energy sector users such as the UKERC Energy Data Centre with a small number of key fields which provide a minimum level of description which can be built upon and expanded as required.

There are 15 ‘core elements’ as part of the Dublin Core standard which are described as follows:

Element

Description 

Title

Name given to the resource  

Creator

Entity primarily responsible for making the resource  

Subject

Topic of the resource (e.g. Keywords from an agreed vocabulary) 

Description

Account of the resource  

Publisher

Entity responsible for making the resource available

Contributor

Entity responsible for making contributions to the resource  

Date

Point or period of time associated with an event in the lifecycle of the resource

Type

Nature or genre of the resource

Format

File format, physical medium, or dimensions of the resource  

Identifier

Compact sequence of characters that establishes the identity of a resource, institution or person alone or in combination with other elements (e.g. URI or DOI)

Source

Related resource from which the described resource is derived (e.g. Source URI or DOI)  

Language

Language of the resource (Selected language(s) from a agreed vocabulary e.g. ISO 639-2 or ISO 639-3).  

Relation

Related Resource (e.g. related item URI or DOI)  

Coverage

Spatial or temporal topic of the resource, spatial applicability of the resource, or jurisdiction under which the resource is relevant  

Rights

Information about rights held in and over the resource

 

Core descriptions from Dublin Core Metadata Initiative (DCMI) licenced via CC BY 3.0 – edits or additions are made in italics.

Many of the fields are straight forward to populate but others are open to interpretation. In the table below we have listed best practice tips to help provide consistency across organisations.

Element

Description 

Title

This should be a short but descriptive name for the resource.

  • Be specific – generic titles will make useful resources harder to find e.g. ‘Humidity and temperature readings for homes in Wales’ is better than ‘sensor data’
  • Be unique – avoid reusing existing titles where possible to help users find the required resources more effectively
  • Be concise – ideally less than 60 characters in order to optimise search engine display but many storage solutions will have an upper character limit (e.g. UKERC Energy Data Centre limit to 100 characters)
  • Avoid repetition or stuffing – use the other metadata fields for keywords and longer descriptions
  • Title is different from filename – use the title to describe your resource in a way that other users will understand e.g. use spaces rather than underscores

Creator

Identify the creator(s) of the resource, individuals or organisations.

  • Creator – a creator is a primary entity which generated the resource which is being shared, this can be the same as the publisher
  • Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number

Subject

Identify the key themes of the resource

  • Keywords – select keywords (or terms) which are directly related to the resource
  • Glossary – select subject keywords from an agreed vocabulary e.g. UKERC Energy Data Centre uses the IEA energy balance definitions

Description

Provide a description of the resource which can be read and understood by the range of potential users.

  • Overview – the description should start with a high level overview which enables any potential user to quickly understand the context and content of the resource
  • Accessible – use language which is understandable to the range of potential users, avoiding jargon and acronyms where possible
  • Accuracy – ensure that the description objectively and precisely describes the resource, peer review can help identify potential issues
  • Quality and Limitations – include detail about perceived quality of the resource and any known limitations or issues
  • Core Supporting Information – ensure that any core supporting information is referenced in the description

Publisher

Identify the organisation or individual responsible for publishing the data, this is usually the same as the metadata author.

  • Publisher – a publisher is an entity which is making the resource available to others, this can be the same as the creator or contributor
  • Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number

Contributor

Identify the contributor(s) of the resource, individuals or organisations.

  • Contributor – a contributor is an entity which provided input to the resource which is being shared, this can be the same as the publisher
  • Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number

Date

Date is used in a number of different ways (start of development/collection, end of development/collection, creation of resource, publication of resource, etc.), the usage of a date field should therefore be explained in the resource description. Where data collection is concerned it may be useful to state the latest date of collection as more recent data may be of greater interest. However, the nature and potential use cases of data will dictate the most useful use of this field.

  • Standardisation – the use of a standardised date format is recommended e.g. ISO 8601 timestamps

Type

Identify the type of the resource from the DCMI type vocabulary

CollectionDatasetEventImageInteractiveResource,

MovingImagePhysicalObjectServiceSoftwareSound, StillImage, Text

Format

Identify the format of the resource – in the case of data this is the file or encoding format. e.g. csv, JSON, API, etc.  

Identifier

Provide a unique identifier for the resource

  • Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use

Source

Identify the source(s) material of the derived resource

  • Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use

Language

Identify the language of the resource

  • Vocabulary – Standard vocabularies such as ISO 639-2 or ISO 639-3 provide a good way to avoid ambiguity

Relation

Identify other resources related the the resource

  • Identifiers – DOIs and URIs provide a truely unique identifier but where these are not available then a system or organisation specific identifier can be of use
  • Supporting Information – Where additional resources are required to understand the resource these should be referenced

Coverage

Identify the spatial or temporal remit of the resource

  • Standardisation – utilising standard, authoritative spatial identifiers is recommended e.g. UPRN, USRN, LSOA, Country Code, etc.

Rights

Specify under which licence conditions the resource is controlled.

  • Essential – this field is vitally important as without clear articulation of user rights the resource is useless, no entry does not meet open
  • Common terms – where possible use standard licence terms e.g. Creative Commons by Attribution etc.

 

The Dublin Core metadata should be stored in an independent file from the original data and machine readable format, such as JSONYAML or XML that can easily be presented in a human readable format using free text editors, for example Notepad++. This approach ensures that metadata can be shared independently from the dataset, that it is commonly accessible and not restricted by software compatibility. The DCMI have provided schemas for representing Dublin Core in XML and RDF which may be of help.

Worked Example

The Department for Business, Energy and Industrial Strategy has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. Note, this data does not have a URI or DOI so a platform ID has been used as identifiers, this is not ideal as it may not be globally unique.

{
“title”:”Solar PV cost data”
“creator”:”Department for Business, Energy and Industrial Strategy”
“subject”:”solar power station [http://www.electropedia.org/], energy cost [http://www.electropedia.org/]”
“description”:”Experimental statistics. Dataset contains information on the cost per kW of solar PV installed by month by financial year. Data is extracted from the Microgeneration Certification Scheme – MCS Installation Database.”
“publisher”:”Department for Business, Energy and Industrial Strategy”
“contributor”:”Microgeneration Certification Scheme”
“date”:”30/05/2019″
“type”:”Dataset”
“format”:”XLSX”
“identifier”:”19c3b55d-32dc-4bb3-8141-32bb2175affc”
“source”:”https://certificate.microgenerationcertification.org/
“language”:”English”
“relation”:”” 
“coverage”:”Great Britain (GB)”
“rights”:”Open Government Licence v3.0
}

Note, the data.gov.uk platform holds metadata for files in JSON format but the above has been simplified for the purposes of providing an example.

References

 

2. Data custodians should ensure that datasets are discoverable by potential users.

The value of data can only be realised when it is possible for potential users to identify what datasets exist and understand how they could utilise them effectively. Data custodians should implement a strategy which makes their data inclusively discoverable by a wide range of stakeholders within and outside of their organisation. There may be instances where it is not advisable to make it known that a dataset exists but this is expected to be exceptionally rare e.g. in cases of national security.

Discoverable: The ability for data to be easily found by potential users

Please note, there is a difference between a dataset being discoverable and accessible. In some cases a dataset may be too sensitive to be released widely but the description of the dataset (metadata) almost certainly can be made available without any sensitivity issues, this visibility can provide significant value through increasing awareness of what data exists. For example, an advertising agency’s dataset that describes personal details about individuals cannot be made openly available without the explicit consent of each subject in the data, but many of the advanced advertising products that provide consumers with more relevant products and services would be significantly less effective if the advertising platform could not explain to potential advertisers that they can use particular data features to target their advertisements. 

Discoverability Techniques

Metadata Publication

Metadata can be used to describe the contents and properties of a dataset. It is possible to make metadata open (i.e. published with no access or usage restrictions) in all but the rarest of cases without creating security, privacy, commercial or consumer impact issues due to it not actually involving the underlying data.

Metadata can be published by individual organisation initiatives or by collaborative industry services. Individual organisations may choose to host their own catalogue of metadata and/or participate with industry initiatives.

Webpage Markup

Where data is made available via services such as a website (open, public or shared) organisations can choose to markup datasets to make them visible to data-centric search engines and data harvesting tools. The Schema.org vocabulary is becoming an increasingly popular way to structure embedded data (such as recipes) but it can also be used to describe datasets. Structured markup has similarities to formal metadata but should not be seen as a replacement for standardised metadata.

Search Engine Optimisation

Search engines are likely to be the way in which most users will discover datasets. It is therefore the responsibility of the data custodian to make sure that data is presented in a way that search engines can find and index. Most major search engines provide guides explaining how to ensure that the correct pages and content appear in ‘organic searches’ (e.g. MicrosoftGoogle) and a range of organisations provide Search Engine Optimisation (SEO) services. The Geospatial Commission have released a guide to help organisations optimise their websites to enable search engines to identify and surface data.

Stakeholder Engagement

Direct stakeholder engagement is a powerful tool to drive interest into new or underutilised datasets. Additionally, this technique may be used when there is a specific use case or challenge which the data custodian is seeking to address.

Worked Example

The Department for Business, Energy and Industrial Strategy has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. This dataset is published under an Open Government Licence so the data can be accessed by all. The data custodians have registered the dataset with the UK Government open data portal Data.gov.uk, this makes the metadata publicly available (albeit only via the API) in JSON – a machine readable format that can easily be presented in a human readable form. It additionally provides search engine optimisation to surface the results in organic search and uses webpage markup to make the data visible to dataset specific search engines.

Note, the discoverability actions taken above are related to a dataset which is publicly available but could also be used for a metadata stub entry in a data catalogue.

References

 

3. Data, Metadata and supporting information should use common terms

It is critically important for data users to be able to search for and utilise similar datasets across organisations. A key enabler of this is finding a common way to describe the subject of data including in formal metadata, this requires a common glossary of terms.

There has been a proliferation of glossaries within the energy sector, with each new document or data store providing a definitive set of definitions for the avoidance of doubt, the list below is provided as an example.

Equally, the same has been occurring across other sectors and domains, the following sources are all data related.

There are currently efforts to standardise the naming conventions used across a range of infrastructure domains by the Digital Framework Task Group as part of the National Digital Twin programme of work. The long term goal is to define an ontology which enables different sectors to use a common language which enables effective cross sector data sharing.

In the near term, it is unhelpful to create yet another glossary so we propose a two staged approach.

  1. Organisations label data with keywords and the authoritative source of their definition e.g. Term [Glossary Reference]
  2. An industry wide Data Catalogue should be implemented with an authoritative glossary based on existing sources which can be expanded or adapted with user feedback and challenge

Implementing a standard referencing protocol enables organisations to understand terms when they are used and slowly converge on to a unified subset of terms where there is common ground. Implementing an industry wide data catalogue with agreed glossary and mechanism for feedback enables the convergence to be accelerated and discoverability of data revolutionised.

Worked Example

“subject”:”solar power station [http://www.electropedia.org/], energy cost [http://www.electropedia.org/]”

References

 

4. Data Custodians should ensure datasets have the supporting information required to make the data understandable for potential users

When data is published openly, made publicly available or shared with a specific group it is critical that the data has any supporting information that is required to make the data useful for potential users. There is a need to differentiate between Core Supporting Information, without which the data could not be understood by anyone, and Additional Supporting Information that makes understanding the data easier. As a rule of thumb, if the original custodian of the dataset were to stop working with it and then come back 10 years later with the same level of domain expertise, but without the advantage of having worked with the data on a regular basis, the Core Supporting Information is that which they need to make the dataset intelligible. It is reasonable to expect that Core Supporting Information should be made available with the dataset.

Data custodians should consider the following topics for areas where Core Supporting Information may be required:

  • Data collection methodology
  • Data structure description (e.g. data schema)
  • Granularity (spatial, temporal, etc.) of the data
  • Units of measurement
  • Version number of any reference data that has been used (e.g. the Postcode lookup reference data)
  • References to raw source data (within metadata)
  • Protocols that have been used to process the data 

It may be possible to minimise the required Core Supporting Information if the dataset uses standard methodologies and structures (e.g. ISO 8601 timestamps) but it should not be assumed that an externally hosted reference dataset or document will be enduring unless it has been archived by an authoritative, sustainable body (e.g. ISO, BSI, UK Data Archive, UKERC Energy Data Centre, etc.). If there is doubt that a key reference data source or document will be available in perpetuity this should be archived by the publisher and, where possible, made available as supporting information.

Depending on the goals of the organisation sharing data, it may be prudent to also include additional supporting information. Reasons to make additional supporting information available could include:

  • Maximising user engagement with the dataset
  • Addressing a particular user need
  • To reduce the number of subsequent queries about the dataset
  • To highlight a particular issue or challenge which the data publisher would like to drive innovators towards

Worked Example

Core Supporting Information

The UK Energy Research Centre (UKERC) are the hosts of the Energy Data Centre, which was set up to “to create a hub for information on all publically funded energy research happening in the UK”. The centre hosts and catalogues a large amount of data which is collected by or made available to aid energy researchers, however much of the data is made available using an open licence (e.g. Creative Commons Attribution 4.0 International License). This licence enables data use for a wide range of purposes, including research. The centre aims to archive data for future researchers and as such, the administrators have embedded a range of data best practice principles including the provision of core supporting information that is required to understand the stored dataset. The custodians of the Energy Data Centre provided the ’10 year’ rule of thumb, described above.

The Energy Data Centre host the Local Authority Engagement in UK Energy Systems data and associated reports. The data is accompanied by rich metadata and a suit of core supporting information that enables users to understand the data. In this case, the core information includes:

  • Core Supporting Information
    • Description of the source datasets
    • Description of the fields and their units
    • ReadMe.txt files with a high level introduction to the associated project
  • Core and Additional Supporting Information
    • A list of academic reports about the data and associated findings

The list of academic reports contains some core supporting information (e.g. the detailed collection methodology), but much of the content is additional supporting information.

Additional Supporting Information

When an organisation has a particular goal in mind it may be prudent to include additional supporting information that enables the maximum number of users to engage with the dataset. A good example of this are data science competitions, which are commonly used in industry to solving particular problems by drawing on a large number of experts.

A recent Kaggle competition asked participants to utilise sensor data to identify faults in power lines. To maximise engagement, the hosts provided high level overviews of the problem domain, more detailed explanations of datasets and offered additional advice as required through question and answer sessions that were widely published. This information was not strictly essential for an expert to understand the data, but the underlying goal of the project was to attract new talent to the area. If the potential participants could not easily understand the data they would likely move on to another lucrative project.

  • Core Supporting Information
    • Descriptions of datasets and fields (metadata)
  • Core and Additional Supporting Information
    • Description of the problem in non technical language
    • Description of common problems and how to identify them
    • Question and Answer feeds

References

 

5. Data Custodians should seek to learn and understand the needs of their current and prospective data users

Digital connectivity and data are enabling a wealth of new products and services across the economy and creating new data users outside of the traditional sector silos. In order to maximise the value of data it is vital custodians develop a deep understanding of the spectrum of their users and their differing needs such that datasets can be designed to realise the maximum value for consumers.

Data custodians should develop a deep understanding of range of topics.

Topic

Description 

Their current and potential data users

  • Who is using your dataset or service?
  • Who would like to use your dataset or service?
  • Who should be using your dataset or service?
  • For organisations with many individual users, it may be helpful to group users into categories or create personas to represent users with similar needs and objectives.
  • For the purposes of user research, a data user could be an individual, an organisation or a persona.

The goals of each current and potential data user

  • What outcome does each user hope to achieve by using the data or service?
  • Individual users may have multiple objectives; it can be helpful to rank these by importance / impact or set constraints, especially where they are competing needs

The user needs of each current and potential data user

  • Users will have a range of needs driven by all sorts of factors including, but not limited to differing objectives, existing data / systems and technical capability.
  • Users may exhibit conflicting needs or provide different combinations of needs depending on their range of use cases and desired outcomes.
  • Data custodians should be considering the different types of needs:
    • explicit needs: derived from how users describe what they are trying to do
    • implicit needs: those that are not expressed and that users are sometimes not aware of, but that are evident from observation
    • created needs: where a user has to do something because it is required by the service
  • In addition, needs can be catagorised into the following groups:
    • high-level needs – for example: ‘I need to understand the data so that I don’t use it incorrectly’
    • needs – for example: ‘I need to trust the data so I can defend my decision’
    • detailed needs – for example: ‘I need to know how reliable the data is, so that I can provide caveat if needed’
  • User needs may include requirements for:
    • Data Granularity (time, space, subject)
    • Data Accuracy or Precision (how closely does the data reflect reality)
    • Data Timeliness and Consistency (duration between data creation and access)
    • Functionality and simplicity of access (file download, API requests, etc.)
    • Reliability (system availability over time)
    • Stability (consistency over time)
    • Agility (the ability to adapt to changing needs)
    • Linkability (joining to other datasets)

How this relates to the goals of the data holders

  • How do the objectives of the users compare to the objectives of the data custodian?

How this delivers benefit for consumers

  • Does meeting the need of the user provide value to the end customer?
  • Where there are conflicting objectives between the data custodian and user, does one provide significantly more value to the consumer?

How the data holder can deliver this in an appropriate format within reasonable timescales

  • It is important to recognise that some needs will require more time and effort to address than others
  • What is the realistic time in which the need can be addressed?
  • Is it possible to address all user needs with one delivery or are multiple iterations / versions needed?

 

There are a range of methods which organisations can use to elicit the needs of current and potential users.

  • Direct Engagement
    • Interviews – detailed research with representative users
    • Workshops – broad research with groups of users (or potential users)
    • Usability Testing – feedback from service users
  • Technology
    • Monitoring – tracking usage of a live service
    • Feedback forms – user initiated feedback on services
  • Other
    • Innovation Projects – generating new user needs through novel work
    • Knowledge Sharing – collaboration between organisations with similar user types
    • Direct Requests – prospective user needs

The Government Digital Service have published advice on the topic of user research within their service manual.

Worked Example

Workshop

One approach that can be used to gain input from potential users is to convene a workshop where individuals from different backgrounds can come together to discuss the challenges they face and the needs that this creates. Formulating the needs of users as structured user stories which can be used to subsequently identify trends:

As a Role given that Situation I need Requirement so that Outcome

References

 

6. Identify the types of roles played by stakeholders of the data

The data custodian should understand the subject of the data as well as those that interact with the data.

Data Custodian: An organisation or individual which holds data

Data Subject: The identified or identifiable living individual to whom personal data relates.

Data Controller: A person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data.

Data Processor: A person, public authority, agency or other body which processes personal data on behalf of the controller.

Definitions from ICO

“To determine whether you are a controller or processor, you will need to consider your role and responsibilities in relation to your data processing activities. If you exercise overall control of the purpose and means of the processing of personal data – ie, you decide what data to process and why – you are a controller. If you don’t have any purpose of your own for processing the data and you only act on a client’s instructions, you are likely to be a processor – even if you make some technical decisions about how you process the data.” ICO 

References

 

7. Ensure data quality improvement is prioritised by user needs

Data quality is subjective, a dataset may be perfectly acceptable for one use case but entirely inadequate for another. Data accuracy can be more objective but there remain many instances where the required precision differs across use cases.

Data is not perfect, even the most diligent organisation that makes the greatest effort to collect and disseminate the highest quality data cannot guarantee enduring accuracy or foresee all the potential future needs of data users. Data custodians are therefore faced with an ongoing task to identify quality and accuracy limitations which can be improved over time. This is a particular challenge for the owners and operators of infrastructure who have the task of deploying assets which will be in situ for many years (often decades). The data needs of potential users are almost certain to change dramatically within the lifespan of the asset and the ability to recollect information is often challenging (especially for buried assets).

Note: Organisations should not see data quality as a barrier to opening datasets. Potential users may find the quality acceptable for their use, find ways to handle the quality issues or develop ways to solve issues which can improve the quality of the underlying data. 

Data is most useful when it is accurate and trust worthy. In many cases, data will be used to store information which can objectively be categorised as right or wrong e.g. customer addresses, asset serial numbers, etc.

Data controllers should make reasonable efforts to ensure that data is accurate and rectify issues quickly when they are identified

This is in line with the GDPR accuracy requirements for personal data. Organisations should utilise master data management techniques to validate inputs, monitor consistency across systems and rectify issues quickly when they are identified by internal or external stakeholders.

Beyond accuracy, data custodians should consider how they can iteratively improve the quality of data.

Data custodians should seek to improve data quality in a way that responds to the needs of users

Not all data will be of sufficient quality for all users and in some cases significant investment may be required to rectify shortcomings. Data controllers should consider if the insufficient quality is due to lack of quality in the underlying data source (e.g. the sensor data is not precise or frequent enough), subsequent processing of the data (e.g. aggregation, rounding, etc.) or technical choices (e.g. deleting data due to storage constants). Data custodian should propose how they could increase the quality of the data and indicate the time and cost requirements, where possible the prospective data user should be given the opportunity to propose alternative solutions.

Where simple, cost effective improvements can be made within existing constants these should be actioned. Where incremental funding is required, the data custodian should consider the benefit to wider stakeholders and users before considering the funding options available (organisational investment, innovation funding, data user collaboration, data user fees, funding request, etc.).

Worked Example

Accuracy

An organisation collects and holds data about companies which operate in a sector in order to help potential innovators find suppliers or collaboration partners. The data that is held is made available on a website and can be queried by users. A potential user searches for an organisation and finds that a company has been incorrectly categorized. The data publishers have provided a contact form which enables the user to submit the correction which is subsequently verified by the data publisher before the dataset is updated.

Quality

A data user wishes to develop an application which enables public transport users to select options based on their impact on air pollution in a city. The public transport provider makes a range of data available about the various modes of transport including routes travelled, vehicle type, average emissions and passenger numbers. However, the data user has identified that the emissions data is not of a sufficient quality for their use case for the following reasons:

  1. The average emissions data field is not consistently populated
  2. There is likely to be a variation between the average emission output and the real impact on local areas

The issues highlighted above are quite different in nature and have different potential solutions.

  1.  The public transport provider can check the dataset for missing values and populate based on the vehicle type and known specification
  2. Monitoring real emissions from each vehicle is likely to require new equipment and create a large amount of cost
    1. An alternative option could be to use existing static air quality monitoring sites and data analytics to estimate the impact in certain conditions. 

Having proposed a quick solution to point 1 and an alternative solution for point 2 the data user can continue with their use case.

References

 

8. Data relating to common assets is Presumed Open

Data relating to common assets should be open unless there are legitimate issues which would prevent this, legitimate issues include Privacy, Security, Negative Consumer Impact, Commercial and Legislation and Regulation. It is the responsibility of the data controller to ensure that issues are effectively identified and mitigated where appropriate. It is recommended that organisations implement a robust Open Data Triage process.

Common Assets are defined as a resource (physical or digital) that is essential to or dependent on common shared infrastructure

In cases where there has been data processing applied to raw data (e.g. Issue mitigation, data cleaning, etc.) it is considered best practice for the processing methodology or scripts to be made available as core supporting information in order to maximise the utility of the data to users

Data relating to common assets should be open unless there are legitimate issues.

 

9. Presumed Open data should go through Open Data Triage, conducted by the data custodian

The triage process considers themes such as privacy, security, commercial and consumer impact issue. Where the decision is for data to not be made open the data controller will: share the rationale for this; identify sensitivity mitigation options and deliver these if user needs require; maximise sharing of the mitigation protocols and desensitised version of the data. Users of the data should have reasonable opportunity to challenge decisions and have a point of escalation where agreement between data users and data controllers cannot be reached.

Process

Open Data Triage
An example of the Open Data Triage process – click to enlarge.

Identification of Discrete Datasets

The goal of open data triage is to identify where issues exists that would prevent the open publication of data in it’s most granular format and address them in a way that maintains as much value as possible. In order to make this process manageable for the data controller, the first step in this process is:

Identify thematic, usable datasets that can be joined if required rather than general data dumps

In this context, we define a ‘thematic, usable dataset’ as a discrete collection of data which relates to a focused, coherent topic but provides enough information to be of practical use. e.g. 1 year of time series monitoring of a defined range of metrics for a particular category of technical asset, where each asset is identified and described either within the dataset or through unique linkage to a related dataset.

The approach described above minimises the risk that the size and complexity of datasets results in issues are not correctly identified. It also reduces the risk that an issue in one part of the dataset results in the whole dataset being made less open or granular therefore maximising the amount of useful data that is openly available in it’s most granular form. For example, providing complete output from a data warehouse in one data dump could contain information about consumers, employees, financial performance, company KPIs, etc. all of which would present issues that would mean the data needs to be modified or the openness reduced. Whereas extracting tables (or parts of tables) from the data warehouse would provide a more granular level of control which enables individual issues to be identified and addressed accordingly which would in turn maximise the data which is made openly available.

Identification of Issues

Once a thematic, usable dataset has been identified the data controller should assess the dataset to identify if there are any issues which would prevent the open publication of the data in it’s most granular format.

Identify the potential issues which might limit the openness or granularity of dataset

In the table below, we outline a range of issue categories which should be carefully considered. Some of these categories will directly relate to existing triage processes which already exist in organisations but others may require the adaptation of existing processes or creation of new processes to provide a comprehensive solution.

Issue Category

Description

Guidance 

Privacy

Data that relates to a natural person who can be identified directly from the information in question or can be indirectly identified from the information in combination with other information.

This should be a familiar process as GDPR introduced a range of requirements for organisations to identify personal data and conduct Data Privacy Impact Assessments. The ICO has a wealth of advice and guidance on these topics, including definitions of personal data and DPIA templates.

Security

Data that creates incremental or exacerbates existing security issues which cannot be mitigated via sensible security protocols such as personnel vetting, physical site security or robust cyber security.

Companies and organisations that own and operate infrastructure should already have a risk identification and mitigation program to support the protection of Critical National Infrastructure (CNI). The Centre for the Protection of National Infrastructure (CPNI) have advice and guidance for organisations involved in the operation and protection of CNI.

Outside of CNI, organisations should assess the incremental security risks that could be created through the publication of data. Organisations should consider personnel, physical and cyber security when identifying issues and identify if the issue primarily impacts the publishing organisation or if it has wider impacts. Issue identification should take into account the existing security protocols that exist within an organisation and flag areas where the residual risk (after mitigation) is unacceptably high.

Note, where that information contained within a dataset is already publicly available via existing means (such as publicly available satellite imagery) the security issue assessment should consider the incremental risk of data publication using the existing situation as the baseline.

Negative consumer Impact

Data that is likely to drive actions, intentional or otherwise, which will negatively impact the consumer

Organisations should consider how the dataset could be used to drive outcomes that would negatively impact consumers by enabling manipulation of markets, embedding bias into products or services, incentivising of actions which are detrimental to decarbonisation of the system, etc.

The Open Data Institute (ODI) Data Ethics Canvas may be of use when identifying potential negative consumer impacts.

Commercial Interest

Data that relates to the private administration of a business or data which was not collected as part of an obligation / by a regulated monopoly and would not have been originated or captured without the activity of the organisation

Commercial data relating to the private administration of a business (HR, payroll, employee performance, etc.) is deemed to be private information and as a legitimate reason for data to be closed, although organisations may choose to publish for their own reasons such as reporting or corporate social responsibility (CSR) reasons.

Data which does not relate to the administration of the business but has been collected or generated through actions which are outside of the organisation’s legislative or regulatory core obligations and funded through private investment may also have legitimate reason to be closed. This description may include the data generated through innovation projects but consideration should be given to the source of funding and any data publication or sharing requirements this might create.

Where an organisation is a regulated monopoly, special consideration should also be given to the privileged position of the organisation and the duty to enable and facilitate competition within their domain.

Where datasets contain IP belonging to other organisations or where the data has been obtained under a licence which would restrict onward publishing this should also be identified. Note, the expectation is that organisations should be migrating away from restrictive licences / terms and conditions that restrict onward data publishing and sharing where possible. 

Legislation and Regulation

Specific legislation or regulation exists which prohibits the publication of data.

Organisations should have legal and regulatory compliance processes which are able to identify and drive compliance with any obligations the company has.

Consideration should include:

  • Utilities Act 2000
  • Electricity Act 1989
  • Gas Act 1986 / 1995
  • Competition Act 1998
  • Enterprise Act 2002
  • Enterprise and Regulatory Reform Act 2013
  • Data Protection Act 2018
  • General Data Protection Regulation (GDPR)

 

Consider the impact of related or adjacent datasets

When assessing the sensitivity of data, thought should be given to the other datasets which are already publicly available and the issues which may arise from the combination of datasets. Organisations should consider where there are datasets outside of their control which, if published, could create issues which would need to be mitigated. Special consideration should be given to datasets which share a common key or identifier, this includes but is not limited to:

  • subject reference (e.g. Passport Number),
  • technical reference (e.g. Serial Number),
  • time (e.g. UTC),
  • space (e.g. Postcode or Property Identifier)

As new datasets are made available, markets develop and public attitudes change there may be a need to revise the original assessment. For example, a dataset which was initially deemed too sensitive to be released openly in it’s most granular form could be rendered less sensitive due to changes to market structure or change in regulatory obligations. Equally, a dataset which was published openly could become more sensitive due to the publication of a related dataset or technology development.

Mitigation of Issues

Where the assessment process identifies an issue, the aim should be to mitigate the issue through modification of data or reduced openness whilst maximising the value of the dataset for a range of stakeholders.

Mitigate issues through modification of data or reduced openness whilst addressing user needs

Open data with some redactions may be preferable to shared data without, but if redactions render the data useless then public or shared data may be better. In some cases, the objectives of the prospective data users might create requirements which cannot be resolved by a single solution so it may be necessary to provide different variations or level of access, for example providing open access to a desensitised version of the data for general consumption alongside shared access to the unadulterated data to a subset of known users.

Modification of Data

Modification of data can serve to reduce the sensitivity whilst enabling the data to be open. There are a wide variety of possible modifications of data which can be used to address different types of sensitivity.

Technique

Description

Example Application

Commentary

Anonymisation

Removing or altering identifying features

Privacy

An organisation has a licence condition to collect certain data about individual usage of national infrastructure. The data is collected about individual usage on a daily basis and could reveal information about individuals if it was to be released openly.

By removing identifying features such as granular location and individual reference it could be possible to successfully anonymse the data such that individuals cannot be re-identified such that the data could be made openly available.

Simple anonmysation can be very effective at protecting personal data but it needs to be undertaken with care to minimise the risk of re-identification. Anonmysation techniques can be combined with other mitigation techniques to minimise this risk.

https://theodi.org/article/how-do-organisations-perceive-the-risks-of-re-identification/

Noise 

Combining the original dataset with meaningless data

Commercial Interest

An organisation collects information about how individuals use a privately built product or service (e.g. a travel planner). This data could be of great use for the purposes of planning of adjacent system (e.g. energy system or road network) but releasing the anonmysed, granular data would given competitors a commercial advantage.

By introducing seemingly random noise into the dataset in a way that ensure that the data remains statistically representative but the detail of individuals is subtly altered the data can be made available whilst reducing the commercial risk.

Introducing noise to data in a way that successfully obfuscates sensitive information whilst retaining the statistical integrity of the dataset is a challenging task that requires specialist data and statistics skills. Consideration needs to be given to the required distribution, which features the noise will be applied to and the consistency of application.

Delay

Deferring publication of data for a defined period

Security

An organisation operates a network of technical assets some of which fail on occasion. If the data related to those assets was made available innovators could help to identify patterns which predict outages before the occur and improve the network stability. However, the data could also be used to target an attack on the network at a point which is already actively under strain and cause maximum impact.

However, by introducing a sufficient delay between the data being generated and published the organisation can mitigate the risk of the data being used to attack the network whilst benefiting from

Delaying the release of data is a simple but effective method of enabling detailed information to be released whilst mitigating many types of negative impact. However, it may be necessary to combine this with other mitigation techniques to completely mitigate more complicated r

Differential Privacy

 An algorithm or model which obscures the original data to limit re-identification

Privacy

An organisation collects rich data from consumers which is highly valuable but sensitive (e.g. email content). The sensitivity of this data is very high but the potential for learning is also very high.

Differential privacy enables large amounts of data to be collected from many individuals whilst retaining privacy. Noise is added to individuals’ data which is then ingested by a model, as large amounts of data is combined the noise averages out and patterns can emerge. It is possible to design this process such that the results cannot be linked back to an individual user and privacy is preserved.

Differential privacy is an advanced technique but can be very effective. It is used by top technology firms to provide the benefits of machine learning but without the privacy impact that is usually required.

Sharing a model can be a highly effective way of enabling parties to access the benefit of highly sensitive, granular data but without proving direct access to the raw information. However, this is an emerging area so carries some complexity and risk.

Differential Privacy Overview

Redaction

Removing or overwriting selected features

Security / Legislation and Regulation

An organisation maintains data about a larger number of buildings across the country and their usage. Within the dataset there are a number of buildings which are identified as Critical National Infrastructure (CNI) sites which are at particular risk of targeted attack if they are known.

In this case it is possible to simply redact the data for the CNI sites and release the rest of the dataset (assuming there is no other sensitivity). Note, this approach works here because the dataset is not complete and therefore it is not possible to draw a conclusion about a site which is missing from the data, it may simply have not been included.

Redaction is commonplace when publishing data as it is a very effective method of reducing risk. However, care needs to be taken to ensure that it is not possible to deduce something by the lack of data.

Aggregation

Combining data to reduce granularity of resolution, time, space or individuals

Commercial Interest

An organisation collects information about the performance of their private assets which form part of a wider system (e.g. energy generation output). This data could be of great use to the other actors within the system but releasing the data in it’s raw format may breach commercial agreements or provide competitors with an unfair advantage.

By aggregating the data (by technology, time, location or other dimension) the sensitivity can be reduced whilst maintaining some of the value of the data.

Aggregation is effective at reducing sensitivity but can significantly reduce the value of the data. It may be worth providing multiple aggregated views of the data to address the needs of a range of stakeholders.

Where aggregation is the only effective mechanism to reduce sensitivity organisations may want to consider providing access to aggregated data openly alongside more granular data that can be shared with restricted conditions.

Shift / Rotate

Altering the position or orientation of spatial or time series data

Privacy

An organisation collects data on how their customers use a mobile product including when and where. This movement data could be of value to other organisations in order to plan infrastructure investment but the data reveals the patterns of individuals which cannot be openly published.

The initial step is to remove any identifying features (e.g. device IDs) and break the movement data into small blocks. Each block of movement can then be shifted in time and space such that they cannot be reassembled it identify the movement patterns of individuals. This means that realistic, granular data can be shared but the privacy of individuals can be protected.

Shifting or rotating data can be useful to desensitise spatial or temporal data. However, it is important to recognise context to ensure that the data makes sense and cannot be easily reconstructed. For example, car journey data will almost always take place on roads and therefore rotation can make the data nonsensical and it can be pattern matched to the underlying road network with relative ease.

Randomisation

Making arbitrary changes to the data

Security

An organisation may be the custodian of infrastructure data relating to a number of sensitive locations such as police stations or MoD buildings. The data itself is of use for a range of purposes but making it openly available could result in security impacts.

Randomising the data (generating arbitrary values) relating to the sensitive locations (rather than redaction) could reduce the sensitivity such that it can be open.

Randomisation can be very effective to reduce sensitivity but it is also destructive so impacts the quality of the underlying data.

Normalisation

Modifying data to reduce the difference between individual subjects

Negative Consumer Impact

An organisation provides collects data about usage of a product in order to diagnose problems and optimise performance. The data has wider use beyond the core purpose but the associated demographic data could result in bias towards certain groups.

By normalising the data (reducing variance and the ability to discriminate between points) it is possible to reduce the ability for certain factors to be used to to differentiate between subjects and hence reduce types of bias.

Normalisation is a statistical technique that requires specialist skills to apply correctly. It may not be enough on it’s own to address all sensitivities so a multifaceted approach may be required.

 

Level of Access

When the mitigation techniques have been applied as appropriate the data custodian should consider how open the resulting data can be. Where the mitigation has been sucessful the data can be published openly for all to use. However, if the nature of the data means that it is only valuable in it’s most granular form it may be necessary to reduce openness but keep granularity.

Level of Access

Description

Open

Data is made available for all to use, modify and distribute with no restrictions

Public

Data is made publicly available but with some restrictions on usage

Shared

Data is made available to a limited group of participants possibly with some restrictions on usage

Closed

Data is only available within a single organisation

 

The above table is based on the ODI data spectrum.

Balancing Openness, Modification and User Needs

A key factor to consider is the needs of the potential data users. Initially, there may be value in providing aggregated summaries of data which can be made entirely open but as new user cases and user needs emerge we may find that access to more granular data is required which necessitates a more sophisticated mitigation technique or a more granular version of the data which is shared less openly. In some cases, it may be prudent to make multiple versions of a dataset available to serve the needs of a range of users.

Documentation

Issues that are identified should be clearly documented. Where issues have been mitigated through reduced openness or data modification, the mitigation technique should also be clearly documented.

Issue Category

Description of Issue

Privacy

A description of the data and identification of features which contain personal data with additional flag for sensitive personal data.

Security

Where possible, a description of the data and features which cannot be published. There may be some cases where acknowledging the data exists may represent a security risk, in which case the relevant regulator or government department should be consulted.

Negative consumer Impact

A description of the data, sensitive fields and overview of the likely negative impact. Whilst it might not be possible to describe the likely negative impact in detail there should be some indication of the cause of the sensitivity e.g. market manipulation.

Commercial Interest

A description of the data along with an outline of how this impacts commercial interest, e.g. business administration data or justifiable private investment.

Legislation and Regulation

A description to the data and reference to specific articles of legislation or regulation which prohibits publication.

 

Publication

After the triage process has taken place the resulting data should be made available in line with the triage outcome accompanied by any supporting information and metadata. Where issues have been identified and mitigated this should be documented and made available ideally with the processing methodology or script such that potential users can understand the modifications to the data. This provides transparency, promotes consistency across the sector and enables challenge.

Publish descriptions of issues and mitigation actions alongside data

Challenge and Review

Once a dataset and accompanying material has been published the data custodian should ensure that they have a process to regularly review the datasets that have been made available to ensure they are correct, relevant and there are no new issues arising. Equally, there should be a process for those outside of the publishing organisation to challenge the granularity, format and level of openness of any data which is published or publicised.

Assessment of issues should be an ongoing activity which can be interrogated and challenged

Worked Example

Substation Connection Capacity

Identification of Discrete Datasets

The dataset that has been identified is substation connection capacity, this represents the likely connection headroom at every substation within a electricity distribution network’s area. Monitoring data (substation and smart meters) is used to measure the current demand and network data is used to provide the maximum capacity.

Note – this is a scenario for the purposes of providing an example and does not represent an operational network monitoring solution.

Identification of Issues

Issues

Identification of Issues

Mitigation of Issues

Privacy

Substation connection capacity is not personal data. However, if the dataset includes total capacity and used or available capacity then in the rare case where a substation serves a single customer this could be an individual’s private data

Amber

Analysis will take place before the data is released to identify any substations which serve a single premises. If these cases materialise the sensitive fields will be redacted and the data will be displayed as available capacity only to avoid personal data issues.

Green

Security

If the data represents the live status of the network then a bad actor could use this information to inform an attack

Amber

Data can be delayed by 24 hours such that it is not possible to determine live status of the network

Green

Negative Consumer Impact

If an actor used the data to opportunistically request connections that utilise all of the available capacity it could drive up costs for future users, including for new housing, a cost which could be passed on to the consumer.

Amber

Ensure there is a process in place to stop actors stockpiling capacity and distribute costs fairly

Green

Commercial Interest

None – the network is a monopoly player with the duty to deliver an efficient, competitive system

Green

Green

Legislation and Regulation

GDPR (if personal data is involved)

Amber

See privacy mitigation.

Green

 

Due to the ability to mitigate all of the issues above the data can be open.

Documentation

The metadata description will include details about how the dataset is generated, the processing which has taken place to remove private data and the delay will be clearly stated. If the privacy issue is more prominent and a data pipeline is required then the methodology will be documented and made available alongside the data.

Publication

Data will be published as a csv file with core supporting information which describes the fields and provides units. The data will be hosted on an open data access platform on the company website and metadata will be registered with open data catalogues.

Challenge and Review

A feedback form will be provided on the website to enable challenge.

An innovator has indicated that for their specific use case they need to be able to access the data in near real time. In order to make this possible the decision has been made to provide a shared access data dashboard and API such that verified actors can gain access to the data in near real time. This has wider value for the technicians and engineers within the company as it provides high quality, low latency data which they can use in their roles.

References

10. Data should be interoperable with other data and digital services

Data is most useful when it can be shared, linked and combined with ease.

Interoperablity (Data): enabling data to be shared and ported with ease between different systems, organisations and individuals

Data custodians should, directed by user needs, ensure that their data is made available in a way that minimises friction between systems through the use of standard interfaces, standard data structures, reference data matching or other methods as appropriate. Wherever possible, the use of cross sector and international standards is advised.

Standard Data Structures

Data structure standardisation is a common method of aligning data across organisations and enabling seemless portability of data between systems. This method provides robust interoperability between systems and if the standard has been correctly adhered to enables entire data structures to be ported between systems as required. However, standardisation of data structures can be expensive, time consuming and require significant industry, regulatory or government effort to develop the standard if one does not already exist.

Examples of Implementation

Telecoms: Many of the underlying functions in telecoms have been standardised at a data structure level by the GSMA. This enables network companies to deploy devices from different manufacturers with limited burden, this drives down costs by reducing the risk of vendor lock in.

Energy: The Common Information Model (CIM) is a set of standardised data models which can be used to represent electicity networks, their assets and functionality. CIM is being deployed in a number of network areas to aid the portability between systems, enable innovation and lower costs.

 

Standard Interfaces

Data interfaces can also be standardised, this means that the formal channels of communication are structured in a standard way which enables systems to ‘talk’ to one another with ease. This approach has the advantage of being very quick to implement as interfaces can be implemented as required and providing there is robust documentation a single organisation can define a ‘standard’ interface for their users without the need for cross sector agreement as they can be easily developed and evolved over time. However, this approach is limited in that a new interface needs to be developed or deployed for each type of data that needs to be shared. Additionally, in sectors where there a few powerful actors they can use interface standardisation to create siloed ecosystems which reduces portability.

Examples of Implementation

Smart Home: Google, Amazon, Apple, Hive and alike have developed open APIs which their partners use to interact with their systems.

 

Reference Data Matching

Matching datasets back to reference data spines can be a useful method to enable non standardised data to be joined with relative ease. This approach can provide a minium level of interoperability and linkability across datasets without the need for full standardisation. However, it does require the users to learn about each new dataset rather than being able o understand the data from the outset with standard data structures.

Examples of Implementation

National Statistics: Datasets are matched back to a core set of ‘data spines’ to aid cross referencing and the production of insightful statistics 

 

Worked Example

Standard Data Structures

Electricity network data is essential for a number of emerging energy system innovations including the successful integration of a highly distributed, renewables dominated grid. However, the network is broken into a number of areas which are operated by different organsations that have implemented different data structures to manage their network data (power flow model, GIS and asset inventory). The Common Information Model (CIM) is the common name for a series of IEC standards (IEC 61970 and IEC 61968)  which standardise the data structure for electricity network data. The deployment of the CIM standards enables network operators to provide third parties with access to their data in a standard form which enables innovation to be rolled out across network areas with relative ease.

References

11. Protect data and systems in accordance with Security, Privacy and Resilience best practice

Ensure data and systems are protected appropriately and comply with all relevant data policies, legislation, and Security, Privacy and Resilience (SPaR) best practice principles.

Data Custodians should consider:

  • How to protect data appropriately
  • How the release of data could impact the security of systems
  • How the systems which are used to release data are made secure
  • How to ensure compliance with related policy, legislation and regulation

A number of frameworks, standards and regulation exist which provide organisations with implementable guidance on the topic of SPaR such as:

  • Frameworks
    • Cyber Assessment Framework (CAF)
  • Standards
    • ISO 27000 – Information Security Management Systems
    • IEC 26443 – Industrial Network and System Security
    • IEEE C37.240-2014 – Cybersecurity Requirements for Substation Automation, Protection, and Control Systems
    • PAS185 – Smart Cities. Specification for establishing and implementing a security-minded approach.
  • Regulation
    • Network Information Systems Directive (NISD)
    • General Data Protection Regulation (GDPR)

In addition, there is a wealth of advice available to organisations through the following organisations:

  • BEIS and Ofgem – The joint competent authorities for downstream gas and electricity
  • National Cyber Security Centre (NCSC)
  • Centre for the Protection of National Infrastructure (CPNI)
  • International Electrotechnical Commission (IEC)
  • Institute of Electrical and Electronics Engineers (IEEE)
  • International Organization for Standardization (ISO)
  • British Standards Institution (BSI)

References

12. Ensure that data is stored and archived in such a way to maximise sustaining value

Organisations should consider the way in which data is stored or archived in order to ensure that potential future value is not unnecessarily limited. This includes ensuring that the storage solution is specified in such a way to ensure that the risk of data being lost due to technical difficulties is minimal. Technical storage solutions which offer component and geographical redundancy are common place with many cloud providers offering data resilience as standard. In addition to technical resilience the data custodian should ensure that data is not unduly aggregated or curtailed in a way that limits future value. Where possible the most granular version of the data should be stored for future analysis but there may be cases where the raw data is too large to store indefinitely. Where this occurs the data custodian should consider how their proposed solution (aggregation, limiting retention window, etc.) would impact future analysis opportunities.

Where data does not have value to the original data custodian the information could be archived with a trusted third party e.g. UKERC Energy Data CentreUK Data Archive, etc. to ensure that it continues to have sustaining value.

Worked Example

The UK Smart Meter Implementation Program is responsible for rolling out digitally connected meter points to all UK premises with the goal of providing an efficient, accurate means of measuring consumption alongside a range of other technical metrics. Electricity distribution networks can request to gain access to this data in order to inform their planning and optimise operation. The personal nature of smart meter data means that organisations have to protect individual consumer privacy and are therefore choosing to aggregate the data at feeder level before it is stored, this approach provides the network with actionable insight for the current configuration of the network.

However, network structure is not immutable. As demand patterns change and constraints appear, network operators may need to upgrade or reconfigure their network to mitigate problems. The decision to aggregate the data means that it is not possible to use the granular data to simulate the impact of splitting the feeder in different ways to more effectively balance demand across the new network structure. In addition, it means that the historic data cannot be used for modelling and forecasting going forwards. Finally, the data ingest and aggregation processes need to be updated to ensure that any future data is of value.

Northern Power Grid (a UK Electricity Distribution Network) have recently proposed to store the smart meter data that they collect in a non-aggregated format but strictly enforce that data can only be extracted and viewed in an aggregated format. This approach is novel in that it protects the privacy of the consumer whilst retaining the flexibility and value of the data.

References