Data Best Practice Guidance
The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system. In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find.
As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy, Ofgem and Innovate UK commissioned Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce. To do this, the Catapult engaged with a large number of stakeholders to gather views and opinions about what best practice means for those working with data across the energy sector and beyond.
The the guidance presented below has been developed in collaboration with a large number of stakeholders and is designed to help organisations and individuals understand how to best manage data in a way that supports the Energy Data Taskforce recommendations and accelerates the transition towards a modern, digitalised energy system that enables net zero.
The initial development phase has now finished and the live version is hosted on the Modernising Energy Data Collaboration Space
Latest Update: 15/10/2020
Ofgem are maintaining the live version of the guidance here.
The final version of the guidance as developed by the Energy Systems Catapult is included below as an archived version.
Archived – Data Best Practice Guidance
The Energy Data Taskforce, led by Laura Sandys and Energy Systems Catapult, was tasked with investigating how the use of data could be transformed across our energy system. In June 2019, the Energy Data Taskforce set out five key recommendations for modernising the UK energy system via an integrated data and digital strategy. The report highlighted that the move to a modern, digitalised energy system was being hindered by often poor quality, inaccurate or missing data, while valuable data is often hard to find. As a follow up to the Taskforce’s findings the Department for Business, Energy and Industrial Strategy (BEIS), Ofgem and Innovate UK have commissioned the Energy Systems Catapult to develop Data Best Practice Guidance to help organisations understand how they can manage and work with data in a way that delivers the vision outlined by the Energy Data Taskforce.
This guidance describes a number of key outcomes that taken together are deemed to be ‘data best practice’. Each description is accompanied by more detailed guidance that describes how the desired outcome can be achieved. In some areas the guidance is very specific, presenting a solution which can be implemented easily. In other areas the guidance is less prescriptive, this may be because there are many possible ‘best practice’ solutions (e.g. understanding user needs) or there is a disadvantage to providing prescriptive guidance (e.g. cyber security). Where this is the case the guidance provides organisations with useful information that can be used to inform the implementation of a solution.
Approach taken by this guidance
This guidance touches on many issues that have existing regulation or authoritative guidance such as personal data protection and security. In these areas this guidance should be seen as complimentary rather than competitive. The guidance includes references to many existing resources and includes key extracts of the content within the guidance where licencing allows.
Each “Principle” laid out in this guidance seeks to achieve an outcome. The principle itself takes the form of a single sentence statement. This is accompanied by a contextual “Explanation”. Principles and their associated Explanation are deemed to be the guidance proper.
About “Principles” and “Explanations”
Each Principle and Explanation are accompanied by supporting information that showcase how the desired outcome might be achieved. This supporting information takes the form of describing practical “Techniques” that, if followed, embody the spirit of meeting each principle. To further assist users of this guidance, worked “Examples” that apply Techniques are also provided.
While the Principles and Explanations are deemed to be the Best Practice guidance itself, the Techniques and Examples are not formal guidance. Users of this guidance might want to benefit from using the Techniques and Examples, but alternative approaches to meeting the Principles are equally acceptable, providing this can be demonstrated as achieving the desired outcomes.
On certain themes the Techniques are very specific, presenting a highly defined solution for which it is though to be less likely that alternative solutions will be suitable (such as for the core data fields to be included in metadata). In other areas the guidance is less prescriptive; in these cases, there is an expectation that there is no single ‘best practice’ solution (e.g. demonstrating understanding of user’ needs) or there is potential for disadvantage to providing prescriptive guidance (e.g. cyber security).
The examples provided in this guidance are a mix of fictitious scenarios and actual real world examples. The examples are primarily drawn from the energy sector, but in cases are deliberately taken from data more relevant to other markets, this is to showcase the universality of the principles and to indicate the opportunities for promoting the inter-operation of data services between marketplaces.
The 4 elements in summary
Data Best Practice Principles
- Identify the roles of stakeholders of the data
- Use common terms within Data, Metadata and supporting information
- Describe data accurately using industry standard metadata
- Enable potential users to understand the data by providing supporting information
- Make datasets discoverable for potential users
- Learn and understand the needs of their current and prospective data users
- Ensure data quality maintenance and improvement is prioritised by user needs
- Ensure that data is interoperable with other data and digital services
- Protect data and systems in accordance with Security, Privacy and Resilience best practice
- Store, archive and provide access to data in ways that maximise sustaining value
- Ensure that data relating to common assets is Presumed Open
- Conduct Open Data Triage for Presumed Open data
1. Identify the roles of stakeholders of the data
There are two fundamental roles which will be referred to throughout this guidance, the Data Custodian and the Data User.
Data Custodian: An organisation or individual that holds data which it has a legal right to process and publish
Data User: An organisation or individual which utilises data held by a data custodian for any reason
The data user (this includes potential data users) seeks to achieve an outcome by utilising data which is held by a data custodian.
In addition to the roles defined above, there are a number of well understood definitions relating to personal data provided by the Information Commissioner’s Office (ICO).
Data Subject: The identified or identifiable living individual* to whom personal data relates.
[Data] Controller: A person, public authority, agency or other body which, alone or jointly with others, determines the purposes and means of the processing of personal data.
Data Processor: A person, public authority, agency or other body which processes personal data on behalf of the controller.
* The ICO guidance concerns personal data but for the purposes of this more general data best practice guidance it is useful to also consider commercial entities as data subjects, and therefore in their case, “personal data” to be data about that company, such as about its assets.
This Principle expects individuals/organisations to understand what data they are a Data Custodian for.
Data Custodians are excepted to identify relevant Data Subjects, Data Controllers and Data Processors for the data in question. The Data Custodian is responsible for managing data and therefore for implementing all other parts of this data best practice guidance.
Note, when assessing the legal right to process and publish, organisations should consider legal or regulatory obligations such as Freedom of Information Act, Environmental Information Regulations, Licence Conditions or Code obligations.
“To determine whether you are a controller or processor, you will need to consider your role and responsibilities in relation to your data processing activities. If you exercise overall control of the purpose and means of the processing of personal data – i.e., you decide what data to process and why – you are a controller. If you don’t have any purpose of your own for processing the data and you only act on a client’s instructions, you are likely to be a processor – even if you make some technical decisions about how you process the data.” – ICO
It is strongly advised that organisational data custodians appoint a specific senior leader to be responsible for the strategy, management and implementation of data best practice within the organisation. Organisations should consider developing and maintaining a stakeholder log which can be used across the organisation when considering the impact data-related work will have.
- Examples: https://ico.org.uk/media/for-organisations/documents/1546/data-controllers-and-data-processors-dp-guidance.pdf
2. Use common terms within Data, Metadata and supporting information
It is critically important for data users to be able to search for and utilise similar datasets across organisations. A key enabler of this is utilising a common way to describe the subject of data including in formal metadata; this requires a common way of defining and understanding key terms that are used across a domain. This includes specific subject and more general technical terms.
The most obvious technique here is to implement a common glossary of terms. However, there has been a proliferation of glossaries within the energy sector, with each new document or data store providing a definitive set of definitions for the avoidance of doubt. The list below is provided as an example.
- IEC Electropedia
- IEA Energy Catagories
- Electricity Distribution Standard Licence Conditions
- Electricity Generation Standard Licence Conditions
- Electricity Interconnector Standard Licence Conditions
- Gas Interconnector Standard Licence Conditions
- Gas Shipper Standard Licence Conditions
- Gas Supplier Standard Licence Conditions
- Electricity Ten Year Statement definitions
Equally, the same has been occurring across other sectors and domains, with the following sources are all data related.
There are currently efforts to standardise the naming conventions used across a range of infrastructure domains by the Digital Framework Task Group as part of the National Digital Twin programme of work. The long term goal is to define an ontology which enables different sectors to use a common language which in turn enables effective cross sector data sharing.
In the near term, it is unhelpful to create yet another glossary so we propose a two staged approach.
- Organisations label data with keywords and the authoritative source of their definition e.g. Term [Glossary Reference]
- The adoption of an authoritative glossary based on existing sources which can be expanded or adapted with user feedback and challenge. This is not something that can be achieved in isolation but could be part of the wider MED data visibility project.
Implementing a standard referencing protocol enables organisations to understand terms when they are used and slowly converge on a unified subset of terms where there is common ground. Implementing an industry wide data catalogue with an agreed glossary and mechanism for feedback enables the convergence to be accelerated and discoverability of data to be revolutionised.
3. Describe data accurately using industry standard metadata
To realise the maximum value creation from data within an organisation, across an industry or across the economy actors need to be able to understand basic information that describes each dataset. To make this information accessible, the descriptive information should be structured in an accepted format and it should be possible to make that descriptive information available independently from the underlying dataset.
Metadata is a dataset that describes and gives information about another dataset.
The Energy Data Taskforce recommended that the Dublin Core ‘Core Elements’ metadata standard (Dublin Core) ISO 15836-1:2017 should be adopted for metadata across the Energy sector. Dublin Core is a well established standard for describing datasets and has many active users across a number of domains including energy sector users such as the UKERC Energy Data Centre with a small number of key fields which provide a minimum level of description which can be built upon and expanded as required.
There are 15 ‘core elements’ as part of the Dublin Core standard which are described as follows:
Many of the fields are straight forward to populate either manually or through automated processes that ensure metadata is up to date and accurate. However, others fields are more open to interpretation. In the table below we have listed best practice tips to help provide consistency across organisations.
This should be a short but descriptive name for the resource.
- Be specific – generic titles will make useful resources harder to find e.g. ‘Humidity and temperature readings for homes in Wales’ is better than ‘sensor data’
- Be unique – avoid reusing existing titles where possible to help users find the required resources more effectively
- Be concise – ideally less than 60 characters in order to optimise search engine display but many storage solutions will have an upper character limit (e.g. UKERC Energy Data Centre limit to 100 characters)
- Avoid repetition or stuffing – use the other metadata fields for keywords and longer descriptions
- Title is different from filename – use the title to describe your resource in a way that other users will understand e.g. use spaces rather than underscores
Identify the creator(s) of the resource, individuals or organisations.
- Creator – a creator is the primary entity which generated the resource which is being shared, this can be the same as the publisher
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Identify the key themes of the resource
- Keywords – select keywords (or terms) which are directly related to the resource
- Glossary – select subject keywords from an agreed vocabulary e.g. UKERC Energy Data Centre uses the IEA energy balance definitions
Provide a description of the resource which can be read and understood by the range of potential users.
- Overview – the description should start with a high level overview which enables any potential user to quickly understand the context and content of the resource
- Accessible – use language which is understandable to the range of potential users, avoiding jargon and acronyms where possible
- Accuracy – ensure that the description objectively and precisely describes the resource, peer review can help identify potential issues
- Quality and Limitations – include detail about perceived quality of the resource and any known limitations or issues
- Core Supporting Information – ensure that any core supporting information is referenced in the description
Identify the organisation or individual responsible for publishing the data, this is usually the same as the metadata author.
- Publisher – a publisher is an entity which is making the resource available to others, this can be the same as the creator or contributor
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Identify the contributor(s) of the resource, individuals or organisations.
- Contributor – a contributor is an entity which provided input to the resource which is being shared, this can be the same as the publisher
- Unique Identifier – where possible use an authoritative, unique identifier for the individual or organisation e.g. company number
Date is used in a number of different ways (start of development/collection, end of development/collection, creation of resource, publication of resource, date range of data etc.), the usage of a date field should therefore be explained in the resource description. The nature and potential use cases of data will dictate the most valuable use of this field. However, where data collection is concerned providing the time interval during which the data has been collected is likely to be most informative rather than the publication date.
- Standardisation – the use of a standardised date format is recommended e.g. ISO 8601 timestamps
- Timezone – time values should always be given context including timezone and any modifiers (e.g. daylight savings)
- Time Interval – it is recommended that time intervals use the ISO 8061 standard representation e.g. start/end, start/duration, etc.
Provide a unique identifier for the resource
- Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use
Identify the source(s) material of the derived resource
- Identifiers – DOIs and URIs provide a truly unique identifier but where these are not available then a system or organisation specific identifier can be of use
Identify other resources related the the resource
- Identifiers – DOIs and URIs provide a truely unique identifier but where these are not available then a system or organisation specific identifier can be of use
- Supporting Information – Where additional resources are required to understand the resource these should be referenced
Identify the spatial or temporal remit of the resource
- Standardisation – utilising standard, authoritative spatial identifiers is recommended e.g. UPRN, USRN, LSOA, Country Code, etc.
Specify under which licence conditions the resource is controlled. It should be clear if the resource is open (available to all with no restrictions), public (available to all with some conditions e.g. no commercial use), shared (available to a specific group possibly with conditions e.g. commercial data product) or closed (not available outside of the data custodian organisation).
- Essential – this field is vitally important as without clear articulation of user rights the resource is useless. No entry does not meet open.
- Common Licences – where possible use standard licence terms, the following offer a range of standard licence conditions:
The Dublin Core metadata should be stored in an independent file from the original data and in a machine readable format, such as JSON, YAML or XML that can easily be presented in a human readable format using free text editors. This approach ensures that metadata can be shared independently from the dataset, that it is commonly accessible and not restricted by software compatibility. The DCMI have provided schemas for representing Dublin Core in XML and RDF which may be of help.
Where a dataset is updated or extended the custodian needs to ensure that the metadata reflects this such that potential users can easily identify the additions or changes. Where the data represents a new version of the dataset (i.e. a batch update or modification of existing data) then it is sensible to produce a new version of the dataset with a new metadata file. Where the dataset has been incrementally added to (e.g. time series data) it may be best to update the metadata (e.g. data range) and retain the same dataset source.
The Department for Business, Energy and Industrial Strategy has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. Note, this data does not have a URI or DOI so a platform ID has been used as identifiers, this is not ideal as it may not be globally unique.
“title”:”Solar PV cost data”
“creator”:”Department for Business, Energy and Industrial Strategy”
“subject”:”solar power station [http://www.electropedia.org/], energy cost [http://www.electropedia.org/]”
“description”:”Experimental statistics. Dataset contains information on the cost per kW of solar PV installed by month by financial year. Data is extracted from the Microgeneration Certification Scheme – MCS Installation Database.”
“publisher”:”Department for Business, Energy and Industrial Strategy”
“contributor”:”Microgeneration Certification Scheme”
“coverage”:”Great Britain (GB)”
“rights”:”Open Government Licence v3.0“
Note, the data.gov.uk platform holds metadata for files in JSON format but the above has been simplified for the purposes of providing an example.
4. Enable potential users to understand the data by providing supporting information
When data is published openly, made publicly available or shared with a specific group it is critical that the data has any supporting information that is required to make the data useful for potential users. There is a need to differentiate between Core Supporting Information, without which the data could not be understood by anyone, and Additional Supporting Information that makes understanding the data easier. As a rule of thumb, if the original custodian of the dataset were to stop working with it and then come back 10 years later with the same level of domain expertise, but without the advantage of having worked with the data on a regular basis, the Core Supporting Information is that which they need to make the dataset intelligible.
Data Custodians should make Core Supporting Information available with the dataset.
Data custodians should consider the following topics for areas where Core Supporting Information may be required:
- Data collection methodology
- If data is aggregated or processed the methodology should be included e.g. for half hourly time-series data, does the data point represent the mean, start, midpoint or end value?
- Data structure description (e.g. data schema)
- Granularity (spatial, temporal, etc.) of the data
- Units of measurement
- Version number of any reference data that has been used (e.g. the Postcode lookup reference data)
- References to raw source data (within metadata)
- Protocols that have been used to process the data
It may be possible to minimise the required Core Supporting Information if the dataset uses standard methodologies and data structures (e.g. ISO 8601 timestamps with UTC offset are strongly recommended) but it should not be assumed that an externally hosted reference dataset or document will be enduring unless it has been archived by an authoritative, sustainable body (e.g. ISO, BSI, UK Data Archive, UKERC Energy Data Centre, etc.). If there is doubt that a key reference data source or document will be available in perpetuity this should be archived by the publisher and, where possible, made available as supporting information.
Whilst Data Custodians will attempt to provide all of the core supporting information required it is likely that there will be cases where essential core supporting information is missed or where there is a need for supporting information to be clarified. The data custodian should provide a Data Contact Point for potential data users to raise queries or request additional core supporting information where required.
A Data Contact Point should be made available to respond to data queries and requests relating to datasets and their core supporting information
Depending on the goals of the organisation sharing data, it may be prudent to also include additional supporting information. Reasons to make additional supporting information available could include:
- Maximising user engagement with the dataset
- Addressing a particular user need
- To reduce the number of subsequent queries about the dataset
- To highlight a particular issue or challenge which the data publisher would like to drive innovators towards
The following template documents are provided to illustrate the type of field level information which should be provided to Data Users. Data Custodians with formal data schemas may be able to efficiently export these in machine readable formats.
Data Custodians should ensure that supporting information is clearly documented when datasets are provided to Data Users. The documents below are provided as an example of the kind of structure which may be useful, Data Custodians may prefer to integrate this information structure into data access solutions rather than using a separate file.
Core Supporting Information
The UK Energy Research Centre (UKERC) are the hosts of the Energy Data Centre, which was set up to “to create a hub for information on all publicly funded energy research happening in the UK”. The centre hosts and catalogues a large amount of data which is collected by or made available to aid energy researchers, however much of the data is made available using an open licence (e.g. Creative Commons Attribution 4.0 International License). This licence enables data use for a wide range of purposes, including research. The centre aims to archive data for future researchers and as such, the administrators have embedded a range of data best practice principles including the provision of core supporting information that is required to understand the stored dataset. The custodians of the Energy Data Centre provided the ’10 year’ rule of thumb, described above.
The Energy Data Centre host the Local Authority Engagement in UK Energy Systems data and associated reports. The data is accompanied by rich metadata and a suit of core supporting information that enables users to understand the data. In this case, the core information includes:
- Core Supporting Information
- Description of the source datasets
- Description of the fields and their units
- ReadMe.txt files with a high level introduction to the associated project
- Additional Supporting Information
- A list of academic reports about the data and associated findings
The list of academic reports contains some core supporting information (e.g. the detailed collection methodology), but much of the content is additional supporting information.
Additional Supporting Information
The Data Custodian may have an interest in maximising engagement with datasets to solve a particular problem or drive innovation. In these cases, the custodian may choose to provide additional supporting information to help their potential users. Good examples of this are data science competitions, which are commonly used in industry to solving particular problems by drawing on a large number of experts.
A Kaggle competition asked participants to utilise sensor data to identify faults in power lines. To maximise engagement, the hosts provided high level overviews of the problem domain, more detailed explanations of datasets and offered additional advice as required through question and answer sessions that were widely published. This information was not strictly essential for an expert to understand the data, but the underlying goal of the project was to attract new talent to the area. If the potential participants could not easily understand the data they would likely move on to another lucrative project.
- Core Supporting Information
- Descriptions of datasets and fields (metadata)
- Core and Additional Supporting Information
- Description of the problem in non-technical language
- Description of common problems and how to identify them
- Question and Answer feeds
5. Make datasets discoverable for potential users
The value of data can only be realised when it is possible for potential users to identify what datasets exist and understand how they could utilise them effectively. Data custodians should implement a strategy which makes their data inclusively discoverable by a wide range of stakeholders. There may be instances where it is not advisable to make it known that a dataset exists but this is expected to be exceptionally rare e.g. in cases of national security.
Discoverable: The ability for data to be easily found by potential users
There is a difference between a dataset being discoverable and accessible. Section 12 discusses the Open Data Triage process which identifies the appropriate level of openness. In some cases a dataset may be too sensitive to be made available but the description of the dataset (metadata) can almost always be made available without any sensitivity issues, this visibility can provide significant value through increasing awareness of what data exists. For example, an advertising agency’s dataset that describes personal details about individuals cannot be made openly available without the explicit consent of each subject in the data, but many of the advanced advertising products that provide customers with more relevant products and services would be significantly less effective if the advertising platform could not explain to potential advertisers that they can use particular data features to target their advertisements.
In the following subsections we discuss some of the possible discoverability techniques which can be used by data custodians to help potential users to find datasets.
Metadata can be used to describe the contents and properties of a dataset. It is possible to make metadata open (i.e. published with no access or usage restrictions) in all but the rarest of cases without creating security, privacy, commercial or consumer impact issues due to it not actually involving the underlying data.
Metadata can be published by individual organisation initiatives or by collaborative industry services. Individual organisations may choose to host their own catalogue of metadata and/or participate with industry initiatives.
Where data is made available via services such as a website (open, public or shared) organisations can choose to markup datasets to make them visible to data-centric search engines and data harvesting tools. Structured markup has similarities to metadata but should not be seen as a replacement for standardised metadata.
One of the most common uses of Schema.org is to formalise the data that is held within online recipes. This is semi structured information which commonly includes a list of required ingredients, a list of equipment and step by step instructions. By structuring this data, we have enabled cross recipe website search, useful tools such as ‘add to shopping basket’ for popular recipe websites and enable greater analysis of the underlying data.
Search Engine Optimisation
Search engines are likely to be the way in which most users will discover datasets. It is therefore the responsibility of the data custodian to make sure that data is presented in a way that search engines can find and index. Most major search engines provide guides explaining how to ensure that the correct pages and content appear in ‘organic searches’ (many search engines provide advice for webmasters e.g. Microsoft, Google) and a range of organisations provide Search Engine Optimisation (SEO) services.
The Geospatial Commission have developed a guide to help organisations optimise their websites to enable search engines to identify and surface geospatial data more effectively which also discusses SEO techniques.
Direct stakeholder engagement is a more focused technique which may range from physical events / webinars to one to one meetings. The data custodian is able to target potential users and provide specific advice and guidance to ensure that potential users understand the data and can make the best use of it.
Stakeholder engagement can be a good opportunity for data custodians to understand user needs (see Section 6) and signpost specific datasets or resources. This technique may be of particular value when there is a specific use case or challenge which the data custodian is seeking to address.
Data Contact Point
The data custodian should provide a Data Contact Point for potential data users to contact the data custodian where it has not been possible to establish if data exists or is available
The Department for Business, Energy and Industrial Strategy (BEIS) has published a dataset relating to the installed cost per kW of Solar PV for installations which have been verified by the Microgeneration Certification Scheme. This dataset is published under an Open Government Licence so the data can be accessed by all. The data custodians have registered the dataset with the UK Government open data portal Data.gov.uk, this makes the metadata publicly available (albeit only via the API) in JSON – a machine readable format that can easily be presented in a human readable form. It additionally provides search engine optimisation to surface the results in organic search and uses webpage markup to make the data visible to dataset specific search engines.
Note, the discoverability actions taken above are related to a dataset which is publicly available but could also be used for a metadata stub entry in a data catalogue.
6. Learn and understand the needs of current and prospective data users
Digital connectivity and data are enabling a wealth of new products and services across the economy and creating new data users outside of the traditional sector silos. In order to maximise the value of data it is vital custodians develop a deep understanding of the spectrum of their users and their differing needs such that datasets can be designed to realise the maximum value for consumers.
Data custodians should develop a deep understanding of range of topics.
- Who is using your dataset or service?
- Who would like to use your dataset or service?
- Who should be using your dataset or service?
- What outcome does each user hope to achieve by using the data or service?
- Users will have a range of needs driven by all sorts of factors including, but not limited to differing objectives, existing data / systems and technical capability.
- User needs may include requirements for:
- Data Granularity (time, space, subject)
- Data Accuracy or Precision (how closely does the data reflect reality)
- Data Timeliness and Consistency (duration between data creation and access)
- Functionality and simplicity of access (file download, API requests, etc.)
- Reliability (system availability over time)
- Stability (consistency over time)
- Agility (the ability to adapt to changing needs)
- Linkability (joining to other datasets)
- How do the objectives of the users compare to the objectives of the data custodian?
- Does meeting the need of the user provide value to the end customer?
- Where there are conflicting objectives between the data custodian and user, does one provide significantly more value to the consumer?
- What is the realistic time in which the need can be addressed?
- Is it possible to address all user needs with one delivery or are multiple iterations / versions needed?
There are a range of methods which organisations can use to elicit the needs of current and potential users.
- Direct Engagement
- Interviews – detailed research with representative users
- Workshops – broad research with groups of users (or potential users)
- Usability Testing – feedback from service users
- Monitoring – tracking usage of a live service
- User initiated feedback or requests – feedback forms, email contact, etc.
- Innovation Projects – generating new user needs through novel work
- Knowledge Sharing – collaboration between organisations with similar user types
- Direct Requests – prospective user needs
The Government Digital Service have published advice on the topic of user research within their service manual.
The table below provides some suggestions which may help to further clarify the topic and help data custodians to focus their effort.
- For organisations with many individual users, it may be helpful to group users into categories or create personas to represent users with similar needs and objectives.
- For the purposes of user research, a data user could be an individual, an organisation or a persona.
- Individual users may have multiple objectives; it can be helpful to rank these by importance / impact or set constraints, especially where they are competing needs
- Data custodians should be considering the different types of needs:
- explicit needs: derived from how users describe what they are trying to do
- implicit needs: those that are not expressed and that users are sometimes not aware of, but that are evident from observation
- created needs: where a user has to do something because it is required by the service
- In addition, needs can be catagorised into the following groups:
- high-level needs – for example: ‘I need to understand the data so that I don’t use it incorrectly’
- needs – for example: ‘I need to trust the data so I can defend my decision’
- detailed needs – for example: ‘I need to know how reliable the data is, so that I can provide caveat if needed’
- Users may exhibit conflicting needs or provide different combinations of needs depending on their range of use cases and desired outcomes.
- These may not be well aligned and require judgement calls to be made based on the responsibilities of the data custodian, regulation and legislation
- Where there are conflicting objectives between the data custodian and user, does one provide significantly more value to the consumer?
- It is important to recognise that some needs will require more time and effort to address than others
- Where priority calls are made users should be kept informed on timelines
Note, it is important to recognise that it will not be possible to identify all potential users and potential use cases of a dataset before it is made available to innovators. Therefore, data without a clear user or use case should not be ignored as there may be unexpected value which can be created.
One approach that can be used to gain input from potential users is to convene a workshop where individuals from different backgrounds can come together to discuss the challenges they face and the needs that this creates. Formulating the needs of users as structured user stories which can be used to subsequently identify trends:
As a Role given that Situation I need Requirement so that Outcome
The user need
As a homeowner given that home heating accounts for 31% of carbon emissions I need to have access to data that evidences the impact of energy efficiency measures so that I can prioritise upgrades and developments and reach net zero carbon as quickly as possible.
7. Ensure data quality maintenance and improvement is prioritised by user needs
Data quality is subjective. A dataset may be perfectly acceptable for one use case but entirely inadequate for another. Data accuracy can be more objective but there remain many instances where the required precision differs across use cases.
In many cases, data will be used to store information which can objectively be categorised as right or wrong e.g. customer addresses, asset serial numbers, etc. In these cases the data custodian should make reasonable efforts to collect, store and share accurate data. However, it is acknowledged that there will inevitably be inaccuracies in datasets and data custodians should seek to rectify specific issues as they are identified by users.
Data custodians should make reasonable efforts to ensure that data is accurate and rectify issues quickly when they are identified
Beyond accuracy, data custodians should consider how they can maintain and iteratively improve the quality of data such that it can be effectively utilised by users.
Data custodians should seek to maintain and improve data quality in a way that responds to the needs of users
Note: It is acknowledged that data is not perfect and organisations should not see data quality as a barrier to opening datasets. Potential users may find the quality acceptable for their use, find ways to handle the quality issues or develop ways to solve issues which can improve the quality of the underlying data.
Data is not perfect; even the most diligent organisation that makes the greatest effort to collect and disseminate the highest quality data cannot guarantee enduring accuracy or foresee all the potential future needs of data users. Data custodians are therefore faced with an ongoing task to identify quality and accuracy limitations which can be improved over time. This is a particular challenge for the owners and operators of infrastructure who have the task of deploying assets which will be in situ for many years (often decades). The data needs of potential users are almost certain to change dramatically within the lifespan of the asset and the ability to recollect information is often challenging (especially for buried assets).
Not all data will be of sufficient quality for all users and in some cases significant action may be required to rectify shortcomings. Many issues may be resolved through incremental changes to business process or data processing but some may have a more fundamental issue. Data controllers should consider if the insufficient quality is due to lack of quality in the underlying data source (e.g. the sensor data is not precise or frequent enough), significant processing of the data (e.g. aggregation, rounding, etc.) or technical choices (e.g. deleting data due to storage constants).
Organisations should consider implementing Master Data Management (MDM) techniques to validate inputs, monitor consistency across systems and rectify issues quickly when they are identified by internal or external stakeholders.
Quality and Accuracy issues
Where known data quality issues exist, data custodians may wish to provide guidance to potential users such that these known limitations can be managed by downstream systems. Many data science techniques exist which can handle inaccuracies in data if they are statistically understood. In addition, third parties may be able to help the data custodian to rectify underlying data issues.
Where an organisation is sharing data which may have quality issues it may be prudent to utilise a liability waver to limit the data publisher risk. Note, liability wavers cannot remove all legal responsibility for data quality.
An organisation collects and holds data about companies which operate in a sector in order to help potential innovators find suppliers or collaboration partners. The data that is held is made available on a website and can be queried by users. A potential user searches for an organisation and finds that a company has been incorrectly categorised. The data publishers have provided a contact form which enables the user to submit the correction which is subsequently verified by the data publisher before the dataset is updated.
A data user wishes to develop an application which enables public transport users to select options based on their impact on air pollution in a city. The public transport provider makes a range of data available about the various modes of transport including routes travelled, vehicle type, average emissions and passenger numbers. However, the data user has identified that the emissions data is not of a sufficient quality for their use case for the following reasons:
- The average emissions data field is not consistently populated
- There is likely to be a variation between the average emission output and the real impact on local areas
The issues highlighted above are quite different in nature and have different potential solutions.
- The public transport provider could address the first point by:
- manually checking the dataset for missing values and populate based on the vehicle type and known specification
- using data science techniques such as interpolation and to populate missing data
- The public transport provider could address the second point by:
- deploying a real emission monitoring solution
- utilising an existing, alternative data source and data processing to provide ‘proxy’ data. e.g. static air quality monitoring sites
Having proposed a quick solution to point 1 and an alternative solution for point 2 the data user can continue with their use case.
8. Ensure that data is interoperable with other data and digital services
Data is most useful when it can be shared, linked and combined with ease.
Interoperablity (Data): enabling data to be shared and ported with ease between different systems, organisations and individuals
Data custodians should, directed by user needs, ensure that their data is made available in a way that minimises friction between systems through the use of standard interfaces, standard data structures, reference data matching or other methods as appropriate. Wherever possible, the use of cross sector and international standards is advised.
Standard Data Structures
Data structure standardisation is a common method of aligning data across organisations and enabling seamless portability of data between systems. This method provides robust interoperability between systems and if the standard has been correctly adhered to enables entire data structures to be ported between systems as required. However, standardisation of data structures can be expensive, time consuming and require significant industry, regulatory or government effort to develop the standard if one does not already exist.
Data interfaces can also be standardised, this means that the formal channels of communication are structured in a standard way which enables systems to ‘talk’ to one another with ease. This approach has the advantage of being very quick to implement as interfaces can be implemented as required and providing there is robust documentation a single organisation can define a ‘standard’ interface for their users without the need for cross sector agreement as they can be easily developed and evolved over time. However, this approach is limited in that a new interface needs to be developed or deployed for each type of data that needs to be shared. Additionally, in sectors where there a few powerful actors they can use interface standardisation to create siloed ecosystems which reduces portability.
Reference Data Matching
Matching datasets back to reference data spines can be a useful method to enable non standardised data to be joined with relative ease. This approach can provide a minimum level of interoperability and linkability across datasets without the need for full standardisation. However, it does require the users to learn about each new dataset rather than being able to understand the data from the outset with standard data structures.
Standard Data Structures
Electricity network data is essential for a number of emerging energy system innovations including the successful integration of a highly distributed, renewables dominated grid. However, the network is broken into a number of areas which are operated by different organsations that have implemented different data structures to manage their network data (power flow model, GIS and asset inventory). The Common Information Model (CIM) is the common name for a series of IEC standards (IEC 61970 and IEC 61968) which standardise the data structure for electricity network data. The deployment of the CIM standards enables network operators to provide third parties with access to their data in a standard form which enables innovation to be rolled out across network areas with relative ease.
Reference Data Matching
When processing data from disparate sources it may not be possible to directly match data sources with comment features. Some data intensive organisations have implemented data spines which provide a number of key identifiers which span the economy such as company reference number, individual identifiers and property identifiers. As part of the data ingest pipeline they match data sources to one or more of the spines This enables data which have no common fields to be cross referenced and used to create meaningful insight.
In the case above, the data users have performed the matching but it is equally possible for individual organisations to match their own data to key reference datasets such as street identifiers or network location. This provides third parties that wish to utilise the data with a useful anchor which can enable the easy linking back to external datasets.
9. Protect data and systems in accordance with Security, Privacy and Resilience best practice
Ensure data and systems are protected appropriately and comply with all relevant data policies, legislation, and Security, Privacy and Resilience (SPaR) best practice principles.
Data Custodians should consider:
- How to appropriately protect:
- Stored data
- Data in transit
- How the release of data could impact the security of systems
- What is the value of data to potential attackers or hostile actors?
- What impact could a data breach cause?
- How the systems which are used to release data are made secure
- How to ensure compliance with related policy, legislation and regulation
Data custodians should utilise modern, agile approaches which seek to balance risk and reward rather than those which take a strict closed approach. A number of frameworks, standards and regulations exist which provide organisations with implementable guidance on the topic of SPaR such as:
- Cyber Assessment Framework (CAF)
- ISO 27000 – Information Security Management Systems
- IEC 62443 – Industrial Network and System Security
- IEEE C37.240-2014 – Cybersecurity Requirements for Substation Automation, Protection, and Control Systems
- PAS185 – Smart Cities. Specification for establishing and implementing a security-minded approach.
In addition, there is a wealth of advice available to organisations through the following organisations:
- BEIS and Ofgem – The joint competent authorities for downstream gas and electricity
- National Cyber Security Centre (NCSC)
- Centre for the Protection of National Infrastructure (CPNI)
- International Electrotechnical Commission (IEC)
- Institute of Electrical and Electronics Engineers (IEEE)
- International Organization for Standardization (ISO)
- British Standards Institution (BSI)
10. Store, archive and provide access to data in ways that maximise sustaining value
Organisations should consider the way in which data is stored or archived in order to ensure that potential future value is not unnecessarily limited.
Data Custodians should ensure that storage solutions are specified to ensure that the risk of data being lost due to technical difficulties is minimal. In addition to technical resilience the data custodian should ensure that data is not unduly aggregated or curtailed in a way that limits future value.
Data custodians should seek to make data available in formats which are appropriate for the type of data being presented and respond to the needs of the user. Note, visualisations and data presentation interfaces are useful to some but restrict the way in which potential users can interact with the data and should be considered as Supporting Information rather than data itself.
Technical storage solutions which offer component and geographical redundancy are common place with many cloud providers offering data resilience as standard. However, data custodians should ensure that the system configuration is assessed to ensure that the desired benefits are realised and system security is as required.
Where possible, the most granular version of the data should be stored for future analysis. However, there may be cases where the raw data is too large to store indefinitely. In these cases, the data custodian should consider how their proposed solution (aggregation, limiting retention window, etc.) would impact future analysis opportunities and ensure user needs have been considered.
Where data does not have value to the original data custodian the information could be archived with a trusted third party e.g. UKERC Energy Data Centre, UK Data Archive, etc. to ensure that it continues to have sustaining value.
Access to data should also be provide in a way that is appropriate for the data type. The table below shows how live and historic data can be treated.
The UK Smart Meter Implementation Program is responsible for rolling out digitally connected meter points to all UK premises with the goal of providing an efficient, accurate means of measuring consumption alongside a range of other technical metrics. Electricity distribution networks can request to gain access to this data in order to inform their planning and optimise operation. The personal nature of smart meter data means that organisations have to protect individual consumer privacy and are therefore choosing to aggregate the data at feeder level before it is stored, this approach provides the network with actionable insight for the current configuration of the network.
However, network structure is not immutable. As demand patterns change and constraints appear, network operators may need to upgrade or reconfigure their network to mitigate problems. The decision to aggregate the data means that it is not possible to use the granular data to simulate the impact of splitting the feeder in different ways to more effectively balance demand across the new network structure. In addition, it means that the historic data cannot be used for modelling and forecasting going forwards. Finally, the data ingest and aggregation processes need to be updated to ensure that any future data is of value.
Northern Power Grid (a GB Electricity Distribution Network) have recently proposed to store the smart meter data that they collect in a non-aggregated format but strictly enforce that data can only be extracted and viewed in an aggregated format. This approach is novel in that it protects the privacy of the consumer whilst retaining the flexibility and value of the data.
11. Ensure that data relating to common assets is Presumed Open
Presumed Open is the principle that data should be as open as possible. Where the raw data cannot be entirely open, the data custodian should provide objective justification for this.
Open Data is made available for all to use, modify and distribute with no restrictions
Data relating to common assets should be open unless there are legitimate issues which would prevent this. Legitimate issues include Privacy, Security, Negative Consumer Impact, Commercial and Legislation and Regulation. It is the responsibility of the data controller to ensure that issues are effectively identified and mitigated where appropriate. It is recommended that organisations implement a robust Open Data Triage process.
Common Assets are defined as a resource (physical or digital) that is essential to or forms part of common shared infrastructure
In cases where there has been data processing applied to raw data (e.g. Issue mitigation, data cleaning, etc.) it is considered best practice for the processing methodology or scripts to be made available as core supporting information in order to maximise the utility of the data to users
Data relating to common assets should be open unless there are legitimate issues
12. Conduct Open Data Triage for Presumed Open data
The triage process considers themes such as privacy, security, commercial and consumer impact issues. Where the decision is for the raw data to not be made open the data controller will: share the rationale for this and consider sensitivity mitigation options (data modification or reduced openness) that maximise usefulness of the data. Where a mitigation option is implemented the protocol should be made publicly available with reference to the desensitised version of the data. In the cases where no data can be made available then the rational should be documented and made available for review and challenge.
Users of the data should have reasonable opportunity to challenge decisions and have a point of escalation where agreement between data users and data controllers cannot be reached.
The diagram below is a high level representation of the proposed process, more detail is provided about the steps in the following subsections.
Identification of Discrete Datasets
The goal of open data triage is to identify where issues exist that would prevent the open publication of data in its most granular format, and address them in a way that maintains as much value as possible. To make this process manageable for the data custodian, the first step in this process is:
Identify thematic, usable datasets that can be joined if required rather than general data dumps
In this context, we define a ‘thematic, usable dataset’ as a discrete collection of data which relates to a focused, coherent topic but provides enough information to be of practical use. Data custodians should consider:
- Data source (device, person, system)
- Subject of data (technical, operational, personal, commercial)
- Time and granularity (collection period, frequency of data collection, inherent aggregation)
- Location (country, region, public/private area)
- Other logical categorisations (project, industry, etc.)
For example, an infrastructure construction company may have data about the construction and operation of various building projects across a number of countries. It is sensible to split the data into operational and construction data and then group by type of construction (public space, office building, residential building, etc.), geographic region and year of construction.
The approach described above minimises the risk that the size and complexity of datasets results in issues that are not correctly identified. It also reduces the risk that an issue in one part of the dataset results in the whole dataset being made less open or granular therefore maximising the amount of useful data that is openly available in its most granular form. For example, providing complete output from a data warehouse in one data dump could contain information about customers, employees, financial performance, company Key Performance Indicators (KPIs), etc. all of which would present issues that would mean the data needs to be modified or the openness reduced. Whereas extracting tables (or parts of tables) from the data warehouse would provide a more granular level of control which enables individual issues to be identified and addressed accordingly which would in turn maximise the data which is made openly available.
Identification of Issues
Once a thematic, usable dataset has been identified the data controller should assess the dataset to identify if there are any issues which would prevent the open publication of the data in its most granular format.
Identify the potential issues which might limit the openness or granularity of dataset
In the table below, we outline a range of issue categories which should be carefully considered. Some of these categories will directly relate to existing triage processes which already exist in organisations but others may require the adaptation of existing processes or creation of new processes to provide a comprehensive solution.
This should be a familiar process as GDPR introduced a range of requirements for organisations to identify personal data and conduct Data Privacy Impact Assessments (DPIA). The ICO has a wealth of advice and guidance on these topics, including definitions of personal data and DPIA templates.
It is important that Open Data Triage is used to effectively identify privacy issues and ensure that any data which is released has been appropriately processed to remove private information and retain customer confidence in the product, service or system.
Companies and organisations that own and operate infrastructure should already have a risk identification and mitigation program to support the protection of Critical National Infrastructure (CNI). The Centre for the Protection of National Infrastructure (CPNI) have advice and guidance for organisations involved in the operation and protection of CNI.
Outside of CNI, organisations should assess the incremental security risks that could be created through the publication of data. Organisations should consider personnel, physical and cyber security when identifying issues and identify if the issue primarily impacts the publishing organisation or if it has wider impacts. Issue identification should take into account the existing security protocols that exist within an organisation and flag areas where the residual risk (after mitigation) is unacceptably high.
Note, where that information contained within a dataset is already publicly available via existing means (such as publicly available satellite imagery) the security issue assessment should consider the incremental risk of data publication using the existing situation as the baseline.
Organisations should consider how the dataset could be used to drive outcomes that would negatively impact customers by enabling manipulation of markets, embedding bias into products or services, incentivising of actions which are detrimental to decarbonisation of the system, etc.
Commercial data relating to the private administration of a business (HR, payroll, employee performance, etc.) is deemed to be private information and as a legitimate reason for data to be closed, although organisations may choose to publish for their own reasons such as reporting or corporate social responsibility (CSR) reasons.
Data which does not relate to the administration of the business but has been collected or generated through actions which are outside of the organisation’s legislative or regulatory core obligations and funded through private investment may also have legitimate reason to be closed. This description may include the data generated through innovation projects but consideration should be given to the source of funding and any data publication or sharing requirements this might create.
Where an organisation is a regulated monopoly, special consideration should also be given to the privileged position of the organisation and the duty to enable and facilitate competition within their domain.
Where datasets contain Intellectual Property (IP) belonging to other organisations or where the data has been obtained with terms and conditions or a data licence which would restrict onward publishing this should also be identified. Note, the expectation is that organisations should be migrating away from restrictive licences / terms and conditions that restrict onward data publishing and sharing where possible.
Organisations should have legal and regulatory compliance processes which are able to identify and drive compliance with any obligations the company has.
Consideration should include:
- Utilities Act 2000
- Electricity Act 1989
- Gas Act 1986 / 1995
- Competition Act 1998
- Enterprise Act 2002
- Enterprise and Regulatory Reform Act 2013
- Data Protection Act 2018
- General Data Protection Regulation (GDPR)
Consider the impact of related or adjacent datasets
When assessing the sensitivity of data, thought should be given to the other datasets which are already publicly available and the issues which may arise from the combination of datasets. Organisations should consider where there are datasets outside of their control which, if published, could create issues which would need to be mitigated. Special consideration should be given to datasets which share a common key or identifier, this includes but is not limited to:
- subject reference (e.g. Passport Number),
- technical reference (e.g. Serial Number),
- time (e.g. UTC),
- space (e.g. Postcode or Property Identifier)
As new datasets are made available, markets develop and public attitudes change there may be a need to revise the original assessment. For example, a dataset which was initially deemed too sensitive to be released openly in its most granular form could be rendered less sensitive due to changes to market structure or change in regulatory obligations. Equally, a dataset which was published openly could become more sensitive due to the publication of a related dataset or technology development. At a minimum, data custodians should aim to review and verify their open data triage assessments on an annual basis.
Mitigation of Issues
Where the assessment process identifies an issue, the aim should be to mitigate the issue through modification of data or reduced openness whilst maximising the value of the dataset for a range of stakeholders.
Mitigate issues through modification of data or reduced openness whilst addressing user needs
Open data with some redactions may be preferable to shared data without, but if redactions render the data useless then public or shared data may be better. In some cases, the objectives of the prospective data users might create requirements which cannot be resolved by a single solution so it may be necessary to provide different variations or level of access, for example providing open access to a desensitised version of the data for general consumption alongside shared access to the unadulterated data to a subset of known users.
Modification of Data
Modification of data can serve to reduce the sensitivity whilst enabling the data to be open. There are a wide variety of possible modifications of data which can be used to address different types of sensitivity.
An organisation has a licence condition to collect certain data about individual usage of national infrastructure. The data is collected about individual usage on a daily basis and could reveal information about individuals if it was to be released openly.
By removing identifying features such as granular location and individual reference it could be possible to successfully anonymise the data such that individuals cannot be re-identified so the data could be made openly available.
Simple anonymisation can be very effective at protecting personal data but it needs to be undertaken with care to minimise the risk of re-identification. Anonymisation techniques can be combined with other mitigation techniques to minimise this risk.
The UK ICO have provided an anonymisation code of practice which should be adhered to.
An organisation (with permission) collects data about how customers use a web service. This data is used to diagnose problems where there are issues with the website operation.
Replacing the customer name and address with a random unique identifier that allows the behavior that led to an issue to be analysed whilst protecting the identify of an individual user.
Pseudonymisation is distinct from Anonmysation as it is possible to consistently identify individuals but not link this to a specific, named person.
Pseudonymisation should be used carefully as it is often possible to utilise external datasets and data analysis to match identifiers and trends such that the individual can be re-identified. for example, it may be possible to analyse the website usage patterns (times, locations, device type, etc.) and cross reference with other personally identifiable datasets (social media posts, mobile positioning data, work schedules, etc.) to identify an individual with a sufficient level of confidence. Again the ODI and ICO provide useful guidance in this area.
An organisation collects information about how individuals use a privately built product or service (e.g. a travel planner). This data could be of great use for the purposes of planning of adjacent system (e.g. energy system or road network) but releasing the anonymised, granular data would given competitors a commercial advantage.
By introducing seemingly random noise into the dataset in a way that ensure that the data remains statistically representative but the detail of individuals is subtly altered the data can be made available whilst reducing the commercial risk.
Introducing noise to data in a way that successfully obfuscates sensitive information whilst retaining the statistical integrity of the dataset is a challenging task that requires specialist data and statistics skills. Consideration needs to be given to the required distribution, which features the noise will be applied to and the consistency of application.
An organisation operates a network of technical assets some of which fail on occasion. If the data related to those assets was made available innovators could help to identify patterns which predict outages before the occur and improve the network stability. However, the data could also be used to target an attack on the network at a point which is already actively under strain and cause maximum impact.
However, by introducing a sufficient delay between the data being generated and published the organisation can mitigate the risk of the data being used to attack the network whilst benefiting from innovation.
Delaying the release of data is a simple but effective method of enabling detailed information to be released whilst mitigating many types of negative impact. However, it may be necessary to combine this with other mitigation techniques to completely mitigate more complicated risks.
An organisation collects rich data from customers which is highly valuable but sensitive (e.g. email content). The sensitivity of this data is very high but the potential for learning is also very high.
Differential privacy enables large amounts of data to be collected from many individuals whilst retaining privacy. Noise is added to individuals’ data which is then ingested by a model. As large amounts of data are combined, the noise averages out and patterns can emerge. It is possible to design this process such that the results cannot be linked back to an individual user and privacy is preserved.
Differential privacy is an advanced technique but can be very effective. It is used by top technology firms to provide the benefits of machine learning but without the privacy impact that is usually required.
Sharing a model can be a highly effective way of enabling parties to access the benefit of highly sensitive, granular data but without proving direct access to the raw information. However, this is an emerging area so carries some complexity and risk.
Security / Legislation and Regulation
An organisation maintains data about a larger number of buildings across the country and their usage. Within the dataset there are a number of buildings which are identified as Critical National Infrastructure (CNI) sites which are at particular risk of targeted attack if they are known.
In this case it is possible to simply redact the data for the CNI sites and release the rest of the dataset (assuming there is no other sensitivity). Note, this approach works here because the dataset is not complete and therefore it is not possible to draw a conclusion about a site which is missing from the data as it may simply have not been included.
Redaction is commonplace when publishing data as it is a very effective method of reducing risk. However, care needs to be taken to ensure that it is not possible to deduce something by the lack of data. In general, if the scope and completeness of data is sufficient that the lack of data is noteworthy then redaction may not be appropriate. e.g. An authoritative map which has a conspicuous blank area indicates the site is likely to be of some interest or importance.
An organisation collects information about the performance of their private assets which form part of a wider system (e.g. energy generation output). This data could be of great use to the other actors within the system but releasing the data in its raw format may breach commercial agreements or provide competitors with an unfair advantage.
By aggregating the data (by technology, time, location or other dimension) the sensitivity can be reduced whilst maintaining some of the value of the data.
Aggregation is effective at reducing sensitivity but can significantly reduce the value of the data. It may be worth providing multiple aggregated views of the data to address the needs of a range of stakeholders.
Where aggregation is the only effective mechanism to reduce sensitivity organisations may want to consider providing access to aggregated data openly alongside more granular data that can be shared with restricted conditions.
Note, aggregating data which is of a low level of accuracy or quality can provide a misleading picture to potential user. Custodians should consider this when thinking about the use of data aggregation and make potential users aware of potential quality issues.
An organisation collects data on how their customers use a mobile product including when and where. This movement data could be of value to other organisations in order to plan infrastructure investment but the data reveals the patterns of individuals which cannot be openly published.
The initial step is to remove any identifying features (e.g. device IDs) and break the movement data into small blocks. Each block of movement can then be shifted in time and space such that they cannot be reassembled it identify the movement patterns of individuals. This means that realistic, granular data can be shared but the privacy of individuals can be protected.
Shifting or rotating data can be useful to desensitise spatial or temporal data. However, it is important to recognise context to ensure that the data makes sense and cannot be easily reconstructed. For example, car journey data will almost always take place on roads and therefore rotation can make the data nonsensical and it can be pattern matched to the underlying road network with relative ease.
An organisation may be the custodian of infrastructure data relating to a number of sensitive locations such as police stations or Ministry of Defence (MoD) buildings. The data itself is of use for a range of purposes but making it openly available could result in security impacts.
Randomising the data (generating arbitrary values) relating to the sensitive locations (rather than redaction) could reduce the sensitivity such that it can be open.
Randomisation can be very effective to reduce sensitivity but it is also destructive so impacts the quality of the underlying data.
Negative Consumer Impact
An organisation provides and collects data about usage of a product in order to diagnose problems and optimise performance. The data has wider use beyond the core purpose, but the associated demographic data could result in bias towards certain groups.
By normalising the data (reducing variance and the ability to discriminate between points) it is possible to reduce the ability for certain factors to be used to differentiate between subjects and hence reduce types of bias.
Normalisation is a statistical technique that requires specialist skills to apply correctly. It may not be enough on its own to address all sensitivities so a multifaceted approach may be required.
Data Custodians should consider the impact of data modification on the usefulness of the data and seek to use techniques which retain the greatest fidelity of data whilst mitigating the identified issue. Given the diversity of data and variety of use cases it is not possible to define a definitive hierarchy of preferred data modification techniques but techniques which have limited impact on the overall scope and accuracy of the data should be prioritised over those which make substantial, global changes. e.g. Redaction of sensitive fields for <1% of records is likely to be better than dataset wide aggregation.
Level of Access
When the mitigation techniques have been applied as appropriate the data custodian should consider how open the resulting data can be. Where the mitigation has been successful the data can be published openly for all to use. However, if the nature of the data means that it is only valuable in its most granular form it may be necessary to reduce openness but keep granularity.
The above table is based on the ODI data spectrum.
Balancing Openness, Modification and User Needs
A key factor to consider is the needs of the potential data users. Initially, there may be value in providing aggregated summaries of data which can be made entirely open but as new user cases and user needs emerge we may find that access to more granular data is required which necessitates a more sophisticated mitigation technique or a more granular version of the data which is shared less openly. In some cases, it may be prudent to make multiple versions of a dataset available to serve the needs of a range of users.
The goal of presumed open is to make data as accessible as possible, this enables innovators to opportunistically explore data and identify opportunities that are not obvious. Wherever possible, the data custodian should seek to identify a mitigation approach which addresses all issues whilst maximising access to the most granular data.
In some cases, the number and diversity of issues may be so great that the mitigation or reduction of openness required to address all of the issues simultaneously is deemed too detrimental to the overall value of the data. In these cases, the data custodian could consider the user needs and individual use cases to help guide the mitigation strategies. For example, it may be possible to provide aggregated data openly for the purposes of statistical reporting but more granular data to a set of known participants via a secure data environment for another use case.
Issues that are identified should be clearly documented. Where issues have been mitigated through reduced openness or data modification, the mitigation technique should also be clearly documented.