Finding, Re-Using and Citing Datasets

Before creating new datasets, it is an important part of any research process to assess whether existing datasets can be re-used or combined, and this contributes to efficiency and reproducibility.

Finding

Firstly you could look for datasets available that are in support of relevant publications during the literature review process, and most research articles should have a Data Access Statement that tells you how you can do this.

After this, you could also search in existing appropriate data repositories. The following are some recommended general resources to search for data repositories and datasets.

Re-Using

There is quite a big difference between referencing data that is similar or relevant to your research, similar to a literature review, and full re-purposing and re-use of a dataset, and it's the latter that we are referring to here.

Arranging access to a dataset you would like to re-use can be the first hurdle to overcome, as not all data can be made open. There may be access restrictions in place that could only allow certain uses, and this might involve negotiation with the dataset creators.

You will also need to check the license, and if applicable, the license of any datasets that they derive from, and make sure that you are permitted to use it in the way you intend in a legal sense, as well as an access sense.

Currently, most datasets, unless published as a data paper, are not peer-reviewed, and therefore you might want to do a similar peer assessment yourself before deciding that a dataset is suitable to re-use.

Citing

You will also need to cite the data in an appropriate format in the same way that you would cite any other literature, and here are the most important things to consider:

  • citation only implies the dataset is relevant to your research, you will need to make clear elsewhere how it has been re-used.
  • use a DOI so that citations can be tracked, which will increase incentive to publish data.
  • datasets are created rather than authored, so better if all creators can be mentioned in the citation.
  • to help prevent gaming by splitting datasets to increase citations, aggregate citations where possible.

More information: