Choose file formats
The format in which research data are created usually depends on how researchers choose to collect and analyse data. This is often determined by discipline-specific standards and customs. Ensuring long-term usability of data requires consideration of the most appropriate file formats.
Available file formats
The safest option to guarantee long-term data access and usable data is to convert data to standard formats that most software are capable of interpreting, and that are suitable for data interchange and transformation.
It is important to choose platform and vendor-independent file formats where possible to ensure the best chance for future compatibility.
Danger of obsolescence
In principle, all software is bound to become obsolete — however there are factors that should be considered in assessing a file format's long-term stability:
- Is it widely adopted?
- Does it have a history of backward compatibility?
- Does it have good metadata support (in an open format such as XML)?
- Does it have a good range of functionality, but not overly complex
- Does it have an available interchange format with a usable target?
- Does it use built-in error checking?
- Does it have a reasonable upgrade cycle?
Choose non-proprietary formats over proprietary ones
Popular formats such as those produced by Microsoft Office products (e.g. Word documents or Excel spreadsheets) are very likely to have reasonable longevity, but be aware that they are proprietary (owned by someone) and so will not necessarily exist forever or remain easily readable. We encourage researchers storing important information in open, non-proprietary formats — for example:
- PDF/A rather than Microsoft Word (.docx);
- CSV rather than Excel (.xlsx);
- TIFF rather than Photoshop files (.psd); and
- XML rather than a database.
File format table
Here is a simple overview of some popular data formats and which to choose for long-term preservation. If you need more detailed advice, please look at the UK Data Archive file format table.
|TEXT||.txt; .odt; .xml; .html||.pdf; .rtf; .docx||.doc|
|AUDIO||.flac; .wav||.ogg; .mp3, aif||.wma; .ra; .ram; compression|
|VIDEO||.mp2/.mp4, MKV||.wmv; .mov; .avi; compression|
|IMAGE||.tif; .png; .svg; .jpg2000||.gif; .jpg||.psd; compression|
|DATA||.sql; .csv; .xml||.xlsx||.xls; proprietary DB formats|
|QUANTITATIVE TABULAR DATA||.por||.sav; .dta; mdb; accb|
|GEOSPATIAL DATA||ESRI Shapefile (essential - .shp, .shx, .dbf, optional - .prj, .sbx, .sbn); geo-referenced TIFF (.tif, .tfw); CAD data (.dwg)||.mdb; .mif; .kml; .ai; .dxf; .svg|