Is Big Data always best?


Posted on

image saying 'Big Data' with some images around it.

The concept of big data cuts across the field of computer science and statistics. Indeed, the big data era has brought a significant massive production of a variety of datasets. The volume of data has increased substantially, mainly due to the advances in computational tools which allow us to collect varieties of data. I first stumbled on an article called ”When small data beats big data” by Faraway and Augustin (2018). The question that came to my mind was, what type of data qualified as big data in statistics?

Big data is often characterised by huge sample sizes and a large number of variables presenting a very complex form of dependence. Yet, there is no universal definition of big data. Some have defined it as data for which the size of the data itself becomes part of the problem Loukides (2010); Reimsbach-Kounatze (2015). The big data is often summarised in terms of three V’s definition, emphasising its three main qualities: the volume of the data (petabytes or exabytes), the variety of the data (structured/unstructured) and the velocity (speed) of generating or collecting the data.

Big data that comes in high volume, high velocity and high variety examples includes mobile phones, banking transactions, ecological and environmental data. This leads to opportunities for new statistics or the redesign of existing statistics. For instance, their high volume could potentially lead to better accuracy and more details, their high velocity may lead to more frequent and timely statistical estimates, and their wide variety may give rise to statistics in new areas (Braaksma and Zeelenberg, 2015). Big data is not just massive but could also be big in complexity and dimension (Secchi, 2018).

What about Small Data? This consists of usable chunks and is portable. It is easily accessible and can be easily comprehended by humans looking to get actionable insights. It is also readily available for analysis. Indeed, according to Rufus Pollock, of the Open Knowledge Foundation, the hype around big data is misplaced, and that small, linked data is where the real value lies. Lindstrom (2016) shows how the tiniest clues from a consumer study can lead to big insights for business analytics in the book Small Data: The Tiny Clues That Uncover Huge Trends.

Researchers have embraced big data, but along with the opportunities it provides, it also brings complexities and challenges around cost, quality, selection bias and redundancy:

Cost. Storing of large sets of data poses both infrastructural and economic problems. The proponents of small data over big data outline the challenges and risks associated with a large volume of data. The cost of obtaining more data will involve even more costs and will complicate the analysis. For instance, there are trade-offs between quality and quantity under limited budgets. Sometimes small data will beat big data and reach the right conclusions faster, more reliably and at lower cost Faraway and Augustin (2018).

Quality. The information that can be extracted from data depends on the quality of the data. The problem starts with how the quality of the data is assessed and assured. Poor quality data will, therefore, almost always lead to poor results. Thus, data cleaning or data wrangling is often highlighted as an essential step before the data can be analysed. The preparation of big data sets for statistical analysis can be extremely time-consuming, and it can account for 50% to 80% of a researcher’s time (Reimsbach-Kounatze, 2015).

Selection bias. Even if the data has good quality, statistical analyses can still lead to wrong results if the data used is not representative. Researchers recognise that it is often too tempting to think that with big data, one has sufficient data to answer almost every question, and to neglect data biases that could lead to false conclusions. Big data may be highly volatile and selective; that is, the coverage of the population to which they refer may change from day to day, leading to inexplicable jumps in time-series.

Redundancy. Big data are usually filled with inconsistent duplicates of the same entry causing data anomalies and corruption. Sometimes data redundancy happens by accident and could be attributed to complex process or inefficient computer coding. A potential problem with data redundancy is that it will increase the size of the data and thus making it more challenging to handle. Redundant data can also lead to longer load times and increase in storage costs (Michael Wu, 2017).

In summary, while big data phenomenon continues unabated, its complexity is placing increasing demands on processing speeds and for faster analysis methods to enable robust decision making. This has dramatically challenged existing data analysis techniques. Clearly, big data plays a crucial role in the advancement of science and has the potential to be a massive asset. But that potential is purely theoretical until it reveals valuable information.

Bibliography

J. J. Faraway, N. H. Augustin, When small data beats big data, Statistics & Probability Letters 136 (2018) 142–145.

M. Loukides, What is data science? The future belongs to the companies and people that turn data into products. An O’Reilly radar report, 2010.

C. Reimsbach-Kounatze, The Proliferation of “Big Data” and Implications for Official Statistics and Statistical Agencies .

B. Braaksma, K. Zeelenberg, “Re-make/Re-model”: Should big data change the modelling paradigm in official statistics?, Statistical Journal of the IAOS 31 (2) (2015) 193–202.

P. Secchi, On the role of statistics in the era of big data: A call for a debate, Statistics & Probability Letters 136 (2018) 10–14.

M. Lindstrom, Small data: the tiny clues that uncover huge trends, St. Martin’s Press, 2016.2

Related Blogs


Disclaimer

The opinions expressed by our bloggers and those providing comments are personal, and may not necessarily reflect the opinions of Lancaster University. Responsibility for the accuracy of any of the information contained within blog posts belongs to the blogger.


Back to blog listing