By Jessica Rast, PhD
There are many definitions out there for big data. Enter “big data” into an internet search bar and you’ve found yourself some big data in all those results.
What is Big Data?
In computer science and technology, big data means: 1) a lot of data, 2) from multiple sources, 3) in different formats. Big data is often described by a set of “V’s”: commonly, volume, variety, and velocity. A good example of big data is the data that you get from social media use. Each use contributes a vast amount of data that may come from posts, likes, comments, clicks (the volume, which refers to the sheer amount of data that exists). The data is in several formats and can be free text like a comment or quantification of engagement (the variety, which refers to data in multiple formats). And millions of people are interacting and producing a lot of data quickly (the velocity, which refers to data that is constantly updating).
In our work through the National Autism Data Center, an initiative of the Policy and Analytics Center at the A.J. Drexel Autism Institute, we often refer to our data as “big.” The mission of the National Autism Data Center is to use secondary data to fuel population-level information about all areas of life. These data are often national, have many observations, and cover a variety of topics. Until recently, we haven’t attached a formal definition to the term “big data” as much as we’ve relied on colloquialisms:
“This dataset is too big for my hard drive!”
“Look at all the people represented here!”
“Can you believe how long this analysis is taking!?”
Establishing Values Around Big Data
While something does not necessarily need to be defined to exist, it can be helpful. A definition can be useful for communication purposes and to set expectations. It can also help to build a set of values surrounding the use of such data.
As part of our work with the Autism Intervention Research Network on Physical Health (AIR-P), we worked to outline core values in using big data in neurodiversity research. As a group of researchers of various backgrounds and disciplines, we use various big data sources to examine the health, health services, and service systems of autistic people. In our group, we didn’t start with a single understanding of big data in neurodiversity research. We all brought our own ideas to the table.
The act of building a definition led us to the realization that our use of big data is driven by a set of values. Some of these values were written into our mission and visions statements long before we began our journey to define big data. For us, the purpose of big data is to highlight where and how we can make social and system changes to ultimately improve population health and wellbeing. Big data doesn’t exist for us without this purpose. We do not have big data just for sake of it. Current definitions of big data don’t consider the ultimate utility of the data and the responsibility of the data user to produce useful and high-quality results. It was important for us to have this up front in our definition and use of big data.
We also felt something else was missing from existing definitions of big data: people. If we aim to use big data to improve the health of populations, we must have information about diverse populations. We decided the inclusion of diversity was integral to our definition, because it is integral to our mission. Just as humanity has endless facets of diversity, our definition too includes acknowledgement of diversity across many areas of the term.
Realizing all this, we proposed the following definition of big data in neurodiversity research:
High variety data covering a diverse population that allows the user to advance our understanding of population-level health.
To unpack this further: variety means data of multiple types or structure, and diversity covers many aspects of diversity including geography, demographics, policy relevance, and measures that fill a data gap in what is often available.
Beyond our definition, we proposed a set of values that drive our use of big data. Meet CREST: in our use of big data, we strive to be credible, responsive, ethnical, significant, and transparent.
Applying values or standards to data and technology is not unique to our efforts. In January 2023, the U.S. National Institute of Standards and Technology published an Artificial Intelligence Risk Management Framework (AIRMF) that applies values to the creation of AI. Similar to the way we see our use of big data as impacting society at large, the AIRMF states that AI creators have a social responsibility to ensure trustworthy creation of AI. Their values include AI that is safe, secure and resilient, explainable and interpretable, privacy-enhanced, fair with harmful bias mitigated, accountable and transparent, and valid and reliable.
We’ve been using big data for a long time, but our journey to officially define values and standards of its use is just starting. This effort, along with how we use big data, will continue to evolve as societal values and priorities shift. We welcome feedback and questions on these starting values and how to work to define them.