There are at least two ways of interpreting a table of data.
Date | Temperature | Humidity |
June 18 | 92 | 57 |
June 19 | 95 | NULL |
June 20 | 84 | 51 |
The first interpretation treats the table as a collection of facts about the world. For example, on June 18 the temperature was 92 degrees and the humidity was 57%. On June 19, the temperature was 95 degrees and humidity was unknown.
The second interpretation treats the table as a literal list of data points. For example, on June 18, someone recorded the temperature at 92 degrees and the humidity at 57%. On June 19, the humidity sensor was broken. The data is stored in a table with three columns. Before June 18, the data was being recorded in a different table.
In other words, we can focus on what the data says about the world, or we can focus on the data itself.
We can think of the data ephemerally as information, or we can think of it as a physical thing that exists in and of itself.
This is analogous to written language: a sentence or paragraph generally means something, but it also exists as physical letters and punctuation on the page.
The second interpretation is often called metadata: data about the data. How was it collected, by whom, for what purpose, and where and how is it stored? How accurate is it likely to be?
If we are very confident about the accuracy and relevance of the data, we can summarize and visualize it cleanly. We could show a line chart of temperature over time and start to draw conclusions about what the temperature trend means.
But if the accuracy and relevance is unknown, we need to take steps to better understand the metadata. How much data is there? Which parts are missing, or appear to be duplicated? Where did it come from? What metrics are most relevant?
Suppose the default behavior of a data analysis tool is to ingest your data and take you directly to a clean line chart. Is that convenient or misleading? Does that clean line chart imply that you are looking at truth, when in fact you may just be looking at data?
Can we assume that the line chart is about temperature, or should we emphasize that it shows data about temperature? What is the best way to communicate that distinction?