The Art of Data Budgeting
Maximizing Insights While Minimizing Data
In the era of growing processing technology, where new instruments can effortlessly collect massive amounts of data, the need for effective data budgeting has never been more crucial. Notably, data budgeting extends its significance to both recorded and real-time data, presenting a dual challenge that demands careful consideration.
Challenges with Data BudgetingThe advent of cutting-edge instruments has revolutionized data collection, providing unprecedented insights and decision-making capabilities. Arrays of sensors amplify these data quantities, offering greater resolution and improved event judgement in various domains, such as ocean monitoring.
However, the allure of vast data quantities comes at a price. The effort required to manage, process, and extract meaningful information from this abundance of data rises significantly. In some instances, this surge in effort becomes a daunting barrier, leading to project failures due to insufficient capacity to handle extensive datasets.
The Paradigm Shift: Budgeting Our Data
Just as we budget our finances and time due to limitations, a similar approach is necessary for data. Budgeting sets boundaries, using a process that aligns with the desired outcomes within those limits. The key is to design the data collection process by first defining the required results, as perhaps outlined in a required Environmental Assessment, project plan, or proposal.
Collecting only what is necessary, along with enough raw data for result validation, establishes the minimum quantity of data to collect. The next step then involves transforming massive raw data into the minimum required to meet project specifications.
Imagine you have a project where you need to monitor a harbour area using 4 hydrophones to detect the presence and number of harbour porpoises. In addition, you’d like to capture vessel activity in the area for a detailed sound analysis at a later time. The catch? The duration of this project is 30 days. To help imagine how we’d solve this, I’ve shared three solution examples.
Solution 1: Collect all the Data at Once
Harbour porpoises are high frequency (roughly 130 kHz) communicators so a data sampling rate (number of samples taken per second) of 512 kS/s (kilo samples per second) is required to record their clicks and calls. On the other hand, vessels are generally much lower frequency, less than 10 kHz. The recording can capture both harbour porpoises and vessel sounds at the same time.
Overall, this means that 16 TB are collected over the full project. However, processing 16 TB of raw data could potentially take months.
Solution 2: Collect Raw Data for Vessels & Processed Spectral Data for Porpoises
The question is, do you need all that raw data from the porpoises? We in fact, just need to know whether or not these harbour porpoises are present. So based on the required outcomes by focusing on collecting raw data for vessels and processed spectral data (this type of data is much more compact) for porpoises, the dataset is reduced by a factor of 16. Now, the spectral data is much more manageable at 21 GB for the entire datase
Solution 3: Process Porpoise Data and Record Only When Events Detected
In this example, the raw and spectral data are collected as in Solution 2, and in addition, detectors are enabled to detect vessels and porpoises.
Assume on average 30 vessels pass by per day, and each vessel event is 5 minutes.
Additionally assume that on average porpoise episodes last for 20 minutes and occur 5 times per day.
This reduces the total data required for the project to 100 GB for 30 days. In addition, the date & time of each event is captured, further simplifying the data processing.
To summarize, it is clear that the quantity of data can be reduced by more than a factor of 100. In addition, the resulting data is nearly in its final form ready to be reported or further analyzed.
Conclusion: The Strategic Approach to Data Budgeting
Data budgeting begins with a clear understanding of the project’s required outcomes. Working back from this point helps in deciding the best data collection and storage strategy. Utilizing available instrument tools to collect only the necessary data offers not only a reduction in data quantity but also data arriving in a form closer to the final project result.
While there may be cases where collecting all raw waveform data is essential, these instances are becoming increasingly rare. As data collection processing tools advance, the need to accumulate vast quantities of raw data is gradually evolving into an artifact of the past. In embracing data budgeting, we pave the way for more manageable, insightful, and efficient projects in the data-driven landscape.