By Dr Tristan Jenkinson
Happy New Year and welcome to 2023 on the eDiscovery Channel!
Towards the end of last year, I saw a short post from Steve Nouri discussing survivorship bias. The principle is something that you may have seen discussed before (it regularly gets shared on sites such as LinkedIn). Nouri’s post shares a diagram of damage found on returning planes during World War II. During the war it was initially considered that these diagrams demonstrated the areas that needed to be reinforced. A mathematician (Abraham Wald) suggested the opposite. The fact that data was only collected from returning planes was key. He argued that the data demonstrated areas where damage could be taken and the planes could still return. The focus should therefore be on reinforcing other areas, as damage to those areas could be preventing the planes from returning.
Nouri’s post explains the above in terms of survivorship bias – “People are biased toward what they see… and easily discard what they do not see”.
This is something that translates really well in the investigation space – sometimes the data that is not present is just as important as the data that is. This is something that is seen regularly in investigations, both in eDiscovery matters and digital forensic investigations.
In this series of short articles, I’ll discuss the importance of data that is not present, the importance of understanding what might be missing, and how this can be used in eDiscovery and digital forensic investigations. I’ll discuss topics from simple timeline analysis to missing data sources, approaches to OSINT and beyond.
One of the simplest topic areas linked to the idea of investigating missing data is the use of timeline analysis to identify potentially missing data. The concept is not a new one but can still be very powerful. The approach involves creating a visualisation of how frequently emails (or other communications) were sent or received.
These visualisations can be used to identify peaks and troughs in communications. The peaks may indicate some incident which may be of specific interest, but in terms of this series of articles, we are more interested in the troughs. These indicate areas in the dataset where less (or even no) communications were sent/received. It can be helpful to confirm the employment history of custodians, and details of retention policies, so that the expected period that data should be available is known.
There could be many different reasons for such troughs existing. For example it could represent a period when the employee took a holiday, it could be the result of an automated process, such as retention policies automatically removing old data, or it could also represent something more nefarious, for example that data has been deleted or removed from the system being analysed – something that may itself require additional investigation.
I was working on a regulatory response several years ago which involved a large volume of audio recordings for many custodians. I put together a timeline analysis and identified that for a specific one week period several years prior, there were no recordings for any of the custodians. I was able to quickly report this directly to the end client. They investigated and were identified that there was an error with the audio recording systems, they had retained communications relating to the original outage with their support team. The client was therefore able to report this directly to the regulators ahead of the relevant productions.
While not ideal, being able to report the issue to the regulator (and disclose contemporaneous emails about the issue) was far better than had the regulators discovered that audio data was missing themselves.
In part two, I discuss the importance of understanding the data sources in an investigation, and in particular the importance of considering data sources which appear to be missing. I’ll talk about some of the potential reasons why this can happen, and some methodologies for resolving such issues, where possible.