
By Dr Tristan Jenkinson
Introduction
In this series I am looking at the importance of data that is not present. In the previous two articles, I have looked at the use of timelines in identifying missing data, reasons that data from some sources may appear to be missing and considered potential steps to take or alternative locations that may contain that missing data. You can find part one here, and part two here.
In this article, I wanted to look at something a little different – a case study specifically focussed on missing file metadata. I cover the steps that were taken, what was found, and the potential dangers of misinterpretation.
This case study is based on a real life litigation case that I worked on a number of years ago. Due to the forensic analysis performed, there is some technical content in this article, though I have tried to keep this at a high level to keep this article accessible.
The Matter
Our client was involved in litigation, and had received a Microsoft Word document from opposing counsel via email. The file supported the opposition’s case, and they sought to rely upon it. Our client had concerns over the authenticity of the file, especially as it was disclosed late. Our clients perspective was reinforced when they opened a copy of the file, and found it missing a lot of metadata that they would have expected to see, including the creation date.
The client asked me to preserve and investigate the file to find out if there was evidence that could identify if the file was authentic or not.
Initial Checks
Once a copy of the original email received (including the attachment) had been forensically preserved, I wanted to take a look and check what the client had seen.
As a “quick and dirty” test, I took a copy of the email, opened a copy of the Word document, and opened up the document properties. They showed what the client had described with very minimal metadata present. For example there were no values for:
- Title
- Tags
- Comments
- Last Modified Date
- Last Printed Date
- Author
- Last Modified By
There was a value present for the Created Date, reporting the date and time at which I opened the file. This would be expected if the value was not present when the file was opened, suggesting that this too was not actually present in the version of the file provided.
Looking Deeper
The particular file that I was analysing was a .docx file. This means that it was in Microsoft’s Office Open XML (OOXML) format. This is the format that Microsoft have been using from Office 2007 and relates to content from Excel, Word and PowerPoint files (.xlsx, .docx and .pptx respectively). The format is well documented and can be found at (for example) https://www.ecma-international.org/publications-and-standards/standards/ecma-376/.
At the basic level, Microsoft’s OOXML files are really zip files containing a folder structure of content stored in .xml files. You can even change the file extension of a .docx Word document to .zip, and open that zip file to see the folder structure and embedded xml files.
There are typically two main files in the OOXML structure which contain the majority of the metadata, these are “app.xml” and “core.xml” and sit in a folder called “docProps” (Document Properties) within the zip structure of an OOXML file. It is content from these two files that is displayed in the internal Microsoft Word Properties section.
App.xml
The app.xml file typically contains statistical information about the file itself, for example the number of words, pages, characters etc. As the name suggests, it also contains information about the application (and the version number) that was used to generate the file.
For this file, the application value was set to “Microsoft Office Outlook”, which would not typically be expected. This will be discussed further later.
Core.xml
This is typically where metadata such as the Author, Last Modified By, and Creation/Last Modified dates and times are stored. The core.xml file in this instance was virtually blank, with no metadata content. This is consistent with what was seen within the Properties window when the file was opened.
Custom Information
Usually when looking at a Microsoft Word document, the app.xml and core.xml files are the only content found within the docProps folder. This was not the case here though. Another file was present, titled “custom.xml”, suggesting that some custom content had been added to the file.
The custom.xml file contained three values, titled “MAIL_MSG_ID”, “RESPONSE_SENDER_NAME” and “EMAIL_OWNER_ADDRESS”. The content itself was binary information (i.e. not in a format meant to be comprehensible for users), and appeared to contain references to data stored in a separate system.
The information from the custom.xml file doesn’t really provide information directly (at least not information which can be read or understood easily by users), but it is again information that we will come back to.
Internal Zip Structure Dates
As noted above, Microsoft Office Open XML files such as Microsoft Word documents are actually zip files containing folders and .xml files. Because of this structure, as with the content of typical zip files, files (and folders) may have associated created or modified dates, as shown below:

And also here:

When Microsoft Word creates documents (and other file types) these dates within the zip file structure are typically set to a default value, representing 1 January 1980 at midnight (as seen above). It is worth noting that if content was changed manually, or by alternative tools, then the creation and/or modified dates may change to reflect the system time when the modification was made. I have seen some crude attempt to manipulate Word document properties using this method.
As a side note, it is worth pointing out that that in addition to the dates above, Microsoft Office creates their files in a specific way using the zip format when they are created. This means that this information within the zip file structure can be analysed to potentially identify if that a file may have been manipulated.
The tool Zinky on GitHub (https://github.com/4144414D/zinky) is a great example of how some of these artefacts can be looked at purely within the zip format structure when considering potentially falsified files. There is much more that can be considered, but that is not something that I’ll cover today.
The file that I was analysing had valid dates and times for the files within the zip file structure (not the 1 January 1980 date that we see above). These dates and times were all the same, and were set a few days before the email was dated.
Removing Metadata
The client in this case already suspected foul play and was concerned that the metadata had been deliberately removed because the document was not created when the other side claim that it was.
Before I discuss what I found in my investigation, I wanted to talk about how someone might go about removing metadata from a Word document (maliciously, or otherwise). There are many ways that this could be accomplished, I cover some examples below.
Using Word – Inspect Document
This is likely the method that most are aware of. Microsoft Word itself has an inbuilt function to remove information – Using the File menu, you can select the Inspect Document option. Part of the inspection is to look at Document Properties and Personal Information and flag that these are present, giving an option for the data to be removed. The feature is designed for removing personal data from documents that may be passed on. It is worth noting that this will typically remove the Author and Last Modified By information, but will typically leave Created and Last Modified dates present.
Using Windows Explorer
Windows Explorer offers a similar option, by right clicking on a file and viewing the properties, you can then click in the details panel to find an option to remove properties and personal information. This option can then be used to remove specific selected metadata, or can be used to create a new copy of the file with as much metadata removed as possible.
Similar to the above, this method will typically leave Created and Last Modified dates embedded within the document.
Using Powershell
Powershell is a Microsoft command line based system that can be used to execute various system commands. This includes the ability to analyse (and edit) item properties.
I won’t go into the details, here, but it is possible to use Powershell to update internal metadata values of Office files.
Using Custom Applications
There are also various applications (both malicious and otherwise) which can be used to update properties of files such as Microsoft Word documents.
Manual Manipulation
As noted above when discussing the zip format of Microsoft Office files, it is possible to change the file’s extension to .zip, then expand it, change the details in the .xml files containing the document metadata, then re-zip the files, and finally change the file extension back. This does typically leave a number of tell-tale signs, which can be seen when the file is analysed. For example, looking at the dates and times of the files within the zip file structure, or other artefacts which can be examined using tools like Zinky.
Putting it all together
There are a number of points here that we need to pull together to work out what is most likely to have happened.
The valid dates appearing in the zip structure of the file suggest that the file was manipulated either manually, or using a process outside of the Microsoft Office suite itself. Similarly the lack of any creation or modified date at all supports this, as this cannot be changed within the program itself (or using Windows Explorer) without setting these dates, which are empty in this case.
We appear to therefore be looking at some external method being used to remove metadata from the file itself.
Next consider the information from the app.xml file, and the custom.xml file. The app.xml listed Microsoft Outlook as the program associated with the creation of the file. This is unusual, as for a Microsoft Word document, you would typically expect this to be set to Microsoft Word. The custom.xml also contained information which appears to relate to email data (“MAIL_MSG_ID”, “RESPONSE_SENDER_NAME” and “EMAIL_OWNER_ADDRESS”). These together suggest that it was likely to have been an email operation which modified the file.
There are a number of reasons why email systems may actually be manipulating attachments in this way. In particular, law firms often use automated tools to remove metadata, and have done for many years. In the US, this is often seen as a basic requirement, for example in Arizona the State Bar of Arizona issued Ethics Opinion 07-03 stating (my emphasis added):
“A lawyer who authors and sends an electronic document to someone other than the client on whose behalf the document was drafted, or other privileged persons, is responsible, under ER 1.6, for first scrubbing the document of confidential metadata that may be contained within the electronic file”
So, could that have been the case here?
Looking back at the zip file structure, recall that there were legitimate dates and times present, and that these dates and times were a few days prior to the email itself.
Scrolling back through the email, it was possible to see the communications earlier in the thread. Specifically there was an email sent at the same date and time as that seen as the modification date in the zip file structure (within a second). It therefore seems likely, give the email references in the file, that it was modified related to the sending of this email.
This email was from a third party law firm, attaching a copy of the file in question to the opposing law firm.
Going back through the thread, it appears that this version of the file was simply forwarded several times until the version of the email that was sent to our client. Further down the thread, you can see there was an email prior in which the third party law firm received their copy of the file.
The evidence strongly suggests that the metadata was scrubbed automatically when the email attaching a copy of the file was sent by the third party law firm, and the attachment was modified at this point. This scrubbed version of the file which was then forwarded to our clients. It also seemed likely that the original metadata may therefore be present in the copy of the file that the third party law firm originally received.
I reported these findings to the client, and suggested that they may want to consider approaching the third party law firm (directly or through opposing) to request that they supply a copy of the file that they received (prior to the potential metadata scrubbing). This should be supplied preferably not via direct email (as this would likely again scrub the metadata), or preferably collected by a forensic professional.
The client agreed, and a copy of the file prior to the data scrubbing was provided. This file showed no signs of interference or manipulation, and the client accepted its authenticity.
Conclusion
As above, the evidence found suggested that the metadata missing from the file was scrubbed automatically by an email system when the file was sent by the third party law firm.
It is important to remember that where metadata has been altered, removed, or scrubbed from a file, this does not necessarily mean that such changes have been performed with malicious intent.
In this case, if only some of the information in the analysis had been considered, the client may have received information that would support their suspicions – that the file had been manipulated by the opposing party to hide details of its creation. This could have led to incorrect accusations regarding the changes made to the file. These inaccurate accusations could have been used against our client in this case.
Instead, it was possible to identify the most likely cause of the file metadata to be missing. Subsequently it was possible to request (and receive) a copy prior to the alterations that occurred. This meant that our client, and opposing were able to get a better view of the relevant facts in the case.
While there was evidence that could have been used to support the client’s suspicions that the file was modified maliciously, this was not the whole story. This really does highlight the importance of thoroughly considering the evidence in a case, and ensuring that digital forensic analysis is performed in an independent fashion. This really is important when providing expert analysis, where it cannot be emphasized enough that the expert’s duty is to the court, not to their client.
Coming Up
I’ll be continuing this series on the investigative importance of data that is not present by looking at deleted data, and also talking about anti-forensics.
One thought on “The Importance of Data that Doesn’t Exist – Part Three (Missing Metadata – A Case Study)”