By Trent Brunson, Ph.D.
Associate Division Chief - Threat Intelligence and Analytics, Georgia Tech Research Institute
Spam and phishing deliver more malicious software into business networks than any other source. To combat it, businesses have long used “file typing” to scan incoming email—a relatively easy and inexpensive security technique. Because every file has a signature, or “magic number,” it can be scanned to identify which application—PDF or JPG, for example—should be associated with that number. When network administrators set email attachment policies, magic numbers are often working behind the scene enforcing those rules.
But think about how someone might bypass a network’s email attachment policy. Suppose a file with a blacklisted file type is inserted into a PowerPoint presentation and then attached to an email. Then what? A scan will overlook the embedded content and not raise any security concerns. This is a trivial way to sneak files onto a network, but data-hiding techniques are far more sophisticated with “polyglot files.” These files appear to have only one purpose, like a PDF, but they actually hold multiple file types simultaneously and exhibit different behaviors depending on the application running it.
Today’s information security systems must go a step further in providing the right level of visibility into incoming files because there can be more than one correct answer to a file’s true type. File typing is an old, often overlooked problem in security and the defense of organizational networks. Yet, file typing with magic numbers is a simple and computationally inexpensive security defense.
Here are three ways information technology teams can elevate their file typing methods using static analysis, a preferred method of analysis over dynamic analysis because it’s safer, faster, and requires less overhead.
At the very least, modern file typing applications should try to unzip all files to see if more content exists. People can play clever tricks with magic numbers to create PDFs that, when unzipped, reveal hidden files or show different content based on the application opening the file. In fact, the structure of Microsoft Office files supports this notion. Office files, like the one in which I typed this post, are effectively zip files. In the figure below, I’ve unzipped my Word file and shown the file layout with the tree command.
So even though a Java .jar file cannot be inserted as a media attachment with the Word application, the file can be extracted and added to the media/ folder in the document’s directory tree and rezipped. Unless a file typer unzips all the files it encounters, it will fail to notice data hidden inside a .docx file.
Many file formats do not have a specific location where magic numbers reside. This allows for anyone to hide data from an application by putting data before the magic number. Here’s an example from the article “This PDF is a Git Repository Containing its Own Latex Source and Copy of Itself” by Evan Sultanik from the 15th issue of the journal POC||GTFO. This file opens as a normal PDF document.
The magic number offset for PDFs is supposed to be zero, but when we search for the PDF magic number (25-50-44-46) for this file we find it located at the 944th byte offset (0x3b0).
This example easily demonstrates the need for modern file typers to look beyond the expected offsets for magic numbers.
Machine learning techniques can be used to evaluate every measurable dimension of a file and group them according to its format. An open-sourced implementation of this idea is the “Sceadan” file typer created at The University of Texas at San Antonio. This application classifies file types based on bi-gram entropy, Hamming weight, mean byte value, and several other statistics taken from files.
Machine learning models give answers to file-typing problems as probabilities, which for polyglot files is a more insightful answer. As these techniques become more refined, additional measurements and relevant correlations will emerge as indicators for file types.
Although file typing is an old problem, it still remains the first line of defense in protecting organizations against both external and insider threats. Here, I have discussed some of the uncertainties that can arise when identifying the format of a file as well as some of the research solutions into the problem. Magic number file typing is not going away any time soon, but advanced methods can be applied to give more transparency to users and network administrators. When it comes to downloading and executing unknown or untrusted files, nobody like surprises.
Trent Brunson is the Associate Division Chief of the Threat Intelligence and Analytics Division in the Cybersecurity, Information Protection, and Hardware Evaluation Research (CIPHER) Laboratory at the Georgia Tech Research Institute (GTRI). His research interests are statistical modeling, programming, data analysis and visualization, security research, technical writing, and code management. Before joining GTRI, he performed security research for the Air Force Research Laboratory (AFRL), the Air Force Office of Scientific Research (AFoSR), and the Defense Advanced Research Projects Agency (DARPA). Brunson received his Ph.D. in computational physics from Emory University in Atlanta.