|
Sometimes, a government agency will post a PDF that doesn’t contain searchable text. Most often, it’s a scan of a printout. Why? Don’t the NSA, the Department of Justice, etc., know how to convert Word (or whatever) directly to PDF? It turns out that they know more than some of their critics do. The reason? With a piece of paper, you know much more about what you’re actually disclosing.
It’s tempting to think of a PDF file as a simple image of a page, or maybe a simple page image with—somehow!—embedded text that you can search for. In fact, PDFs are far more complex than that. A PDF file (or more or less any modern document file) is a container that can hold many different types of things: text, images, fonts definitions, JavaScript programs (yes, you can embed JavaScript in PDF), and much more. If you release a PDF produced by a text formatter, do you really know what you’re releasing?
It may be possible to strip all of the metadata safely. The NSA, in fact, has a guide on how to do it. (N.B. You’ll get a certificate error: many US government agencies have certificates from a US government-specific certificate authority, and outside browsers do not trust it by default. If you do not want to click through the warning messages (if you even can), I’ve created a mirror of it. And that’s legal: by law, US government-created documents are in the public domain.) But the complexity is worrisome—and the list of things that “Sanitize Document” can delete (page 10) is quite amazing. (Sanitizing Word is harder.)
So why is this an issue? Well, people still get it wrong. And it’s not a new problem; Bruce Schneier wrote about it years ago and said it was barely newsworthy then. Even, yes, Federal prosecutors can get it wrong.
Printing things onto paper and scanning it is ugly and not as functional, but it does prevent this sort of error.
And there are two more subtle points. First, sensitive networks are often air-gapped from the Internet. Air-gapping—having no physical connection whatsoever to the outside world—is a strong defense, though far from perfect. Getting a PDF file from an air-gapped network to the Internet can be done, but it’s painstaking and—if done incorrectly—can expose the sensitive network to attack from the outside. Again, we know how to do this—follow NSA procedures on the sensitive network, burn a CD-R (not a CD-RW) with just the PDF, and carry that to an outside machine—but there’s still the chance for human error. And there’s one more threat…
What is really in a PDF, and how do you know? Is it just what you see on the screen? Even apart from malice or stupidity, e.g., setting the font color to white, there’s a hidden danger: what did the PDF creation or redaction program actually write out? Remember that PDFs are containers; there can be nominally empty sections of the file. What fills those bytes? How do you know, and what is your assurance?
Many years ago, while I was at AT&T, I was working on an important internal project. Someone sent out a Word document with some very sensitive details. Unlike everyone else on the project, I was running an open source OS instead of Windows, so I couldn’t just fire up Word. Instead, I used an open source tool to view the file—and I saw something different. The person who created the file had two documents open in Word, and what was nominally empty space was filled with whatever garbage was lying around RAM at the time: in this case the body of an unrelated letter he was sending to someone outside the company. The tool I used to view the file wasn’t perfect, so it printed the wrong part of the Word document. The odds are high, of course, that the recipient of that letter received some of our project plans, but if that person did the usual—run Windows and Word—it would never appear, and our corporate secrets would be safe.
The NSA and the Department of Justice, of course, have serious adversaries, ones who won’t take a file at face value. Unless you have a lot of confidence in the PDF redaction program, you’re much better off scanning a printed version. Sure, there are still some risks, e.g., steganography based on kerning or the like, but they’re much less than with a PDF.
So: DoJ has its reasons for sending out these difficult-to-use PDFs. You may not like it—I don’t like it—but they’re doing it out of caution, not ignorance or stupidity.
Sponsored byWhoisXML API
Sponsored byRadix
Sponsored byVerisign
Sponsored byIPv4.Global
Sponsored byDNIB.com
Sponsored byCSC
Sponsored byVerisign
Interesting article, thanks.
I tried to get to your mirror of the article, but it points to somewhere on CircleID and goes 404.
Link fixed. Thank you for pointing out the issue.
I just export to formats that don’t contain those problems. Plain text, for instance, or Rich Text Format (the real DEC format, not Microsoft’s). This makes a lot of problems with metadata vanish (along with the problematic metadata).
Good piece, Steve. It does make a certain amount of sense from a security perspective, even if it is not as good from a user experience perspective.