Why Government Agencies Use Ugly, Difficult to Use Scanned PDFs

Home / Blogs

Why Government Agencies Use Ugly, Difficult to Use Scanned PDFs - There’s More Than Meets the Eye

	By Steven Bellovin Professor of Computer Science at Columbia University
	July 20, 2018 Views: 13,323 Comments: 4

Sometimes, a government agency will post a PDF that doesn’t contain searchable text. Most often, it’s a scan of a printout. Why? Don’t the NSA, the Department of Justice, etc., know how to convert Word (or whatever) directly to PDF? It turns out that they know more than some of their critics do. The reason? With a piece of paper, you know much more about what you’re actually disclosing.

It’s tempting to think of a PDF file as a simple image of a page, or maybe a simple page image with—somehow!—embedded text that you can search for. In fact, PDFs are far more complex than that. A PDF file (or more or less any modern document file) is a container that can hold many different types of things: text, images, fonts definitions, JavaScript programs (yes, you can embed JavaScript in PDF), and much more. If you release a PDF produced by a text formatter, do you really know what you’re releasing?

It may be possible to strip all of the metadata safely. The NSA, in fact, has a guide on how to do it. (N.B. You’ll get a certificate error: many US government agencies have certificates from a US government-specific certificate authority, and outside browsers do not trust it by default. If you do not want to click through the warning messages (if you even can), I’ve created a mirror of it. And that’s legal: by law, US government-created documents are in the public domain.) But the complexity is worrisome—and the list of things that “Sanitize Document” can delete (page 10) is quite amazing. (Sanitizing Word is harder.)

So why is this an issue? Well, people still get it wrong. And it’s not a new problem; Bruce Schneier wrote about it years ago and said it was barely newsworthy then. Even, yes, Federal prosecutors can get it wrong.

Printing things onto paper and scanning it is ugly and not as functional, but it does prevent this sort of error.

And there are two more subtle points. First, sensitive networks are often air-gapped from the Internet. Air-gapping—having no physical connection whatsoever to the outside world—is a strong defense, though far from perfect. Getting a PDF file from an air-gapped network to the Internet can be done, but it’s painstaking and—if done incorrectly—can expose the sensitive network to attack from the outside. Again, we know how to do this—follow NSA procedures on the sensitive network, burn a CD-R (not a CD-RW) with just the PDF, and carry that to an outside machine—but there’s still the chance for human error. And there’s one more threat…

What is really in a PDF, and how do you know? Is it just what you see on the screen? Even apart from malice or stupidity, e.g., setting the font color to white, there’s a hidden danger: what did the PDF creation or redaction program actually write out? Remember that PDFs are containers; there can be nominally empty sections of the file. What fills those bytes? How do you know, and what is your assurance?

Many years ago, while I was at AT&T, I was working on an important internal project. Someone sent out a Word document with some very sensitive details. Unlike everyone else on the project, I was running an open source OS instead of Windows, so I couldn’t just fire up Word. Instead, I used an open source tool to view the file—and I saw something different. The person who created the file had two documents open in Word, and what was nominally empty space was filled with whatever garbage was lying around RAM at the time: in this case the body of an unrelated letter he was sending to someone outside the company. The tool I used to view the file wasn’t perfect, so it printed the wrong part of the Word document. The odds are high, of course, that the recipient of that letter received some of our project plans, but if that person did the usual—run Windows and Word—it would never appear, and our corporate secrets would be safe.

The NSA and the Department of Justice, of course, have serious adversaries, ones who won’t take a file at face value. Unless you have a lot of confidence in the PDF redaction program, you’re much better off scanning a printed version. Sure, there are still some risks, e.g., steganography based on kerning or the like, but they’re much less than with a PDF.

So: DoJ has its reasons for sending out these difficult-to-use PDFs. You may not like it—I don’t like it—but they’re doing it out of caution, not ignorance or stupidity.

By Steven Bellovin, Professor of Computer Science at Columbia University

Bellovin is the co-author of Firewalls and Internet Security: Repelling the Wily Hacker, and holds several patents on cryptographic and network protocols. He has served on many National Research Council study committees, including those on information systems trustworthiness, the privacy implications of authentication technologies, and cybersecurity research needs.

Visit Page

Filed Under

Comments

Thanks Robert Martin-Legene – Jul 20, 2018 8:38 PM

Interesting article, thanks.

I tried to get to your mirror of the article, but it points to somewhere on CircleID and goes 404.

# 1 Reply | Link | Report Problems

Re: Thanks Ali Farshchian – Jul 20, 2018 9:18 PM

Link fixed. Thank you for pointing out the issue.

# 2 Reply | Link | Report Problems

I just export to formats that don't Todd Knarr – Jul 23, 2018 1:33 AM

I just export to formats that don’t contain those problems. Plain text, for instance, or Rich Text Format (the real DEC format, not Microsoft’s). This makes a lot of problems with metadata vanish (along with the problematic metadata).

# 3 Reply | Link | Report Problems

Good piece, Steve. It does make a Dan York – Jul 26, 2018 10:18 AM

Good piece, Steve. It does make a certain amount of sense from a security perspective, even if it is not as good from a user experience perspective.

# 4 Reply | Link | Report Problems

The Weekly Wrap

More and more professionals are choosing to publish critical posts on CircleID from all corners of the Internet industry. If you find it hard to keep up daily, consider subscribing to our weekly digest. We will provide you a convenient summary report once a week sent directly to your inbox. It's a quick and easy read.

I make a point of reading CircleID. There is no getting around the utility of knowing what thoughtful people are thinking and saying about our industry.

VINTON CERF
Co-designer of the TCP/IP Protocols & the Architecture of the Internet