This is an initial post about mods to PDFBox to allow XFA form filling on modern AES encrypted PDF forms, so that they still load into Acrobat Reader, and do not get the dreaded message informing you that the document has been modified and the Reader (form filling) extensions no longer work.
I imagine that PDF toolkits have a very limited audience. So this first post isn’t about the changes to PDFBox. Well, maybe just a little. It’s more about why this matters. And we’ll also cover XFA, form filling in Acrobat, and IText.
Part deux – quite soon I hope – will discuss the implementation. But if you read these posts with the expectation of a quick cut and paste solution, you will be disappointed. No PDFBox solution is quick. I would be doing you a disservice by pretending that the Apache project is the way to go for a quickie. You need an out-of-the-box solution, and I recommend IText. But if you want a solid starting point for your own PDFBox project, I hope you will find my comments helpful. Also, you won’t learn much PDF by studying PDFBox, but the more you understand about PDF, the more you will understand the software.
This is a longish blog post. I’m writing it after finally “succeeding” with the Apache PDFBox project. It’s a pretty raw success, and I have no idea when how or if any of it will get back into the trunk, so I’ll be posting detailed notes and breadcrumbs for others who don’t live a double life, and want to become coding superheroes.
And the spoiler is that IText 5 (the one with the commercial license and/or the GNU Affero “community license”) was the only out-of-the-box solution that worked. Even the IText 4.2 snapshot, with the earlier licensing scheme, didn’t work. I was very impressed by the quality of IText and would have signed up with them and not gone down this road if they had a “starter/incubator” licensing scheme. But I’m getting ahead of the story.
PDFBox is now an Apache project. It is an 80% project. It does about 80% of the things you need to do with a PDF file, and for the other things, it gets you about 80% of the way there. It has bugs, some silly, others subtle. It has a learning curve – because the software models the PDF architecture (rather than procedurally doing tasks with PDF) you have to understand a lot about PDF to make a small change. And because it does model the PDF spec, it is “self documenting” – if you can read the spec. But there is very little other documentation, and not many examples. And there are holes. And when it comes to XFA form filling, these holes are very deep. I am not associated with the Apache group, and I don’t know anything about its current status or future plans.
PDFBox does what it does.
I’ve used PDFBox for a number of years, mostly for text extraction. It made life easy recently when I had to take the 1000+ page PDFs at various stages of the progress of the Senate’s S-744 Comprehensive Immigration Reform legislation, and get them in a state, where I could track the changes, and maintain “authorship” of sections that started as amendments from individual senators, and eventually merged “seamlessly” and without attribution, into the body of the text. All that is comfortably within the 80%
Although I’m never quite sure which of my jobs is my day job, my non-programming life is as an immigration attorney. I work with undocumented people and other low to middle income individuals and families who are trying to survive our immigration system. And over the holidays, I had some ideas about making the whole form-filling and evidence gathering process dramatically easier.
All went well with this project until the “final step” – taking all the information and plugging it into the government forms.
Form Filling with Acrobat Pro
The “simplest” solution was to use Acrobat Pro. I’ll deal with “format” issues in a moment, but if you know what format it expects for its form “import data” feature, you can fill forms. Unfortunately, the forms I deal with – the I-130, I-485, N-400 and all the others that will be very familiar to every first generation american, and their immediate families are of mixed quality. No, that’s not fair. They are intended for people to fill in by hand, and are not designed to make life easier for me.
These forms don’t make automated form filling easy. They are AES encrypted, and they allow form filling and saving in the latest Acrobat reader – but assume that they will be filled in by hand. So (as they come from the USCIS site) some can be filled in completely using XFA, but most can only be partially completed. Some of the fields are configured so they cannot be filled in. I don’t understand why they do this, and it may simply be an oversight, since you can fix them up, and the modifications don’t cause any issues with the *big* upcoming feature in these forms – 2D barcode scanning of the data.
Using Acrobat Pro is labor intensive. First you have to make a copy of the file that is no longer “Reader Extended” – then you have to fill your form. And if you only work on these PDFs in Acrobat Pro, that’s all = but if you want them to work in Reader as well (maybe you want the client to check them and make corrections, or you just want to be able to look at the forms on computers that don’t have a license to use Pro) you then have to save the completed form “Save As Other” re-enabling the reader extensions. Maybe this could be automated using applescript. But it is hard work.
The good news (for an impoverished attorney) is that you no longer have to find $500 to buy Acrobat Pro. They give you a 30 day trial and then offer you a $20/month annual subscription.
The USCIS forms are developed using Adobe LifeCycle. And Adobe sell java toolkit API licenses for companies that want to do the sort of automation I need. No prices advertised. Just contact sales.
XFA – Adobe’s XML Forms Architecture
At this point, you need to know a little about PDF Forms. In the old days, forms were very simplistic. The old format was eventually displaced by XFA – the XML Forms Architecture. An XML form description file gets embedded in your PDF. It contains a “template” that says what sort of fields there are and how they are laid out, and a “data” section, where filled-in values are stored. The short answer for someone reading this article to find out how to import form data into Acrobat is that you import an XML file containing the “xfa:datasets” section of the XFA for the PDF.
Once you get the XML data format from the PDF, you can work out which elements correspond to which fields in the form – and you are on your way to automated form filling.
But out-of-the-pdfbox doesn’t help you with that. It will decrypt AES for PDF manipulation, but has no AES encryption for creating and updating modern PDFs. That is part of the missing 20%.
So Many Toolkits
I tried various java PDF toolkits (and their name is Legion) with no success. I had test forms (the I-130 and the new N-400). I had some test data – the XML data file I could use successfully to fill the form in Acrobat Pro. And all I wanted was to be able to use the toolkit to update the PDF so that I could load it into Acrobat Reader. Every “solution” I tried, failed at some point. I’m not going to list the also-ran’s – and I may have missed a toolkit that does all of this (which could have saved me a couple of hard weeks of programing). But I will say that some got close. But none got past the dreaded message:
IText 5 Just Works
None, that is, except IText 5. Bruno’s software “just worked.” More than that, it also has an “unethicalreading” feature that allows you to fix up some of the problems in the underlying form – subverting the author’s permissions. “Unethical” and “attorney” don’t really go together (some of you may disagree) however, for my application, these are government forms. Government for the people by the people. And government (at least U.S Government) documents cannot be copyrighted. Making government forms more usable is quite ethical. And infringes nothing. But beware if you are working with PDFs from other sources.
When you start using XFA, it’s quite wonderful to watch all the fields fill automagically. Until the day that you find a form that “doesn’t quite” For me that happened on the second form I worked on – the G-325-A. You can’t fill fields like “first name” “family name” etc. I could use Bruno’s toolkit to fix these forms. And I’ll post the code to do that – it is the one thing I intend to use IText5 for – and if I post the code, that complies with the “community license” requirements. It may also be useful. Again the short version is that XFA fields can have a “bind” attribute – and if “bind” is set to “none” you are screwed. That’s an old engineering term. With a few lines of code and IText5 I could produce a PDF that I could run through Acrobat Pro to create a “fixed” version of the standard PDF with all fields working.
At that point, I wrote to IText. And had a very pleasant email exchange with a guy called Stan. IText have commercial and community licenses. The community version is the GNU Afferro license that requires full publication of all your source code if you use their software. A real community license. The commercial license is reasonably pitched around $2000 – very reasonable for the quality of the product – maybe even too cheap to maintain a small software company. I used to work in one – it takes a lot of licenses to make payroll month after month. And the only reason (IMO) that IText 5 “just works” is that the funding allows them to do a good job. And they really have.
However, if you don’t have $2000, that’s not much help. If you are developing something that may not even become a product, just putting it into beta is going to take food off the table. There are a large number of highly talented people – some in academia, some with day jobs – who could easily envision a new PDF service. And it would be in IText’s self-interest to encourage these people to become “future paying customers.” That’s what I thought, anyway, and that’s what I told “Stan.”
I pitched the idea of a nominal short-duration “starter/incubator” license for individuals (not companies) – he replied that depending on what I meant by nominal, I’d probably be better advised to go with the Afferro community license. After I pointed him to a forum thread from a few years ago by Bruno, quoting an under $200 single desktop license – which they no longer offer – as the ballpark, I got no further response from them.
Between a Rock and a Hard Place
I feel for this company. They are in a very difficult position. Any “starter” license would cut into their revenues – and unless they are wealthy beyond avarice through deals with large companies, that is bad news. I believe Google uses the earlier “free” version of IText. That must really hurt. However, if they don’t encourage “future customers” they force resourceful and motivated people to roll up their sleeves and solve the problems themselves. And IText’s only selling point is that it is a high quality implementation of a published specification. They have a business because they are the only non-Adobe game in town. Other vendors may disagree – that’s just my personal experience (see above.) Their challenge will be to keep that position.
My debt to IText is that they showed me that the problem can be solved. As I tried every other toolkit I could find, and came up short with each, it started to look like a shell game. The requirement is that a filled in form has to work with Acrobat Reader, but the only way to do that would be to buy from Adobe and use Adobe technology. Bruno’s toolkit showed me otherwise. There are no shortcuts. You solve the problem when you produce something that works exactly according to the PDF specification. No more, no less.
In the End, It’s Just Code
So I returned to PDFBox. And I read the PDF spec more carefully. And used PDFBox to look carefully at the output from Adobe Reader when you fill out and save a form.
The most important thing to know is that they make the job easier by appending the new data after the end of the PDF file. So you don’t have to take a PDF to pieces, make the changes you need, and then put it all back together again and hope it all still works. “All” you have to do is copy the original file, and add changes to the form at the end.
The Kindness of Strangers
Three things made the work possible. First, “zegrevart” posted a fork of PDFBox 1.6 that filled the encryption hole. 128 and 256 bit AES encryption. It needed some work to fit it with the newer 1.8.3 codebase, and a few small bug-fixes, but otherwise it was a gift.
Second I discovered (as I dug deeper into the PDFBox source) that there were already hooks for incremental saving of PDF files – intended for digital signing, but extendable. It was the PDFBox way of doing what IText does with its PDFStamper.
And finally, someone with the endearing name “Jammy Dodger” posted example code that showed how to use DOM transformations to put modified XFA back into the PDF. Thank you, Jammy. Your example put back a single COSStream, and I only got to the finish line by implementing it as Acrobat does as a COSArray of separate streams for each of the sections in the XFA, and only putting the data section in the incremental update. But hey…
There are no Shortcuts (other than buying a license)
Once you have a framework, you can begin to work with it. All I can tell the aspiring PDF coder is that when something doesn’t work, you just have to follow the data. Can you decrypt what you encrypted and does it look the same. When it is not quite right, where does it go wrong. Debugging is usually a procedural task. With PDF it is all about data. What does it look like. Are you preserving it when you encrypt and deflate it, and then inflate and decrypt it.
So that’s the background. I can now take a PDF, extract and analyze the XFA so I can easily make changes. Create new XFA ready to fill the form, and then use this very experimental PDFBox extension to fill the PDF and preserve the integrity of the file, so that Acrobat Reader accepts it and doesn’t say “this document has been changed since it was created, and the use of extended features is no longer available.” And I can deal with those new fangled 2D barcodes. Gotta wear shades.
I wrote this outline for people like me coming to this and similar PDF problems. Most of what you find on the internet is a wasteland of desperate unanswered questions. And when these are the questions you are facing, that is deeply depressing. There are also some who say “it can’t be done.” I feel for the people who have come a long way on this path, and have accepted this wisdom, and given up.
I also disagree with the people who attacked IText’s license change. The company has a product. The people who work there need to eat if their product is going to continue and remain good enough. When you buy their license, you buy the endless hours of bug fixing, and data checking, and work that has gone into their product. I just had a small requirement, and that needed a couple of intense weeks of study that made law school seem like a walk in the park. These people do it for a living. And they have a product that works out of the box. Kudos to them. It’s a hard life working in and running a small software company.
But in the end, PDFBox worked for me. The changes won’t “just go back” – some are just hacks “that get me home” – so the second post in this series (sometime soon I hope) will detail everything that was needed, and what bugs (or features) got in the way, so that those who understand PDF can move the Apache project forward. I can now go back to being a lawyer and spend more time on people and less on torturing them with forms.