An XRef table usually starts with an object number of zero. While it
could technically start at any other number, this is a tell-tale sign
of a broken table.
For the "broken" documents I encountered, this always meant that some
objects must have been removed from the start of the table, without
updating the following indices. When this is the case, the document is
not able to be read normally.
However, most other PDF parsers seem to know of this quirk and fix the
XRef table automatically.
Likewise, we now check for this exact case, and if it matches up with
what we expect, we update the XRef table such that all object numbers
match the actual objects found in the file again.
Now, whenever the xref table points to a compressed object,
parse_object_with_index will look it up in the corresponding object
stream as if it were a regular object.
With this, our parser gains the bare minimum support for xref streams.
Since PDF version 1.5, a document may omit the xref table in favor of
a new kind of xref stream object. This is used to reference so-called
"compressed" objects that are part of an object stream.
With this patch we are able to parse this new kind of xref object, but
we'll have to implement object streams to use them correctly.
The Parser class is now a generic PDF object parser, of which the new
DocumentParser class derives. DocumentParser now takes over all
functions relating to linearization, pages, xref and trailer handling.
This allows the use of multiple parsers in the same document's
context, which will be needed in order to handle PDF object streams.