Document Formats
and Image Formats
James C. King
PDF Architect/Senior Principal Scientist
Advanced Technology Laboratory
Adobe Systems Incorporated
1
Outline
Some Fundamentals
PDF Documents
PDF Pages
Synthesized Pages versus Scanned Pages
PDF and JPEG2000
PDF and ISO Standards
2
Some Fundamentals
3
Image Formats versus Document Formats
picture
“Sampled” Image
(e.g., JPEG2000)
Multi-page Compound Document picture
(e.g., PDF)
4
Image Resolution and Size
lower resolution
display
subsample supersample
higher resolution
display (x2)
higher resolution
sampled image (x2)
5
Image Sampling (JPEG2000)
Sub and Super Sampling Tools needed
Size and resolution are different things
display or page
(arbitrary image size)
subs
amp
supe le
rsam
ple
JPEG2000 Image
(multiple resolutions)
6
PDF Documents
7
PDF: Multi-page Compound Documents
A Comprehensive Format for Representing Documents and Forms
Page contents Metadata
Images Annotations
Graphics Links
Fonts Digital signatures
Colorspaces <and more>
Not an image format like TIFF or JPEG
High fidelity, high precision text layout and graphics features
Platform and device independent definition
Selective compression to reduce file size (e.g., image formats)
Color Management (ICC support)
PDF 1.0 in 1993 … PDF 1.7 in 2006. Many enhancements!
8
Composite Documents
9
PDF Pages
Page Content Objects
10
Text, Graphics and Image
Typographic Text Sampled Images
Typographic Text
Vector Graphics
11
Coordinate Transforms
x-scale, rotate/skew, rotate/skew, y-scale, x-pos, y-pos
2 0.8 0.7 2 10 210 cm
3 0.9 0.8 1 180 200 Tm
x t
T e
2.5 0 0 -1 235 170 cm
12
Clipping and Masking
Picture
picture Mask
Typographic
Typogra
grap Text
Clip to path (star) Mask off sky
13
Text as Text
The JPEG2000 image compression
technique has been cited by experts
as a new archiving format for digital
images. It is both a preservation and
delivery format, and has been seen
as a possible alternative to the TIFF
format which most institutions use
as a long-term archiving standard.
Produced by both imaging experts
and the Joint Photographic Experts
Group, it is now a recognised ISO
standard. The standard JPEG file
format which is so widely in use is
not yet an ISO standard.
Text as text
(using outline fonts) Text as image
14
Various Resolutions for Image Text
The JPEG2000 image compression
technique has been cited by experts
as a new archiving format for digital
images. (outlines)
9.5 in x 5.3 in
15
Resolution Independence
16
Resolution Independence
17
Synthesized Pages
versus
Scanned Pages
18
Document Sources
Born digital
More compact
Editable
Device independent/resolution independent
Zoom-able
Scanned from paper
Bulky
Need to pick a sampling resolution
Text and image need different treatment
Can do OCR or DR (document recognition)
Born digital is a luxury
19
OCR’ed Text as Underlayer
OCR’d Text
• underlaid
• made invisible
• may have mistakes
• used for search
Scanned Text as Image
A PDF Page
20
Image Text and Image Picture Require Different Treatment
The JPEG2000 image
compression technique has been Needs 1-bit per pixel black and white at 600 dpi
cited by experts as a new
archiving format for digital
images.
Needs 24-bit per pixel color at 150 dpi
The standard JPEG file format
which is so widely in use is not
yet an ISO standard.
MRC (Mixed Raster Content)
Both JPEG2000 and PDF support this
21
PDF and JPEG2000
22
PDF Support for JPEG2000
JPEG2000 images can be included on PDF pages
JPX Baseline is supported
Enumerated color spaces 19 (CIEJab) not supported
Enumerated color space 12 (CMYK) is supported
All four progressions supported: resolution, color depth, band, location
Inappropriate progression will just cost time
One global soft mask within the JPEG2000 supported
JPEG2000 document features are not supported
PDFs document features are more general and more flexible
An image to display in a rectangle is obtained from the JPEG2000 stream
23
Software Support
Key to use of any image format or document format are the tools available
Tools for creation
support advanced features
Tools for presentation
Tools for incorporating with other formats
Ubiquity of viewing tools
OCR and DR capabilities
24
Tools for Scan to PDF
Tools that separate image text and image pictures (MRC)
Adobe Professional Create from Scanner
Adobe PDF Scan Library 3.0 (OEM product)
CVision Technologies (www.cvisiontech.com) option for JPEG2000
Canon desktop printers and multi-functions devices (some)
Iris (www.irislink.com) MRC called IHQ
LuraTech (www.luratech.colm) MRC using JPEG2000
Nuance (www.thedevilincarnate.com)
JRAPublish (Jim Rile)
Spigraph (www.spigraph.fr)
VeryPDF (www.veryPDF.com)
Of course the ubiquitous Adobe Reader presents them all
25
PDF and
ISO Standards
26
Establishing the ISO PDF Umbrella
PDF 1.7 (ISO 32000 in 2008
PDF/A PDF/X PDF/E PDF/UA
archive graphic arts engineering accessibility
ISO 19005-1 ISO 15930-1 AIIM Committee AIIM Committee
(PDF 1.4) (PDF 1.4 & 1.6) --> ISO --> ISO
27
PDF/A
A PDF subset for archiving
ISO 19005-1
28
Long-term Preservation Needs for Electronic Documents
Characteristics identified as objectives for PDF/A were
Device Independent - Can be reliably and consistently rendered without regard
to the hardware or software platform
Self-contained - Contains all resources necessary for rendering
Self-documenting - Contains its own description
Unfettered - Absence of technical file protection mechanisms
Available - Authoritative specification publicly available
Adoption - Widespread use may be the best deterrent against preservation risk
29
PDF/A -- A PDF Subset of PDF 1.4
(Standard: ISO 19005-1)
Some useful PDF features work against, and are incompatible with,
preserving information over the long-term
PDF/A
PDF Subset: restricted from using some PDF features, for example
Anything that would alter the visual appearance over time (forms)
No external references or embedded files
Encryption
PDF Subset: required to use some PDF features, for example
Accessibility features for recoverable text (tagged PDF)
Embed all fonts
Specific metadata requirements
Device independent color
30
Uses for PDF/A
Archival storage of electronic documents
Documents of record
Government records
Corporate records
Distributing read only material
Documents with assured accessibility (read to the blind)
31
bc
32