pdf-read: Read and render PDF files
1 Examples
#lang racket (require slideshow/pict pdf-read) (show-pict (page->pict "oopsla04-gff.pdf"))
By default, page->pict shows the first page of the given PDF filename. You can also say something like (show-pict (page->pict (pdf-page "oopsla04-gff.pdf" 5))) to show the 6th page (pages are zero-indexed).
;; The first page of a PDF file. (Pages are zero-indexed) (define page (pdf-page "oopsla04-gff.pdf" 0)) ;; Overlay each box over the PDF. (for/fold ([pageview (page->pict page)]) ([bounding-box (in-list (page-find-text page "the"))]) (match-define (list x1 y1 x2 y2) bounding-box) ;; Each match's bounding box ^ (pin-over pageview x1 y1 (cellophane (colorize (filled-rectangle (- x2 x1) (- y2 y1)) "yellow") 0.5)))
(define page (pdf-page "oopsla04-gff.pdf" 0)) (for/fold ([pageview (apply blank (page-size page))]) ([box (in-list (page-text-layout page))]) (match-define (list x1 y1 x2 y2) box) (pin-over pageview x1 y1 (colorize (rectangle (- x2 x1) (- y2 y1)) "gray")))
(rotate (frame (scale (inset/clip (page->pict page) -400 -300 -100 -400) 5)) (* 0.125 pi))
(rotate (frame (scale (inset/clip (bitmap (page->bitmap page)) -400 -300 -100 -400) 5)) (* 0.125 pi))
2 PDF files
All functions that accept pages or documents also accept filenames. This is more convenient for you, but it is also less efficient because the document must be re-opened every time. You can make this faster by keeping the result of pdf-page or open-pdf-uri to ensure that this library only opens the document once.
procedure
(pdf-document? maybe-doc) → boolean?
maybe-doc : any/c
procedure
(open-pdf-uri uri password) → (or/c pdf-document? false?)
uri : string? password : (or/c string? false?)
This function will throw an error if the PDF file does not exist.
(define document (open-pdf-uri "file:/tmp/secret.pdf" "some_password")) (define page (pdf-page document 9)) (show-pict (page->pict page))
(page->pict (pdf-page (open-pdf-uri "file:/tmp/filename.pdf" #f) 0))
procedure
maybe-doc : pdf-document? page-index : exact-nonnegative-integer?
(pdf-page "/tmp/oopsla04-gff.pdf" 2)
3 Rendering
procedure
(page->pict page) → pict?
page : pdf-page?
procedure
(page->bitmap page) → (is-a?/c bitmap%)
page : pdf-page?
procedure
(page-render-to-dc! page dc) → any/c
page : pdf-page? dc : (is-a?/c dc<%>)
procedure
(page-render-to-cairo! page _cairo_t) → any/c
page : pdf-page? _cairo_t : any/c
4 Layout
procedure
(page-size page) →
(list/c (and/c real? (not/c negative?)) (and/c real? (not/c negative?))) page : pdf-page?
procedure
(page-crop-box page) →
(list/c (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?))) page : pdf-page?
Each rectangle is a (list x1 y1 x2 y2), where x1,y1 is the top left corner and x2,y2 is the bottom right. Coordinates are in points (1/72 of an inch).
procedure
(page-text-in-rect page mode x1 y1 x2 y2) → string?
page : pdf-page? mode : (one-of/c 'glyph 'word 'line) x1 : (and/c inexact? (not/c negative?)) y1 : (and/c inexact? (not/c negative?)) x2 : (and/c inexact? (not/c negative?)) y2 : (and/c inexact? (not/c negative?))
When specifying the rectangle, x1,y1 should be the point of the beginning of the selection and x2,y2 should be the end. Coordinates are in points (1/72 of an inch).
procedure
(page-text-layout page)
→
(listof (list/c (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)))) page : pdf-page?
Each bounding box is a (list x1 y1 x2 y2), where x1,y1 is the top left corner and x2,y2 is the bottom right. Coordinates are in points (1/72 of an inch).
procedure
(page-text-with-layout page)
→
(listof (list/c string (list/c (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?))))) page : pdf-page?
Each bounding box is a (list x1 y1 x2 y2), where x1,y1 is the top left corner and x2,y2 is the bottom right. Coordinates are in points (1/72 of an inch).
(take (page-text-with-layout "oopsla04-gff.pdf") 5)
'(("S\n" (150.738 71.302 162.699 87.890)) ("u\n" (162.699 71.302 173.656 87.890)) ("p\n" (173.656 71.302 184.613 87.890)) ("e\n" (184.613 71.302 194.584 87.890)) ("r\n" (194.584 71.302 201.560 87.890)))
5 Searching
procedure
(page-find-text page text)
→
(listof (list/c (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)) (and/c real? (not/c negative?)))) page : pdf-page? text : string?
6 Metadata
procedure
doc : pdf-document?
procedure
doc : pdf-document? (pdf-author doc) → (or/c false? string?) doc : pdf-document? (pdf-subject doc) → (or/c false? string?) doc : pdf-document? (pdf-keywords doc) → (or/c false? string?) doc : pdf-document? (pdf-creator doc) → (or/c false? string?) doc : pdf-document? (pdf-producer doc) → (or/c false? string?) doc : pdf-document?
procedure
(page-label page) → (or/c false? string?)
page : pdf-page?
7 Bugs and Issues
Note that pdf->pict draws directly to the underlying surface’s cairo context. This may have problems if the dc<%> is not backed by cairo or if you perform different transformations (like cropping or blurring).