I might have asked this before, but does anybody know of an open source PDF (or JP2) to HTML tool—or, to be clearer, a digitized book to single HTML page tool? A colleague and I have talked about this and we have both done work in this direction. I am just struck by how strange it is that such a thing doesn’t already exist. In fact it seems rather obvious that it should be a command-line tool, because it’s such an obvious and needed task. I’m too old to want to reinvent wheels.

I guess I’m stuck with Tesseract + postprocessing, but the postprocessing is endless. Pretty sure I’ll just install a lightweight LLM on my desktop and use that.

You are not currently signed in. Please sign in or use the link below (requires minifeed.org).
Like a post to share the love.
Log in Sign up
Reply to join the conversation.
Log in Sign up