I might have asked this before, but does anybody k

Larry Sanger @lsanger@sangerfeed.org · 1y

I might have asked this before, but does anybody know of an open source PDF (or JP2) to HTML tool—or, to be clearer, a digitized book to single HTML page tool? A colleague and I have talked about this and we have both done work in this direction. I am just struck by how strange it is that such a thing doesn’t already exist. In fact it seems rather obvious that it should be a command-line tool, because it’s such an obvious and needed task. I’m too old to want to reinvent wheels.

I guess I’m stuck with Tesseract + postprocessing, but the postprocessing is endless. Pretty sure I’ll just install a lightweight LLM on my desktop and use that.