I saw this other guy asking how you’d download protected drive only view documents. So that reminded me of that other annoying characteristic of PDFs. They’re ‘protected’.

How do you deal with PDFs that are inherently uncustomisable and have fixed formatting? I appreciate the KO Reader and other readers can do reflowable text, but I’d prefer not to and epubs/txt/any customisable format would be better.

Any good methods of PDF to text/epub out there?

  • ChaoticNeutralCzech@feddit.de
    link
    fedilink
    English
    arrow-up
    14
    ·
    edit-2
    1 year ago

    Yeah, PDFs frequently make each line, word or sometimes letter its own textbox to ensure consistency when rendering. Did you try the open source ebook manager Calibre for desktop or Librera Reader for Android?

    Also, utilities like qpdf can remove a PDF password (provided you know it of course – though you can guess with quick, unlimited retries).

  • med@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    1
    ·
    1 year ago

    Calibre is the way to go. It’ll convert quite happily to epub, html, whatever. I just converted the Linux From Scratch book pdf in to epub and mobi for my kindle.

    If you just need to edit a pdf and change some formatting on a line, try LibreOffice Draw!

  • kniescherz@feddit.de
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    1 year ago

    No easy ones I know of. Have done it but it was a pain.

    1. Crop the pages so that headers, footnotes and page numbers arent part of the book anymore. Use Briss for that.

    2. Convert to html, not sure which is the best software though

    3. Clean up manually. That is the most work, mostly newlines and stuff. Dont even try without Regex.

    I dont bother anymore, for smaller pages I do the cropping and read on kindle. Larger on tablet.

    Have a look at this forum for more in depth knowledge: https://www.mobileread.com/forums/forumdisplay.php?f=184

          • kniescherz@feddit.de
            link
            fedilink
            English
            arrow-up
            4
            ·
            1 year ago

            Wow who downvotes you?

            That quote is pretty spot on. I feel that there might be well formated and simple pdfs which could be formatted but as soon as it gets busier or the pdf isnt well structured behind the scenes it gets messy.

            • Historical_General@lemmy.worldOP
              link
              fedilink
              English
              arrow-up
              2
              ·
              edit-2
              1 year ago

              I think some idiots/bots from another instance to do with a political topic did it lol. They’ve done it to a few comments. We haven’t left reddit apparently.

              Do you think lemmy admins could see the downvoters and check if they’re using bots?

                • Historical_General@lemmy.worldOP
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  1 year ago

                  Yeah I noticed that.

                  I just checked, there are 40 downvotes (78 upvotes). That’s wierd - I checked other posts and they don’t seem to have anything like that - I saw none in the double digits.

                  I might just have to keep a separate account for posting politics which is sad but necessary so that my posts/comments are ranked according to utility and not just downvoted by angry nerds. I’ll probably make a post about this somewhere too.

  • liliumstar@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    If you have one of those really annoying PDFs where the structure is all crazy or some letters are pictures, etc., it is possible to OCR them with a mask on page numbers.

    There are also tools which can just extract the text elements and smush them together, but as others have said, this doesn’t always works as intended.