Very strange copy-paste bug in PDF document

  • #1
DaveC426913
Gold Member
22,859
6,538
TL;DR Summary
I have encountered a very strange bug in a PDF document that is corrupting the contents of copy-paste function.
I am trying to copy the contents of this PDF into another document (it doesn't seem to matter which document type).

No matter how much or how little I copy into the clipboard, no matter whether I paste it into a text editing field - an email, or notepad - I get the same thing:

The text I'm copying reads "... affordable!"
But what gets pasted is "... aJordable!"

At first I wondered if it was some sort of real-time OCR interpretation thing happening. I have tried this six ways from Sunday. In fact, all I need copy is the "ff" and I still get "J".

I'd like sanity check and a second pair of eyes. If someone is willing, I'll PM the PDF to you directly (it's like 29kb).
(It's a coworker's document, and it's just some marketing text copy so it's harmless).

Anyone game?

(To be clear: of course I can just write out myself that's not the point. Our department lives and dies on marketing text. If PDFs can't be retrusted that's a heck of an apple cart to be upset.)
 
Computer science news on Phys.org
  • #2
DaveC426913 said:
(it doesn't seem to matter which document type)
Does that include vanilla *.txt files?

Is this on Windows or on a Mac/Linux?
 
  • #3
DaveC426913 said:
But what gets pasted is "... aJordable!"
It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?

1728338597350.png
 
  • #4
You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.

This can be fixed at PDF creation.
 
  • Informative
  • Like
Likes DaveC426913 and berkeman
  • #5
berkeman said:
Does that include vanilla *.txt files?
I'll try that too.

berkeman said:
Is this on Windows or on a Mac/Linux?
Mac. But I'll test it on my PC.
berkeman said:
It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?
I dont see how that's relevant. I am literally copying and pasting text. It sure better not be auto-corrupting when I'm doing that.
 
  • #6
It looks like V50 has figured out the problem. Can you look at the hex source for this using a hex editor?
 
  • #7
DaveC426913 said:
Mac. But I'll test it on my PC.
Same thing on my PC.
 
  • #8
Vanadium 50 said:
You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.
That is ... alarming.

It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.

Vanadium 50 said:
This can be fixed at PDF creation.
Hopefully, there is a setting or simple method that can defeat this feature across-the-board, otherwise, moving forward the creators will never even know they've made this mistake.
 
  • #9
DaveC426913 said:
It means we cannot trust that what we copy from a client's document will be published faithfully.
That is correct. And you never could. What goes on the clipboard is what the application says to put on the clipboard. Full stop. Usually that is what you just selected, but not always. It's been this way since before PDF.

The same thing can happen with Word. I don't know if and when it does, bit the OS does not prevent it, No OS does.
 
  • #10
CHatGPT-4o identified the Unicode special character for the "ff" as U+FB00

https://www.compart.com/en/unicode/U+FB00

ChatGPT-4o also suggests the following reasons:

The replacement of “ff” with “j” when copying text could be caused by several factors, most likely related to font encoding, text recognition software, or copying from a PDF or scanned document. Here are a few potential reasons:

1. Font Issues: Some fonts use ligatures for “ff,” which are special characters that combine two or more letters into one glyph. When copied, the system might misinterpret the ligature as another character, such as “j,” especially if the text is being transferred to a different system or program that doesn’t support the original font.

2. Optical Character Recognition (OCR) Errors: If the document was scanned and converted to text using OCR software, it might have misrecognized the “ff” ligature as a “j,” due to the similarity in appearance in certain fonts.

3. Encoding Mismatch: If you’re copying from a PDF or another document type that uses a specific encoding (such as a custom encoding or one that substitutes characters), it may result in incorrect character substitutions when the text is copied to a program with a different encoding system.

To avoid this, try the following:

Change Fonts: Before copying the text, change the font in the source document to something more standard like Arial or Times New Roman, which avoids using complex ligatures.

Convert to Plain Text: Paste the copied text into a plain text editor (like Notepad) first, then copy it again from there.

Use OCR Carefully: If OCR is involved, check the OCR output for errors, or use different OCR software that might better recognize the characters.
 
  • #11
If you were to look at that Unicode character with a dump command, it would be represented in UTF-8 as a three-byte sequence 0xEF 0xAC 0x80.
 
  • #12
I have seen some other weirdness when copying from certain PDF files made with my OCR reader namely reversed text as an example "affordable" would be pasted as "e l b a d r o f f a"

My feeling was that the Chinese OCR software had some I18N NLS setting for right-to-left instead of left to right but I was never able to find it and their support was non-existent.
 
  • #13
DaveC426913 said:
It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.
Sounds like it's time to contact Adobe customer support, right?
 
  • #14
To be clear, this was not a scanned doc.

jedishrfu said:
Change Fonts: Before copying the text, change the font in the source document to something more standard like Arial or Times New Roman, which avoids using complex ligatures.
I think this is going to become the company policy: use standard fonts only.
 
  • #15
DaveC426913 said:
To be clear, this was not a scanned doc.
Then scan it into text, not PDF,
 
  • Like
Likes berkeman
  • #16
To swep away the cruft, Dave is doing OCR, While OCR ts getting pretty good, expecting perfection is unrealistic.

Going through PDF just adds an unnecessary complication and point of failure.
 
  • #17
Vanadium 50 said:
Then scan it into text, not PDF,
Not sure what you mean; it's not scanned at all.

The designer has typed this in (to what, I'm not sure) and saved it out as PDF. (Presumably, as part of their prefered routine, for a number of useful reasons.) The point being I can select the text as text (i.e. with text select tools) in the PDF.

I don't really have a say over what the designers use to work in. But I think I can make a case for insisting that they don't mess with the font when delivering text assets.
 

Similar threads

Replies
4
Views
3K
Replies
2
Views
3K
Replies
2
Views
13K
Replies
2
Views
2K
Back
Top