Very strange copy-paste bug in PDF document

In summary, the article discusses a peculiar bug encountered when copying and pasting text from a PDF document, where the pasted text appears distorted or altered compared to the original. This issue may arise due to the way PDFs encode text, leading to unexpected characters or formatting when transferred. The article explores potential causes, such as font embedding and encoding discrepancies, and suggests troubleshooting steps to mitigate the problem, including using different PDF readers or converting the document to a more edit-friendly format.
  • #1
DaveC426913
Gold Member
23,047
6,732
TL;DR Summary
I have encountered a very strange bug in a PDF document that is corrupting the contents of copy-paste function.
I am trying to copy the contents of this PDF into another document (it doesn't seem to matter which document type).

No matter how much or how little I copy into the clipboard, no matter whether I paste it into a text editing field - an email, or notepad - I get the same thing:

The text I'm copying reads "... affordable!"
But what gets pasted is "... aJordable!"

At first I wondered if it was some sort of real-time OCR interpretation thing happening. I have tried this six ways from Sunday. In fact, all I need copy is the "ff" and I still get "J".

I'd like sanity check and a second pair of eyes. If someone is willing, I'll PM the PDF to you directly (it's like 29kb).
(It's a coworker's document, and it's just some marketing text copy so it's harmless).

Anyone game?

(To be clear: of course I can just write out myself that's not the point. Our department lives and dies on marketing text. If PDFs can't be retrusted that's a heck of an apple cart to be upset.)
 
Computer science news on Phys.org
  • #2
DaveC426913 said:
(it doesn't seem to matter which document type)
Does that include vanilla *.txt files?

Is this on Windows or on a Mac/Linux?
 
  • #3
DaveC426913 said:
But what gets pasted is "... aJordable!"
It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?

1728338597350.png
 
  • #4
You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.

This can be fixed at PDF creation.
 
  • Like
  • Informative
Likes mfb, DaveC426913 and berkeman
  • #5
berkeman said:
Does that include vanilla *.txt files?
I'll try that too.

berkeman said:
Is this on Windows or on a Mac/Linux?
Mac. But I'll test it on my PC.
berkeman said:
It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?
I dont see how that's relevant. I am literally copying and pasting text. It sure better not be auto-corrupting when I'm doing that.
 
  • #6
It looks like V50 has figured out the problem. Can you look at the hex source for this using a hex editor?
 
  • #7
DaveC426913 said:
Mac. But I'll test it on my PC.
Same thing on my PC.
 
  • #8
Vanadium 50 said:
You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.
That is ... alarming.

It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.

Vanadium 50 said:
This can be fixed at PDF creation.
Hopefully, there is a setting or simple method that can defeat this feature across-the-board, otherwise, moving forward the creators will never even know they've made this mistake.
 
  • Like
Likes AlexB23
  • #9
DaveC426913 said:
It means we cannot trust that what we copy from a client's document will be published faithfully.
That is correct. And you never could. What goes on the clipboard is what the application says to put on the clipboard. Full stop. Usually that is what you just selected, but not always. It's been this way since before PDF.

The same thing can happen with Word. I don't know if and when it does, bit the OS does not prevent it, No OS does.
 
  • #10
If you were to look at that Unicode character with a dump command, it would be represented in UTF-8 as a three-byte sequence 0xEF 0xAC 0x80.
 
  • #11
I have seen some other weirdness when copying from certain PDF files made with my OCR reader namely reversed text as an example "affordable" would be pasted as "e l b a d r o f f a"

My feeling was that the Chinese OCR software had some I18N NLS setting for right-to-left instead of left to right but I was never able to find it and their support was non-existent.
 
  • #12
DaveC426913 said:
It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.
Sounds like it's time to contact Adobe customer support, right?
 
  • #13
To be clear, this was not a scanned doc.

jedishrfu said:
Change Fonts: Before copying the text, change the font in the source document to something more standard like Arial or Times New Roman, which avoids using complex ligatures.
I think this is going to become the company policy: use standard fonts only.
 
  • #14
DaveC426913 said:
To be clear, this was not a scanned doc.
Then scan it into text, not PDF,
 
  • Like
Likes berkeman
  • #15
To swep away the cruft, Dave is doing OCR, While OCR ts getting pretty good, expecting perfection is unrealistic.

Going through PDF just adds an unnecessary complication and point of failure.
 
  • #16
Vanadium 50 said:
Then scan it into text, not PDF,
Not sure what you mean; it's not scanned at all.

The designer has typed this in (to what, I'm not sure) and saved it out as PDF. (Presumably, as part of their prefered routine, for a number of useful reasons.) The point being I can select the text as text (i.e. with text select tools) in the PDF.

I don't really have a say over what the designers use to work in. But I think I can make a case for insisting that they don't mess with the font when delivering text assets.
 
  • #17
OK, so we got stuck down an OCR rabbit hole. Without it, the story is still sample - the document was printed with (possibly tacit) instructions to make it look pretty, and to adjust what it needed to to make that happen. This is a side effect.

If you print in Lucida Console, this is less likely to happen, and your documents will be butt-ugly. Take your pick.
 
  • #18
Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions.

This is a doc that a designer colleague has passed to me to be implemented on a page. It's entirely internal use, and it is straight up text.

The choice of PDF as a transfer medium is likely merely a professional habit on his part. (There are good reasons for it, but here is one of the downsides).

It does help, in that he us able to show me how he wants it laid out - bolding, paragraphs, etc.

But his choice of font is irrelevant, since I am stripping everything out and using the site's CSS styling anyway.

I have asked him to resend it in Word, which does not deign to know better than the user about what the user wants. If I'm lucky, this doesn't have to be a department-wide mandate; I think other designers use Word.
 
  • #19
DaveC426913 said:
This is a doc that a designer colleague has passed to me
DaveC426913 said:
I have asked him to resend it in Word
Does your work use some kind of formal document control system? In my previous work, we used various systems over the years (like Arena and Agile) for document control and revision control. One of the requirements for releasing a document was to attach the source file (in whatever format like Word or the format of the designing tool), and to attach a PDF copy that anybody could open even if they did not have the design tool software.
 
  • #20
berkeman said:
Does your work use some kind of formal document control system? .
Our department is small enough that we don't have such policies yet. But this incident has occurred because we are growing.

It makes me aware I've come from much bigger companies. I suspect implementation of standard policies are going to be something I will be harping on.
 
  • Like
Likes berkeman
  • #21
I think there is some confusion here. Some elements of this have been touched on by others, but I think you are still missing the full picture.

The problem does not lie with Windows, Word, Adobe or the PDF format - all of these are doing exactly what they are supposed to.

The problem lies with the software that your designer is using to create the content: I would guess that this is some professional layout software such as InDesign or Quark Express. This sort of software is not designed for creating content, it is designed to prepare a document for presentation, perhaps electronically or more often in print. In order to do this it follows some common printing conventions which have been used for hundreds of years including the use of ligatures, where some letter combinations (in this case 'ff') are replaced with a special glyph (in this case 'ff'). When the document is rendered as a PDF, it is the code for this glyph (Unicode U+FB00 'Latin Small Ligature Ff') that is inserted into the text and so this code is what is copied across to your Word document.

What happens next depends on what version of Word you are using and what fonts you have installed*, but I won't go into this now as it is not going to help you.

What is going to help is asking the 'designer' (although here they are simply a content creator) to turn off ligatures (and kerning and any other layout features) in the copy they send you. If you still need them to provide a fully laid out page to see what the intended layout is then they can do this as well, in an additional copy.



* in particular it seems that the font you are using does not implement this glyph correctly: I am pretty confident however that the 'ff' will render properly in your web browser here because this is part of modern web standards.
 
  • Informative
  • Like
Likes jedishrfu and berkeman
  • #22
pbuk said:
What is going to help is asking the 'designer' (although here they are simply a content creator) to turn off ligatures (and kerning and any other layout features) in the copy they send you. If you still need them to provide a fully laid out page to see what the intended layout is then they can do this as well, in an additional copy.
Yes. This.

You are certainly right that I need their deliverables that show layout. In this case, it was straight copy, and copy should really be sent under separate cover - as you point out.

I have spoken with the designer (who, it turns out, got this from a third party stakeholder) and we are in agreement that Word is the proper tool. Jury is still out on whether we can start telling third party stakeholders what tools they're allowed to communicate with.
 
  • #23
DaveC426913 said:
Jury is still out on whether we can start telling third party stakeholders what tools they're allowed to communicate with.
A first-pass approach, it would need polishing before decreeing anything, especially the phrases in <...>:

"Since we do not have every text layout program in existence, <would you> kindly supply a copy in PDF <Word/plain text> format in addition to your layout requirements? This will expedite <the/our> <processing of your project>."

Hope this helps!

Cheers,
Tom
 
Last edited:
  • #24
Let's not lose focus. Fixing this takes 5 or 10 seconds. To avoid wasting 5 or 10 seconds in the future, we will have meetings, rules, standards, meetings about riles and srandards and thus ensure we will never waste 5 or 10 seconds again!
 
  • #25
Vanadium 50 said:
Let's not lose focus. Fixing this takes 5 or 10 seconds. To avoid wasting 5 or 10 seconds in the future, we will have meetings, rules, standards, meetings about riles and standards and thus ensure we will never waste 5 or 10 seconds again!
I know (hope) you're kidding around but it isn't a 5 or 10 second fix.

Unless everybody in the chain is schooled on not using PDFs to pass copy around, there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time, for every revision that comes down the pipe. And even if we did, we can't be sure what the copy was supposed to be (what if it occurred in a date or dollar value?).

By the time it gets to me, I've got a compressed timeline and cannot afford to proof every word. Chasing down the original copy writer for a change causes ripples that travel up and down the line of approvers and stakeholders, which results in delays and missed deadlines.

It's not the time, it's the consequences: missed publish deadlines or errors in the product. Both are unprofessional and new-one tearing-worthy.
 
  • #26
Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.

You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"
 
  • #27
Vanadium 50 said:
Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.
You speak out-of-turn. Please re-read my post as to why it is not nearly this simple, particularly this:

"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

Vanadium 50 said:
You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"
Thanks but I'm pretty sure you don't know the processes in a given department in an unknown business or how many people are involved or how approvals travel up and down the chain.

One doesn't necessarily know who the originator is, or how to reach them right before the deadline - what one does is establish policies i.e. anticipate problems rather than react to them in crunch time - which is what I'm doing here by figuring out exactly what the problem is, so I can justify why the entire department has to take on a new policy.
 
Last edited:
  • #28
It's your time. Spend it however you want.
 
  • #29
Vanadium 50 said:
Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.

You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"
As someone who is just coming in and reading this thread, I agree with the OP: he has already explained why it isn't that simple, and why he is approaching it the way he is.
 
  • Like
Likes Wrichik Basu, Tom.G and DaveC426913
  • #30
The fundamental problem is that copy-and-patse provides no gurantee of accuracy. What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read. I could, if I were inclined to be perverse, wrire a program that, no matter what was copied to the clipboard, put "Tim Horton's has stale donuts" there. (Likewise with the paste)

If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.
 
  • #31
Vanadium 50 said:
The fundamental problem is that copy-and-patse provides no gurantee of accuracy.
What would be an alternative?

I can't write the copy myself merely by reading what someone wrote. That's prone to even more errors.

Copy and paste does guarantee accuracy if you establish standards. A straight up .txt document will faithfully copy almost everything. Enough that the list of exceptions is reduced to a manageable level.

Vanadium 50 said:
What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read.
Which is why you set standards that are known to guarantee compatibility.

Vanadium 50 said:
I could, if I were inclined to be perverse, wrire a program that,
You could, yes. That would be dishonest.

You could also use invisible ink or pour sugar in our gas tanks, if you were of a mind.

What is the point in arguing absurdities?

It troubles me that you seem to consider sabotage is a valid concern. Or are you joking around?

Vanadium 50 said:
If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.
No you don't.

First thing you do is hire people who are not saboteurs.
Second thing you do is establish standards.

I am not sure why you appear to be treating this as a joke. This is real consequential issue. I was just lucky to catch this error because it happened to be the last word in the copy. Frankly I still don't know if I published any other misspellings because I had to publish what I got and couldn't afford to stop everything to ask the chain of stakeholders to rewrite.

What if it had been a dollar figure? How would I ever know? I can't report something I don't know about. And then, before it's caught, some customer tries to sue us for fraudulent pricing. Still a joke?
 
  • Like
  • Love
Likes Wrichik Basu, Tom.G and PeterDonis
  • #32
The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.
 
  • #33
Vanadium 50 said:
The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.
Nobody said 'perfect', except you (three times now, including post 15). I think this is carrying the argument to an impractical (ad absurdum) degree, and I'm not sure why.

What I actually said (post 31) was "...the list of exceptions is reduced to a manageable level."

One should be able to count on standard text copying as standard text. The ASCII string 'affordable' doesn't get mangled by a simple copy paste.

So we will establish a policy that uses the lowest common denominator of text representation.
 
  • #34
"Imperfect fidelity acceptable to Dave" is not a clear spec.

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.
 
  • #35
Vanadium 50 said:
"Imperfect fidelity acceptable to Dave" is not a clear spec.
Clear for whom? You?
What mater si that it is sufficiently clear to our department.

Vanadium 50 said:
I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.
You told me I was scanning it and using OCR (you said "sweeping away the cruft").
Then you told me it was a simple ten second fix, no matter how many ways I explained that it was not.
Then you mocked me because you apparently didn't understand how processes in departments work.

You've been mansplaining. That's what you've been getting push back on. (Is it "grief" if I am telling you where your assumptions are incorrect and asking why you are making a joke of this?).

I have been trying to be diplomatic: "Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions." This is me trying to not point fingers, despite my frustration that you are giving me grief at every turn.

That being said, thank you sincerely for your help.
 
  • Like
  • Love
Likes jedishrfu and Wrichik Basu

Similar threads

Replies
2
Views
368
Replies
4
Views
4K
Replies
2
Views
3K
Replies
2
Views
13K
Replies
2
Views
2K
Back
Top