Very strange copy-paste bug in PDF document

DaveC426913 · Oct 7, 2024

I am trying to copy the contents of this PDF into another document (it doesn't seem to matter which document type).

No matter how much or how little I copy into the clipboard, no matter whether I paste it into a text editing field - an email, or notepad - I get the same thing:

The text I'm copying reads "... affordable!"
But what gets pasted is "... aJordable!"

At first I wondered if it was some sort of real-time OCR interpretation thing happening. I have tried this six ways from Sunday. In fact, all I need copy is the "ff" and I still get "J".

I'd like sanity check and a second pair of eyes. If someone is willing, I'll PM the PDF to you directly (it's like 29kb).
(It's a coworker's document, and it's just some marketing text copy so it's harmless).

Anyone game?

(To be clear: of course I can just write out myself that's not the point. Our department lives and dies on marketing text. If PDFs can't be retrusted that's a heck of an apple cart to be upset.)

berkeman · Oct 7, 2024

DaveC426913 said:

(it doesn't seem to matter which document type)

Does that include vanilla *.txt files?

Is this on Windows or on a Mac/Linux?

berkeman · Oct 7, 2024

DaveC426913 said:

But what gets pasted is "... aJordable!"

It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?

Vanadium 50 · Oct 7, 2024

You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.

This can be fixed at PDF creation.

DaveC426913 · Oct 7, 2024

berkeman said:

Does that include vanilla *.txt files?

I'll try that too.

berkeman said:

Is this on Windows or on a Mac/Linux?

Mac. But I'll test it on my PC.

berkeman said:

It looks like that really is a word after all. Can you delete it from your dictionary to see if that helps? Is affordable in your dictionary at the moment?

I dont see how that's relevant. I am literally copying and pasting text. It sure better not be auto-corrupting when I'm doing that.

berkeman · Oct 7, 2024

It looks like V50 has figured out the problem. Can you look at the hex source for this using a hex editor?

DaveC426913 · Oct 7, 2024

DaveC426913 said:

Mac. But I'll test it on my PC.

Same thing on my PC.

DaveC426913 · Oct 7, 2024

Vanadium 50 said:

You are pasting what is there. PDF has replaced a system font with one of its own that looks better for some combinations. ff is a common one. However, its system representation is not ff, but Alt-K or something, so that's what goes on the clipboard.

That is ... alarming.

It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.

Vanadium 50 said:

This can be fixed at PDF creation.

Hopefully, there is a setting or simple method that can defeat this feature across-the-board, otherwise, moving forward the creators will never even know they've made this mistake.

Vanadium 50 · Oct 7, 2024

DaveC426913 said:

It means we cannot trust that what we copy from a client's document will be published faithfully.

That is correct. And you never could. What goes on the clipboard is what the application says to put on the clipboard. Full stop. Usually that is what you just selected, but not always. It's been this way since before PDF.

The same thing can happen with Word. I don't know if and when it does, bit the OS does not prevent it, No OS does.

jedishrfu · Oct 7, 2024

If you were to look at that Unicode character with a dump command, it would be represented in UTF-8 as a three-byte sequence 0xEF 0xAC 0x80.

jedishrfu · Oct 7, 2024

I have seen some other weirdness when copying from certain PDF files made with my OCR reader namely reversed text as an example "affordable" would be pasted as "e l b a d r o f f a"

My feeling was that the Chinese OCR software had some I18N NLS setting for right-to-left instead of left to right but I was never able to find it and their support was non-existent.

berkeman · Oct 7, 2024

DaveC426913 said:

It means we cannot trust that what we copy from a client's document will be published faithfully.

If we don't find a solution, we'll have to proof every single word in all our content. Or we'll have to stop using PDF format.

Sounds like it's time to contact Adobe customer support, right?

DaveC426913 · Oct 7, 2024

To be clear, this was not a scanned doc.

jedishrfu said:

Change Fonts: Before copying the text, change the font in the source document to something more standard like Arial or Times New Roman, which avoids using complex ligatures.

I think this is going to become the company policy: use standard fonts only.

Vanadium 50 · Oct 7, 2024

DaveC426913 said:

To be clear, this was not a scanned doc.

Then scan it into text, not PDF,

Vanadium 50 · Oct 7, 2024

To swep away the cruft, Dave is doing OCR, While OCR ts getting pretty good, expecting perfection is unrealistic.

Going through PDF just adds an unnecessary complication and point of failure.

DaveC426913 · Oct 7, 2024

Vanadium 50 said:

Then scan it into text, not PDF,

Not sure what you mean; it's not scanned at all.

The designer has typed this in (to what, I'm not sure) and saved it out as PDF. (Presumably, as part of their prefered routine, for a number of useful reasons.) The point being I can select the text as text (i.e. with text select tools) in the PDF.

I don't really have a say over what the designers use to work in. But I think I can make a case for insisting that they don't mess with the font when delivering text assets.

Vanadium 50 · Oct 8, 2024

OK, so we got stuck down an OCR rabbit hole. Without it, the story is still sample - the document was printed with (possibly tacit) instructions to make it look pretty, and to adjust what it needed to to make that happen. This is a side effect.

If you print in Lucida Console, this is less likely to happen, and your documents will be butt-ugly. Take your pick.

DaveC426913 · Oct 8, 2024

Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions.

This is a doc that a designer colleague has passed to me to be implemented on a page. It's entirely internal use, and it is straight up text.

The choice of PDF as a transfer medium is likely merely a professional habit on his part. (There are good reasons for it, but here is one of the downsides).

It does help, in that he us able to show me how he wants it laid out - bolding, paragraphs, etc.

But his choice of font is irrelevant, since I am stripping everything out and using the site's CSS styling anyway.

I have asked him to resend it in Word, which does not deign to know better than the user about what the user wants. If I'm lucky, this doesn't have to be a department-wide mandate; I think other designers use Word.

berkeman · Oct 8, 2024

DaveC426913 said:

This is a doc that a designer colleague has passed to me

DaveC426913 said:

I have asked him to resend it in Word

Does your work use some kind of formal document control system? In my previous work, we used various systems over the years (like Arena and Agile) for document control and revision control. One of the requirements for releasing a document was to attach the source file (in whatever format like Word or the format of the designing tool), and to attach a PDF copy that anybody could open even if they did not have the design tool software.

DaveC426913 · Oct 8, 2024

berkeman said:

Does your work use some kind of formal document control system? .

Our department is small enough that we don't have such policies yet. But this incident has occurred because we are growing.

It makes me aware I've come from much bigger companies. I suspect implementation of standard policies are going to be something I will be harping on.

pbuk · Oct 9, 2024

I think there is some confusion here. Some elements of this have been touched on by others, but I think you are still missing the full picture.

The problem does not lie with Windows, Word, Adobe or the PDF format - all of these are doing exactly what they are supposed to.

The problem lies with the software that your designer is using to create the content: I would guess that this is some professional layout software such as InDesign or Quark Express. This sort of software is not designed for creating content, it is designed to prepare a document for presentation, perhaps electronically or more often in print. In order to do this it follows some common printing conventions which have been used for hundreds of years including the use of ligatures, where some letter combinations (in this case 'ff') are replaced with a special glyph (in this case 'ﬀ'). When the document is rendered as a PDF, it is the code for this glyph (Unicode U+FB00 'Latin Small Ligature Ff') that is inserted into the text and so this code is what is copied across to your Word document.

What happens next depends on what version of Word you are using and what fonts you have installed*, but I won't go into this now as it is not going to help you.

What is going to help is asking the 'designer' (although here they are simply a content creator) to turn off ligatures (and kerning and any other layout features) in the copy they send you. If you still need them to provide a fully laid out page to see what the intended layout is then they can do this as well, in an additional copy.

* in particular it seems that the font you are using does not implement this glyph correctly: I am pretty confident however that the 'ﬀ' will render properly in your web browser here because this is part of modern web standards.

DaveC426913 · Oct 9, 2024

pbuk said:

What is going to help is asking the 'designer' (although here they are simply a content creator) to turn off ligatures (and kerning and any other layout features) in the copy they send you. If you still need them to provide a fully laid out page to see what the intended layout is then they can do this as well, in an additional copy.

Yes. This.

You are certainly right that I need their deliverables that show layout. In this case, it was straight copy, and copy should really be sent under separate cover - as you point out.

I have spoken with the designer (who, it turns out, got this from a third party stakeholder) and we are in agreement that Word is the proper tool. Jury is still out on whether we can start telling third party stakeholders what tools they're allowed to communicate with.

Tom.G · Oct 11, 2024

DaveC426913 said:

Jury is still out on whether we can start telling third party stakeholders what tools they're allowed to communicate with.

A first-pass approach, it would need polishing before decreeing anything, especially the phrases in <...>:

"Since we do not have every text layout program in existence, <would you> kindly supply a copy in ~~PDF~~ <Word/plain text> format in addition to your layout requirements? This will expedite <the/our> <processing of your project>."

Hope this helps!

Cheers,
Tom

Vanadium 50 · Oct 11, 2024

Let's not lose focus. Fixing this takes 5 or 10 seconds. To avoid wasting 5 or 10 seconds in the future, we will have meetings, rules, standards, meetings about riles and srandards and thus ensure we will never waste 5 or 10 seconds again!

DaveC426913 · Oct 11, 2024

Vanadium 50 said:

Let's not lose focus. Fixing this takes 5 or 10 seconds. To avoid wasting 5 or 10 seconds in the future, we will have meetings, rules, standards, meetings about riles and standards and thus ensure we will never waste 5 or 10 seconds again!

I know (hope) you're kidding around but it isn't a 5 or 10 second fix.

Unless everybody in the chain is schooled on not using PDFs to pass copy around, there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time, for every revision that comes down the pipe. And even if we did, we can't be sure what the copy was supposed to be (what if it occurred in a date or dollar value?).

By the time it gets to me, I've got a compressed timeline and cannot afford to proof every word. Chasing down the original copy writer for a change causes ripples that travel up and down the line of approvers and stakeholders, which results in delays and missed deadlines.

It's not the time, it's the consequences: missed publish deadlines or errors in the product. Both are unprofessional and new-one tearing-worthy.

Vanadium 50 · Oct 11, 2024

Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.

You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"

DaveC426913 · Oct 11, 2024

Vanadium 50 said:

Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.

You speak out-of-turn. Please re-read my post as to why it is not nearly this simple, particularly this:

"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

Vanadium 50 said:

You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"

Thanks but I'm pretty sure you don't know the processes in a given department in an unknown business or how many people are involved or how approvals travel up and down the chain.

One doesn't necessarily know who the originator is, or how to reach them right before the deadline - what one does is establish policies i.e. anticipate problems rather than react to them in crunch time - which is what I'm doing here by figuring out exactly what the problem is, so I can justify why the entire department has to take on a new policy.

Vanadium 50 · Oct 11, 2024

It's your time. Spend it however you want.

PeterDonis · Oct 11, 2024

Vanadium 50 said:

Yes it is. Two letters, "ff" was replaced by the "ff" ligature. To fix, replace the ligature with ff.

You can also fix this with a phone call - "The PDF got mungled. Can you send me the Word or other parent document?"

As someone who is just coming in and reading this thread, I agree with the OP: he has already explained why it isn't that simple, and why he is approaching it the way he is.

Vanadium 50 · Oct 13, 2024

The fundamental problem is that copy-and-patse provides no gurantee of accuracy. What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read. I could, if I were inclined to be perverse, wrire a program that, no matter what was copied to the clipboard, put "Tim Horton's has stale donuts" there. (Likewise with the paste)

If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.

DaveC426913 · Oct 13, 2024

Vanadium 50 said:

The fundamental problem is that copy-and-patse provides no gurantee of accuracy.

What would be an alternative?

I can't write the copy myself merely by reading what someone wrote. That's prone to even more errors.

Copy and paste does guarantee accuracy if you establish standards. A straight up .txt document will faithfully copy almost everything. Enough that the list of exceptions is reduced to a manageable level.

Vanadium 50 said:

What goes on the clipboard is what one program chooses to write, and what comes back is what another chooses to read.

Which is why you set standards that are known to guarantee compatibility.

Vanadium 50 said:

I could, if I were inclined to be perverse, wrire a program that,

You could, yes. That would be dishonest.

You could also use invisible ink or pour sugar in our gas tanks, if you were of a mind.

What is the point in arguing absurdities?

It troubles me that you seem to consider sabotage is a valid concern. Or are you joking around?

Vanadium 50 said:

If this is viewed as too risky or inaccurate, you need to drop cut and paste entirely.

No you don't.

First thing you do is hire people who are not saboteurs.
Second thing you do is establish standards.

I am not sure why you appear to be treating this as a joke. This is real consequential issue. I was just lucky to catch this error because it happened to be the last word in the copy. Frankly I still don't know if I published any other misspellings because I had to publish what I got and couldn't afford to stop everything to ask the chain of stakeholders to rewrite.

What if it had been a dollar figure? How would I ever know? I can't report something I don't know about. And then, before it's caught, some customer tries to sue us for fraudulent pricing. Still a joke?

Vanadium 50 · Oct 14, 2024

The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.

DaveC426913 · Oct 14, 2024

Vanadium 50 said:

The clipboard does not guarantee perfect fidelity. If perfect fidelity is a requirement, you need to use something else. That something else might not even be a PC.

Nobody said 'perfect', except you (three times now, including post 15). I think this is carrying the argument to an impractical (ad absurdum) degree, and I'm not sure why.

What I actually said (post 31) was "...the list of exceptions is reduced to a manageable level."

One should be able to count on standard text copying as standard text. The ASCII string 'affordable' doesn't get mangled by a simple copy paste.

So we will establish a policy that uses the lowest common denominator of text representation.

Vanadium 50 · Oct 14, 2024

"Imperfect fidelity acceptable to Dave" is not a clear spec.

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.

DaveC426913 · Oct 14, 2024

Vanadium 50 said:

"Imperfect fidelity acceptable to Dave" is not a clear spec.

Clear for whom? You?
What mater si that it is sufficiently clear to our department.

Vanadium 50 said:

I will point out that I was the one who pointed out what was happening and have received nothing but grief for it. Lesson learned.

You told me I was scanning it and using OCR (you said "sweeping away the cruft").
Then you told me it was a simple ten second fix, no matter how many ways I explained that it was not.
Then you mocked me because you apparently didn't understand how processes in departments work.

You've been mansplaining. That's what you've been getting push back on. (Is it "grief" if I am telling you where your assumptions are incorrect and asking why you are making a joke of this?).

I have been trying to be diplomatic: "Sorry, I am becoming aware that lack of context is spinning this off in unexpected directions." This is me trying to not point fingers, despite my frustration that you are giving me grief at every turn.

That being said, thank you sincerely for your help.

Very strange copy-paste bug in PDF document

Similar threads

Hot Threads

Recent Insights