Very strange copy-paste bug in PDF document

In summary, the article discusses a peculiar bug encountered when copying and pasting text from a PDF document, where the pasted text appears distorted or altered compared to the original. This issue may arise due to the way PDFs encode text, leading to unexpected characters or formatting when transferred. The article explores potential causes, such as font embedding and encoding discrepancies, and suggests troubleshooting steps to mitigate the problem, including using different PDF readers or converting the document to a more edit-friendly format.
  • #36
DaveC426913 said:
there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time, for every revision that comes down the pipe.
It is fairly easy to search a Word document for 'unusual' characters - I have posted instructions here:
https://www.physicsforums.com/threads/searching-for-unusual-characters-in-a-word-document.1066335/

DaveC426913 said:
(what if it occurred in a date or dollar value?).
This is less likely to happen with numbers (but not impossible, for instance the alternative digits that are used in some fonts where some 'normal' digits descend below the baseline).
 
  • Like
Likes berkeman
Computer science news on Phys.org
  • #37
We have exhausted this topic and I think Dave understands the issue that he has to convey to the upstream designer.

So it's a good time to close the thread and without further ado...

Thank you all for contributing here.

PS: The ff ligature and other characters that are beyond the standard 8-bit ASCII character set will likely be encoded in the document as UTF-8 encodings so it would be possible to scan the document looking for those multibyte characters.

The Linux commands awk, sed, iconv, and tr come to mind for removing these special characters. Windows has equivalent third-party versions of these commands but you'll have to search for them.

In general, awk is the most powerful of the set since its a programming language geared for text search and replacement. Awk is still my favorite goto language when I need to work with text files.

You can access some or all of these commands in a Linux environment on Windows if you install WSL ie Windows Subsystem for Linux.

Alternatively you can install the cygwin tools and get access to them that way.

Neither of which are easy installs and may require an admin in a work environment.

Or you could install GNU awk (gawk) on Windows and find online or develop a script that locates the UTF-8 characters and removes or replaces them with what you want.

https://www.gnu.org/software/gawk/

https://www.gnu.org/software/software.html

Some brief info on the Linux commands:

https://confluence.cornell.edu/display/CNF/Linux+CheatSheet#LinuxCheatSheet

One more alternative came to mind, write a python script to do it. Again you'll need python on your machine but that is becoming more common as machine learning proliferates.
 
  • Informative
Likes Tom.G

Similar threads

Replies
2
Views
380
Replies
4
Views
4K
Replies
2
Views
3K
Replies
2
Views
13K
Replies
2
Views
2K
Back
Top