- #1
- 4,800
- 3,075
- TL;DR Summary
- Unusual characters in a Word document (e.g. ligatures like œ and ff) can be troublesome but there is an easy way to find them when working with English text.
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.
Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.
You need to select "Advanced Find" from the menu bar, in the "Find what" box type
To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match,
If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.
DaveC426913 said:"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).
jedishrfu said:The ff ligature and other characters beyond the Ascii set will likely be encoded in the document as UTF-8 encoding so it would be possible to scan the document looking for those multibyte characters.
Awk, sed, icons and tr come to mind for removing these special characters.
Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.
You need to select "Advanced Find" from the menu bar, in the "Find what" box type
[! -~^13]
(note the space between ! and -) and select the "Use Wildcards" option. You can then select "Reading Highlight" to see all the culprits, or "Find Next" to go through them one by one.To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match,
-~
means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.