Searching for unusual characters in a Word document

In summary, searching for unusual characters in a Word document involves using the Find feature to locate non-standard symbols or formatting that may not be easily visible. Users can access this function by pressing Ctrl + F, entering specific character codes or wildcards, and employing the advanced search options to refine their queries. This process helps in identifying and correcting potential issues with document formatting or content that might affect readability or compatibility.
  • #1
pbuk
Science Advisor
Homework Helper
Gold Member
4,800
3,075
TL;DR Summary
Unusual characters in a Word document (e.g. ligatures like œ and ff) can be troublesome but there is an easy way to find them when working with English text.
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.

DaveC426913 said:
"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

jedishrfu said:
The ff ligature and other characters beyond the Ascii set will likely be encoded in the document as UTF-8 encoding so it would be possible to scan the document looking for those multibyte characters.

Awk, sed, icons and tr come to mind for removing these special characters.

Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.

You need to select "Advanced Find" from the menu bar, in the "Find what" box type [! -~^13] (note the space between ! and -) and select the "Use Wildcards" option. You can then select "Reading Highlight" to see all the culprits, or "Find Next" to go through them one by one.

1729075871315.png


To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.
 
  • Informative
  • Like
Likes Wrichik Basu, berkeman, DaveC426913 and 2 others
Computer science news on Phys.org
  • #2
Interesting.
pbuk said:
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.
The OCR that I use, tends to tie 'fi' into one ligature character within words. One of those "normal looking" words will be detected by a spell check, so I select the ligature and replace it with the expansion throughout the document.
 
  • #3
pbuk said:
To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

Very useful, I didn't know that 'wildcards' would include regular expressions (almost...).
 
Back
Top