Searching for unusual characters in a Word document

  • Thread starter Thread starter pbuk
  • Start date Start date
AI Thread Summary
Unusual characters can be introduced into documents through copy and paste from sources like PDFs, leading to potential corruption that may go unnoticed unless the entire document is proofed. These characters, often encoded in UTF-8, can be identified using tools like Awk, sed, and tr, but a simpler method is available in Microsoft Word. By using the "Advanced Find" feature with the wildcard search option, users can locate non-ASCII characters by entering a specific search string. This method allows for the identification of characters outside the standard ASCII range, including ligatures created by OCR processes, such as the 'fi' ligature. The wildcard search functionality in Word enhances the search capabilities, enabling users to specify which characters to include or exclude in their search, thus facilitating the cleanup of documents from unwanted special characters.
pbuk
Science Advisor
Homework Helper
Gold Member
Messages
4,970
Reaction score
3,219
TL;DR Summary
Unusual characters in a Word document (e.g. ligatures like œ and ff) can be troublesome but there is an easy way to find them when working with English text.
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.

DaveC426913 said:
"...there is no way someone down the line can even know if there's a corruption in the copy - not without proofing the entire document every time.." (and even then one can't know if there are any other ligatures lurking in there that aren't even visible - I only caught this one by sheer luck because it happened to be the very last word in the entire copy).

jedishrfu said:
The ff ligature and other characters beyond the Ascii set will likely be encoded in the document as UTF-8 encoding so it would be possible to scan the document looking for those multibyte characters.

Awk, sed, icons and tr come to mind for removing these special characters.

Rather than using a command line tool that may not be available in Windows (and may corrupt any non-text file content e.g. embedded images), you can easily search for non-ASCII characters within Word.

You need to select "Advanced Find" from the menu bar, in the "Find what" box type [! -~^13] (note the space between ! and -) and select the "Use Wildcards" option. You can then select "Reading Highlight" to see all the culprits, or "Find Next" to go through them one by one.

1729075871315.png


To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

If you want you can add other characters within the square brackets that you want to allow such as £ or €, the 'smart' quotation marks “” and ‘’ or the emdash – just copy them from here.
 
  • Informative
  • Like
Likes Wrichik Basu, berkeman, DaveC426913 and 2 others
Computer science news on Phys.org
Interesting.
pbuk said:
One way unusual characters can be introduced is by copy and paste from another document e.g. a PDF.
The OCR that I use, tends to tie 'fi' into one ligature character within words. One of those "normal looking" words will be detected by a spell check, so I select the ligature and replace it with the expansion throughout the document.
 
pbuk said:
To understand how this works note that the "Use wildcards" option gives some characters in the search box special powers. Here [...] means 'look for characters that match...', ! changes that to mean 'look for characters that don't match, -~ means 'match any character with a code point between that of a space (32) and that of a tilde (127); these are the first and last printable ASCII characters, and ^13 means 'match ASCII character 13' which is the end of paragraph marker in Word.

Very useful, I didn't know that 'wildcards' would include regular expressions (almost...).
 
Well, the date has now passed, and Windows 10 is no longer supported. Hopefully, the readers of this forum have done one of the many ways this issue can be handled. If not, do a YouTube search and a smorgasbord of solutions will be returned. What I want to mention is that I chose to use a debloated Windows from a debloater. There are many available options, e.g., Chris Titus Utilities (I used a product called Velotic, which also features AI to prevent your computer from overheating etc...
Back
Top