Spreadsheet help: compare two lists to find exceptions

  • Thread starter DaveC426913
  • Start date
  • Tags
    Compare
In summary: List 1:CarolAliceBobDaveList 2:BillBobDaveIn summary, the MATCH function does not work properly when trying to compare two lists of text that have numbers in them.
  • #1
DaveC426913
Gold Member
22,986
6,659
TL;DR Summary
Compare a column of each of two spreadsheets to pull out exceptions
(I am using OpenOffice Calc, though I have access to Excel).
I have coerced two lists of patients into two sheets on a spreadsheet.

I need to determine who is missing from each list.

So, schematically, something like this output:

List 1:
Alice
Bob
Carol
Dave

List 2:
Alice
Bill
Bob
Dave

or something like this is fine too:

List 1:
Carol
List 2:
Bill

I'm not really picky. I am compiling my report manually, so "cheap and cheerful" is the word-of-the-day. The only reason I need to automate parts of it is because of the sheer volume: 70 doctors, totaling 16,000 patients.I'm not sure how to Google search for such operations - whether it's "conditional formatting" or "compare sheets", etc. Which is why I'm asking for direction/inspiration.As for comparing sheets, note that the two lists are not currently normalized:
One list might be:
Last Name, First name, Address, Gender, etc.
The other list list might be:
Doctor, Phone#, Full name, Gender, etc.

(The point I'm making is that a straight up "compare two sheets" is not really what this is. A "compare two sheets" operation is likely expecting a majority of overlap with only data/content differences.) Normalizing them could be done, but that's even more work. I'm sure I will have to at least normalize the name field(s).Also: No hurry. Anytime in the next half hour will be fine. :smile:
 
Computer science news on Phys.org
  • #3
Excel has something called an intersection operator. You could use that. Syntx is tricky, as white space is significant.
 
  • #4
jack action said:
VLOOKUP seems like a nice candidate.
In my experience, MATCH is almost always a better solution than any of the (V/H/)LOOKUP family (often, as here, with the 'match-type' option set to 0). The answer here is probably to introduce a column doing the MATCHing and using that for conditional formatting (which is always a bit of a faff) or with a bit more work, ingenuity and intermediate calculations, extracting the matching rows.
 
  • Like
Likes DaveC426913
  • #5
OK, making progress with the MATCH function.

I've abandoned using names because one list has Bobbie Smith and the other has Bobbie Jo Smith.
But insurance numbers are unique, so that's a lot easier.

But one list has ON 1234 567 890 YZ while the other has 1234 567 890, so now I have to see if there's a way of trimming/truncating a field...

Ah .. MID()
 
  • #6
Nope. Not working.

Here's a sample:
Sheet 'H-RCP':

7072​
Alice​
4154​
Andy​
etc.
Sheet 'H-PSS' (sorted):

7033​
Bill​
7058​
Dave​
7072​
Alice​
7103​
Charlie​
7113​
Annie​
etc.
Sheet 'Output', Cell A1:

=MATCH('H-RCP'.A1; 'H-PSS'.A1:A5; 1)
If I am using this correctly, MATCH is
looking for H-RCP cell 1: 7072
in the array H-PSS cells 1-5

It should return row 3, but it actually returns #N/A, meaning it didn't find it.
 
  • #7
Use 0 as the third argument; you don't need to sort them. Also make sure you are not comparing numbers with text that looks like a number (you might need to convert with VALUE or TEXT).
 
  • Like
Likes scottdave
  • #8
pbuk said:
Use 0 as the third argument; you don't need to sort them. Also make sure you are not comparing numbers with text that looks like a number (you might need to convert with VALUE or TEXT).
Yup, and yup.
Tried them as numbers and as text.

The only other difference I can think of is that one of them is, itself, the output of a function: the MID function, to strip off the prefix and suffix, thus:

B1 =MID(A1;3;10)
ON6323609633ET​
6323609633​
 
  • #9
How many names are there? If it's mall, and you don't need to repeat it often, it might be aster and cheaper to have two sorted lists side by side.
 
  • #10
Vanadium 50 said:
How many names are there? If it's mall, and you don't need to repeat it often, it might be aster and cheaper to have two sorted lists side by side.
16,000 names for 70 doctors.
 
  • #11
DaveC426913 said:
The only other difference I can think of is that one of them is, itself, the output of a function: the MID function, to strip off the prefix and suffix, thus:

B1 =MID(A1;3;10)
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10))[icode].
 
  • #12
pbuk said:
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10))[icode].
I'm comparing text.
 
  • #13
Give the problem, along with your worked example to ChatGPT. It can be very good at Excel. You can suggest using MATCH.
 
  • #14
Really? 🤔
 
  • #15
DaveC426913 said:
Really? 🤔
ChatGPT Q & A results.
Q: I have a spreadsheet file. Sheet A has a list of ID numbers in column A, and names in column B. Sheet B has a list of ID numbers in column A, and names in column B. Write an Excel expression to join the two and return only the matches in Sheet C.
A: In Excel, you can use the VLOOKUP function ... [cut off]
Q: No, use the MATCH function.
A: You can use the MATCH function in Excel to find the row number of a specific value in a range, and then use that row number to retrieve the corresponding data. Here's how to use the MATCH function in Sheet C to find the matching ID numbers in Sheet A and Sheet B:

In Cell A2 of Sheet C, enter the following formula:

=MATCH(A2,SheetA!A:A,0)

In Cell B2 of Sheet C, enter the following formula:

=MATCH(A2,SheetB!A:A,0)

These formulas will return the row number of the matching ID number in Sheet A and Sheet B, respectively. To retrieve the corresponding name for each ID, use the INDEX function, like this:

In Cell C2 of Sheet C, enter the following formula:

=INDEX(SheetA!B:B,A2)

In Cell D2 of Sheet C, enter the following formula:

=INDEX(SheetB!B:B,B2)

These formulas will return the name corresponding to the matching ID number in Sheet A and Sheet B, respectively.

You can then drag these formulas down to the last row of your data to apply them to all the IDs in Sheet C.
 
  • #16
DaveC426913 said:
TL;DR Summary: Compare a column of each of two spreadsheets to pull out exceptions

(The point I'm making is that a straight up "compare two sheets" is not really what this is. A "compare two sheets" operation is likely expecting a majority of overlap with only data/content differences.)
Just as an aside question, have you considered just using a typical scripting programming language to do this type of task instead of Excel? Have you used Python/Perl/Tcl at all before? I will often use Excel for simple stuff, but as soon as things get a little complicated, I'm inclined to just write a program to do it instead...
 
  • Like
Likes Vanadium 50 and FactChecker
  • #17
I was also going to suggest an honest to goodness database, but I suspect that the additional ease of that operation would be cancelled out by the trouble getting the data in and out.
 
  • Like
Likes DaveC426913
  • #18
DaveC426913 said:
I'm comparing text.
You sure? 6323609633 looks like a number to me (and to Excel).

Also, if you have Excel I'd be inclined to use Excel, not OpenOffice which is mostly compatible but you can never be sure how edge cases are handled.
 
  • #19
anorlunda said:
ChatGPT Q & A results.
Which are broken: a human expert can spot this instantly (SheetC!A2 will be a circular reference).

You really need to stop believing that a language tool can ever provide a reliable answer to technical problems.
 
  • Like
Likes Wrichik Basu and jack action
  • #20
pbuk said:
You sure? 6323609633 looks like a number to me (and to Excel).
By setting the format to text explicitly.

pbuk said:
Also, if you have Excel I'd be inclined to use Excel, not OpenOffice which is mostly compatible but you can never be sure how edge cases are handled.
You're right, apparently this utility may have to survive me, which means I'll have to use their tools, not mine.
 
  • #21
berkeman said:
Just as an aside question, have you considered just using a typical scripting programming language to do this type of task instead of Excel? Have you used Python/Perl/Tcl at all before? I will often use Excel for simple stuff, but as soon as things get a little complicated, I'm inclined to just write a program to do it instead...
*sigh* Yeah. I made* a Python script for a very similar utility. (There's lots of data massaging that needs to happen in this office because of old. obsolete and unfriendly Record Management software).

The problem with such utilities is how to make them survive me.

It's hard enough writing several pages of instructions for my successor to export/massage/import/process the data, but if I use a python program, that adds another bunch of pages in the "Office Procedures" documentation for how to install and use Python and environment. And that drops any likelihood this program will ever be used beyond my exit to less than 1%.*read as: had a friend make for me
 
  • Like
Likes pasmith
  • #22
DaveC426913 said:
By setting the format to text explicitly.
No, a number formatted as text is not text, it is still a number (you can verify this by adding 1 to it). This is a common source of problems when extracting numbers from strings in spreadsheets. The most reliable way around it is to make sure that anything that looks like a number is a number with VALUE, not to try to make numbers look like text.
 
  • Like
Likes hutchphd, DaveE, DaveC426913 and 1 other person
  • #23
Since you have access to Excel, I would suggest taking advantage of Power Query. Load every table into your data model, apply any necessary transformations to get each dataset into a consistent format, and then analyse them.

(You will need the actual desktop version of Excel, not the Office365 app.)
 
  • #24
Well this is disappointing.

It appears that a derived field does not want to be formatted as a number, so I can't MATCH it.

My Search data:
1675782708778.png


My Array to be searched:
1675782742349.png

Column B is derived from column A.
Column C is flat numbers (as a control).

The results:
1675782964324.png

Column B cannot find a MATCH to the derived number, even though they're both formatted as 'number'.
Column C (the control) MATCHes fine.

I am stumped.

Test file attached.
 
  • #25
pbuk said:
Yes, that will give a string (text). Convert it to a number with VALUE(MID(A1;3;10)).
Ah. @pbuk solved that already! Enclose it in VALUE(). (It also works to add +0)

1675784959491.png
 
  • #26
Dave, what was the formula you used in cell D1 of post #25? I'm using =MATCH(D1, B1:B5), but it just gives the (correct) index of the value I have in D1, of the column range B1 - B5.

I can't figure out how to use the index of a range to get the value there.
 
  • #27
Mark44 said:
I'm using =MATCH(D1, B1:B5), but it just gives the (correct) index of the value I have in D1, of the column range B1 - B5.
Tips for MATCH:
  • Always specify the final argument, which you most often want to be 0.
  • It is often best to MATCH on a whole column e.g. MATCH(D1,B:B), this avoids mistakes when you add more items to the list (and you should never put that formula in D1, or anywhere in column B as it would be a circular reference).
Mark44 said:
I can't figure out how to use the index of a range to get the value there.
You could try INDEX :-p
 
  • #28
pbuk said:
  • It is often best to MATCH on a whole column e.g. MATCH(D1,B:B), this avoids mistakes
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
 
  • #29
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
Yes, it does:

Screenshot_2023-02-07_17-48-39.png
 
  • #30
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
In that case I strongly recommend you use guard rows (the narrow rows in the sheet below - I downloaded the latest OpenOffice just for you!)
1675810630626.png


Coming up: the formulas to achieve this.
 
  • #31
pbuk said:
Coming up: the formulas to achieve this.
Doesn't seem easy in OpenOffice, I'll give you the entries in row 3, the rest just copy down.
Code:
C3 =MATCH(A3;B$2:B$25;0)
D3 =IF(ISNA(C3);0;MAX(D2:D$2)+1)
E3 =E2+1
F3 =MATCH(E3;D$2:D$25;0)
G3 =IF(ISNA(F3);"";INDEX(A$2:A$25;$F3))
 
  • #32
jack action said:
Yes, it does:
It didn't for me (OpenOffice on Windows) - that looks like LibreOffice on Linux to me.
 
  • Like
Likes jack action and DaveC426913
  • #33
Yeah, works in LibreOffice on Windows too, and we can display the formulas - see below (different formulas as we don't need the guard rows now).

1675811998151.png


Can you persuade your client to move to LibreOffice, apart from these there are a number of technical and commercial reasons to drop OpenOffice since the fork?
 
  • #34
pbuk said:
Can you persuade your client...
Well, client (hospital) has Excel on their image for virtual servers, but it is such a pain to tunnel in via VPN to do this kind of finicky work.

The biggest headache is that the virtual environment actively discourages copy/pasting of any sort. I mean, literally copying a number from a Notepad doc and pasting it in to a spreadsheet cell will not work most of the time. There is some sort of chron job or utility that clears the clipboard once per second. If I hit Ctrl-C ten times in a row, and then paste within about two seconds I have maybe a 50% chance of pasting what was in the clipboard. Crazy stuff.
 
  • #35
DaveC426913 said:
Indeed. Too bad it doesn't work in OpenOffice Calc. :frown:
In my case (LibreOffice), after I begin writing a function, when I click on a column heading, it writes B1:B1048576, and then it transforms into B:B after I get out of the cell.

Maybe you can just write the full column range by default?
 
Back
Top