How do data scientists resolve name discrepancies like this?

Eclair_de_XII · Oct 24, 2020

Suppose you have a .csv file that resembles something like this:

Code:

Name,Profession, ...
Mike Jones, Driver, ...

And now suppose you have many .csv files pertaining to information about people who drive for a living.

Code:

Profession,Qualified to Operate Motor Vehicle,...
[variation of "driver"],TRUE,...

where these variations can range from "bus driver", "taxi driver", "self-employed on-call driver", etc.

Now suppose you wished to learn a bit more about Mr. Jones and his life as a driver of sorts. To get a general picture of this life, you would want to associate the information found in these files to his person using the key: "Driver". Would correcting these errors in the files be a job for a data correction team, or would the data-scientist have a sure-fire way of associating "Mike Jones" to all these .csv files?

Personally, the way I would do it is to associate each of these related professions to "Driver", and using that as a conduit of sorts, link the information to Mr. Jones that way. But it seems tedious to me and I wonder if there is a better way. I'm wondering if employing the use of regular expressions here would be appropriate. I haven't actually learned about them to be honest.

EDIT: Changed "driver" example to "psychologist".
EDIT 2: Changed example back to "driver".

andrewkirk · Oct 24, 2020

You could program it to associate to the class 'driver' every record in each csv of interest that records a profession described by a string that, when converted to lower case (by a function named something like 'tolower'), contains the pattern 'driver'. You would get some false positives, which you would need to weed out by hand. For each one (eg screwdriver salesperson), as you manually identify it as a false positive, you could add it to a list of exclusions for the search algorithm to apply.

We generally call such pattern-matching 'grep'. For a python approach to grepping, see here.

anorlunda · Oct 24, 2020

You may want to think about how a search engine like Google can perform so many successful searches when the users type in the queries with so many languages and so much flexibility in how to phrase the question.

You could spend a lifetime dealing with natural language processing. Or you could do no such processing and manually identify the fields you want to treat as "driver" in each of those csv files (as @andrewkirk suggested). Or you can find a compromise solution. It is a classical problem of manual labor versus automation.

jedishrfu · Oct 24, 2020

A lot of this work pertains to data cleaning methods used when creating a data warehouse from various input sources like CSV files or form data or other database tables Also known as ETL.

https://en.wikipedia.org/wiki/Extract,_transform,_load?wprov=sfti1

.Scott · Oct 24, 2020

You're asking broadly how a data scientist works with this.

One of the first things the data scientists needs to do is build a "data dictionary". For an existing system, this is done by looking at all existing system documentation to identify what fields exist, where they come from, and how they are used. In most cases, this also involves interviewing a comprehensive sample of the data users.

With this information in hand, a full Boyce-Codd normalization can be done to identify the data relations. In the process, fields from multiple sources can be recognized as denoting the same information and the data scientists can pick a name for each field that will be used in his dictionary.

All this and he hasn't even done anything with the data. But in the case of very large data sets, the data dictionary is often purposeful as a user reference document even if no other software development follows.

Eclair_de_XII · Oct 27, 2020

Right-o. Thanks for the information. I'll keep these tips in mind moving forward.

How do data scientists resolve name discrepancies like this?

FAQ: How do data scientists resolve name discrepancies like this?

1. How do data scientists identify name discrepancies in data?

2. What are the common causes of name discrepancies in data?

3. How do data scientists handle name discrepancies in data?

4. Can name discrepancies affect the accuracy of data analysis?

5. Is there a specific tool or software used to resolve name discrepancies in data?

Similar threads

Hot Threads

Recent Insights