How do data scientists resolve name discrepancies like this?

  • Python
  • Thread starter Eclair_de_XII
  • Start date
  • Tags
    Data
In summary, the data scientists needs to build a data dictionary in order to identify fields and their associated data relations.
  • #1
Eclair_de_XII
1,083
91
TL;DR Summary
You have one data set containing an attribute about some person and many data sets pertaining to that same attribute. In each of these data sets, this attribute is spelt differently. How would a data scientist go about resolving these discrepancies?
Suppose you have a .csv file that resembles something like this:

Code:
Name,Profession, ...
Mike Jones, Driver, ...

And now suppose you have many .csv files pertaining to information about people who drive for a living.

Code:
Profession,Qualified to Operate Motor Vehicle,...
[variation of "driver"],TRUE,...

where these variations can range from "bus driver", "taxi driver", "self-employed on-call driver", etc.

Now suppose you wished to learn a bit more about Mr. Jones and his life as a driver of sorts. To get a general picture of this life, you would want to associate the information found in these files to his person using the key: "Driver". Would correcting these errors in the files be a job for a data correction team, or would the data-scientist have a sure-fire way of associating "Mike Jones" to all these .csv files?

Personally, the way I would do it is to associate each of these related professions to "Driver", and using that as a conduit of sorts, link the information to Mr. Jones that way. But it seems tedious to me and I wonder if there is a better way. I'm wondering if employing the use of regular expressions here would be appropriate. I haven't actually learned about them to be honest.

EDIT: Changed "driver" example to "psychologist".
EDIT 2: Changed example back to "driver".
 
Last edited:
Technology news on Phys.org
  • #2
You could program it to associate to the class 'driver' every record in each csv of interest that records a profession described by a string that, when converted to lower case (by a function named something like 'tolower'), contains the pattern 'driver'. You would get some false positives, which you would need to weed out by hand. For each one (eg screwdriver salesperson), as you manually identify it as a false positive, you could add it to a list of exclusions for the search algorithm to apply.

We generally call such pattern-matching 'grep'. For a python approach to grepping, see here.
 
  • #3
You may want to think about how a search engine like Google can perform so many successful searches when the users type in the queries with so many languages and so much flexibility in how to phrase the question.

You could spend a lifetime dealing with natural language processing. Or you could do no such processing and manually identify the fields you want to treat as "driver" in each of those csv files (as @andrewkirk suggested). Or you can find a compromise solution. It is a classical problem of manual labor versus automation.
 
  • #5
You're asking broadly how a data scientist works with this.

One of the first things the data scientists needs to do is build a "data dictionary". For an existing system, this is done by looking at all existing system documentation to identify what fields exist, where they come from, and how they are used. In most cases, this also involves interviewing a comprehensive sample of the data users.

With this information in hand, a full Boyce-Codd normalization can be done to identify the data relations. In the process, fields from multiple sources can be recognized as denoting the same information and the data scientists can pick a name for each field that will be used in his dictionary.

All this and he hasn't even done anything with the data. But in the case of very large data sets, the data dictionary is often purposeful as a user reference document even if no other software development follows.
 
  • #6
Right-o. Thanks for the information. I'll keep these tips in mind moving forward.
 

FAQ: How do data scientists resolve name discrepancies like this?

1. How do data scientists identify name discrepancies in data?

Data scientists use various techniques such as data cleaning and data profiling to identify name discrepancies in data. They also use data visualization tools to visually identify any inconsistencies in data.

2. What are the common causes of name discrepancies in data?

Name discrepancies in data can be caused by human error, inconsistent data entry, data merging from multiple sources, and cultural differences in naming conventions.

3. How do data scientists handle name discrepancies in data?

Data scientists can handle name discrepancies by standardizing names using a common format, merging similar names, and using fuzzy logic algorithms to match similar names. They can also manually review and correct any discrepancies.

4. Can name discrepancies affect the accuracy of data analysis?

Yes, name discrepancies can affect the accuracy of data analysis. Inconsistent names can lead to duplicate records, incorrect grouping, and inaccurate insights. It is important for data scientists to resolve name discrepancies to ensure reliable and accurate analysis.

5. Is there a specific tool or software used to resolve name discrepancies in data?

There is no one specific tool or software used to resolve name discrepancies in data. Data scientists use a combination of data cleaning, data profiling, data matching, and data visualization tools to identify and resolve name discrepancies. They may also develop custom scripts or algorithms to handle specific name discrepancies in their data.

Similar threads

Back
Top