- #1
Eclair_de_XII
- 1,083
- 91
- TL;DR Summary
- You have one data set containing an attribute about some person and many data sets pertaining to that same attribute. In each of these data sets, this attribute is spelt differently. How would a data scientist go about resolving these discrepancies?
Suppose you have a .csv file that resembles something like this:
And now suppose you have many .csv files pertaining to information about people who drive for a living.
where these variations can range from "bus driver", "taxi driver", "self-employed on-call driver", etc.
Now suppose you wished to learn a bit more about Mr. Jones and his life as a driver of sorts. To get a general picture of this life, you would want to associate the information found in these files to his person using the key: "Driver". Would correcting these errors in the files be a job for a data correction team, or would the data-scientist have a sure-fire way of associating "Mike Jones" to all these .csv files?
Personally, the way I would do it is to associate each of these related professions to "Driver", and using that as a conduit of sorts, link the information to Mr. Jones that way. But it seems tedious to me and I wonder if there is a better way. I'm wondering if employing the use of regular expressions here would be appropriate. I haven't actually learned about them to be honest.
EDIT: Changed "driver" example to "psychologist".
EDIT 2: Changed example back to "driver".
Code:
Name,Profession, ...
Mike Jones, Driver, ...
And now suppose you have many .csv files pertaining to information about people who drive for a living.
Code:
Profession,Qualified to Operate Motor Vehicle,...
[variation of "driver"],TRUE,...
where these variations can range from "bus driver", "taxi driver", "self-employed on-call driver", etc.
Now suppose you wished to learn a bit more about Mr. Jones and his life as a driver of sorts. To get a general picture of this life, you would want to associate the information found in these files to his person using the key: "Driver". Would correcting these errors in the files be a job for a data correction team, or would the data-scientist have a sure-fire way of associating "Mike Jones" to all these .csv files?
Personally, the way I would do it is to associate each of these related professions to "Driver", and using that as a conduit of sorts, link the information to Mr. Jones that way. But it seems tedious to me and I wonder if there is a better way. I'm wondering if employing the use of regular expressions here would be appropriate. I haven't actually learned about them to be honest.
EDIT: Changed "driver" example to "psychologist".
EDIT 2: Changed example back to "driver".
Last edited: