Two separate csv files with some common entries

  • Python
  • Thread starter EngWiPy
  • Start date
  • Tags
    Csv files
In summary: Process country abbreviations from d2foreach $record (@d2){... $country = $record.country;... $code = $record.code;... $d2{$country} = $code;}# ================================================================# Process country abbreviations from d3foreach $record (@d3){... $country = $record.country;... $code = $record.code;... $d3{$country} = $code;}# ================================================================# Check for inconsistent country namesforeach $key (keys(@d1)){... if($d1{$key} != $
  • #1
EngWiPy
1,368
61
Hello,

I have two dataframes, df1 and df2. Both have Country column, but one, say df2, has the codes of the country names as well. I wish to extract these codes, and assign them to the corresponding countries in df1. I just don't want to look up and fill the codes manually. I am hoping to find something like this:

Python:
df1['Code'] = df2[df2['Country'] == df1_countries]['Code']

but I have two problems:
1. The number of countries in both dataframes are different, and
2. They are not of the same order (countries of df1 can be found somewhere in df2).

Thanks in advance
 
Technology news on Phys.org
  • #2
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.
 
  • Like
Likes QuantumQuest, FactChecker, EngWiPy and 1 other person
  • #3
jedishrfu said:
Can you read one dataframe into a hash table keyed by country code? Then you’d be able to find matches as you read each line of the other file and then check the hash table for a matching country code.

I am not familiar with hash tables, but I did it with two for loops, the first (outer) one iterates over the countries in df1, and the second iterates over the countries in df2. I then found a match for each country in df1 in df2, and once I found it, I retrieved the Code of that country from df2 and added it to a list. At the end I created a new column in df1 assigned the values of the resulted list.
 
  • #4
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.
 
  • Like
Likes EngWiPy
  • #5
Ibix said:
Python calls hash tables dicts. Well worth looking up if you aren't familiar with them. Basically they are unordered arrays that are indexed by more or less anything (anything that implements the _hash_ function, in python). So you could create a dict whose indices were the country names and whose values were the codes. Then you loop over the countries df1 and look each one up in the dict.

But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.
 
  • #6
Don't populate the dict manually - populate it from the code/country pairing you've got in df2. Since hash lookup is a lot faster than the sequential search you are using, the result should be a lot quicker.
 
  • Like
Likes QuantumQuest and jedishrfu
  • #7
Regardless of the approach that is used, a common problem when working with two data bases would be that the country names are not completely standardized. One might be abbreviated, in a different order, or different capitalization (sometimes even within its own entries). On a large project, that can be a nightmare. Typically, you want to automate the matching and print any problems so that you can fix things up. Then a final clean automated run tells you that everything can be matched.
 
  • Like
Likes QuantumQuest, jedishrfu and Ibix
  • #8
FactChecker said:
a common problem when working with two data bases would be that the country names are not completely standardized
Indeed. One of the three rules for working with data is "do not trust your data"[1]. There's always someone who enters "Isle of Angelsey".

[1] The other two rules are "do not trust your data" and "do not trust your data".
 
  • Like
Likes QuantumQuest, jedishrfu and FactChecker
  • #9
S_David said:
But the main idea is to avoid typing the codes manually. Otherwise, I would have added them to my file without the need of the other file. I am familiar with dictionaries in python, and I think I get the idea, but it is what I want to avoid, since I have many countries.

It should work quite well with up to 10000 entries in your hash table.
 
  • #10
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.
 
  • #11
FactChecker said:
Since there are so few countries in the world, the brute force method of looping through a list and comparing should work fine.

Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
 
  • Like
Likes FactChecker
  • #12
jedishrfu said:
Yes, of course. It’s just that you’re writing extra code that’s unnecessary if you use a hash table. It’s why folks use higher level languages for the conveniences offered to the programmer and that the provided implementation is often faster than what you code yourself.
I agree. Hash functions are the best way to do it (as long as its limits are not exceeded).
 
  • #13
I did it as following using a dictionary:

Python:
d = {}
for index, row in df2.iterrows():
    d[row['Country']] = row['Code']
Code = []
for index, row in df1.iterrows():
    country = row['Country']
    if country in d.keys():
        Code.append(d[country])
    else:#Not all countries are found, or may have a different name (to be entered manually)
        Code.append('')
df1['Code'] = Code

Can I improve it more than this?
 
  • #14
Just to give you an example of the type of data checking that you might want to do for an official program, I made a rough cut at Perl pseudo code. (Sorry, I don't know Python.) I have ignored any possible errors regarding Perl syntax or accessing the data in d1 and d2. So don't expect this to be working Perl code. It's just to give you a rough idea.
Perl:
# ================================================================
# Process country abbreviations from d1
foreach $record (@d1){
    $abbreviation = $record{'abbreviation'};
    $country = $record{'country'};
    # check if $abbreviation has already been used by a different country
    if( ( defined $abbr_used{$abbreviation})
               &&  ( $abbr_used{$abbreviation} ne $country )
    {
        print "Abbreviation $abbreviation use conflict: $abbr_used{$abbreviation} and $country\n";
    }
   
    # check if country has already been assigned a different abbreviation
    if( (defined $abbr{$country} )
               &&  ( $abbr{$country} ne $abbreviation)
    {
        print "Country $country abbreviation conflict: $abbr{$country} and $abbreviation\n";
    }
   
    # ================================================================
    # Store abbreviation for country
    $abbr{$country} = $abbreviation;
    $abbr_used{$abbreviation} = $country;
}

# ================================================================
# Process d2 countries
foreach $record (@d2){
    $country = $record{'country'};
    # check if there is a d1 abbreviation for this country
    if( ! defined $abbr{$country} ){
        print "No abbreviation defined in d1 for country $country\n";
        $record{'abbreviation'} = 'undefined';
    }else{
        $record{'abbreviation'} = $abbr{$country};
    }
}
 
  • Like
Likes EngWiPy

Related to Two separate csv files with some common entries

1. What is a CSV file?

A CSV (Comma Separated Values) file is a type of plain text file that stores tabular data in a structured format, with each row representing a record and each column representing a field of data. It is commonly used for storing and exchanging data between different software programs.

2. How can I open a CSV file?

A CSV file can be opened using any text editor, such as Notepad or Microsoft Word. It can also be opened in spreadsheet software, such as Microsoft Excel or Google Sheets.

3. What are common entries in a CSV file?

Common entries in a CSV file refer to data that appears in multiple rows or columns, such as names, dates, or numerical values. These entries can be used to link or match data between different CSV files.

4. How can I merge two CSV files with common entries?

To merge two CSV files with common entries, you can use software or programming languages such as Microsoft Excel, Python, or R. These tools have functions or libraries that allow you to combine and manipulate data from multiple CSV files.

5. What are the benefits of using CSV files?

CSV files offer several advantages, including their compatibility with a wide range of software and programming languages, their simple and lightweight structure, and their ability to store and exchange large amounts of data in a standardized format.

Similar threads

  • Programming and Computer Science
Replies
3
Views
902
  • Programming and Computer Science
Replies
3
Views
2K
  • Programming and Computer Science
Replies
12
Views
2K
Replies
6
Views
1K
  • Programming and Computer Science
Replies
2
Views
21K
  • Programming and Computer Science
Replies
16
Views
3K
  • Programming and Computer Science
Replies
8
Views
1K
  • Programming and Computer Science
Replies
34
Views
3K
  • Programming and Computer Science
Replies
5
Views
4K
  • Programming and Computer Science
Replies
2
Views
2K
Back
Top