Why Use Python for Web Scraping Instead of Excel?

In summary: Thank you very much!Just curious why you chose Python when MS Excel supports this sort of task natively.
  • #1
Leo_Chau_430
8
1
TL;DR Summary
I am trying to write a program that can automatically scrap through the website https://www.goodschool.hk/ss to make an Excel that contains phone number, address, email address and fax number of all the secondary schools, primary schools and kindergarten in Hong Kong. However, I have faced some problems... My code can be ran successfully, but the excel generated is blank.
My code is as follow:

Python:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import os

url = 'https://www.goodschool.hk/ss'

response = requests.get(url)
html = response.content

soup = BeautifulSoup(html, 'html.parser')

school_items = soup.find_all('div', {'class': 'school-item'})

school_names_en = []
school_names_zh = []
school_addresses_en = []
school_addresses_zh = []
school_phones = []
school_emails = []
school_faxes = []

for school_item in school_items:

    name_elements = school_item.select('a.school-name')
    school_names_en.append(name_elements[0].text.strip())
    school_names_zh.append(name_elements[1].text.strip())

    address_elements = school_item.select('div.school-address')
    school_addresses_en.append(address_elements[0].text.strip())
    school_addresses_zh.append(address_elements[1].text.strip())

    contact_elements = school_item.select('div.contact-info')
    school_phones.append(contact_elements[0].text.strip())
    school_emails.append(contact_elements[1].text.strip())
    school_faxes.append(contact_elements[2].text.strip())

df = pd.DataFrame({
    'School Name (English)': school_names_en,
    'School Name (Chinese)': school_names_zh,
    'Address (English)': school_addresses_en,
    'Address (Chinese)': school_addresses_zh,
    'Phone Number': school_phones,
    'Email Address': school_emails,
    'Fax Number': school_faxes
})

desktop_path = os.path.join(os.path.expanduser("~"), "Desktop")
excel_file_path = os.path.join(desktop_path, "school_data.xlsx")
df.to_excel(excel_file_path, index=False)

if os.path.exists(excel_file_path):
    print("Excel file generated successfully!")
else:
    print("Failed to generate Excel file.")
 
Technology news on Phys.org
  • #2
Why don't you just e-mail them, asking if a non-html list is available ? Other resources may include local school boards, etc.
 
  • #3
I have found the contact list of the schools on the website https://data.gov.hk/tc/ However the list there do not have the email address of the schools...
 
  • Like
Likes hmmm27
  • #4
Leo_Chau_430 said:
My code is as follow
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
 
  • Haha
Likes jedishrfu
  • #5
PeterDonis said:
Code should be in a BBCode code block. For one thing, doing that preserves the exact formatting and whitespace of your code, which with Python is very important.

I have used magic moderator powers to edit your OP to put your code in a code block. Please review and make sure that the code formatting and indentation is correct (it looks ok to me but you're the one that wrote the code).
Sorry I am new to this forum, you said your code is as follow, but I cannot see them, where can I find it
 
  • #6
As a debug measure, print out the dataframe and make sure it contains data.
 
  • Like
Likes pbuk
  • #7
Greg Bernhardt said:
As a debug measure, print out the dataframe and make sure it contains data.
Ya I have tried to print out the data but it seems that the data is not extracted properly
 
  • #8
Leo_Chau_430 said:
Ya I have tried to print out the data but it seems that the data is not extracted properly
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
 
  • #9
Greg Bernhardt said:
Next debug is to print out each school_item. I suspect you're not parsing the classes right.
I have just checked, school_items are blank data sets. However, when I print soup it has a valid output
 
  • #10
Leo_Chau_430 said:
you said your code is as follow, but I cannot see them, where can I find it
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
 
  • #11
PeterDonis said:
I meant your code, the code you posted in the OP of this thread. I have put it inside a BBCode code block. If you can't see that, try reloading the page.
Thank you I can see the code now. I think the identation should be correct.
 
  • #12
Leo_Chau_430 said:
My code can be ran successfully, but the excel generated is blank.
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
 
  • Like
  • Love
Likes harborsparrow and Leo_Chau_430
  • #13
pbuk said:
Did you get that code from ChatGPT? Wherever it came from, you need to approach writing and debugging code in a different way.

Why don't you try inserting some print() statements to see what data is being scraped?
The contents of the page are generated by JavaScript so BeutifulSoup doesn't see what you see.
Even if BeautifulSoup could run JavaScript the selectors you are trying to use e.g. {'class': 'school-item'}) don't exist in the page.
Thank you very much!
 
  • #14
Just curious why you chose Python when MS Excel supports this sort of task natively. If the web site you are scraping is cooperative you can sometimes even do it directly in an Excel worksheet with no code required.
 
  • Like
Likes Vanadium 50 and Greg Bernhardt

FAQ: Why Use Python for Web Scraping Instead of Excel?

Why use Python for web scraping instead of Excel?

Python is a powerful programming language that offers a wide range of libraries and tools specifically designed for web scraping. It allows for more advanced data extraction and manipulation compared to Excel, making it more efficient for scraping large amounts of data from websites.

Is Python better for handling dynamic web pages than Excel?

Yes, Python is better suited for handling dynamic web pages as it can interact with JavaScript elements and render dynamic content. Excel, on the other hand, may struggle with dynamic web pages that require user interaction or real-time updates.

Can Python automate the web scraping process more effectively than Excel?

Python excels at automating web scraping tasks through the use of libraries like BeautifulSoup and Scrapy. These tools allow for the creation of custom web scraping scripts that can be scheduled to run at specific times, making the process more efficient and hands-off compared to using Excel.

Does Python offer better data processing capabilities for web scraping than Excel?

Python provides more advanced data processing capabilities for web scraping compared to Excel. With libraries like Pandas and NumPy, Python can handle and manipulate scraped data more effectively, allowing for more complex analysis and visualization.

Are there any limitations to using Python for web scraping instead of Excel?

While Python offers many advantages for web scraping, it does require some programming knowledge to effectively utilize its capabilities. Excel, on the other hand, may be more user-friendly for beginners or those with limited programming experience. Additionally, Python may require more setup and configuration compared to Excel for simple scraping tasks.

Similar threads

Replies
9
Views
2K
Replies
3
Views
2K
Replies
4
Views
7K
Back
Top