Extract URLs with regexp - Homework Solution

  • Thread starter Trentonx
  • Start date
In summary, the homework solution for extracting URLs using regular expressions involves using the "re" module in Python to write a pattern that will match the desired URLs. This pattern is then used with the "findall" function to extract all matching URLs from a given string or text file. The solution also includes using groups to capture specific parts of the URL and using the "sub" function to replace certain patterns in the URL. With proper understanding and implementation of regular expressions, extracting URLs can be done efficiently and accurately.
  • #1
Trentonx
39
0

Homework Statement


I have a file that contains lines like the following:
Code:
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]
It is already processed some from an html file, but I want the following to be the final output
Code:
http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations


Nothing much here

The Attempt at a Solution


My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.
 
  • #3
That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.
 
  • #4
Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.
 
  • #5
.Thank you for sharing your problem with me. It seems like you are trying to use regular expressions (regex) to extract URLs from a file. Regex is a powerful tool for pattern matching and can be used to extract specific information from a larger text. However, it can be tricky to use and requires some practice to become proficient.

In this case, I would suggest using a regex pattern that specifically looks for URLs. This pattern should include the "http" part and also the ending ".com/" portion. Here is an example of a regex pattern that could work for your situation:

"http:\/\/.*\.com\/"

Let's break down this pattern:

- "http": This is the starting part of the URL that we want to match.
- "\/\/": This is a special character that matches the "//" part of the URL.
- ".*": This is a wildcard that matches any character.
- "\.com\/": This is a string that we want to match, in this case, the ending ".com/" portion of the URL.

Using this pattern, you should be able to extract the URLs from your file by using the "sed" command. For example, you could try something like this:

sed -n -e 's/"http:\/\/.*\.com\/"/\1/p' input.txt > output.txt

This command will search for the pattern in the "input.txt" file and replace it with the first captured group (the URL) and save the results in the "output.txt" file. You can then open the "output.txt" file and see the extracted URLs.

I hope this helps you with your homework problem. Keep practicing with regex, and you will become more comfortable using it. Good luck!
 

FAQ: Extract URLs with regexp - Homework Solution

What is a regular expression (regexp)?

A regular expression (regexp) is a sequence of characters that define a search pattern. It is used to extract specific information from a larger string of text.

How can I use regexp to extract URLs?

To extract URLs using regexp, you can specify a pattern that matches the structure of a URL. This pattern can include protocols (e.g. http, https), domain names, subdomains, and file paths.

Are there different types of regexps for extracting URLs?

Yes, there are different types of regexps that can be used to extract URLs. Some are more specific and can match a wider range of URLs, while others are more general and may only match certain types of URLs.

Are there any limitations to using regexp for extracting URLs?

Yes, there are some limitations to using regexp for extracting URLs. For example, it may not be able to handle certain special characters or variations in URL formatting. Additionally, it may not be the most efficient method for extracting URLs from large amounts of text.

Can I test my regexp before using it to extract URLs?

Yes, there are online tools and resources available that allow you to test your regexp before using it to extract URLs. This can help you identify any errors or areas for improvement in your pattern.

Similar threads

Replies
3
Views
973
Back
Top