Extract URLs with regexp - Homework Solution

Trentonx · Jan 22, 2011

Homework Statement

I have a file that contains lines like the following:

Code:

<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.yahoo.com/">Yahoo!</a></strong></font></div></td>[/PLAIN] 
<td><divalign="center"><fontcolor="#0000ff"face="Arial,Helvetica,sans-serif"size="2"><strong><ahref="[PLAIN]http://www.google.com/">Google</a></strong></font></div></td>[/PLAIN]

It is already processed some from an html file, but I want the following to be the final output

Code:

http://www.yahoo.com/
http://www.google.com/

I am using sed to edit the file line by line and substitute.

Homework Equations

Nothing much here

The Attempt at a Solution

My idea was to say '*http' to match anything in front of http and then replace it with an empty string. This didn't actually match anything and negated a similar idea to match and delete everything after the .com/ portion. I also tried '<td>*="' to try and remove the portion before http and again replace with an empty string. Any help or hints would be appreciated, thanks

nvn · Jan 22, 2011

Trentonx: Try "^.*http", instead of "*http". And then try "\">.*$", to try to match everything after the URL. Try it, and let us know whether or not it works, since I have not tested it.

Trentonx · Jan 22, 2011

That worked with a little modification. I realized I wanted to match right before the http, so as not to remove it. I used '^.*="' which somehow dodn't match the other same expressions in the file. So now, how does it do it? The ^ is an anchor to the start of a line, and the * is a wildcard, but what does the . do? You used it in both expressions, so it is likely useful to know.

nvn · Jan 22, 2011

Trentonx: What you used should not be working, it seems, because it would match the other "=\"", and therefore, should not be reliable. Therefore, instead try "s/^.*http/http/", and see if that works (untested). Let us know. Period (.) means any character.

DeeNos · Jan 29, 2011

.Thank you for sharing your problem with me. It seems like you are trying to use regular expressions (regex) to extract URLs from a file. Regex is a powerful tool for pattern matching and can be used to extract specific information from a larger text. However, it can be tricky to use and requires some practice to become proficient.

In this case, I would suggest using a regex pattern that specifically looks for URLs. This pattern should include the "http" part and also the ending ".com/" portion. Here is an example of a regex pattern that could work for your situation:

"http:\/\/.*\.com\/"

Let's break down this pattern:

- "http": This is the starting part of the URL that we want to match.
- "\/\/": This is a special character that matches the "//" part of the URL.
- ".*": This is a wildcard that matches any character.
- "\.com\/": This is a string that we want to match, in this case, the ending ".com/" portion of the URL.

Using this pattern, you should be able to extract the URLs from your file by using the "sed" command. For example, you could try something like this:

sed -n -e 's/"http:\/\/.*\.com\/"/\1/p' input.txt > output.txt

This command will search for the pattern in the "input.txt" file and replace it with the first captured group (the URL) and save the results in the "output.txt" file. You can then open the "output.txt" file and see the extracted URLs.

I hope this helps you with your homework problem. Keep practicing with regex, and you will become more comfortable using it. Good luck!

Extract URLs with regexp - Homework Solution

Homework Statement

Homework Equations

The Attempt at a Solution

FAQ: Extract URLs with regexp - Homework Solution

What is a regular expression (regexp)?

How can I use regexp to extract URLs?

Are there different types of regexps for extracting URLs?

Are there any limitations to using regexp for extracting URLs?

Can I test my regexp before using it to extract URLs?

Similar threads

Hot Threads

Recent Insights