What strings does this regex match? (a|b)*

In summary: The "check for a match" operation, a | b, is looking for a match at the position after the |, which in this case is the position of the character b.
  • #1
shivajikobardan
674
54
regepxal is showing infinite matches and shows "baaaaaaa" as a match. Doesn't (a|b) means a or b?
* means match 0 or more times. So, it should be matching aa,aaa,bb,bbb, , etc?
I'm not getting how baaaa is a match. Can anyone clarify please?
 
Technology news on Phys.org
  • #2
Go through "baaaaa" character by character. Which character is not an 'a' or a 'b'?
Sometimes you have to see what it does before you realize that it is doing exactly what it was told to do.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
  • #3
FactChecker said:
Go through "baaaaa" character by character. Which character is not an 'a' or a 'b'?
Sometimes you have to see what it does before you realize that it is doing exactly what it was told to do.
I feel I'm not getting it. I feel I'm missing something that you have good grasp of. You choose a or b and repeat, so aaaa, bbbb should be only matches. Could you help me out? How'd this look like in FSM?
 
  • #4
shivajikobardan said:
You choose a or b and repeat
You are misinterpreting the "repeat" part. As the regex is written, it means "for each character, match if it is a or b".

What you are thinking of, namely, "choose a or b; then match a string containing any number of that character and no others", would be expressed as the regex (a*|b*) (with stars inside the parentheses, instead of a star outside).
 
  • Like
Likes FactChecker and DaveC426913
  • #5
Go through the string "baaaa" one character at a time and for each one, apply the test (a|b). As @PeterDonis says, the parenthesis forces that part to be evaluated first. Each character passes the test and is a match. Then apply the * to match as many as possible; that is all of them.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
  • #6
PeterDonis said:
You are misinterpreting the "repeat" part. As the regex is written, it means "for each character, match if it is a or b".

What you are thinking of, namely, "choose a or b; then match a string containing any number of that character and no others", would be expressed as the regex (a*|b*) (with stars inside the parentheses, instead of a star outside).
I can't get it till now. Can you suggest some readings instead? I feel a good chance there's something you guys know that I don't know.
 
  • #7
FactChecker said:
Go through the string "baaaa" one character at a time and for each one, apply the test (a|b). As @PeterDonis says, the parenthesis forces that part to be evaluated first. Each character passes the test and is a match. Then apply the * to match as many as possible; that is all of them.
I am not getting it :(
 
  • #8
shivajikobardan said:
I am not getting it :(
Consider what (a|b) matches:
baaaa Is 'b' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
baaaa Is 'a' an 'a' or a 'b'? Yes.
The regular expression (a|b) matches all 5 characters.
Since the * tells it to match as many as possible, (a|b)* matches all of baaaa.
CORRECTION: This is wrong. It is for the Perl regular expression [ab]*
 
Last edited:
  • #9
shivajikobardan said:
I can't get it till now.
You have two operations that the regex is doing: (1) looking at a character and deciding whether it's a match; (2) deciding how many characters to match.

The "looking at a character and deciding if it's a match" operation, by itself, is straightforward: a single character matches if it's an a or a b.

The problem you are having comes with the "deciding how many characters to match" operation. You are thinking of that operation as follows: once you find a matching character, continue matching more characters as long as they are the same character as the first one that matched. In other words, once a match is found with an a, continue matching as long as there are a's, or once a match is found with a b, continue matching as long as there are b's; but stop matching if you started with a's but then encounter a b, or vice versa.

But that is not what the regex (a | b)* actually does. The regex that actually does what I just described in the last paragraph is the one I gave in my last post, (a* | b*). That regex will indeed match aa, aaa, bb, bbb, etc., but will not match baaaa.

What the regex (a | b)* actually does is to keep matching characters as long as each character is either an a or a b; it does not care if a's and b's are mixed. That is why baaaa is a match. The reason it doesn't care if a's and b's are mixed is that the "check for a match" operation, a | b, is inside the parentheses, but the "keep matching" operation, *, is outside the parentheses. That means it applies the a | b match criterion to each character independently during the "keep matching" operation, so it will keep matching as long as each character is either an a or a b, without caring about mixing the two.

Whereas in the regex (a* | b*), the "keep matching" operation is inside the parentheses, and there are two of them, one for a and one for b; that means that once it succeeds at the "check for a match" operation just once, it has to try the "keep matching" operation with just that character that it matched, for as long as it can.
 
  • #10
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
 
Last edited:
  • Like
Likes Nugatory
  • #11
Thank you. The perfect way to think about this for me is this:

(X)*= X X X X X X

Here X=a or b
So,

a or b then a or b then a or b then....
 
  • #12
shivajikobardan said:
Thank you. The perfect way to think about this for me is this:

(X)*= X X X X X X

Here X=a or b
So,

a or b then a or b then a or b then....
No, if you look at what I posted in #10, (a|b)* will be a match for ANY string, even an empty string
it matches 0 or more occurences of either a or b: any string has 0 or more occurances of any character

If you want to only match a, b, aa, aaa, bb, bbb etc. but not e.g. baaaaa or the empty string then you need ^(a+|b+)$ (or if there is a possibility that your test string may contain newline characters \A(a+|b+)\z)
 
  • #13
pbuk said:
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
I stand corrected. Actually, I was confusing it with the Perl regular expression, [ab]*
As a Perl regular expression, (a|b)* only matches one 'a' in 'baaaaxxxyyy'.
 
  • #14
pbuk said:
No, if you look at what I posted in #10, (a|b)* will be a match for ANY string, even an empty string
it matches 0 or more occurences of either a or b: any string has 0 or more occurances of any character

If you want to only match a, b, aa, aaa, bb, bbb etc. but not e.g. baaaaa or the empty string then you need ^(a+|b+)$ (or if there is a possibility that your test string may contain newline characters \A(a+|b+)\z)
yes it matches empty string as well. e, X, XX, XXX...where each X can be either a or b.
 
  • #15
pbuk said:
@FactChecker, @PeterDonis I am afraid you are both analysing a different regex, namely ^(a|b)+$. The regex in the OP will also match "gt 5*au!6".
Hm, yes, there should be a + instead of a * to ensure at least one matching character, and we should make explicit that we want to match the entire string.
 
  • #16
shivajikobardan said:
yes it matches empty string as well. e, X, XX, XXX...where each X can be either a or b.
You are not listening: (a|b)* is completely useless (as a complete regex) because it matches absolutely anything whether there are a or b characters in it or not.
 
Last edited:
  • #17
pbuk said:
You are not listening: (a|b)* is completely useless because it matches absolutely anything whether there are a or b characters in it or not.
Maybe worth stressing that it's only completely useless on its own (because, as you say, it matches any string containing a sequence of zero or more a or b, and anything contains zero or more as and bs). But it has its uses as a component of a larger regex (e.g. your ^(a|b)*$).
 
  • Like
Likes pbuk
  • #18
Ibix said:
Maybe worth stressing that it's only completely useless on its own
Yes, that is closer to what I intended to convey: edited.
 
Last edited:

FAQ: What strings does this regex match? (a|b)*

1. What is a regex?

A regex, short for regular expression, is a sequence of characters that serves as a pattern for matching strings in a larger set of text. It is commonly used in programming and text processing to find and manipulate specific patterns of text.

2. What does (a|b)* mean in a regex?

The (a|b)* in a regex means that the pattern will match any string that contains either the letter "a" or the letter "b" zero or more times. The pipe symbol "|" acts as an "or" operator, allowing for either option to be matched.

3. Does the order of the letters matter in (a|b)*?

No, the order of the letters does not matter in this particular regex. The (a|b)* pattern will match strings that contain any combination of "a" and "b" characters, regardless of their order.

4. Will (a|b)* match strings with other characters besides "a" and "b"?

No, the (a|b)* pattern will only match strings that contain either "a" or "b" characters. Other characters, such as numbers or symbols, will not be included in the match.

5. How many times can (a|b)* be repeated in a string?

Since the asterisk symbol "*" indicates that the preceding pattern can be repeated zero or more times, there is no limit to the number of times (a|b)* can be repeated in a string. It could potentially match an infinite number of "a" and "b" characters.

Back
Top