Algorithm for Converting Strings: Find the Best Option

  • #1
Arman777
Insights Author
Gold Member
2,168
193
Let us suppose I have a string in this form,

string = 'CıCCkCnow CwCho CyCou CaCre but CwChat CaCm CıC'

Now I won't to take each word between 'C' and convert it into an upper case letter. For example, the above string should turn into

new_string = 'I know Who You Are but What Am I'

what kind of algorithm is best for this job ? I have come up with something but it seems really long and inefficient. Any ideas ?
 
Last edited by a moderator:
Technology news on Phys.org
  • #4
Arman777 said:
I have come up with something but it seems really long and inefficient.
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
 
  • #5
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
 
  • Like
Likes sysprog and Arman777
  • #6
PeterDonis said:
It's going to be hard for us to tell whether or not we agree with you if we don't see the algorithm.
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index. However, it is taking too long...
 
  • #7
FactChecker said:
The 'sub' function in Python allows you to have it call a function for each string that matches the pattern and return the string that you want to be substituted. That is what you want so that you can replace 'CxC' with 'X'. See this for a description.
It looks good. I'll try this.
 
  • #8
regex 2021.8.3 is a python package that you can install and use.
 
  • Like
Likes sysprog
  • #9
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
 
  • Like
Likes jack action and sysprog
  • #10
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
 
  • #11
FactChecker said:
The hardest part, in general, would be to distinguish between a 'CxC' pattern that you want to replace versus an acronym with two 'C's that should stay as is. IMO, it is a mistake to use a normal ASCII character like 'C' as a special non-ASCII indicator with a special meaning.
(Also document section headers are sometimes all capitalized and might have character patterns that you don't want to replace.)
Thats no problem for my case. All strings that I will work are lowercase.
 
  • #12
FactChecker said:
Python is not known for speed. I assume that your algorithm is fairly simple and that Python is just slow. I think there are ways to pre-compile Python so that it will be faster. If this program is to be used for large quantities of text processing, you might want to do that part with a separate program in a faster language. If this is to be used for large text documents, you might be surprised at how many things occur in text documents that require more logic than you anticipated.
Well I don't know any language other then python..and I kind of need it to work on python. But the length of the text will not be much longer...so it won't be a problem
 
  • #13
I tried to use
Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldn't make it work..any ideas ?
 
  • #14
Arman777 said:
I tried to use
Code:
def my_replace(m):
    if <some condition>:
        return <replacement variant 1>
    return <replacement variant 2>

result = re.sub("\w+", my_replace, input)

but I couldn't make it work..any ideas ?
This is just pseudocode. You need to replace it with real Python code appropriate for your problem. Exactly what code did you try?
 
  • #15
FactChecker said:
This is just pseudocode. You need to replace it with real Python code appropriate for your problem.
Yes indeed. I just don't know how to use re module. It says I can take a function but I am not sure how to use that function.
FactChecker said:
Exactly what code did you try?
Not worth sharing since its not useful
 
  • #16
I'm starting to suspect that this is a Python homework problem because it seems very artificial. In that case, I will only give hints on how to modify your Python code.

In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.

Perl:
$string = `type temp2.txt`;
$string =~ s/(C(\w)C)/uc($2)/ge;
print "$string\n";
 
  • #17
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
 
  • Like
Likes FactChecker and Arman777
  • #18
FactChecker said:
I'm starting to suspect that this is a Python homework problem because it seems very artificial.
Well I am going to use it somewhere but its not homework.
FactChecker said:
In case it is not a Python homework problem, below is some Perl code that will work. Put the original text in the file temp2.txt and it will print the modified result to STDOUT.
I did not ask for a perl code. I don't know PERL or how to run it.
 
  • #19
jack action said:
This is what you want:

Python:
import re

def to_camel_case(match):
    if match.group(1) is not None:
        return match.group(1).upper()

old_str = 'CıC CkCnow CwCho CyCou CaCre but CwChat CaCm CıC'
new_str = re.sub(r"C([^C])C", to_camel_case, old_str)

print(new_str)

I will leave it as an exercise for you to understand how it works.

But this is a much more useful (and fun!) use of regular expressions and python (note that there are no 'C' in the original string):

Python:
import re

def to_camel_case(match):
    if match.group(2) is not None:
        if match.group(2) not in ['but', 'and', 'of']:
            return  match.group(1) + match.group(3).upper() + match.group(4)
        else:
            return match.group(1) + match.group(2)

old_str = 'ı know who you are but what am ı'
new_str = re.sub(r"(^|[\s.,;:!?()])(([^\s.,;:!?()])([^\s.,;:!?()]*))(?=$|[\s.,;:!?()])", to_camel_case, old_str)

print(new_str)

If I was more fluent in python, I could make a better regular expression than that, but re seems to use a limited version. The best tool to learn about regular expression is regex101.com.
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
 
  • #20
Arman777 said:
My algorithm was to take the index of each `C` letter. Pair them as 2. Get index values between them and then turn strings into uppercase letters based on these index.
That's basically what the regex version is doing.

Arman777 said:
However, it is taking too long...
That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
 
  • #21
FactChecker said:
I think there are ways to pre-compile Python so that it will be faster.
If you're running Python bytecode, you're running Python bytecode. "Pre-compiling", for Python, just means compiling Python source code to bytecode in advance. That won't make much difference compared to the overhead of bytecode while actually running the algorithm.

There is the option of trying other interpreters, such as PyPy, that use various tricks to optimize how Python bytecode is run. For this problem, with a short string, that probably won't do much; but for a very large body of text, it might since the PyPy optimizer will have more opportunities to optimize.
 
  • #22
PeterDonis said:
That's basically what the regex version is doing.
But my code takes more then 10-20 lines regex might take 5 lines maybe less

PeterDonis said:
That's because Python is doing it in bytecode instructions, whereas the regex version is using the underlying C implementation for regular expressions, which will be a lot faster. But the algorithm itself is basically the same either way. There's no magical shortcut to finding the "C"s and uppercasing the letters between them.
I did not mean in terms of speed of the running time but in terms of me writing the code :)

Code:
re.sub('C(\w)C', lambda s: s.group(1).upper(), xstr)
This seems to be working, but it has the problem that, if the C has multiple values it fails. Such as,

For

xstr = 'CaCCvaC'

the above code produces

a = 'ACvaC'

but it should produce AVA.
 
  • #23
Arman777 said:
Its nice but does not work for this case

xstr = 'CaCarmanCpopC'

It should have produce

xstr = 'AarmanPOP`

but instead it produces

AarmanCpopC

so it ignores other C values.
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
 
  • Like
Likes jack action
  • #24
FactChecker said:
Your example didn't have anything with multiple letters between the 'C's.
In line 8 try new_str = re.sub(r"C([^C]+)C", to_camel_case, old_str)
or new_str = re.sub(r"C(\w+)C", to_camel_case, old_str)
Unfortunately, this will be fooled by any pair of 'C's that are part of real words. So it is most useful if there are no capital letters in the real text.
You may need to get familiar with Python regular expressions and try some things to get it to work the way you want it to.
Guys please. As I have said earlier. In the text I am working on there are no capital letters. So there will be no uppercase C.
Arman777 said:
All strings that I will work are lowercase.

Code:
re.sub('C(\w+?)C', lambda s: s.group(1).upper(), xstr)

This code works

FactChecker said:
have anything with multiple letters between the 'C's.
CpopC was the caseYou guys are really helpful, but sometimes I just need some spesific things. I know what I am doing. I know the difference between capital C and lowercase C and how can the code mix them up. Maybe I am 'new' in coding but I know that much.

Arman777 said:
But my code takes more then 10-20 lines regex might take 5 lines maybe less
We have seen that it takes only 1 line :)
 
  • #25
Arman777 said:
my code takes more then 10-20 lines regex might take 5 lines maybe less
Yes, because the regex version already has built-in functions that perform the operations you need, so you don't have to code them by hand.

Arman777 said:
I did not mean in terms of speed of the running time but in terms of me writing the code :)
Yes, I agree that's important. I've found that a great source of innovation in coding is programmer laziness. :wink:
 
  • #26
anorlunda said:
regex 2021.8.3 is a python package that you can install and use.
Python already has the built-in re module in the standard library.
 
  • Like
Likes Arman777
  • #27
Arman777 said:
This code works
As long as you're sure the characters in between the C's will all be lower case letters, yes. You could also make the regex more specific for that:

Python:
re.sub('C([a-z]+)C', lambda s: s.group(1).upper(), xstr)

Also, as shown in the example above, if you're sure there will be at least one lower case letter in between each pair of C's, you don't need the question mark in the regex, just the plus sign.
 
  • Like
Likes Arman777
  • #28
Arman777 said:
Guys please. As I have said earlier. In the text I am working on there are no capital letters. So there will be no uppercase C.
Code:
re.sub('C(\w+?)C', lambda s: s.group(1).upper(), xstr)

This code worksCpopC was the caseYou guys are really helpful, but sometimes I just need some spesific things. I know what I am doing. I know the difference between capital C and lowercase C and how can the code mix them up. Maybe I am 'new' in coding but I know that much.We have seen that it takes only 1 line :)
Sorry. You will get the best help if you are careful about the initial statement of the problem. The information about no capital letters and the example with more than one letter between the 'C's was not in the first post. It is hard for me to keep up with all the posts to get a clear picture of what is needed.
 
  • #29
Could just get your keyboard fixed.
 
Back
Top