Text file that is read into an array

Sue Parks · Dec 2, 2015

I have a program "probe.f90" that reads in a text file. It runs. I am having trouble reading in the second text file (when prompted by user).

How could I turn this bit of code into a function? I was thinking about a function that strips the header from the fasta file. Then read into an array. Then compute amino acid frequency (last)!

I looked at some examples of character functions in fortran 90, but I see mostly integer functions.
Mod note: Edited the code so that the indentation matches the program structure.

Fortran:

    PROGRAM PROBE 
    implicit none
    character(len=200) :: string
    character, allocatable :: baseq(:), tmpstr(:), newstr(:)
    integer :: nbase, errstat, tmplen, i

    nbase = 0
    errstat = 0

    allocate( baseq(0) )

    open(unit=101, file="TitinFastaFormat.txt", form="formatted", status="old")
    do
       read (101,*,iostat=errstat) string
       if (errstat .ne. 0) exit
       if ( string(1:1) .ne. ">" ) then
            tmplen = len_trim(string)
            nbase = nbase + tmplen
            allocate ( tmpstr( tmplen ) )
            do i=1,tmplen
               tmpstr(i) = string(i:i)
            end do
            allocate( newstr( nbase ))
            newstr(1:size(baseq)) = baseq
            call move_alloc(newstr,baseq)
            baseq(nbase-tmplen+1:nbase) = tmpstr
            deallocate ( tmpstr )
       end if
    end do
    close(101)
    write (*,"(A6, tr1, 999999(A1))"),baseq
    !write (*,"(A6, tr1, i0)") "NBASE:", nbase
    deallocate (baseq)

end program probe

Mark44 · Dec 3, 2015

Sue Parks said:

I have a program "probe.f90" that reads in a text file. It runs. I am having trouble reading in the second text file (when prompted by user).

How could I turn this bit of code into a function? I was thinking about a function that strips the header from the fasta file. Then read into an array. Then compute amino acid frequency (last)!

I don't recommend doing this. A function (or subroutine) should essentially do one thing, and the name should be chosen so that the one thing it does is obvious from the name. A better strategy, I believe, would be to write one subroutine that strips the header from the file, and a second subroutine that reads data from the file into an array, and finally, a third subroutine that does the amino acid frequency calculations.

I see all of these as subroutines, rather than functions, because each one performs some operation, but doesn't return a value, as a function should do. Each sub performs an action, so the name of each should be a verb, such as StripHeader for the first, and ReadData for the second, and maybe CalcAminoAcidFreq for the third.

If you decide to go this route, you'll then need to figure out what parameters each sub needs so that it can perform its task efficiently and effectively.

Sue Parks said:

I looked at some examples of character functions in fortran 90, but I see mostly integer functions.
Mod note: Edited the code so that the indentation matches the program structure.

Fortran:

    PROGRAM PROBE
    implicit none
    character(len=200) :: string
    character, allocatable :: baseq(:), tmpstr(:), newstr(:)
    integer :: nbase, errstat, tmplen, i

    nbase = 0
    errstat = 0

    allocate( baseq(0) )

    open(unit=101, file="TitinFastaFormat.txt", form="formatted", status="old")
    do
       read (101,*,iostat=errstat) string
       if (errstat .ne. 0) exit
       if ( string(1:1) .ne. ">" ) then
            tmplen = len_trim(string)
            nbase = nbase + tmplen
            allocate ( tmpstr( tmplen ) )
            do i=1,tmplen
               tmpstr(i) = string(i:i)
            end do
            allocate( newstr( nbase ))
            newstr(1:size(baseq)) = baseq
            call move_alloc(newstr,baseq)
            baseq(nbase-tmplen+1:nbase) = tmpstr
            deallocate ( tmpstr )
       end if
    end do
    close(101)
    write (*,"(A6, tr1, 999999(A1))"),baseq
    !write (*,"(A6, tr1, i0)") "NBASE:", nbase
    deallocate (baseq)

end program probe

Sue Parks · Dec 3, 2015

Three subroutines makes more sense. Question: could I have two more subroutines that add the characters to an array. What would be the best way to sort through the parameters?

Mark44 · Dec 3, 2015

Sue Parks said:

Three subroutines makes more sense. Question: could I have two more subroutines that add the characters to an array. What would be the best way to sort through the parameters?

My suggestion of three subs came from you description of what seemed to me to be three separate tasks, although, after taking another look at your file, I don't think it would hurt to combine the header stripping and reading into an array into one sub, especially since Fortran doesn't really treat files as first-class objects. By that, I mean, I don't think there's an elegant way to pass the file into and out of a subroutine. Many other programming languages have that ability

By "sorting through the parameters" I don't think you really mean actually sorting through them -- instead, just figuring out what they are. In each case, just identify what information the sub needs to do its job.

So far, the main tasks seem to be 1) pulling the data out of the file and storing it in an array, and 2) calculating the amino acid frequencies.
For the first task, one parameter (in) would be a string that contains the name of the file, and another would be the array (out) into which the data would be stored (omitting the header, which seems to be the first line of the text file). For the second task, the subroutine would need the array (in) , and possibly another array (out) with the frequencies.

Sue Parks · Dec 3, 2015

Thank you!

Mark44 · Dec 3, 2015

Is there some reason you're using Fortran for this problem? For working with strings, many other languages are better choices. That's not to say you can't work with strings in Fortran, but it's easier to do this kind of work in many other languages.

Out of curiosity, what are you planning to do with the text file? I've copied the first four lines of the file you provided. What sort of analysis is your program supposed to do with the alphabet soup of the 2nd and following lines? (I'm assuming that the first line is the header.)

Code:

>sp|Q8WZ42|TITIN_HUMAN Titin OS=Homo sapiens GN=TTN PE=1 SV=4
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSD
GRAKLTIPAVTKANSGRYSLKATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQ
VRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGDLYSLLIAEAYPEDSGTYSVNATN

Sue Parks · Dec 3, 2015

I am using fortran for the first time. Typically, I would do this type of problem in Python or BioPython. As you can see, the first line is a header. That is the first 4 lines of the FASTA file of the human gene Titin. I have another FASTA file (with a probe). I would like to read these "sequences" into an array. Find the frequency and also find the number of matches of the probe for the Titan gene. Technically, the "alphabet soup" is an amino acid sequence that codes for the Titin gene.

Sue Parks · Dec 3, 2015

I am having some difficulty taking my first program and converting that to a subroutine. I was only working the text file. I was trying to get the sequence to print to the screen.

Fortran:

        program aa
   
        character:: filename*20 
        character, allocatable :: baseq(:)
        call readfile("TitinFastaFormat.txt")
       
    end program aa

      subroutine readfile(filename) 
          integer ::eof, tempLength, nBase, i 
          character, allocatable :: baseq(:), tempString(:), newString(:) 
          character:: filename*20 
        character :: stringFile*200
     
      eof=0 
          open (102, file=filename, Form="formatted", status="old" ) 
          Read(102,*) 
          do
              !Read(102,*  iostate = eof) stringFile  ! <---------------------
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(stringFile)
              nBase = nBase + tempLength
          allocate ( tempString(tempLength))
     
          do i=1, tempLength
              tempString(i) = stringFile(i:i)
          end do
     
          allocate (newString (nBase))
          newString (1:size(baseq)) = baseq
          call move_alloc(newString, baseq)
          baseq(nbase - tempLength+1:nBase) = tempString
          deallocate(tempString)
     
          end if 
        end do 
        close(101) 
       
        write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq 
        write (*,"(A6, tr1, i0)") "NBASE:", nbase 

        deallocate (baseq)
   
    end

Sue Parks · Dec 3, 2015

I know the subroutine is long. How do I manipulate the intent to get my array to print?
I fixed this command

Fortran:

end program aa
.
.
.
Read(102,  iostat = eof) stringFile

Mark44 · Dec 3, 2015

Sue Parks said:

I am using fortran for the first time. Typically, I would do this type of problem in Python or BioPython.

I think Python would be a good choice. I don't know anything about BioPython, so don't have any knowledge of its capabilities.

I'll take a look at the latest code you included, and make some comments.

Sue Parks said:

As you can see, the first line is a header. That is the first 4 lines of the FASTA file of the human gene Titin. I have another FASTA file (with a probe). I would like to read these "sequences" into an array. Find the frequency and also find the number of matches of the probe for the Titan gene. Technically, the "alphabet soup" is an amino acid sequence that codes for the Titin gene.

Mark44 · Dec 3, 2015

Sue Parks said:

I know the subroutine is long.

It's not that long...

Sue Parks said:
How do I manipulate the intent to get my array to print?
I fixed this command
Fortran:
end program aa
.
.
.
Read(102,  iostat = eof) stringFile

Let me address your readfile subroutine first, and then I'll come back to this.

Fortran:

subroutine readfile(filename)
    integer ::eof, tempLength, nBase, i
   character, allocatable :: baseq(:), tempString(:), newString(:)
   character:: filename*20
   character :: stringFile*200   ! *** stringFile is declared to hold 200 characters. That's definitely not enough to hold the entire contents of the file I saw.
     
   eof=0
   open (102, file=filename, Form="formatted", status="old" )
   Read(102,*)    ! ***  ? This doesn't do anything -- there's no variable to read into
   do
          !Read(102,*  iostate = eof) stringFile  ! <---------------------
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(stringFile)
              nBase = nBase + tempLength
              allocate ( tempString(tempLength))
    
              do i=1, tempLength
                  tempString(i) = stringFile(i:i)
              end do
    
              allocate (newString (nBase))
              newString (1:size(baseq)) = baseq
              call move_alloc(newString, baseq)
              baseq(nbase - tempLength+1:nBase) = tempString
              deallocate(tempString)
          end if
    end do
   close(101)
       
   write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
   write (*,"(A6, tr1, i0)") "NBASE:", nbase

   deallocate (baseq)

end  ! *** Needs to be end subroutine

I added a few comments, the ones that start with ! ***

The subroutine should have two parameters: a string to hold the filename, and an array that will hold the contents of the file. That array would be declared in your main program. When the readfile sub finishes, the array will be initialized appropriately.

I don't know how big the array should be, but the file you attached is about 36 KB in size. I don't know whether you program needs to be able to hold the entire contents of this textfile or not.

Mark44 · Dec 4, 2015

Python:

def ReadFile(file_name):
    infile = open(file_name, "r")
    while (True):
       line = infile.readline()
       if line[0:1] == '>': continue  # Skip the header
       if line == "":
           infile.close()
           return lines
           break
       lines.append(line)

def PrintFile(lines):
    for line in lines:
       print (line)inFile = "TitinFasta.txt"
lines = []
lines = ReadFile(inFile)
PrintFle(lines)

The "main" part of this Python program is the four lines at the bottom. The lines variable is a list (like an array, but not exactly the same) will hold the lines of text from the input file.

The ReadFile() function takes as its parameter the name of the file to open, and returns the filled-in list. This function opens the file in read mode, as a text file, and then reads lines of text from the input file. It skips the entire first line. Each line that is read is appended to the lists variable. When the final line is read (when line == "", or empty string), the function returns the lines list.

The PrintFile() function has a list parameter (lines). For each line of text in the lines list, it prints that line.

I'm hopeful that you'll be able to follow this code to see what your Fortran program needs to do (or even abandon your attempt to use Fortran to do this...)

The code above is short and sweet, and seems to work well.

Sue Parks · Dec 4, 2015

I was under the assumption that Fortran read files line by line. The empty Read(102,*) was to skip that line of the file. I may be wrong.

Mark44 · Dec 4, 2015

Sue Parks said:

I was under the assumption that Fortran read files line by line. The empty Read(102,*) was to skip that line of the file. I may be wrong.

I could be wrong (it's been many years since I had a Fortran compiler), but I don't believe that the statement above does anything if there's no variable at the end of the read statement. You could have this
read(102, *) str
where str is declared as a CHARACTER string variable that is large enough to hold the longest line in your file. You don't need to do anything with the str variable, so you're effectively discarding the first line.

jbriggs444 · Dec 4, 2015

My years-old recollection is that the iolist can be empty. That matches the Fortran77 documentation that I can quickly google up.

"Input List
iolist can be empty or can contain input items or implied DO lists"

https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vn79/index.html

A Fortran READ will generally read as many lines as it takes to fill in the variables. But it will always read at least one line. [I found myself often reading a line and then using internal reads to parse the result]

Mark44 · Dec 5, 2015

jbriggs444 said:

My years-old recollection is that the iolist can be empty. That matches the Fortran77 documentation that I can quickly google up.

"Input List
iolist can be empty or can contain input items or implied DO lists"

https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vn79/index.html

A Fortran READ will generally read as many lines as it takes to fill in the variables. But it will always read at least one line. [I found myself often reading a line and then using internal reads to parse the result]

Good to know. Thanks for setting me straight @jbriggs444...

Sue Parks · Dec 5, 2015

I'm not sure how to use the subroutine to take in the file and put the characters in an array. I may have some problems declaring the array in the subroutine and in the main program

Mark44 · Dec 5, 2015

Sue Parks said:

I'm not sure how to use the subroutine to take in the file and put the characters in an array. I may have some problems declaring the array in the subroutine and in the main program

Something like this...
Note that the names of the actual arguments (the variables in main and used in the call to the subroutine) can be the same as the names of the formal parameters in the subroutine, but don't have to be.

Fortran:

! Main program
! Declare variables
CHARACTER * 20 :: fileName
CHARACTER * 10000 :: data  ! Can contain 10,000 characters, maybe not enough for your application
.
.
.
! Get a value for fileName
.
.
.
CALL ReadFile(fileName, data)
.
.
.
END PROGRAM

SUBROUTINE ReadFile(fileName, data)
IMPLICIT NONE
CHARACTER * 20, INTENT(IN) :: fileName
CHARACTER * 10000, INTENT(OUT) :: data

   ! Open file
   ! Strip off header line
   ! Read data from file into array

END SUBROUTINE ReadFile

Sue Parks · Dec 5, 2015

ok. Let me see. I have been going in circles.

Sue Parks · Dec 5, 2015

I'm stuck

Fortran:

    program aa
   
        character*20:: filename  = "TitinFastaFormat.txt"
        character*10000 :: baseq
        call readfile(filename, baseq)
        !deallocate
   
    end program aa
   
      subroutine readfile(filename, baseq) 
      implicit none
          integer ::eof,  nBase, i, tempLength 
          character, allocatable ::  tempString(:), newString(:)
          character, intent(in):: filename*20
          character, intent(out):: baseq*10000
        character  :: stringFile*10000
     
          eof=0 
          allocate( baseq)
          open (102, file=filename, Form="formatted", status="old" ) 
          Read(102,*) stringFile
          !Read(102,  iostate = eof) stringFile 
          do
          !Read(102,  iostate = eof) stringFile  
          if (eof .ne. 0) exit
          if(stringFile(1:1) .ne. ">") then
              tempLength = len_trim(string(1:1))
              nBase = nBase + tempLength
              allocate ( tempString(tempLength))
   
              do i=1, tempLength
                  tempString(i) = stringFile(:)
              end do
   
              allocate (newString (nBase))
              newString (1:size(baseq)) = baseq
              call move_alloc(newString, baseq)
              baseq(nbase - tempLength+1:nBase) = tempString
              deallocate(tempString)
          end if
    end do
   close(102)
       
   !write (*,"(A6, tr1, 999999(A1))") "BASEQ:", baseq
   !write (*,"(A6, tr1, i0)") "NBASE:", nbase

   deallocate (baseq)

end  subroutine read file

Mark44 · Dec 5, 2015

I can't tell what you're trying to do in your readfile subroutine. The only things that it should are 1) open the file, and 2) read the data from the file into an array. After that it's done.

One mistake I see is that you have a line -- nBase = nBase + tempLength
nBase is an uninitialized variable. So the line of code above adds tempLength to a garbage value (nBase) and stores that in nBase, which results in a garbage value.

You have a lot of stuff in your code whose purpose escapes me. What I had in mind is more like the following. Note that I changed the size of the data array to 40000. You need to make the same change for the corresponding array.

I think the following will work (i.e., the single read statement will suck up all of the data), but I don't have a Fortran compiler to test it, so caveat emptor.

Fortran:

subroutine ReadFile(filename, baseq)
   implicit none
   integer ::eof
   character, intent(in):: filename*20
   character, intent(out):: baseq*40000
   eof=0
   open (102, file=filename, Form="formatted", status="old" )
   Read(102,*) baseq
end subroutine ReadFile

Sue Parks · Dec 5, 2015

I was trying to remove the header (first line of the file). How would I call the subroutine from the main. I just want to make sure the characters are in an array. I'm close!

output:
Sues-MacBook-Air:fortran sueparks$ gfortran a.f90
Sues-MacBook-Air:fortran sueparks$ ./a.out
Backtrace for this error:
#0 0x10f112092
#1 0x10f1113b0
#2 0x7fff8767ff19
Segmentation fault: 11
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Fortran:

program a
   
        character*20:: filename  
        character, allocatable :: baseq*40000
        call ReadFile("TitinFastaFormat.txt", baseq)
        Write (102,*) baseq ! <-------
   
   
    end program a

    subroutine ReadFile(filename, baseq)
          implicit none
          !integer ::eof
          character, intent(in):: filename*20
          character, intent(out):: baseq*40000

          open (102, file=filename, Form="formatted", status="old" )
          Read(102,*) baseq
          Write (102, *) baseq
          close(102)
end subroutine ReadFile

Mark44 · Dec 6, 2015

I believe your problem comes from not actually allocating storage for your baseq array.

Fortran:

program a
  character*20:: filename 
  character, allocatable :: baseq*40000  ! <--- this line
  call ReadFile("TitinFastaFormat.txt", baseq)
  Write (102,*) baseq !
end program a

Merely stating that it is "allocatable" doesn't allocate any storage. To do that, you need to have a statement something like
allocate(baseq(40000)) following your declaration of the baseq variable.

Ideally your code would determine the size of the file, and then allocate the correct amount of storage (using allocate(baseq(size)) ).

I don't know if that approach is feasible, given your current working knowledge of Fortran, so IMO using "allocatable" and "allocate" seems like extra complication for no gain. Instead of using dynamic array allocation (i.e., using allocatable and allocate), just declare an array that is big enough, like so:

Fortran:

   character*40000 :: baseq

Sue Parks · Dec 7, 2015

That was extremely helpful! I have another question. I have a short algorithm to help determine the Amino acid frequency. Could you take a look? I am just giving you an example of what I am thinking, mostly because I am not 100% sure how to extract the information from the array (baseq in the case).

Fortran:

String = "ABDJFAAUGBJDK"   ! Just found an example to determine functionality
Write (*,*) String ! checking 

! seems plausible to have some sort of subroutine to count the "elements"
    subroutine countAminoAcids(sumAminoAcids, baseq)
Integer :: sumAminoAcids
! intent (in) :: baseq
! intent (out) sumAminoAcids
    sum all elements
    do i=1, len(array)
        sumAminoAcids = sumAminoAcids + 1
    end do
    end subroutine countAminoAcids
       
! Here is my Data (list of ALL possible amino acids
DATA /'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,E,A,T,V,W,Y'/! seems plausible to loop through the array to sum (countA) and avg(averageA)
! Is seems easier to do this is one subroutine
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')
        countA = countA + 1
        averageA = countA/sumAminoAcids
       
    else if (String == 'D')
        countD = countD + 1
        averageD = countD/sumAminoAcids
    
end do
end subroutine AAFrequencyTitin

jbriggs444 · Dec 7, 2015

Sue Parks said:
That was extremely helpful! I have another question. I have a short algorithm to help determine the Amino acid frequency. Could you take a look? I am just giving you an example of what I am thinking, mostly because I am not 100% sure how to extract the information from the array (baseq in the case).
Fortran:
! seems plausible to have some sort of subroutine to count the "elements"
    subroutine countAminoAcids(sumAminoAcids, baseq)
Integer :: sumAminoAcids
! intent (in) :: baseq
! intent (out) sumAminoAcids
    sum all elements
    do i=1, len(array)
        sumAminoAcids = sumAminoAcids + 1
    end do
   end subroutine countAminoAcids

You never use baseq. As written, this is equivalent to "sumAminoAcids = len(array)"

Documentation is important. What is this subroutine supposed to do? What are its inputs. What are its outputs?

Fortran:

! seems plausible to loop through the array to sum (countA) and avg(averageA)
! Is seems easier to do this is one subroutine
subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')

If you want to access character number i in String, that would be String(i:i)

averageD = countD/sumAminoAcids

You've declared both countD and sumAminoAcids as integers. What happens in Fortran when an integer division would give a result that is between zero and one?

Sue Parks · Dec 7, 2015

subroutine countAminoAcids:
input: array (baseq)
output: count of total amino acids

subroutine AAFrequencyTitin
input: array(baseq)
output: frequency count of all amino acids

jbriggs444 · Dec 7, 2015

Sue Parks said:

subroutine countAminoAcids:
input: array (baseq)
output: count of total amino acids

subroutine AAFrequencyTitin
input: array(baseq)
output: frequency count of all amino acids

For documentation, I was looking for something more like...

baseq is a character string containing a list of amino acid bases coded with one character for each base in the sequence. "A" for adenine, "C" for cytosine, etc. It may be padded with blanks. It is case sensitive. The A's, C's, etc must all be in upper case.

SumAminoAcids is the number of amino acids in the chain encoded by baseq, not counting any trailing blanks.

CountA, CountC, etc are the number of the "A"s, "C"s, etc.

Mark44 · Dec 7, 2015

Granted, the code below is Python, but my intent is to show how to write the code in a modular fashion, with each function carrying out a specific task. Each function is passed the information it needs through the parameter list, and, for some functions, returns the processed information for other parts of the code to use.

Python:

def readfile(file_name):
  # Read the given file line by line, and store each line of the file in a list of lines.
  # The file header is skipped.
  # Returns a list containing the lines of the file.
  infile = open(file_name, "r")
  while (True):
     line = infile.readline()
     if line[0:1] == '>': continue  # Skip the header
     if line == "":  # We've hit the end of the file
         infile.close()
         return lines
         break
     lines.append(line)

def printfile(lines):
  # Print the list of lines of the file.
  for line in lines:
  print (line)def processData(lines):
  # Process the data by storing amino acid frequencies in a dictionary, a Python data type that consists of (key, value) pairs.
  # Each amino acid is represent by a one-letter key; e.g. 'A'.
  # Each occurrence of a particular amino acid causes the value portion to be incremented.
  # Characters that don't match any amino acid are tracked by a catchall key, Misc.
  # Returns the dictionary and the total count. A peculiarity of Python is its ability to return
  #   more than one thing.
  dataDict = dict(A=0,C=0,D=0,E=0, F=0, G=0, H=0, I=0, K=0, L=0, M=0, N=0, P=0, Q=0, T=0, V=0, W=0, Y=0, Misc=0)
  count = 0

  for line in lines:
     for i in range(len(line)):
        key = line[i:i+1]
        if key in dataDict:
            dataDict[key] += 1
        else:
           dataDict['Misc'] += 1
        count += 1
  return dataDict, count

def printSummary(dataDict, count):
  # Print a summary, with each amino acid, how often it occurred, and its relative proportion overall.
   for key in dataDict:
      print("Amino acid: ", key, "\tCount: ", dataDict[key], "\tProportion: ", dataDict[key]/count * 100, "%")# main program
fn = "TitinFasta.txt"
lines = []
lines = readfile(fn)
print(lines[0])
dataDict, count = processData(lines)
printSummary(dataDict, count)
print(count)

Output from the above:

Code:

Amino acid:  M  Count:  398  Proportion:  1.1396827214936143 %
Amino acid:  I  Count:  2062  Proportion:  5.904587366130233 %
Amino acid:  F  Count:  908  Proportion:  2.600080178683924 %
Amino acid:  Q  Count:  942  Proportion:  2.697440009163278 %
Amino acid:  K  Count:  2943  Proportion:  8.427352385315846 %
Amino acid:  G  Count:  2066  Proportion:  5.916041463833686 %
Amino acid:  H  Count:  478  Proportion:  1.3687646755626826 %
Amino acid:  V  Count:  3184  Proportion:  9.117461771948914 %
Amino acid:  P  Count:  2517  Proportion:  7.207490979898059 %
Amino acid:  E  Count:  3193  Proportion:  9.143233491781686 %
Amino acid:  Y  Count:  999  Proportion:  2.8606609014374893 %
Amino acid:  D  Count:  1720  Proportion:  4.925262012484966 %
Amino acid:  L  Count:  2117  Proportion:  6.0620812095527175 %
Amino acid:  N  Count:  1111  Proportion:  3.1813756371341846 %
Amino acid:  A  Count:  2084  Proportion:  5.967584903499227 %
Amino acid:  C  Count:  513  Proportion:  1.4689880304679 %
Amino acid:  Misc  Count:  4675  Proportion:  13.386976690911172 %
Amino acid:  W  Count:  466  Proportion:  1.3344023824523223 %
Amino acid:  T  Count:  2546  Proportion:  7.290533188248095 %
34922

Sue Parks · Dec 7, 2015

How could I count the number of characters in my variable baseq?

Mark44 · Dec 8, 2015

Sue Parks said:

How could I count the number of characters in my variable baseq?

baseq(i:i) is the one-character substring that starts at index i.
baseq(i:i+1) would be the two-character substring starting at index i.

Mark44 · Dec 8, 2015

In post #24 you said this:

Sue Parks said:

! Here is my Data (list of ALL possible amino acids
DATA /'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,E,A,T,V,W,Y'/

Might have been a copy/paste mistake, but the list above has duplicates for E and A, and is missing R and S. The data file that you attached at the beginning of this thread contains numerous R's and S's. The output from my Python code in post #28 doesn't have entries for R and S, which I presume are valid amino acids. I didn't include these two because you didn't list them above.

I have to ask: Is there some reason you're doing this with Fortran? To me, using Fortran in the context of this problem is something like trying to make fine furniture using only a hammer.

Mark44 · Dec 8, 2015

Adding R and S as possible amino acids, and tweaking the output format slightly, this is what I'm now getting. The 'Misc' category now consists only of the newline characters that were in your input textfile.

Code:

Amino acid  Count Proportion
  E    3193  9.143 %
  N    1111  3.181 %
  H    478  1.369 %
  M    398  1.140 %
  W    466  1.334 %
  I    2062  5.905 %
  Q    942  2.697 %
  A    2084  5.968 %
  K    2943  8.427 %
  Y    999  2.861 %
  D    1720  4.925 %
  L    2117  6.062 %
  C    513  1.469 %
  S    2463  7.053 %
  Misc   572  1.638 %
  G    2066  5.916 %
  T    2546  7.291 %
  F    908  2.600 %
  P    2517  7.207 %
  V    3184  9.117 %
  R    1640  4.696 %
Cumulative total percentages: 100.00%
Characters processed:  34922

Sue Parks · Dec 8, 2015

This is a practice simulation in fortran. I have a good foundation in Python. We (YOU & I) know Fortran is not the best way to go about solving this problem, but it can be done.

Mark44 · Dec 8, 2015

Sue Parks said:

This is a practice simulation in fortran. I have a good foundation in Python. We (YOU & I) know Fortran is not the best way to go about solving this problem, but it can be done.

Sure.

Here's what you posted earlier:

Fortran:

subroutine AAFrequencyTitin

Integer:: countA  , averageA  , countC , countD, average D
do i =1, len(String)
    if ( String == 'A')
        countA = countA + 1
        averageA = countA/sumAminoAcids
       
    else if (String == 'D')
        countD = countD + 1
        averageD = countD/sumAminoAcids
    
end do
end subroutine AAFrequencyTitin[/quote]
Something like this will work, but it needs some work.
1. The subroutine should have at least one parameter, the string (CHARACTER* xxxxx) that was read earlier in another subroutine.
2. You could use countA, countC, countD, etc to store the counts of the various amino acids, but you DON'T need separate variables for the relative frequencies. Just keep track of the total number of amino acids, and display countA/totalCount for the relative proportion of A, and so on.
3. The string (passed as a parameter) can be read one character at a time, by baseq[i:i]. Get the character in the i-th position, and run it through a chain of IF... ELSE IF ... ELSE IF ... statements, incrementing the appropriate countX when the IF clause is matched.
Once you have cycled through the string, and all of the countX variables are set, you could store these values in an array of suitable size (one-dimensional, with one cell for each amino acid count). That array could be an OUT parameter in your function, that could be used by some other subroutine, similar to what I did in my Python code.

Text file that is read into an array

Attachments

FAQ: Text file that is read into an array

What is a text file?

What is an array?

How do you read a text file into an array?

Why would you want to read a text file into an array?

What are some common applications of reading a text file into an array?

Similar threads

Hot Threads

Recent Insights