Extract all the text between the repeated unique strings in a large file

In summary, the conversation involves the use of Python to extract text between recurring unique strings from a large text file and write it into a CSV file. The desired output is a long column of the extracted text without any other characters. The person is also trying to learn how to do this using BeautifulSoup, but is currently using a regex approach. The ultimate goal is to write a program that can extract desired text from HTML tags and put it into the CSV file along with other desired text.
  • #1
deltapapazulu
84
12
How do I use Python to extract all text that comes between a recurring set of unique strings? Example:

ABC;the red dragon;123,
ABC;the blue dragons;123
ABC;the black dwarf;123,
ABC;the gray elves;123
ABC;fantasy characters;123,
ABC;winged balrogs;123

I want to parse through a large text file that has certain non-repeating text that comes between repeating text and write each instance of this into a CSV file in a long column. The program would also create the CSV file. It would open a text file, get stuff from it, create CSV file and write stuff into it.

The end result would look like this in the CSV file.

the red dragon
the blue dragons
the black dwarf
the gray elves
fantasy characters
winged balrogs
 
Technology news on Phys.org
  • #2
Can you show us your first attempt at doing this? Please be sure to use code tags when posting your code. Thanks. :smile:

Also, is this for a schoolwork assignment?
 
  • #3
berkeman said:
Can you show us your first attempt at doing this? Please be sure to use code tags when posting your code. Thanks. :smile:

Also, is this for a schoolwork assignment?

Here is the sample text from testReg2.txt:

abc blue candy is poisonous 123
abc green candy is SoCal 123
abc black candy is gothlicious 123

Here is the code attempt:

Python:
import csv
f = open('testReg2.txt','r')
message = f.read()
start = 'abc'
end = '123'

with open('some5.csv', 'w') as q:
    thewriter = csv.writer(q)
    thewriter.writerow((message.split(start))[1].split(end)[0])

The output of this code in the created CSV file is:

,b,l,u,e, ,c,a,n,d,y, ,i,s, ,p,o,i,s,o,n,o,u,s,

I would like the output in the CSV file to be:

blue candy is poisonous
green candy is SoCal
black candy is gothlicious

-------

This is not for an assignment. I am learning this for personal purposes. This is actually a kind of sub task that I want to learn as a part of some web scraping stuff I am learning at the moment. What I would ultimately like to do is get python's BeautifulSoup to parse large html file for some recurring text that is inside a long 'div' block but outside of all other tags in that 'div' block. I will go ahead and show an example of ultimately what I would like to be able to do. Here is a piece of the large 'div' block (actually see that later in this post). I have written a program that successfully puts every instance of text inside 'strong' tags into a CSV file with each word (text) in a long tight neat column. Here is the code for that, and then following is the piece of HTML that I am using this on:

Python:
import csv
import requests
from bs4 import BeautifulSoup

url = "https://lrc.la.utexas.edu/eieol_master_gloss/norol/18"
r = requests.get(url)
soup = BeautifulSoup(r.text, "lxml")
nameList = soup.findAll("strong")
with open('some.csv', 'w', newline='', encoding='utf-8') as f:
    thewriter = csv.writer(f)
    for name in nameList:
        thewriter.writerow([name.get_text()])

A few more things before I post the HTML piece. What I want to do is get all text inside these regular unique character strings:

&gt;</span> endurance <ul>
&gt;</span> otherwise <ul>
&gt;</span> otherwise <ul>
&gt;</span> love <ul>
&gt;</span> Embla <ul>
&gt;</span> Emma <ul>
&gt;</span> but, and; than <ul>

The output in the created CSV file would be:

endurance
otherwise
otherwise
love
Embla
Emma
but, and; thanOne problem (perhaps) for a specifically BeautifulSoup method is that there is other text (that I don't want) in the large 'div' tag that also happens to not be inside any other tag. This is why I felt a need to learn a regex routine as a side task because my desired text happens to appear in between these regular unique characters: "&gt;</span>" and "<ul>", but (again) OUTSIDE any other tags inside the large 'div'.

I would like to learn how to do this using something specific to the BS4 package, but I wanted to learn the regex stuff anyway.

HTML:
<strong><span lang='non' class='Unicode'>eljun</span></strong> -
noun, feminine; accusative singular of <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;eljun&gt;</span> endurance
<ul>
<li>
<a href='/eieol/norol/80#glossed_text_gloss_28480'>The Waking of <span class="Unicode" lang="non">Angantýr</span></a>
</li>
</ul>
<strong><span lang='non' class='Unicode'>elligar</span></strong> -
adverb; <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;elligar&gt;</span> otherwise
<ul>
<li>
<a href='/eieol/norol/60#glossed_text_gloss_26756'>from the Tale of <span class="Unicode" lang="non">Bǫðvarr Bjarki</span></a>
</li>
</ul>
<strong><span lang='non' class='Unicode'>elskaði</span></strong> -
verb; 3rd singular past of <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;elska (að)&gt;</span> love
<ul>
<li>
<a href='/eieol/norol/20#glossed_text_gloss_24475'>from the Prologue of <span class="Unicode" lang="non">Snorra Edda</span></a>
</li>
</ul>
<strong><span lang='non' class='Unicode'>Emblo</span></strong> -
proper noun, feminine; accusative singular of <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;Embla&gt;</span> Embla
<ul>
<li>
<a href='/eieol/norol/90#glossed_text_gloss_28894'>from the <span class="Unicode" lang="non">Vǫluspá</span></a>
</li>
</ul>
<strong><span lang='non' class='Unicode'>Emma</span></strong> -
proper noun, feminine; nominative singular of <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;Emma&gt;</span> Emma
<ul>
<li>
<a href='/eieol/norol/70#glossed_text_gloss_27669'>from the Battle of Stamford Bridge</a>
</li>
</ul>
<strong><span lang='non' class='Unicode'>en</span></strong> -
conjunction; <span style='white-space: nowrap' lang='non' class='Unicode'> &lt;en&gt;</span> but, and; than
<ul>

Longer term goal is to write a program that puts the word in 'strong' tags into the CSV along with the other desired text right next to the 'strong' tag words. Here is ultimately what I want to write a program to do (reference the HTML to see what I am after):

(tab delimited)

eljun (tab) endurance
elligar (tab) otherwise
elskaði (tab) love
Emblo (tab) Embla
Emma (tab) Emma
en (tab) but, and; than
 
Last edited:
  • Like
Likes berkeman
  • #4
You should be using an xml parser not trying to parse it line by line. I’ve done that when I need a small portion of easily extractable data. However, I think in this case, you should learn how to use an xml parser to get what you want.

Web pages are structured documents not unlike a textual outline. However, sometimes there will be comment tags that block out certain html tagging that your program might try to process. In contrast, an xml parser will bypass those comment or JavaScript blocks. There are many other reasons to use the parser too.

You might find one here:

https://wiki.python.org/moin/PythonXml
 
  • Like
Likes FactChecker and QuantumQuest
  • #5
jedishrfu said:
You should be using an xml parser not trying to parse it line by line. I’ve done that when I need a small portion of easily extractable data. However, I think in this case, you should learn how to use an xml parser to get what you want.

Web pages are structured documents not unlike a textual outline. However, sometimes there will be comment tags that block out certain html tagging that your program might try to process. In contrast, an xml parser will bypass those comment or JavaScript blocks. There are many other reasons to use the parser too.

You might find one here:

https://wiki.python.org/moin/PythonXml

yeah but I still want to know how to parse a file of text to grab and write (into new file) some text between a recurring set of unique character strings.

There's got to be some simple C-like algorithm that is able to grab the desired text from a file and put it into a new file in some desired format, e.g.:

desired text
other desired text
more desired text


example source:
blah sentence BBB desired text CCC carp fishing madness BBB other desired text CCC clap and sing everybody crazy cooter BBB more desired text CCC more stuff

A little bio on me. I am in my mid-40s, I take care of an elderly person and have a LOT of extra time on my hands at the moment. At the beginning of the year I made a New Years Resolution to use a lot of this free time to learn two programming languages specifically C++ and Python. I have been gradually learning through some C++ and Python books, making occasional excursions into C, Javascript, and C#. I very much do NOT want to get side-tracked by 'full-stack' development type stuff although it has been very tempting. My goal is to spend at least a year focusing mostly on specifically "programming", not web-dev or server management.
 
  • #6
Well with time on your hands then you need to write an xml parser or something similar.

This means you need to read on a character by character basis to find where comments are defined and then ignore them. You need to identify html tags by the brackets based on the tag expect to find an end tag or not (html doesn’t follow xml rules exactly) and realize that tags are defined within tags so some recursive programming is needed and then you be able to separate text from tagging.

Lastly, you need some conversions done when you find the html symbols like &gt. For the greater than sign and others and you may need to worry about UT-8 strings although I think reading a string of characters in will handle that eventuality unless you’re reading raw bytes.
 
  • #7
jedishrfu said:
Well with time on your hands then you need to write an xml parser or something similar.

This means you need to read on a character by character basis to find where comments are defined and then ignore them. You need to identify html tags by the brackets based on the tag expect to find an end tag or not (html doesn’t follow xml rules exactly) and realize that tags are defined within tags so some recursive programming is needed and then you be able to separate text from tagging.

Lastly, you need some conversions done when you find the html symbols like &gt. For the greater than sign and others and you may need to worry about UT-8 strings although I think reading a string of characters in will handle that eventuality unless you’re reading raw bytes.
I will eventually need to get into XML and XML parsing related stuff. But I am kind of moving off the BeautifulSoup stuff at the moment, today I started learning some C++11 regex material. I am going to stick with that for a few days.

Just curious. What if a page like UT Austin's Old Norse doesn't have an XML file (kind of a noob question I know). You can't parse an XML file that either doesn't exist or that you don't have access to.

So I am officially asking you. On this exact page. Where is the XML file?

https://lrc.la.utexas.edu/eieol_master_gloss/norol/18
 
  • #8
It’s an html page. You can look at its page source in your browser. You will see comments, html tagging, textual content and possibly JavaScript code with references to css and JavaScript libraries. Html is a an example of xml format or more correctly a subset of xml with some anomalies depending on the version level the site uses and on the browsers supported.

Read up on xml and html on Wikipedia to get a better idea. All the things you’ve mentioned like regular expressions are good to know but don’t get fixated on one approach to solve your problem. I’ve written code like you’re proposing using the same kinds of things only I know that there are cases that will trip you up as you parse more html pages and eventually you will come to understand why xml parsing is probably the best in the long term.

My favorite language has been AWK. I often fall back on it when I need to do something quickly and for short term. Eventually though I have to reorganize my code, switch to a more formal approach to overcome some problem because I know that while might do it the code will look absolutely dense and incomprehensible.

One such example was using AWK to process a binary file which is way outside the normal use case for AWK. I researched it, found an example of how one programmer did it and decided that’s cool but fraught with too many gotchas.
 

FAQ: Extract all the text between the repeated unique strings in a large file

What does "Extract all the text between the repeated unique strings" mean?

This means that in a large file, there are certain strings that appear more than once, and you need to extract the text that falls between these repeated strings.

Why is it important to extract all the text between repeated unique strings?

Extracting this text can help in identifying patterns or important information that may be hidden within the file. It can also help in organizing and analyzing the data more efficiently.

Can this task be done manually?

Yes, it is possible to manually extract the text between repeated unique strings, but it can be time-consuming and prone to errors. Using a computer program or script can make the task more efficient and accurate.

What are the common challenges when extracting text between repeated unique strings in a large file?

One of the main challenges is identifying and differentiating between the unique strings. Sometimes, there may be variations in the strings due to formatting or spelling errors, which can make it difficult to extract the correct text. Another challenge may be dealing with large amounts of data, which can slow down the process.

Are there any specific software or tools that can help with this task?

Yes, there are various software and tools available that can assist with extracting text between repeated unique strings in a large file. Some programming languages like Python have libraries specifically designed for this task. There are also online tools and text processing software that can be used for this purpose.

Similar threads

Back
Top