Most efficient way to randomly choose a word from a file with a list of words

  • Python
  • Thread starter Wrichik Basu
  • Start date
  • Tags
    File Random
In summary, the most efficient way to randomly choose a word from a file containing a list of words involves reading the file to determine the total number of words, generating a random index within that range, and then retrieving the word at that index. This method minimizes memory usage while ensuring a uniform distribution of random selection. For large files, using a streaming approach can further optimize performance by avoiding the need to load the entire file into memory.
  • #1
Wrichik Basu
Science Advisor
Insights Author
Gold Member
2,158
2,719
I have a file consisting of a bunch of words, with one word in each line. For example, consider this webpage, or this one, but instead of being online, there is a .txt file is locally on my PC. My aim is to pick a word at random from this list that starts with a specific given letter. Say I want a random word that starts with "n" or "p". Both upper and lower cases are acceptable.

What is the most efficient and fast approach to pick out a word with a given starting letter from this massive list in Python (3.12, if the version matters)? Reading a file takes time, so either I have to read the whole file into a list/set and keep it in memory (which will be expensive on memory), or I can write the entire file to a MySQL/sqLite3 database. I want to use this in a bot that will stay online all the time, so I want response times to be as short as possible. Also, with a small memory overhead, since the host I use charges if I need higher memory. So in that sense, probably keeping it in an indexed database will be faster. But I am not sure how to select a word that starts with a particular letter from a database.

As always, any ideas are appreciated.
 
Technology news on Phys.org
  • #2
one could preprocess the list sorting words into distinct files like a-words, b-words, c-words...

The letter determines the word list to load into a numeric array and then use the python random function to pick an index into the array.
 
  • Like
Likes Vanadium 50 and Wrichik Basu
  • #3
jedishrfu said:
one could preprocess the list sorting words into distinct files like a-words, b-words, c-words...

The letter determines the word list to load into a numeric array and then use the python random function to pick an index into the array.
Interesting, probably I can do that in a database too! One column with the first letter of each word, another column with the word itself. Then proceed as shown in this answer on SO:
SQL:
SELECT words FROM table_name ORDER BY RANDOM() LIMIT 1 WHERE initial_letter = 'a';
 
  • #4
Wrichik Basu said:
What is the most efficient and fast approach to pick out a word with a given starting letter from this massive list in Python (3.12, if the version matters)?
If speed is the primary consideration, load the entire file into memory and construct a Python dictionary from it that maps each letter to a function that takes no arguments and returns a randomly chosen word from the list of words in your file that start with that letter. That will give you the fastest response since once everything is initialized all you are doing is a dictionary lookup and a single function call, with no file or database access required.

However, you say:

Wrichik Basu said:
a small memory overhead
What exactly counts as "small"? And how does that compare to the memory that would be needed to do it the way I described above? A lot will depend on how large your file of words is, since that will determine how much memory it takes to store all the necessary Python objects.
 
  • #5
PeterDonis said:
What exactly counts as "small"?
Less than 256 MB to be specific. Your idea is good, I will check how much memory is needed if I need to do it that way.
 
  • #6
Wrichik Basu said:
I want to use this in a bot that will stay online all the time, so I want response times to be as short as possible.
One thing to note here is that if your bot is online and you are talking to it over an Internet connection, network latency will probably dominate your response time--i.e., you most likely won't even be able to see the difference in response time between different implementations (in-memory vs. file access vs. database), unless you use a database implementation where the database file is not local to the bot but has to be accessed over a separate network connection. (For SQLite this wouldn't be an issue since you would just co-locate the SQLite file and the bot's code, but for something like MySQL you might not be able to guarantee that the bot and the MySQL server will be running on the same machine.)

Also, one way to reduce network latency, at least assuming both ends have reliably good Internet connections, is to use a websocket to talk to the bot. Setting up the initial websocket connection will have the same latency as a normal HTTP connection, but after that subsequent queries, as long as the websocket remains active, should be quicker because no handshake is required.
 
Last edited:
  • #7
I have built a list of words, sorted like the examples in the OP. It is only about 3.4 Mbyte. Thanks for the links, I will run them by mine to see what I am missing.

Preprocess, to find the first position of each first letter in the file.
Generate a random number and interpolate proportionally into the letter block with that first letter.
 
  • Like
Likes Wrichik Basu
  • #8
Some other tricks to consider:
- have an HTML list selector where the user selects the first letter
- download those words to the web page
- do the work in the browser
 
  • Like
Likes Wrichik Basu
  • #9
PeterDonis said:
For SQLite this wouldn't be an issue since you would just co-locate the SQLite file and the bot's code, but for something like MySQL you might not be able to guarantee that the bot and the MySQL server will be running on the same machine.
I am using SQLite so that the database file is local to the bot. My host has an option of a MySQL database (free of charge), but I don't use it because of issues related to backing up the data. With an SQLite file, I can just download the file to create a backup.
PeterDonis said:
Also, one way to reduce network latency, at least assuming both ends have reliably good Internet connections, is to use a websocket to talk to the bot. Setting up the initial websocket connection will have the same latency as a normal HTTP connection, but after that subsequent queries, as long as the websocket remains active, should be quicker because no handshake is required.
It's a Discord bot, and I am not really sure if Discord allows one to configure how the connection will be made. Will have to search that a bit.
 
  • #10
This seems trivial as a list of words is something very small - database-wise - with only 25 000 items (freebsd list) or even 10 000 items (mit list). I downloaded the largest list on my computer in a file named word-freebsd.txt to which I added the word word as the first line (the file size is 210 kB). Then I ran sqlite3 words.txt to create a new SQLite database (I chose the txt extension to be able to upload it on PF). In SQLite, I executed this simple code:

SQL:
.mode csv word
.import words-freebsd.txt word

Voilà! I ended up with a 385 kB database with the table word with one column named word. I went further by creating an index (I'm not even sure how faster it can make a request, but still):

SQL:
CREATE UNIQUE INDEX word_index ON word (word);

Now my database is at 778 kB and I can run a simple query like this one to get what you want (also possible without the index):

SQL:
SELECT word FROM word WHERE word LIKE 'a%' ORDER BY RANDOM() LIMIT 1;

It is lightning fast. So small, I can even attach the database to this post! I'm pretty sure your 256 MB RAM will be sufficient. :wink:

EDIT: Apparently, PF loads the .txt file when I attach it to the post but doesn't make it available.
 
  • Like
Likes Wrichik Basu
  • #11
SQL of any form is unnecessary here; @jedishrfu/ @Baluncore's solution is a good one.

If your word lists are truly enormous so that the numpy.vector index does not fit in memory, preprocess the list so that each word is padded to equal length then you only need to store the number of entries for each initial letter, or alternatively use a separate index file.
 
  • Informative
Likes Wrichik Basu
  • #12
To add some anecdotal info to @pbuk post:

While working on a datamining project some years ago. We would score a SQL star schema fact table in IBM DB/2 but found that we could double the speed of scoring by dumping the table to a flat file and scoring it there and then updating the scoring field in each fact table record.

At the time, we were scoring records using 250+ attributes of the record that was saved in the database as a star schema ie one central fact table and several ancillary dimension tables. These all had to be collected together to compute a score inside the database engine which can become quite costly.

We would dump the star schema to a flat file resembling a spreadsheet file (ala comma separated values) and then score the flat file. Scores were then reloaded into the database.

Star schemas are great for reporting type queries but not so much for datamining work.
 
  • Like
  • Informative
Likes pbuk and Wrichik Basu
  • #13
I may have missed it being already mentioned, but for a file or set of files too large to keep in memory remember that you can seek and read from a file without storing the full file content in memory. E.g. if you want an approximate random line from a large file with one word per line then you could seek to a random position in the file and then take the next (or previous if no next) full line. This should work even if the file is unordered, but if you need the first letter to be fixed it is likely most efficient, as already mentioned, to split the single file into a file for each initial letter, but this can also be done without sorting and without storing the full content in memory.
 
  • #14
You would do well by reading:

THE ART OF COMPUTER PROGRAMMING
Donald E. Knuth
Volume 3 / Sorting and Searching
Chapter 6, SEARCHING

ISBN 0-201-03803X
Addison-Wesley Company, Reading Massachusetts

The copyright is 1973, but the series was considered the "Bible" for many years, and is still a useful reference.

Th other volumes are:
Vol. 1 FUNDAMENTAL ALGORITHYMS
Vol. 2 SEMINUMERICAL ALGORITHYMS
And Vol. 4, published much later and I don't recall the title
.
.
.
Just did a Goggle search:
https://www.google.com/search?hl=en&q=the+art+of+computer+programming
which turned up PDF versions, and Vols. 4a, 4b, 5

Have Fun!

Cheers,
Tom
 
  • #15
jedishrfu said:
Star schemas are great for reporting type queries but not so much for datamining work.
Yes, the key word in RDBMS is relational - if you don't have relational data (e.g. customers/orders/products/stock levels) then you don't need SQL.
 
  • #16
True. To be clear we were building tools for a data warehouse system where query is king. The scoring was done periodically as data was updated during batch runs.

to use or not use a relational database is a design issue. It makes a lot of sense for large amounts of data or data that is highly interconnected (where star schemas shine) that you have to query a lot but not so much for small amounts of data where you can spend more time doing full scans of the dat to get what you want.

10,000 for me was the magic number where I considered doing things in a database. Also if I felt the application might grow into something more then I'd consider a relational database. So much depends on speed, storage, future growth and optimization that there is no hard and fast rule for what approach is best especially for systems where the customer may say I going to use it for X but then you find they are more likely to use it for Y.

In one system I designed I learned a hard lesson in optimization. We were to keep the data for ten days. The data record was 200 or so data points updated every hour coming in from multiple sources. The most common query was for only 10 data points of interest from each record from a given source.

I designed for that setup breaking up the data points into small individual chunks so one record was expanded into 200 rows... This approach minimized network traffic to only 10 data point rows being over a given time period sent to a display app. I added indexing tables to speed the search for data and things worked great until the ten day mark.

At ten days, the delete logic kicked in and thrashing began as the database was being updated every hour with 200 rows while the delete logic would search and delete 200 rows too including the associated index tables. The system dragged to a crawl.

I finally realized I had to give up the notion of storing each data point in its own row and let the display app pick out the data points it got back form the list of 200 points. Network traffic wasn't appreciably affected but the database update and delete was dramatically sped up.

A most humbling experience with my project manager (non-programmer) fretting about it over my shoulder and my manager who trusted in my ability to fix it quickly.
 
Last edited:
  • #17
Filip Larsen said:
E.g. if you want an approximate random line from a large file with one word per line then you could seek to a random position in the file and then take the next (or previous if no next) full line.
But that would bias selections towards longer words.

Unless you make them all the same length by padding them.

In which case you can seek to exactly the right position which is the method suggested in #2 and #7.
 
  • Like
Likes hutchphd
  • #18
jedishrfu said:
True. To be clear we were building tools for a data warehouse system where query is king. The scoring was done periodically as data was updated during batch runs.
Oh sorry, my comment
pbuk said:
the key word in RDBMS is relational - if you don't have relational data (e.g. customers/orders/products/stock levels) then you don't need SQL.
was intended for the context of this thread, not your interesting anecdote.

jedishrfu said:
to use or not use a relational database is a design issue. It makes a lot of sense for large amounts of data or data that is highly interconnected (where star schemas shine) that you have to query a lot
IMHO the real strength of an RDBMS is not when you have to query a lot - if all you want to do is read then you are usually better off with indexed static files of redundant data, it is when you have data that is added to/updated a lot - so flat files won't work.
 
  • Like
Likes jedishrfu
  • #19
Wrichik Basu said:
What is the most efficient and fast approach….
You are asking the wrong question here. You don’t want the fastest approach, you want one that is fast enough. Decide how many lookups a second you’ll be doing and how much time you can spend on each lookup, and then choose the simplest implementation that meets those requirements.
 
  • Like
  • Informative
Likes hutchphd, Wrichik Basu, Filip Larsen and 2 others
  • #20
pbuk said:
But that would bias selections towards longer words.
Indeed, that is why I called it "approximate random". But my point was mainly to point to the option of seeking in a file without having to load all content, an option that nowadays often is overlooked.
 
  • Like
Likes jedishrfu
  • #21
Or one could build an ISAM file which is one step above a random file where you'd need to build an index to where each first letter starts so you can jump there and begin reading.

https://en.wikipedia.org/wiki/ISAM

ISAM files use a paging scheme where each page holds several records and there are indexing pages that point to these pages.

Each common programming language should have a library of ISAM functions to implement this style of data storage.
 
  • Like
  • Informative
Likes Wrichik Basu and Tom.G
  • #22
jedishrfu said:
Each common programming language should have a library of ISAM functions to implement this style of data storage.
I'm not sure this is true. ISAM is typically part of the "under the hood" implementation of databases like MySQL or Berkeley DB, and the programming language libraries or bindings that exist are for those higher level implementations, not for ISAM itself. There are actual ISAM libraries for C and C++, I believe, but not necessarily for other programming languages.
 
  • #23
(Thanks, I'm aware the thread's been dormant a bit ; I was actually following along, but got distracted, then "vacationed" for awhile. Anyways...)

Given that the "Python" requirement appears to be a hosting-service security mandate, rather than a best-fit for the application (and maybe the OP), there's no inherent advantage in following that language's default workflow.

The simplest functional coding core is to hit up random bytes in the file until a <newline> is met, then check the first letter of the word for a match. Less complicated than a sequential run-through, even.

Wrapped inside a "randomized binary search" algorithm, it can easily meet "fast enough" requirements for a user-input-->find-a-random-word-->display-word sequence.

No prep or initialization required ; no complicated, massive data structures, or overhead beyond a couple of nested loops ; memory footprint is pretty much just <filesize>. If there's only going to be a handful of queries, it's also outright fastest in terms of machine-time (except Larsen's random seek --> sequential seek, which is a bit more complex)

For a smoother run, adding a persistent steering-table stops processor cache thrashing cold after the first few queries, but I'm not sure Python can do that, elegantly. (Interestingly, the cobol code is uncharacteristically compact.)

With that in place, query/returns can be into the double-digits while any method without a preexisting external database is still tying its shoelaces.

PeterDonis said:
If speed is the primary consideration, load the entire file into memory and construct a Python dictionary from it that maps each letter to a function that takes no arguments and returns a randomly chosen word from the list of words in your file that start with that letter. That will give you the fastest response since once everything is initialized all you are doing is a dictionary lookup and a single function call, with no file or database access required.

Could you expound on that a bit ? Or, a lot. I can parse "store the words from the file into an internal table", but the rest seems to be a jumble.

Doesn't matter of course : the OP's specs aren't refined enough to determine much of anything. Those two example.txt's are structurally different from each other : it's a good bet that a possible 256MB-challenging file will be majorly different from them in some manners.
 
Last edited:
  • #24
hmmm27 said:
hit up random bytes in the file until a <newline> is met, then check the first letter of the word for a match.
This would favor longer words since a random byte is more likely to end up on a line with a longer word than a shorter one. As I understand it, the OP's intention was for the probability of selection to be uniform over words (although it's true, as you point out, that the OP's specs are rather vague).

hmmm27 said:
I can parse "store the words from the file into an internal table", but the rest seems to be a jumble.
Dictionary; the Python dictionary is what other languages call a hash map. The keys in the map are letters. When the user has picked a starting letter, the code looks up that letter in the dictionary; what is returned is a function that takes no arguments and, when called, returns a randomly selected word from among all the words in the file that start with the given letter.

Constructing the functions that appear in the dictionary is straightforward: read in the file line by line and construct a dictionary that maps each starting letter to a list of words in the file that start with that letter. Then, once the file is entirely read, just replace each list with a function that randomly selects a word from it (Python makes such a function easy to construct as a closure over the list).

As long as there is enough memory to store the entire data structure, this would be faster, even in Python, than any solution that requires accessing the file on disk every time a word selection is made, since disk lookups are orders of magnitude slower than in-memory operations.
 
  • #25
I think memory usage is a red herring here. Let's say we want to choose from 1.5 million words (see https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words) with average length 20 bytes (should be enough for UTF-8 encoding of most languages), plus we are going to need an index with 4 bytes per word. That's a total of 36 MB.
  1. How much memory do you think Python uses?
  2. How much memory do you think the Apache/LiteSpeed/NGINX etc. reverse proxy that sits in front of Python uses?
  3. How much memory does your web host allocate in its standard package?
The answer to all of these is "a lot more than 36MB"

So just read the words into a huge string and build int32 numpy.arrays for each initial letter.
 
Last edited:
  • Like
Likes Vanadium 50
  • #26
pbuk said:
I think memory usage is a red herring here.
Yes.

If you have 256 MB of memory space, you have enough for the whole dictionary. The 1.5M word one seems extreme to me, but even using that, you have space for the whole dictionary, many times over. On the slowest HDD you are likely to find, loading this will take about 2 seconds.

Using a more realistic dictionary size (no jargon - do we really want to use, e.g. drug names as words) the size would be 5-10%, which means it would go 10-20 times faster.

Assuming the list is sorted, you keep track of the first and last entries starting with a given letter, and throw a random number between these bounds.
 
  • #27
pbuk said:
I think memory usage is a red herring here.
The reason for not reading the whole file would probably be speed and not so much memory. But while delaying premature optimization until it makes sense to measure actual performance is a fine principle it does not imply people should design their algorithm without a single care for memory or performance (as some seems to think because, hey, memory is cheap).
 
  • #28
Filip Larsen said:
The reason for not reading the whole file would probably be speed and not so much memory.
This is a web bot: the whole file is read into memory once and then stays there for hours/days/weeks, easily serving multiple requests per second if necessary, even with a basic shared hosting service (providing that does not breach the terms of the web host of course: some explicitly forbid this sort of thing).

Reading only part of the file on each request would not be a good idea.
 
  • Like
Likes PeterDonis
  • #29
pbuk said:
some explicitly forbid this sort of thing
Some even make it impossible by forcing you to use a framework in which the program that responds to requests is separately executed for each request.
 
  • #30
pbuk said:
This is a web bot
OK, hadn't noticed the discussion had moved to be about a persistent web service.

And just to make sure: the point with my participation in this thread has so far only been to say that if whatever partial information is needed from a large file can be retrieved without reading and parsing the whole file and storing it memory first, then it likely will be more performant to not load the whole file in memory even if the API in question makes this dead simple to do and all the examples you find for loading a file use that simple pattern. For cases where it is not possible to do what you want without reading everything into memory first then my comment is obviously not relevant. I guess I feel a need to say this because I am tired of maintaining code written by people who apparently are oblivious about how their code perform and yet still surprised when it doesn't scale well.
 
  • #31
Filip Larsen said:
if whatever partial information is needed from a large file can be retrieved without reading and parsing the whole file and storing it memory first, then it likely will be more performant to not load the whole file in memory
Even leaving aside the web bot issue, the OP requirements, as far as I can see, cannot be met without loading and parsing the whole file in some way, since you have to randomly select a word. If the file is sorted by initial letter (as at least one of the ones linked to in the OP is), you might be able to get away with just loading the portion of it contaning words beginning with the chosen letter to randomly select a word, but even for that you would have to know what portion that is in advance, i.e., you would have had to load the entire file and parse it in order to generate the information about what portion of the file contains words starting with each letter. You could do the latter in a pre-processing step and store the results in a second file, I suppose.
 
  • #32
PeterDonis said:
cannot be met without loading and parsing the whole file
Sure it can. In the Bad Old Days this was done all the tine. You get your letter, say Q, and the file header tells you where in the file the Q's start, and you start reading from there.

However, just because you can do it this way does not mean you should. The data file is not large - it is small. An unreasonable version is ~10 MB and a reasonable one 5-10% of that. One floppy disk (remember those?).

This is not a lot of data, and one should not attack the problem as if there were.
 
  • #33
Vanadium 50 said:
the file header
If there is one that contains the necessary data. In the examples the OP linked to, there wasn't.
 
  • #34
Vanadium 50 said:
One floppy disk (remember those?).
I do, yes. My first PC only had floppy drives (two of them), and I had things configured to use a RAM disk for frequently used files because loading them from floppy was so slow.
 
  • Like
Likes jedishrfu
  • #35
Fortunately, HDDs are faster than floppies. They can read that much data in a fraction of a second. I have small dictionary as part of a program (12K words) and the loading time is a small fraction of a second.

If instead of "header", I wrote "index" would that be clearer? The OPs dictionary need not be a flat file. It can have a more complex structure, like an index at the front. This is at least a 50 year old solution.

Fundamentally, though, this is not a lot of data. Treating this as if it did is likely not to lead to he optimal solution.
 

Similar threads

Replies
7
Views
2K
Replies
8
Views
1K
Replies
4
Views
6K
Replies
1
Views
1K
Replies
8
Views
2K
Replies
5
Views
6K
Back
Top