Please give me an example of how any indexing works in big data search

shivajikobardan · Jul 9, 2022

[Mentor Note -- PF thread and MHB threads merged together below due to MHB forum merger with PF]

I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.

shivajikobardan · Jul 9, 2022

I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.

jim mcnamara · Jul 9, 2022

There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"

phinds · Jul 11, 2022

jim mcnamara said:

There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"

bad link. Here's one that works:
https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation

shivajikobardan · Jul 14, 2022

phinds said:

bad link. Here's one that works:
https://www.researchgate.net/publication/273082158_A_survey_on_Indexing_Techniques_for_Big_Data_Taxonomy_and_Performance_Evaluation

thanks but this didn't help. i think what I'm needing is related to information retrieval systems. I'm googling

shivajikobardan · Jul 14, 2022

jim mcnamara said:

There is a taxonomy of big data indexes, so it is not as simple as you seem to have assumed:

https://d1wqtxts1xzle7.cloudfront.net/61504256/A_survey_on_Indexing_Techniques_for_Big_20191213-17789-1fk1xfk-with-cover-page-v2.pdf?Expires=1657429301&Signature=A~hBFUftKoRMyTR~5Ss6meg4wuT-V63Y3CMyAxN3xaA73eSM6LRK8SiwP2vptGYCocqG2gyP7NGzTixaJofLGf8eKLo01nruK7-9TAsT27iKjY~APa0bJeZM68IRFTi8URlYZ7FZpFqywW9FMZwxQnGpct37CWyAEpiKTGXlkRlLBg8tT2sPy1BnAtb1ZKt~sXeEBgidRlkZzKNQB6DhXuKr9vcnXa0nuaOCIDZQoG0zSo204n4nMRt33WTYQjYWfWnEnyMLZHUBry1on~dRNl-XkFU2M2skFRQ6fapEvVp23m2DMrdeeFThfZbAycs9Ep1HH1s~vUAV4A1FJIpudA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA

The key might break in this link so try google for the same result:

"big data index taxonomy"

hmm it should be but this didn't have too many google answers.

shivajikobardan · Jul 15, 2022

I am very near to getting this. I got indexing(as per needed) now I am near to getting how searching works(not elastic search just the basics of "how inverted index searching works". I am confused in the application of step 3.

Here's a relevant video.

This slide also covers the concept. but no example is given, i want one example of manipulation (while i have a inituitive feeling of what might be going on).

https://slidetodoc.com/modern-information-retrieval-chapter-8-indexing-and-searching-3/

pbuk · Jul 15, 2022

shivajikobardan said:

i want one example of manipulation

Lets say we are searching for 'special relativity'. The retreival stage may return 500 documents with the words 'special' and 'relativity'; we manipulate the results to place those where the words are close together near the top of the list.

shivajikobardan · Jul 15, 2022

pbuk said:

Lets say we are searching for 'special relativity'. The retreival stage may return 500 documents with the words 'special' and 'relativity'; we manipulate the results to place those where the words are close together near the top of the list.

yeah it's like that. there is another document that makes it even more clear..

.Scott · Sep 2, 2022

In contrast to working with an indexed data set, Indexing is an optimization method.

We can use your "Google" search example, but since we don't know everything Google does, let's make our own search engine - say "ShivaFind".
So, as an example, let's image we are using our new "ShivaFind" search engine with these search terms: crazy river salad

In the simplest case, ShivaFind could simply read every page on the looking for any page that contains all three of those words. That would work, but we wouldn't want to have to wait for the results - we're looking for results in seconds, not years.

So ShivaFind will start reading the web long before we try to search for anything - and it will build up some indices:
* The first index will be a list of all the web pages that have been scanned. Whenever we discover another web page, we will add it to this list. Then, from that point on we will refer to that page by its position in that list. So if the tenth (10th) web page we discovered is https://www.physicsforums.com/threads/please-give-me-an-example-of-how-any-indexing-works-in-big-data-searches.1044569/, the that string of characters will be added to the list and from that point on we will refer to this page by the number 10. In this case, the index is simply acting like a glossary so that we can convert is abbreviation ("10") into the full URL.
* The second index is the one described in your video. It will be a list of lists. Whenever we find a new word, we will add that word to the list and add a list of places we have found that word to the list with it.

That second index (the list of lists) is our word index. We are going to optimize that list of lists so that when you enter your search terms, we can find the pages very quickly. There are two optimizations we will be applying:

Optimization number 1: We will use a hash index - here's how it will work: Before we start scanning the web and collecting words, we will set up a server with a large RAM storage array divided into 65536 records. Then we will choose a hash function that can take any word and turn it into a number from 0 to 65535. We will then store a small "word index" that includes that word in the record with that number. In our example, let's say the search term "crazy" is encoded in ASCII - so its 5 bytes. The hash will be some function of those 40 bits - followed by a "modulo 65536" to make sure we stay in the range of 0 to 65535. Our hash of "crazy" might give us the number 2080 - so we add the word "crazy" to record number 2080 along with a disk address where we will store the "crazy" word-page list. Now, if we need to get information about "crazy", we do a quick hash, read its record and we now have a short list or words that hash to 2080 - and those are the only words we need to search through. Each of the words in that list includes the disk address for that word that tells us where to get the word-page list for that word. Let's say that the disk address for "crazy" is 123, 4567. If we now go to SSD drive number 123, sector 4567 we will find a list of pages where that number occurs - and those pages are specified using the first index. So, the record at 123,4567 will include the number "10" which is the index to this forum page.

Optimization number 2: We are optimizing for the search operation - not the data collection/indexing operation. So when we find a new page and begin adding its words to the word-page lists, we will order them in a way that makes combining lists faster. We don't need to discuss exactly what that means - but that hashing trick we just used with words could give us a boost when applied to the entries in the word-page lists.

So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "river" to 12345, scan its record and find 44,5555.
- Hash "salad" to 23456, scan its record and find 99,1111.

Fast operations, but require going out to you big data server:
- Read from those three disk records scanning for 3-way matches.
- For each match, look up the URL from the first index and report it.

256bits · Sep 3, 2022

shivajikobardan said:

I have to learn in context of lucene, but firstly, I want to learn the example indexing in general.
Sth like this-:

And I am not getting any google books and pdfs to learn about these topics. I basically need to learn basics of indexing and searching,indexing with lucene full process in detail, elastic search etc. I haven't googled elastic search yet, but I am not finding much information for the first twos. I believe I am not hitting the right google query. Any feedback would be really helpful as we don't have official textbook here and this topic I haven't seen in many general big data books as well.

Here is a write up on Indexed data bases.
By the NoSQL, interpret that as meaning Not Only SQL, rather than No SQL, depending upon the data base.
https://ils.unc.edu/courses/2018_fall/inls523_004/nosql.pdf

pbuk · Sep 3, 2022

.Scott said:

Indexing is an optimism method.

I'm not sure what you mean by that, but the rest of the post was a really good summary!

A minor correction: you have repeated "crazy" three times:

.Scott said:

So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "crazy" to 12345, scan its record and find 44,5555.
- Hash "crazy" to 23456, scan its record and find 99,1111.

.Scott · Sep 3, 2022

pbuk said:

I'm not sure what you mean by that, but the rest of the post was a really good summary!

A minor correction: you have repeated "crazy" three times:

I have corrected my spelling of "optimization", flashed out my point on that, and corrected those "crazy" cut and paste errors.

Thanks!

shivajikobardan · Sep 3, 2022

.Scott said:

In contrast to working with an indexed data set, Indexing is an optimization method.

We can use your "Google" search example, but since we don't know everything Google does, let's make our own search engine - say "ShivaFind".
So, as an example, let's image we are using our new "ShivaFind" search engine with these search terms: crazy river salad

In the simplest case, ShivaFind could simply read every page on the looking for any page that contains all three of those words. That would work, but we wouldn't want to have to wait for the results - we're looking for results in seconds, not years.

So ShivaFind will start reading the web long before we try to search for anything - and it will build up some indices:
* The first index will be a list of all the web pages that have been scanned. Whenever we discover another web page, we will add it to this list. Then, from that point on we will refer to that page by its position in that list. So if the tenth (10th) web page we discovered is https://www.physicsforums.com/threads/please-give-me-an-example-of-how-any-indexing-works-in-big-data-searches.1044569/, the that string of characters will be added to the list and from that point on we will refer to this page by the number 10. In this case, the index is simply acting like a glossary so that we can convert is abbreviation ("10") into the full URL.
* The second index is the one described in your video. It will be a list of lists. Whenever we find a new word, we will add that word to the list and add a list of places we have found that word to the list with it.

That second index (the list of lists) is our word index. We are going to optimize that list of lists so that when you enter your search terms, we can find the pages very quickly. There are two optimizations we will be applying:

Optimization number 1: We will use a hash index - here's how it will work: Before we start scanning the web and collecting words, we will set up a server with a large RAM storage array divided into 65536 records. Then we will choose a hash function that can take any word and turn it into a number from 0 to 65535. We will then store a small "word index" that includes that word in the record with that number. In our example, let's say the search term "crazy" is encoded in ASCII - so its 5 bytes. The hash will be some function of those 40 bits - followed by a "modulo 65536" to make sure we stay in the range of 0 to 65535. Our hash of "crazy" might give us the number 2080 - so we add the word "crazy" to record number 2080 along with a disk address where we will store the "crazy" word-page list. Now, if we need to get information about "crazy", we do a quick hash, read its record and we now have a short list or words that hash to 2080 - and those are the only words we need to search through. Each of the words in that list includes the disk address for that word that tells us where to get the word-page list for that word. Let's say that the disk address for "crazy" is 123, 4567. If we now go to SSD drive number 123, sector 4567 we will find a list of pages where that number occurs - and those pages are specified using the first index. So, the record at 123,4567 will include the number "10" which is the index to this forum page.

Optimization number 2: We are optimizing for the search operation - not the data collection/indexing operation. So when we find a new page and begin adding its words to the word-page lists, we will order them in a way that makes combining lists faster. We don't need to discuss exactly what that means - but that hashing trick we just used with words could give us a boost when applied to the entries in the word-page lists.

So for example, "ShivaFind crazy river salad":
Very fast operations:
- Hash "crazy" to 2080, scan its record and find 123,4567.
- Hash "river" to 12345, scan its record and find 44,5555.
- Hash "salad" to 23456, scan its record and find 99,1111.

Fast operations, but require going out to you big data server:
- Read from those three disk records scanning for 3-way matches.
- For each match, look up the URL from the first index and report it.

thank you for the information

berkeman · Sep 4, 2022

Thread closed for Moderation...

berkeman · Sep 5, 2022

Thread is re-opened. As a reminder, this is a schoolwork question. Please wait for the OP to actually show their work on this. It was originally misplaced in the technical forums, so that may have confused some of the folks who replied. Thank you.

berkeman · Sep 5, 2022

Update -- From the dates on this thread, it appears that @shivajikobardan originally started this thread at MHB in their technical forums (they did not have the same schoolwork rules as PF does), and when MHB was merged with PF, this thread ended up in our technical forums. It is now in our schoolwork forums.

Please give me an example of how any indexing works in big data search

Similar threads

Hot Threads

Recent Insights