Comp Sci ElasticSearch document types removed, why?

shivajikobardan · Jul 19, 2022

The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.

Say there's an index called "college".
Then there are document types called "student" "teacher" "administration" "staff".
What problem would occur if we allow this?

Books and documentations are saying that if a field called "date_of_join" is given a "text" data type in "student", then we can't give "date_of_join" as "date" data type in "staff".

It says that it's due to the way Lucene is.

This is because of the way Lucene maintains the field types in an index. As Lucene manages fields on an index level, there is no flexibility to declare two fields of different data types in the same index

But this is not clear without an example(of how lucene is storing index). Can you guys clarify this?
I know that lucene stores inverted indexes though. But still I'm not clear.

pbuk · Jul 19, 2022

shivajikobardan said:

Homework Statement:: Why did we remove multiple document types within an index in ElasticSearch?

The answer is this-:
Because we can't declare a field of different data types within a same index in different document types.

One index can contain multiple document types, no problem.
Different document types can have fields with different names with different types, no problem.
But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).

Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?

shivajikobardan · Jul 19, 2022

pbuk said:

One index can contain multiple document types, no problem.

Different document types can have fields with different names with different types, no problem.

But if different documents have fields with the same name they must be of the same type: this is obvious if you think about it (an index is essentially an ordering, you can't put things in both alphabetical order and date order at the same time).

I genuinely don't see a problem.
say
date_of_join of student is "2022/2/2" (this is text format)
date_of_join of staff is "2011-11-11" (this is date format-just assume)
What's the problem?
Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.

I don't see any problem. What's the problem. (Maybe some problem could occur when trying to parse it though as the code might only take date with yyyy-mm-dd format). Other than that, I see no problem, like the text is claiming.

pbuk said:

Rather than remove one or both of the student and staff documents you could simply change one of the field names e.g. to date_of_join_textof course, or there is an even better solution using pre-processing: can you think what this is?

I can't think of other preprocessing ideas.

pbuk · Jul 19, 2022

shivajikobardan said:

Here's the inverted index-:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11.

That's a forward index, not an inverted index.
Here's the forward index after a few more documents:

date_of_join=>doc1,2022/2/2; doc2,2011-11-11; doc3,02/02/22; doc4,1/2/22; doc5,31/1/2022, doc6,1/31/2022; doc7,2022-02-31,doc8,Last Tuesday; doc9,Not provided...

How do you think the following query is going to handle that?

Code:

GET /_search
{
  "query": {
    "range": {
      "date_of_join": {
        "gte": "now-1y/M"
      }
    }
  }
}

pbuk · Jul 19, 2022

shivajikobardan said:

I can't think of other preprocessing ideas.

One idea could be to map date_of_join to two fields: date_of_join_date: date and date_of_join_text: string

shivajikobardan · Jul 19, 2022

pbuk said:

That's a forward index, not an inverted index.

term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or? (They use BKD trees for numeric data type though)

pbuk · Jul 19, 2022

shivajikobardan said:

term-document arrangement is called inverted index though. what's the term here? ig the value is it ie 2022-02-02 or?

You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.

I am not sure what is going wrong here, these concepts should not be difficult to grasp. Perhaps you should take a break.

shivajikobardan said:

(They use BKD trees for numeric data type though)

There is no numeric type involved here, just date and string. And the implementation details are entirely irrelevant to using Elasticsearch - they could change to a KDB tree and it wouldn't change anything.

shivajikobardan · Jul 19, 2022

pbuk said:

You wrote this: date_of_join=>doc1,2022/2/2.

Even if you are confused as to what a 'term' is, I think it's pretty clear which is a document, and if we are looking for a term-document index then the document must come second.

Document is differently defined in different contexts.
1) Document of elasticsearch
2) document what normal people know
term-document means the figure like the one I posted above.
Like this-:

So anything with "document id" on the right side would be inverted because the classic data mining etc technique was to use document-term which would be sparse here.

date_of_join=>doc1,2022/2/2

Since document id is in right side, it should be an inverted index Isn't it?
But according to above analogy, I'd think

2022/2/2->doc1

should be inverted index more accurately.

pbuk · Jul 19, 2022

shivajikobardan said:

Since document id is in right side, it should be an inverted index Isn't it?

We are talking about doc1 -> 2022/2/2 where the document id is clearly on the left side.

shivajikobardan said:

I'd think 2022/2/2->doc should be inverted index more accurately.

Yes, that was my point, which is the exact opposite of what you have been saying.

Comp Sci ElasticSearch document types removed, why?

Similar threads

Hot Threads

Engineering Why is my output current so low in this Transconductance Amplifier cell?

LTspice: Implementing a Single Balanced BJT Mixer

Engineering Diff gain of a push pull degenerated differential pair

Engineering AGMA pitting resistance factor of safety (SH)

PLL - How to find all the gains of a PI corrector and fix Ki ? MATLAB

Recent Insights

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers