Microsoft Technical Interview question

btb4198 · Sep 12, 2022

I had a Microsoft Technical interview this past Friday, the question I was asked was this : How do you find the middle value for a dataset that is too big to fit in RAM?
I was not able to figure this out during the interview, but I have been look in this all weekend and I read something online that said it can be done at O(N) using something called the counting sort histogram algorithm ( I did not learn that in my advanced data structures and algorithms class). I have watched some youtube videos on it, but I still do not get how this could be using to find the location of the middle number in a dataset too big for RAM. I read something that said once you have all the values counted up in your dictionary, you can scan it to find the index of the first non-zero element and that is where your middle value will be but I really do not get that.

For example, these 3 sets:
set 1: {6, 2, 7, 3, 4, 3, 5, 4, 1}
set 2: {6, 4, 2, 5, 8, 1, 7, 3, 9}
set 3: {4, 2, 9, 8, 7, 6, 3, 5, 1}

your dictionary would be :

index	1	2	3	4	5	6	7	8	9
count	3	3	4	4	3	3	3	2	2

I read somewhere that the index of the first non-zero element is your center. In this case, it would be 1.
so you do
(9 - 1) / 2 = 4. ( I think this is the equation for the median but could be wrong)

The 9 comes from the fact that there are 9 elements in the all 3 sets. So, if we subtract the index of the first non-zero element 1 from the total number of elements 9, we get 8. Then, we divide that by 2 to get our 4.
and yes 4 is the Middle value, but I do not get how what I just did somehow find the middle value. I am very lost here on the math behind this.
can someone please example ? also did I even do this right ? again I never learned this in school, So this whole problem is completely new to me.

jedishrfu · Sep 12, 2022

Not knowing the problem is the whole point of the interview. The interviewer wants to see how you perform under stress, how you approach the problem and how you explain your reasoning.

Of course it doesn’t hurt to actually solve even if you do so after the interview and send it back to them. Also it doesn’t hurt to ask them in your thank you letter how it was supposed to be solved or some reference to followup on the problem.

Baluncore · Sep 12, 2022

The median of medians algorithm can operate on a big dataset in a small memory.

btb4198 · Sep 12, 2022

Baluncore said:

The median of medians algorithm can operate on a big dataset in a small memory.

I am sorry, but is that what I am doing? I feel like there of a lot of algorithms I am just finding out about because of this one Microsoft Interview. Is there like a list of the algorithms one needs to know for a FAANG interview? Also, did I just do that right ?

Vanadium 50 · Sep 12, 2022

btb4198 said:

Is there like a list of the algorithms one needs to know for a FAANG interview?

So you don't have to contaminate your mind with any algorithms that "won't be on the test"? If I were hiring, that would be pretty much the worst possible answer.

My answer - and I have never heard of median of medians, although I can guess at how it works - would be "I would start by looking it up in Knuth. (Shows I know what to do if I don't know something) My first thought is probably non-optimnal: build an index by looking at subsets of the data that fit in memory, and merge the indices. Then pick out the median." (Shows I can come up with an answer, and I recognize the answer has its flaws)

.Scott · Sep 12, 2022

From the problem statement, it is unclear whether the "middle value" refers to the entire median record or just the collating key of the median record. It is likely that they are presumed to be the same.

The only problem with attempting to do this purely with a histogram is that the range and size of the collating keys may exceed what you can fully store in the histogram. For example, if you are collating Social Security number (SSN) by city by color, 1 billion histogram bins will handle the SSN, but won't touch the city field. You could make it 4 billion bins and separate every city into four bins by the first letter, but you will still fail if every entry in your database is SSN=123456789 and every city starts with T through Z.

But risking that issue, the histogram method would simply involved:
1) Setting up the histogram bins. Every bin will be a cord count and perhaps a record index. If the record index is used, the index to the first (or last) record landing in that bin will be kept. This array of bins must be initially zeroed.
2) Read through every record and use the collating key to determine in which histogram bin it belongs.
3) Keep a count of the number of records.
4) Increment that record count in that bin and, if you need a record index, store the index in that bin as well.
5) Scan through the counts in your histogram until you reach the halfway point of your record count.
6) If you land on bin with only 1 count or if your bins can fully resolve the collating range, you are done.
7) Otherwise, you are close. If you want to be more precise start over, this time limiting the range of the full histogram to the range of that single bin you landed on the first time.

If you limit the number of iterations to a fixed number (say 4), then this method would be O(n). Otherwise, it would be O(Ln(k))O(n) where Ln(k) is the logarithm of the length of the collation key.

btb4198 · Sep 12, 2022

.Scott said:

From the problem statement, it is unclear whether the "middle value" refers to the entire median record or just the collating key of the median record. It is likely that they are presumed to be the same.

The only problem with attempting to do this purely with a histogram is that the range and size of the collating keys may exceed what you can fully store in the histogram. For example, if you are collating Social Security number (SSN) by city by color, 1 billion histogram bins will handle the SSN, but won't touch the city field. You could make it 4 billion bins and separate every city into four bins by the first letter, but you will still fail if every entry in your database is SSN=123456789 and every city starts with T through Z.

But risking that issue, the histogram method would simply involved:
1) Setting up the histogram bins. Every bin will be a cord count and perhaps a record index. If the record index is used, the index to the first (or last) record landing in that bin will be kept. This array of bins must be initially zeroed.
2) Read through every record and use the collating key to determine in which histogram bin it belongs.
3) Keep a count of the number of records.
4) Increment that record count in that bin and, if you need a record index, store the index in that bin as well.
5) Scan through the counts in your histogram until you reach the halfway point of your record count.
6) If you land on bin with only 1 count or if your bins can fully resolve the collating range, you are done.
7) Otherwise, you are close. If you want to be more precise start over, this time limiting the range of the full histogram to the range of that single bin you landed on the first time.

If you limit the number of iterations to a fixed number (say 4), then this method would be O(n). Otherwise, it would be O(Ln(k))O(n) where Ln(k) is the logarithm of the length of the collation key.

1) He wanted to know the value at the middle of the list if you were to sort them in Ascending order.
If the dataset size was even then you would take the two middles and add them together and Divided by 2.

2) The values were integers from range "1 to 1000 " being pulled from a massive database of unknown size;

btb4198 · Sep 12, 2022

Vanadium 50 said:

So you don't have to contaminate your mind with any algorithms that "won't be on the test"? If I were hiring, that would be pretty much the worst possible answer.

My answer - and I have never heard of median of medians, although I can guess at how it works - would be "I would start by looking it up in Knuth. (Shows I know what to do if I don't know something) My first thought is probably non-optimnal: build an index by looking at subsets of the data that fit in memory, and merge the indices. Then pick out the median." (Shows I can come up with an answer, and I recognize the answer has its flaws)

So I did that, it is called "the Brute Force solution". The interviewer did not care at all.
He asked me what the Time complexity was, I said O(N^2) and he immediately replied with something like that was inefficient and unfeasible, what is a better solution? He did not even let me code the Brute Force solution. I just told him what I was going to try. I got the strongest impression that he expected me to know these algorithms and how to Apply them. lol Because he literally said something like that to me.
I found a post on another site, where someone else was asked this same question here
and reading all the possible solutions, it occurred to me that all the hints the Interviewer was giving me were All from these different algorithms for these post. But I have never heard of any them until Literally Saturday night when I found this post.
"Like a histogram" was one of his hints and I told him that I coded one many years ago for a job interview.
but I never knew that a histogram could be used to sort a list. He said something like they are looking for people who can apply different algorithms to different problems.
IDK

btb4198 · Sep 12, 2022

.Scott said:

From the problem statement, it is unclear whether the "middle value" refers to the entire median record or just the collating key of the median record. It is likely that they are presumed to be the same.

The only problem with attempting to do this purely with a histogram is that the range and size of the collating keys may exceed what you can fully store in the histogram. For example, if you are collating Social Security number (SSN) by city by color, 1 billion histogram bins will handle the SSN, but won't touch the city field. You could make it 4 billion bins and separate every city into four bins by the first letter, but you will still fail if every entry in your database is SSN=123456789 and every city starts with T through Z.

But risking that issue, the histogram method would simply involved:
1) Setting up the histogram bins. Every bin will be a cord count and perhaps a record index. If the record index is used, the index to the first (or last) record landing in that bin will be kept. This array of bins must be initially zeroed.
2) Read through every record and use the collating key to determine in which histogram bin it belongs.
3) Keep a count of the number of records.
4) Increment that record count in that bin and, if you need a record index, store the index in that bin as well.
5) Scan through the counts in your histogram until you reach the halfway point of your record count.
6) If you land on bin with only 1 count or if your bins can fully resolve the collating range, you are done.
7) Otherwise, you are close. If you want to be more precise start over, this time limiting the range of the full histogram to the range of that single bin you landed on the first time.

If you limit the number of iterations to a fixed number (say 4), then this method would be O(n). Otherwise, it would be O(Ln(k))O(n) where Ln(k) is the logarithm of the length of the collation key.

Step 4 has : Increment that record count in that bin and, if you need a record index, store the index in that bin as well.
I do not understand. I know that in counting sort you :

Modify the count array such that each element at each index stores the sum of previous counts.
- Index: 0 1 2 3 4 5 6 7 8 9
- Count: 0 2 4 4 5 6 6 7 7 7
The modified count array indicates the position of each object in the output sequence

is that what you mean ? or do you mean something else? can you show an example ?

.Scott · Sep 12, 2022

btb4198 said:

1) He wanted to know the value at the middle of the list if you were to sort them in Ascending order.
If the dataset size was even then you would take the two middles and add them together and Divided by 2.

2) The values were integers from range "1 to 1000 " being pulled from a massive database of unknown size;

I skipped over the "even record count" issue - only because it made the description difficult.
So you are simply computing the median. So your histogram will simply be an array with 1000 counts.

When you're done, count through half the records. If it's odd, just pick the histogram bin number.
If it's even but both values are in the same bin, pick the bun number.
If it's even and the two values straddle bins, average the two bin numbers.
You will always be at O(n).

.Scott · Sep 12, 2022

btb4198 said:

Step 4 has : Increment that record count in that bin and, if you need a record index, store the index in that bin as well.
I do not understand. I know that in counting sort you :

Modify the count array such that each element at each index stores the sum of previous counts.

Index: 0 1 2 3 4 5 6 7 8 9

Count: 0 2 4 4 5 6 6 7 7 7

The modified count array indicates the position of each object in the output sequence

is that what you mean ? or do you mean something else? can you show an example ?

As originally described, we were looking through a dataset -not just a list of integers.
So it was unclear whether he was looking for a dataset entry - or just the collation (sort key) value.
My description intended to handle both cases.

So the table I was compiling would have looked like this:

bin # (index)	1	2	3	4	5	6	7	8	...
Count	12	99	555	9872	1234	5555	343	2	0
data set index	98675	13455	32232	77655	44322	11234	56789	56789

In the case above, the median sort key value would be 4. A record that holds such a value can be found in your dataset at position 77655.

Of course, if it is just a list of integers that you're working on, then all of the dataset "records" that match "4" will be nothing other than the number "4" - and you probably wouldn't care about the 77655.

btb4198 · Sep 12, 2022

.Scott said:

I skipped over the "even record count" issue - only because it made the description difficult.
So you are simply computing the median. So your histogram will simply be an array with 1000 counts.

When you're done, count through half the records. If it's odd, just pick the histogram bin number.
If it's even but both values are in the same bin, pick the bun number.
If it's even and the two values straddle bins, average the two bin numbers.
You will always be at O(n).

I was trying to do what you said in your other post but it did not work:
here is my code :

findMedian:

        public static int FindMedian(List<int> arr)
        {
            Dictionary<int, int> MedianDictionary = new Dictionary<int, int>();
            int numberCounter = arr.Count;
            for (int i = 1; i <= 100; i++)
            {
                MedianDictionary.Add(i, 0);
            }

            foreach (var item in arr) // Loop through List with foreach
            {
                // Increment the value associated with the key "item" by 1
                MedianDictionary[item]++;
                // This will keep a count of how many times each key occurs
            }

            // Now we are on step 4 and we need to  Increment that record count in that bin and, if you need a record index, store the index in that bin as well.

            int currentCount = 0;

            for (int i =0; i < MedianDictionary.Count; i++) // Loop through Dictionary with foreach
            {

                // Add 1 to value and assign it back to itself
                if (MedianDictionary.ContainsKey(i))
                {
                    MedianDictionary[i]++;

                    // Get the value associated with the key "keyValuePair.Key" again.
                    int numberOfTimesKeyOccurs = MedianDictionary[i];
                    // If the value is greater than 1, that means it's an ongoing occurrence

                    if (numberOfTimesKeyOccurs > 1)
                    {
                        // Get the value of "currentCount" and add the number of times this key occurs to it
                        currentCount += numberOfTimesKeyOccurs;
                    }
                    // If the value is equal to 1, that means this is the first occurrence of this key
                    else if (numberOfTimesKeyOccurs == 1)
                    {
                        // Add 1 to "currentCount" because we just found our first occurrence of a key
                        currentCount++;
                    }
                }
            }

            // Now we need to scan through the counts in your histogram until you reach the halfway point of your record count.
            int median = 0;
            foreach (var keyValuePair in MedianDictionary) // Loop through Dictionary with foreach
            {
                // Get the value associated with the key "keyValuePair.Key"

                int numberOfTimesKeyOccurs = keyValuePair.Value;

                if (currentCount % 2 == 0) // even number of items in List
                {

                    // The median will be the average of the two middlemost items

                    if (currentCount / 2 == numberOfTimesKeyOccurs || (currentCount / 2) + 1 == numberOfTimesKeyOccurs)
                    {
                        median = (keyValuePair.Key + MedianDictionary.ElementAt(MedianDictionary[keyValuePair.Key] - 1).Key) / 2;
                    }
                }
                else // odd number of items in List
                {
                    // The median will be the middlemost item
                    if (currentCount / 2 == numberOfTimesKeyOccurs)
                    {
                        median = keyValuePair.Key;
                    }

                }
                // If we've found the median, break out of the foreach loop

                if (median != 0)
                {
                    break;
                }
            }
            return median;  // return the median
        }

Here is how I am testing the function:

TestButton:

        private void TestButton_Click(object sender, EventArgs e)
        {

            List<int> numberList = GetRandomSet();

            int temp = FindMedian(numberList);

            richTextBox1.Text = "The random  list is " + ListToString(numberList) + Environment.NewLine;
            richTextBox1.Text = richTextBox1.Text + "the Middle value is:" + temp + Environment.NewLine;
            numberList.Sort();
            richTextBox1.Text = richTextBox1.Text + " The Sorted list is: " + ListToString(numberList);
        }

I did not think it would be a good idea to make a list too big it could not fit in RAM, so this is how I make my list :

GetRandomSet:

        public List<int> GetRandomSet()
        {
            List<int> myList = new List<int>();
            Random rnd = new Random();
            Random rand = new Random();
            int size = rand.Next(5, 55);
            
            for (int i = 0; i < size; i++)
            {
                myList.Add(rnd.Next(1,100));
            }

            return myList;
        }

I just got zero for my middle value, also I think I miss understood somethings was said
like set 4 and 5

pbuk · Sep 12, 2022

btb4198 said:

1) He wanted to know the value at the middle of the list if you were to sort them in Ascending order.
If the dataset size was even then you would take the two middles and add them together and Divided by 2.

That is the definition of the median.

btb4198 said:

2) The values were integers from range "1 to 1000 " being pulled from a massive database of unknown size;

That makes a considerable difference, for such a restricted set the ## \mathcal O(n) ## implementation is pretty simple and even if you didn't come up with it straight away it seems that the interviewer was trying to guide you towards it.

All you need to do is create an array of 1000 elements for each of the 1000 values and traverse the values incrementing the relevant array element. Working out which element is the median (or possibly for odd ## n ## which elements span the median) is ## \mathcal O(1) ##.

Vanadium 50 · Sep 12, 2022

btb4198 said:

The values were integers from range "1 to 1000 "

This is critical. The biggest mistake you made is not coding. It's not recognizing the importance of this (you should have mentioned this).

The database doesn't fit in memory, but the number of entries for each bin certainly does. Then you can calculate the median - and mean and mode if you like.

If the range is much larger - say unique for each entry - this will not work.

DaveC426913 · Sep 12, 2022

btb4198 said:

He did not even let me code the Brute Force solution. I just told him what I was going to try. I got the strongest impression that he expected me to know these algorithms and how to Apply them.

Often you will be given problems that you will not be able to (or at least very unlikely to) solve. That's deliberate.

The interviewer is not looking for a solution; he is looking for insight into your analysis processes.

You would do well to mention "Brute Force as a last ditch but that it is so inefficient that you would never consider it except as last ditch". And then waste not one more second of his time on that solution. That shows you know how to prioritize.

The key is to think out loud how you break down the problem.

If you are preparing for a future interview, do not bother trying to solve this problem. Better to research problem-solving techniques.
I was once given an abstract math test that would have taken 20 minutes to finish, but they interrupted me after only ten minutes. They didn't care about the answers one wit; what they were looking for was whether I had prioritized my efforts to get the most done.

pbuk · Sep 12, 2022

DaveC426913 said:

Often you will be given problems that you will not be able to (or at least very unlikely to) solve. That's deliberate.

But that is not the case here - there is a simple solution as @Vanadium 50 and I have pointed out.

DaveC426913 · Sep 12, 2022

pbuk said:

But that is not the case here - there is a simple solution as @Vanadium 50 and I have pointed out.

And how much can the interviewer glean about your analysis process if you just pull an answer right out of your b*tt? Your answer is the math equivalent of yes/no answers in an interview.

(I didn't say it doesn't have a simple solution, I said he might not be able to solve it. Microsoft probably doesn't want to hire walking encyclopaediae.)

pbuk · Sep 12, 2022

DaveC426913 said:

And how much can the interviewer glean about your analysis process if you just pull an answer right out of your b*tt?

He will see that I have analysed the problem and realized that with only 1,000 distinct values the solution is to bin them.

DaveC426913 said:

Your answer is the math equivalent of yes/no answers in an interview.

No it isn't: the "yes/no" equivalent would be:
Q: Can you find the median of a arbitrary length list of positive integers less than 1001 in linear time?
A: Yes.

Note that I am not arguing with this statement in general:

DaveC426913 said:

The interviewer is not looking for a solution; he is looking for insight into your analysis processes.
...
The key is to think out loud how you break down the problem.

and this could well be relevant if the question were more complex (e.g. finding the median of a list of strings). Indeed I suspect if the OP had said "I'd bin the values into a 1000 element array" the interviewer would have moved the conversation on with "good, now what if we are working with arbitrary length strings" - and then it's time to show how you think.

Vanadium 50 · Sep 12, 2022

pbuk said:

good, now what if we are working with arbitrary length strings

Been thinking about that. Well, I was actually thinking if 1000 were a much larger number - large enough to be sparsely populated. Let's say it's ten billion. The square root of 10 billion is 100,000. So a million bins (including overflow and underflow) would cover +/- 5 standard deviations. Now we just need a guess for the median.

Let's look at a subset of the data and calculate the median there. Let's find the median of a million entries - a million will fit in memory (by the above supposition). That will get us the median to about 0.1%. Do that 100x and you should know the median well enough to center your histogram.

If you get it wrong, no problem, just recenter and try again. You'll still beat NlogN if you repeat it fewer than 30 times. If you know something about the datatset, e.g. partially sorted, you might be able to do better.

pbuk · Sep 13, 2022

Vanadium 50 said:

Been thinking about that. Well, I was actually thinking if 1000 were a much larger number - large enough to be sparsely populated.

That's still easier than the general problem where we can't bin elements at all, only compare them. However even for this case, quickselect with a random pivot is on average (and almost certainly) ## \mathcal O(n) ##, although worst case is ## \mathcal O(n^2) ##. We can reduce the worst case to ## \mathcal O(n) ## with median of medians but at the cost of a much higher constant which means median of medians is almost always slower, or we can use a more sophisticated pivot selection strategy such as Floyd Rivest to speed up the average.

The interviewer for a general coding job wouldn't expect every candidate to know all of this off the top of their head of course, interviews are not usually just pass/fail, they are used to determine suitability for different jobs - a candidate that didn't do well on this question but was strong on questions on, say, asychronous programming may find their final interview is for a front-end job rather than back-end.

pbuk · Sep 13, 2022

Vanadium 50 said:

Been thinking about that. Well, I was actually thinking if 1000 were a much larger number - large enough to be sparsely populated. Let's say it's ten billion. The square root of 10 billion is 100,000.

In this case, bin into 100,000 bins according to floor(value / 100,000) and select the median bin*, then bin that into 100,000 bins according to value mod 100,000 to guarantee a result in just two passes over the data i.e. ## \mathcal O(n) ## (in practice you might choose ## 2^{17} = 131,072 ## rather than 100,000).

* if there are two bins spanning the median element then the next pass is different but even quicker.

Vanadium 50 · Sep 13, 2022

Well, if I got that far I'd be pretty happy...

I would get two other ideas into my interview:

1. I could spend some more time optimizing this, but I don't want to go down this path until I had more context: is this the limiting factor in the program? Let's not optimize the wrong thing.

2. Do I know anything about the data I am operating on that would help? (e.g. it's already partially sorted) Worst case, best case, average case and this case are probably four different things.

That said, I probably wouldn't get the job. When asked "do you have any questions?" I would be hard-pressed not to respond "if you hiring process is this tough, why are your products such bloated crap'?:

btb4198 · Sep 13, 2022

All,
So I got it to work today. Hackerrank has a Find the median test and I use that to check and it passed for all 3 test case.

Anyhow, I think if someone learns the Histogram sorting Algorithm ( I am not sure the name of this) beforehand, then yeah they could easily passed that interview. but I do not see anyone, Who never even heard of it before figuring it out in a hour. I think you are suppose to know all the Algorithm and Data structures Pretty well beforehand.

I wish I had learned about the Histogram storing Algorithm in school before, but I have never heard of it until like Saturday night when I found that other posted.

hutchphd · Sep 13, 2022

Really you need to listen to what folks here are telling you. As an interviewer I am much more impressed with folks who can think on their feet and know how to approach a problem than someone who knows the answer because of dumb luck. You need to show off those skills and not some arcane memorized algorithm. It is good that you wanted to work it through, but you cannot know everything, nor will you be expected to at any place you want to work.

btb4198 · Sep 13, 2022

hutchphd said:

Really you need to listen to what folks here are telling you. As an interviewer I am much more impressed with folks who can think on their feet and know how to approach a problem than someone who knows the answer because of dumb luck. You need to show off those skills and not some arcane memorized algorithm. It is good that you wanted to work it through, but you cannot know everything, nor will you be expected to at any place you want to work.

Wait you are an interviewer ? How should a person prepare beforehand? and What should they know beforehand?

Vanadium 50 · Sep 13, 2022

btb4198 said:

What should they know beforehand?

As people have told you repeatedly, this is not a test of your knowledge. It's a test of your problem solving skills.

hutchphd · Sep 13, 2022

Truth in advertising: I have interviewed a number (maybe 25?) people for various positions both software and science, direct hires and contract. It was never my primary function but it was an activity I enjoyed and got pretty good at. I am no longer formally working, because I am older than dirt.
Relax and trust your abilities.

harborsparrow · Nov 22, 2022

There were two big parts to the problem, and there can be wildly varying divergence in how one approaches it. And in my opinion, there are probably more than one decent solution.

One, which obviously a lot of people here expect, is to know about an algorithm that would already deal with the limited memory. Another approach would be to develop a paging system first, and then run a reasonable sort with the paging being managed transparently to the sort algorithm. It might be that one or the other of these approaches would be faster, but I didn't notice that fast performance was given as a requirement. So I would have had to ask about that.

In fact, I would have been explaining the reasoning I gave all along as I developed my answer. I've been told that what they are generally looking for is to assess your thought process and how you go about tackling a problem, and perhaps that is just as important as demonstrating knowledge of a particular algorithm or the ability to code it from memory.

I assume you have read Gayle Laakmaan's book "Cracking the Coding Interview". That would seem to be a pre-requisite, these days, for anyone about to interview for a programming position at Microsoft.

And, don't beat yourself up. They deliberately wait until you're mentally tired to spring these things on you. It's not a nice interview process, and I'm not even sure it's all that effective a process. It selects for a certain personality type, for one thing. But it is what it is.

Microsoft Technical Interview question

FAQ: Microsoft Technical Interview question

1. What is the purpose of a Microsoft technical interview?

2. What types of questions can I expect in a Microsoft technical interview?

3. How should I prepare for a Microsoft technical interview?

4. How long does a Microsoft technical interview typically last?

5. What qualities does Microsoft look for in a candidate during a technical interview?

Similar threads

Hot Threads

Recent Insights