IEEE-754 Precision Format

In summary, the IEEE-754 Precision Format is a standard for representing floating-point numbers in computer systems, ensuring consistent behavior across different platforms. It defines formats for single precision (32 bits) and double precision (64 bits), including components such as the sign bit, exponent, and significand (mantissa). This standard allows for a wide range of values and helps mitigate issues like rounding errors, overflow, and underflow, making it essential for numerical computations in various applications.
  • #1
arhzz
268
52
Homework Statement
Consider the IEEE-754 format for Single Precision numbers. Provide your results in both IEEE-754 format (binary) and in decimal system.
a) Determine the smallest number ε such that 2 + ε > 2.
b) Determine the largest representable (positive or negative) number maxreal.
c) Determine the smallest representable (positive or negative) number minreal.
Relevant Equations
-
Hello! (Note : I put the prefix as comp sci, since at my uni this is a computer science class but I am in EE so it could be put under both prefixes. If anyone feels its more appropriate as Engineering feel free to change)

So here is my attempt at the solution

The formula that we are given is ## (-1)^s*M*2^E ## where M is the mantissa and E is the exponent and s Sign

Since its single precision the mantissa should be 23 bits The smallest number the mantissa can have is 1, (followed by 23 0's) . So the value that follows this, is the smallest value that can be addead to the mantissa, hence we can find our epsilon as follows

## \epsilon = \frac{1}{2^{23}} = 2^{-23} = 1,19 * 10^{-7} ## (roughly)

I think this part should be correct;

Now for the second part I tried it like this.

b) The biggest value the mantissa can have is M = 1,(followed by 23 1's) and the biggest value the exponent have is 127 hence ## 2^{127}##

So now I can plug in the formula ## (-1)^s * M * 2^E = 1,89*10^{38} ## where S is either 0 or 1 for positive or negative

Now the answer should be ##3,40*10^{38} ## and I really dont see how they get to that solution. The power of 38 is correct which confuses me, because that would implie that the formula I am using is correct no?

Thanks for the help!
 
Last edited by a moderator:
Physics news on Phys.org
  • #2
And please use a decimal point, not a decimal comma ... :rolleyes:

What do you think of this:
Wikipedia said:
an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2−23) × 2127≈ 3.4028235 × 1038

[edit]
And I did
##\qquad##Y = 2 + ε
##\qquad##write (6,'(Z8.8)') Y-2.0
and it printed 00000000 !

##\ ##
 
Last edited:
  • #3
BvU said:
And please use a decimal point, not a decimal comma ... :rolleyes:

What do you think of this:


[edit]
And I did
##\qquad##Y = 2 + ε
##\qquad##write (6,'(Z8.8)') Y-2.0
and it printed 00000000 !

##\ ##
Well I think that my solution is wrong, but how do they calculate 3,40? What am I doing wrong in my calculations?

Also for what you did I dont really understand, what is that supposed to represent?

Also noted for the decimal comma points.
 
  • #4
BvU said:
And please use a decimal point, not a decimal comma ... :rolleyes:
Why? If the OP is studying in Europe then answers with full stops as the fractional separator would be wrong. You do need to insert braces around the comma though to avoid inserting unwanted white space, and also use \times instead of * for multiplication, so instead of 1,19 * 10^{-7} rendered as ## 1,19 * 10^{-7} ## you should write 1{,}19 \times 10^{-7} rendered as ## 1{,}19 \times 10^{-7} ##.

arhzz said:
So now I can plug in the formula ## (-1)^s * M * 2^E ## = 1,89*10^{38} ## where S is either 0 or 1 for positive or negative
## (-1)^s \cdot M \cdot 2^E ## is the right equation, but you need to plug the right value of M in!

arhzz said:
Now the answer should be ##3,40*10^{38} ## and I really dont see how they get to that solution.
By using approximately the right value of ## M = 1{,}111\dots 111_2 ## where there are 23 1's in the fractional part as you have correctly stated (there is of course a simpler way to express this value, what is it? Can you see that it leads very quickly to a good approximation?).
 
Last edited:
  • Like
Likes arhzz
  • #5
arhzz said:
Also for what you did I dont really understand, what is that supposed to represent?
It means you have to reconsider your answer for a).
real*4 2.0 is hexadecimal 40000000
2-23 = 1.1920929E-07 is hexadecimal 34000000
and 2+ 2-23 is hexadecimal 40000000 also.

arhzz said:
how do they calculate 3,40
##\mathtt {(2 − 2^{−23}) × 2^{127}≈ 3.4028235 × 10^{38}}##

1715772297212.png

There are 23 bits for the fraction The first (implicit leading bit) is always a 1 and is not stored. Effectively 24 bits
so the biggest possible fraction is FFFFFF (##\mathtt {(1 − 2^{-24})}## and ##\mathtt {2^ {(1 − 2^{-24})} = (2 − 2^{-23})}##.

wikipedia said:
The exponent field is an 8-bit unsigned integer from 0 to 255, in biased form: a value of 127 represents the actual exponent zero. Exponents range from −126 to +127 (thus 1 to 254 in the exponent field), because the biased exponent values 0 (all 0s) and 255 (all 1s) are reserved for special numbers (subnormal numbers, signed zeros, infinities, and NaNs).
so there is the ##\mathtt {2^ {127}}##

Fortran has a function HUGE that prints 7F7F FFFF hex

##\ ##
 
  • Like
Likes arhzz
  • #6
BvU said:
It means you have to reconsider your answer for a).
real*4 2.0 is hexadecimal 40000000
2-23 = 1.1920929E-07 is hexadecimal 34000000
and 2+ 2-23 is hexadecimal 40000000 also.


##\mathtt {(2 − 2^{−23}) × 2^{127}≈ 3.4028235 × 10^{38}}##

View attachment 345289
There are 23 bits for the fraction The first (implicit leading bit) is always a 1 and is not stored. Effectively 24 bits
so the biggest possible fraction is FFFFFF (##\mathtt {(1 − 2^{-24})}## and ##\mathtt {2^ {(1 − 2^{-24})} = (2 − 2^{-23})}##.

so there is the ##\mathtt {2^ {127}}##

Fortran has a function HUGE that prints 7F7F FFFF hex

##\ ##
Okay now I see how they get the correct result, i retraced it step by s

Okay now I see it, I checked our slides again and the part about the implicit leading bit not being stored is what caused the confusion on my part. I was able to reproduce your answer and get the same result (I also have it in binary form as well)

Thank you for the help on part b)

But for part a) I realized that I made a mistake in my calculations. Reasoning is that we did a very similiar example in class with ##1 + \epsilon > 1 ## and after looking at the solution it seemed to me that the "constant factors" (the 1 and in my example 2) did not impact the solution. Obviously this is wrong since it would mean for every number the solution would be ##2^{-23} ## and that just does not make sense.

So I am kind of stumped on how to do a) any insights?
 
Last edited by a moderator:
  • #7
The key is in "real*4 2.0 is hexadecimal 40000000"
 
  • #8
I thought. But it's more complicated. Trial and error, at least for me now.
I figured the ##\varepsilon## for your part a) would be 2**(-22) but I find that anything > 2**(-23) already sets the last bit in that 40000000

with eps = 1.1920929E-07 ( 2**(-23), in hex: 34000000)
I do Y = 2.0 + eps and look at Y and at Y-2

eps: 0.119209289551E-06 in hex: 34000000
: Y: 0.200000000000E+01 in hex: 40000000
Y-2: 0.000000000000E+00 in hex: 00000000


same with with eps = 0.2384186E-06 (2**(-22), in hex: 34800000)

eps: 0.119209303762E-06 in hex: 34000001
eps: 0.238418607523E-06 in hex: 34800001

: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000


But this last result I also get when I input eps = 1.192093E-07

eps: 0.119209303762E-06 in hex: 34000001
: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000


[edit] messed up struggling with font and bold.
Point is: anything > 2**(-23) qualifies as answer to part a), just NOT 2**(-23) itself
Disclaimer: I am using the Intel fortran compiler that appears to have around a hundred switches for all kinds of compatibilities in floating point arithmetic.

##\ ##
 
Last edited:
  • #9
BvU said:
I thought. But it's more complicated. Trial and error, at least for me now.
I figured the ##\varepsilon## for your part a) would be 2**(-22) but I find that anything > 2**(-23) already sets the last bit in that 40000000

with eps = 1.1920929E-07 ( 2**(-23), in hex: 34000000)
I do Y = 2.0 + eps and look at Y and at Y-2

eps: 0.119209289551E-06 in hex: 34000000
: Y: 0.200000000000E+01 in hex: 40000000
Y-2: 0.000000000000E+00 in hex: 00000000


same with with eps = 0.2384186E-06 (2**(-22), in hex: 34800000)

eps: 0.119209303762E-06 in hex: 34000001
: Y: 0.200000023842E+01 in hex: 40000001
Y-2: 0.238418579102E-06 in hex: 34800000


But this last result I also get when I input eps = 1.192093E-07

##\ ##
Huh interesting, so if I understood correct you found your epsilon,than you add the epsilon to the 2 . So for the first case our epsilon is ##2^{-23}## and after adding it to we get the 40000000 hexadecimal. Now this value -2 gets us to 00000000 which suggests that no change has occured and that the equation is not fullfiled? Did I understand this correctly.

The same analogy for when epsilon is ## 2^{-22}## which yields that epsilon should be ## 2^{-22} ##

But you get tha same result when inputing a different epsilon if I got that right? How does that work, that should not be happening right?
 
  • #10
Oops, messed up. Back later.
 
  • Like
Likes arhzz
  • #11
arhzz said:
Huh interesting, so if I understood correct you found your epsilon,than you add the epsilon to the 2 . So for the first case our epsilon is ##2^{-23}## and after adding it to we get the 40000000 hexadecimal. Now this value -2 gets us to 00000000 which suggests that no change has occured and that the equation is not fullfiled? Did I understand this correctly.

The same analogy for when epsilon is ## 2^{-22}## which yields that epsilon should be ## 2^{-22} ##

But you get tha same result when inputing a different epsilon if I got that right? How does that work, that should not be happening right?
So far I have done some trials starting with real*4 Y = 2.0 which is stored as 40000000 (in hexadecimal).

Adding 2**(-23) (decimal 0.119209289551E-06) to this gets the sum stored as 40000000 again, so Y + 2**(-23) is NOT seen as greater than Y

The minimum change required to see Y+ ##\varepsilon## as greater than Y is when it goes from 40000000 to 40000001 in hex. The decimal value of the latter is 0.200000023842E+01 or 2 + 2**(-22) , which makes sense.

But it appears that the floating point arithmetic rounds off to 2**(-22) numbers slightly greater than 2**(-23):
e.g. 0.1192092967E-06 but NOT 0.1192092966E-06

I think this is going too deep, so probably the intended answer for a) is 2**(-23)

https://en.wikipedia.org/wiki/Machine_epsilon#Values_for_standard_hardware_arithmetics

##\ ##
 
  • #12
BvU said:
Oops, messed up. Back l

BvU said:
So far I have done some trials starting with real*4 Y = 2.0 which is stored as 40000000 (in hexadecimal).

Adding 2**(-23) (decimal 0.119209289551E-06) to this gets the sum stored as 40000000 again, so Y + 2**(-23) is NOT seen as greater than Y

The minimum change required to see Y+ ##\varepsilon## as greater than Y is when it goes from 40000000 to 40000001 in hex. The decimal value of the latter is 0.200000023842E+01 or 2 + 2**(-22) , which makes sense.

But it appears that the floating point arithmetic rounds off to 2**(-22) numbers slightly greater than 2**(-23):
e.g. 0.1192092967E-06 but NOT 0.1192092966E-06

I think this is going too deep, so probably the intended answer for a) is 2**(-23)

https://en.wikipedia.org/wiki/Machine_epsilon#Values_for_standard_hardware_arithmetics

##\ ##
Hm okay interesting so you would bet on 2^(-23) to be the answer? Considering that ## 1+\epsilon > 1 ## yields the same answer can we state that the minimum value is always ## 2^{-23} ## regardless what teh constant factors are ?
 
  • #13
I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)

Up to you :smile: !
 
  • Like
Likes arhzz
  • #14
BvU said:
I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)

Up to you :smile: !

I think I will try it like this;

What is the equation that relates the components of the normalized IEEE-754 single-precision representation to the value it represents:

$$ v \; = \; \left( -1 \right)^S \left( 1 \; + \; \frac{M}{2^{23}} \right) 2^{\left( E-127\right)} $$

Now I define ## v_1 = 2 ## and ## v_2 = 2+ \epsilon ##

Than we solve for ## \epsilon## and what comes out is my solution.

Does this make any sense for you?
 
  • #15
@BvU you are confusing rounding epsilon with interval epsilon.

@arhzz given that 2 is twice the size of 1 it should be obvious that they could not both have the same value of epsilon.
 
  • Like
Likes BvU
  • #16
BvU said:
I have shown that it is not the right answer (for my compiler). But probably the intended answer for a) is 2**(-23)
How are you getting that? If you're going to fit the numbers into 24 bits, you have

2 = 10.00 0000 0000 0000 0000 0000
2+e = 10.00 0000 0000 0000 0000 0001

Since there are 22 digits to the right of the binary point, the difference is 2^-22.
 
  • Like
Likes pbuk
  • #17
pbuk said:
@BvU you are confusing rounding epsilon with interval epsilon.

The question is then: does the problem statement in #1 include this rounding (which is demonstrated in practice in the last block in #8, -- but may well be compiler-dependent)
 
  • #18
vela said:
How are you getting that? If you're going to fit the numbers into 24 bits, you have

2 = 10.00 0000 0000 0000 0000 0000
2+e = 10.00 0000 0000 0000 0000 0001

Since there are 22 digits to the right of the binary point, the difference is 2^-22.
Yes, see #8: the hex representation of the smallest number > 2 is 40000001

I have muddied the waters by a practical interpretation -- makes for a good learning opportunity :smile:

##\ ##
 

FAQ: IEEE-754 Precision Format

What is the IEEE-754 Precision Format?

The IEEE-754 Precision Format is a standard for representing floating-point numbers in computers. It defines how numbers are stored in binary, specifying formats for single precision (32 bits) and double precision (64 bits), as well as rules for rounding, overflow, underflow, and special values like NaN (Not a Number) and infinity.

What are the components of an IEEE-754 floating-point number?

An IEEE-754 floating-point number consists of three main components: the sign bit, the exponent, and the fraction (or significand). The sign bit indicates whether the number is positive or negative, the exponent determines the scale of the number, and the fraction represents the significant digits of the number.

What are the differences between single precision and double precision?

Single precision uses 32 bits to represent a floating-point number, allocating 1 bit for the sign, 8 bits for the exponent, and 23 bits for the fraction. Double precision, on the other hand, uses 64 bits, with 1 bit for the sign, 11 bits for the exponent, and 52 bits for the fraction. This means double precision can represent a wider range of values and greater precision than single precision.

How does rounding work in IEEE-754?

IEEE-754 defines several rounding modes for floating-point arithmetic, including round to nearest (the default), round toward zero, round toward positive infinity, and round toward negative infinity. The rounding mode determines how numbers are adjusted when they cannot be precisely represented, ensuring consistency and predictability in floating-point calculations.

What are the special values defined by IEEE-754?

IEEE-754 defines several special values, including positive and negative infinity, NaN (Not a Number), and denormalized numbers. Positive and negative infinity represent overflow conditions, NaN is used to signify undefined or unrepresentable values (like 0/0), and denormalized numbers allow for representation of very small numbers closer to zero than the smallest normalized value.

Similar threads

Replies
14
Views
3K
Replies
6
Views
3K
Replies
21
Views
2K
Replies
8
Views
5K
Replies
5
Views
2K
Replies
5
Views
5K
Replies
10
Views
2K
Back
Top