This is due to the conversion of real numbers into decimal bases for floating-point numbers on a binary basis (according to the IEEE 754 standard). Generally, this happens when the number is not a sum of exponents of 2.
It should be understood that floats are represented binary, typically with 32 or 64 bits:
- Signal bit (0 - positive, 1 - negative)
- Exponent (8 or 11 bits)
- Mantissa or fractional part (23 or 52 bits)
Where the formula is:
Base 2: (-1) sign × 10 exponent - (1111111 or 1111111111) × 1.mantissa
Exceptions:
-
+0 and -0 have exponent = 0 and mantissa = 0
-
+ Inf and -Inf have exponent = all 1s and mantissa = 0
-
NaNs have exponent all 1s and mantissa other than 0
These utilities ( [1] , [2] ) help to understand the representation of 0.10 and 0.05 in single floats (32 bits). Experiment with 4.0, 2.0 (positive exponents of 2) 1.0 (exponent 0, coded as 127), 0.5, 0.25, 0.125 (negative exponents of 2), 0.375 (sum of 0.125 three times) and then with 0.1, 0.3, etc.
For example, with the second utility , click "Add An Analyzer." Select "decimal" on the left and enter 0.1. Select "binary" on the right and enter 0.000110011001100110011001101.
In this case, we see a rounding:
0.1 10
= 0,000 1100 1100 1100 1100 1100 1100 ... 2
≈ 0.000 1100 1100 1100 1100 1100 1101 2 (rounding)
100 1100 1100 1100 1100 1101
The way to convert this number back to decimal base is similar to integer, but every bit after the comma is multiplied with a negative exponent of 2:
i i i × 2 i , i ∈ ℕ: min ≤ i ≤ max
In this case, min = -27, max = 0
0 × 2 0 + 0 × 2 -1 + 0 × 2 -2 + 0 × 2 -3
+ 1 × 2 -4 + 1 × 2 -5 + 0 × 2 -6 + 0 × 2 - 7
+ 1 × 2 -8 + 1 × 2 -9 + 0 × 2 -10 + 0 × 2 - 11
+ 1 × 2 -12 + 1 × 2 -13 + 0 × 2 -14 + 0 × 2 - 15
+ 1 × 2 -16 + 1 × 2 -17 + 0 × 2 -18 + 0 × 2 - 19
+ 1 × 2 -20 + 1 × 2 -21 + 0 × 2 -22 + 0 × 2 - 23
+ 1 × 2 -24 + 1 × 2 -25 + 0 × 2 -26 + 1 × 2 - 27
= 0.100000001490116119384765625
≈ 0,1 (rounding)
Although there were no rounding off the input data, it does not mean that the results can not be different than expected. For example, sums and subtractions suffer much more because of the difference of the exponent than the multiplications and divisions.
Often, what we see as a comparison of equality (x = 2.5) in mathematical terms has to be an interval comparison in terms of floating point (2,5 - ε ≤ x ≤ 2,5 + ε).
This ε (epsilon) must be a value in the scale of the value we are checking.
It makes no sense for the difference to be the smallest floating possible, because if you do, you will notice that 2.5 ± floating smallest ≈ 2.5. That is, the sum of fact is as if it did not happen, because the values are very distant in the scale. Precision is finite, as you would expect.
On the other hand, it is also not very good to use the float immediately below or above (unless we want even such precision), but perhaps the second, third, I do not know, tenth float below or above.
In practical terms, it should be a function to do this work, if not a macro for when the term of comparison is constant. This is because it is not common for programming languages to have syntax for binary floats, such as 10.1 2 (2.5 10 ), how much more such a single-float 10 (8th single-floats above 2.5) or 10.0111111111111111111000 2 (8th single-float below 2.5). Or even if the language even has syntax, it is nothing readable or practical.