Floating-Point Numbers, Representation And Manipulation (Copy)
]Floating-Point Number Representation
- Binary Fixed-Point Limitation:
- Binary numbers can be stored using a fixed-point representation.
- The magnitude of stored numbers depends on bit count.
- Example:
- 8-bit representation (using two’s complement) allows values from -128 to +127.
- 16-bit representation extends the range to -16,384 to +16,383.
- Fixed-point representation cannot store fractions and has a limited range.
- Scientific Notation in Binary:
- In denary, large numbers are represented using scientific notation (e.g., 312,110,000,000,000 can be written as 3.1211 × 10²³).
- In binary, the equivalent notation is: M×2EM times 2^E
- M (Mantissa) represents the fractional part.
- E (Exponent) determines the power of 2.
- Example Binary Floating-Point Representation:
- Assume 8 bits for mantissa and 8 bits for exponent.
- Example:
- A denary number 0.31211 × 10²⁴ translates into binary as:
- Mantissa: 0.31211
- Exponent: 24
- A denary number 0.31211 × 10²⁴ translates into binary as:
- Key Terms:
- Mantissa: Fractional part of a floating-point number.
- Exponent: Power of 2 that the mantissa is multiplied by.
- Binary Floating-Point Number: A number written as M × 2ᴱ.
- Normalization: Adjusting a binary floating-point number for improved precision.
- Overflow: When a calculated value exceeds storage capacity.
- Underflow: When a calculated value is too small to be stored.
Normalization of Floating-Point Numbers
- Why Normalize?
- Prevents multiple representations of the same number.
- Ensures maximum precision.
- Uses a standard format for all floating-point numbers.
- Rules for Normalization:
- Positive numbers: Mantissa should start with 0.1.
- Negative numbers: Mantissa should start with 1.0.
- Shifting: Adjust mantissa left or right, modifying the exponent accordingly.
- Example of Normalization:
- Given binary number: 0.0011100×250.0011100 times 2^5
- Shift mantissa left to get 0.1110000.
- Reduce exponent by 2: 0.1110000×230.1110000 times 2^3
- The new format preserves the original value.
Precision vs. Range
- Precision:
- Defined by the number of bits in the mantissa.
- More bits in mantissa → higher precision.
- Range:
- Defined by the number of bits in the exponent.
- More bits in exponent → larger range.
- Trade-Off Between Precision and Range:
- Example cases:
- 12-bit mantissa, 4-bit exponent → High precision, small range.
- 8-bit mantissa, 8-bit exponent → Moderate precision and range.
- 4-bit mantissa, 12-bit exponent → Poor precision, very high range.
- Example cases:
- Largest and Smallest Values Storable:
- Maximum Positive Value: 2127×1.11111112^{127} times 1.1111111
- Smallest Magnitude: 2−128×1.00000002^{-128} times 1.0000000
Floating-Point Arithmetic and Limitations
Potential Errors
- Rounding Errors:
- Some decimal values cannot be represented exactly in binary.
- Example:
- 5.88 in decimal becomes 5.75 when stored in an 8-bit mantissa system.
- Solution: Increase mantissa size for better approximation.
- Overflow Error:
- Occurs when a computed result exceeds the largest possible number.
- Example:
- Attempting to compute 1.21 × 10¹⁰⁰ when the largest storable value is 10⁹⁹.
- Underflow Error:
- Occurs when a computed result falls below the smallest storable number.
- Example:
- Dividing by an extremely large number may lead to underflow.
- Zero Representation Problem:
- Normalized binary floating-point cannot represent zero.
- Since mantissa must be 0.1 or 1.0, zero cannot be stored directly.
- Solution: Use a special reserved bit pattern for zero.
Binary Floating-Point Conversions
Converting Binary Floating-Point Numbers to Decimal
- Example Conversion:
- Given binary floating-point number: 0.1011010×240.1011010 times 2^4
- Compute mantissa: 1/2+1/8+1/16+1/64=45/641/2 + 1/8 + 1/16 + 1/64 = 45/64
- Compute exponent: 45/64×24=11.2545/64 times 2^4 = 11.25
- Result: 11.25 in decimal.
Converting Decimal to Binary Floating-Point
- Example Conversion:
- Convert 4.75 to binary floating-point.
- Convert to binary:
- 4 in binary: 100
- 0.75 in binary: .11
- Full binary: 100.11
- Normalize: 1.0011×221.0011 times 2^2
- Final Binary Floating-Point Representation:
- Mantissa: 1.0011
- Exponent: 2
Common Floating-Point Problems in Programming
- Inaccuracy in Iterative Addition
- Example:
number ← 0.0 FOR loop ← 0 TO 50 number ← number + 0.1 OUTPUT number ENDFOR- Expected Output: 0.1, 0.2, 0.3, …, 5.0.
- Actual Output: 0.399999 instead of 0.4 due to rounding errors.
- Solution: Use higher precision formats (e.g., double or quadruple precision).
- Example:
- Division by Zero
- Example: xy when y=0x^y text{ when } y = 0
- Causes undefined behavior.
- Solution: Implement error handling in programs.
Conclusion
- Floating-point representation allows for a wide range of values but comes with precision limitations.
- Normalization ensures consistent representation of numbers.
- Precision vs. range is a trade-off influenced by mantissa and exponent sizes.
- Errors such as rounding, overflow, and underflow are common and must be handled in software and hardware.
