Floating-Point Numbers, Representation And Manipulation (Copy)

]Floating-Point Number Representation

Binary Fixed-Point Limitation:
- Binary numbers can be stored using a fixed-point representation.
- The magnitude of stored numbers depends on bit count.
- Example:
  - 8-bit representation (using two’s complement) allows values from -128 to +127.
  - 16-bit representation extends the range to -16,384 to +16,383.
- Fixed-point representation cannot store fractions and has a limited range.
Scientific Notation in Binary:
- In denary, large numbers are represented using scientific notation (e.g., 312,110,000,000,000 can be written as 3.1211 × 10²³).
- In binary, the equivalent notation is: M×2EM times 2^E
  - M (Mantissa) represents the fractional part.
  - E (Exponent) determines the power of 2.
Example Binary Floating-Point Representation:
- Assume 8 bits for mantissa and 8 bits for exponent.
- Example:
  - A denary number 0.31211 × 10²⁴ translates into binary as:
    - Mantissa: 0.31211
    - Exponent: 24
Key Terms:
- Mantissa: Fractional part of a floating-point number.
- Exponent: Power of 2 that the mantissa is multiplied by.
- Binary Floating-Point Number: A number written as M × 2ᴱ.
- Normalization: Adjusting a binary floating-point number for improved precision.
- Overflow: When a calculated value exceeds storage capacity.
- Underflow: When a calculated value is too small to be stored.

Why Normalize?
- Prevents multiple representations of the same number.
- Ensures maximum precision.
- Uses a standard format for all floating-point numbers.
Rules for Normalization:
- Positive numbers: Mantissa should start with 0.1.
- Negative numbers: Mantissa should start with 1.0.
- Shifting: Adjust mantissa left or right, modifying the exponent accordingly.
Example of Normalization:
- Given binary number: $0.0011100\times250.0011100 times 2^5$
- Shift mantissa left to get 0.1110000.
- Reduce exponent by 2: $0.1110000\times230.1110000 times 2^3$
- The new format preserves the original value.

Precision:
- Defined by the number of bits in the mantissa.
- More bits in mantissa → higher precision.
Range:
- Defined by the number of bits in the exponent.
- More bits in exponent → larger range.
Trade-Off Between Precision and Range:
- Example cases:
  1. 12-bit mantissa, 4-bit exponent → High precision, small range.
  2. 8-bit mantissa, 8-bit exponent → Moderate precision and range.
  3. 4-bit mantissa, 12-bit exponent → Poor precision, very high range.
Largest and Smallest Values Storable:
- Maximum Positive Value: $2127×1.11111112^{127} times 1.1111111$
- Smallest Magnitude: $2−128×1.00000002^{-128} times 1.0000000$

Rounding Errors:
- Some decimal values cannot be represented exactly in binary.
- Example:
  - 5.88 in decimal becomes 5.75 when stored in an 8-bit mantissa system.
  - Solution: Increase mantissa size for better approximation.
Overflow Error:
- Occurs when a computed result exceeds the largest possible number.
- Example:
  - Attempting to compute 1.21 × 10¹⁰⁰ when the largest storable value is 10⁹⁹.
Underflow Error:
- Occurs when a computed result falls below the smallest storable number.
- Example:
  - Dividing by an extremely large number may lead to underflow.
Zero Representation Problem:
- Normalized binary floating-point cannot represent zero.
- Since mantissa must be 0.1 or 1.0, zero cannot be stored directly.
- Solution: Use a special reserved bit pattern for zero.

Example Conversion:
- Given binary floating-point number: $0.1011010\times240.1011010 times 2^4$
- Compute mantissa: $1/2+1/8+1/16+1/64=45/641/2 + 1/8 + 1/16 + 1/64 = 45/64$
- Compute exponent: $45/64\times24=11.2545/64 times 2^4 = 11.25$
- Result: 11.25 in decimal.

Example Conversion:
- Convert 4.75 to binary floating-point.
- Convert to binary:
  - 4 in binary: 100
  - 0.75 in binary: .11
  - Full binary: 100.11
- Normalize: $1.0011\times221.0011 times 2^2$
- Final Binary Floating-Point Representation:
  - Mantissa: 1.0011
  - Exponent: 2

Inaccuracy in Iterative Addition
- Example:
```
number ← 0.0
FOR loop ← 0 TO 50
   number ← number + 0.1
   OUTPUT number
ENDFOR
```
  - Expected Output: 0.1, 0.2, 0.3, …, 5.0.
  - Actual Output: 0.399999 instead of 0.4 due to rounding errors.
  - Solution: Use higher precision formats (e.g., double or quadruple precision).
Division by Zero
- Example: $xy when y=0x^y text{ when } y = 0$
- Causes undefined behavior.
- Solution: Implement error handling in programs.

Floating-point representation allows for a wide range of values but comes with precision limitations.
Normalization ensures consistent representation of numbers.
Precision vs. range is a trade-off influenced by mantissa and exponent sizes.
Errors such as rounding, overflow, and underflow are common and must be handled in software and hardware.

Want To Teach Online?