tanszek:oktatas:techcomm:floating-point

Floating-Point Representation

Floating-Point Representation

Floating-point representation is used to store real numbers, especially when dealing with very large or very small values. It approximates real numbers in a way that balances precision and range.

The IEEE 754 Standard

The IEEE 754 standard is the most common way to represent floating-point numbers. It splits a floating-point number into three components:

Sign (S): Determines if the number is positive or negative (1 bit).
Exponent (E): Represents the number range (8 bits for single-precision).
Mantissa (M) (also called the significant or fraction): Represents the precision (23 bits for single-precision).

The formula:

$$ \text{value} = (-1)^S \times 1.M \times 2^{(E - \text{Bias})} $$

where Bias is 127 for single-precision (32-bit).

Single-Precision (32-bit) Example

Let’s break down the number 10.25 in binary to see how it’s represented:

Step 1: Convert 10.25 to binary:
- Integer part: 10 in binary is 1010.
- Fraction part: 0.25 is 0.01 in binary.
- So, 10.25 in binary is 1010.01
Step 2: Normalize it into scientific notation in binary:

$$ 1010.01 = 1.01001 \times 2^3 $$

Step 3: Identify the components:
- Sign (S): 0 (positive)
- Exponent (E): We add the bias (127) to the actual exponent (3), so E = 3 + 127 = 130. In binary: 10000010
- Mantissa (M): We drop the leading 1 from 1.01001, so the mantissa is 010010… (with trailing zeros to make 23 bits)

Step 4: Combine them

$$ 0 | 10000010 | 01001000000000000000000 $$

Special Values in Floating-Point Representation

In the IEEE 754 floating-point standard, certain special values are reserved for edge cases such as zero, infinity, and undefined operations. These values help systems represent situations that can’t be expressed as regular floating-point numbers.

Special Values Overview

Zero: Represents positive or negative zero
Infinity: Represents positive or negative infinity, resulting from overflow or division by zero.
NaN (Not a Number): Represents undefined results, such as 0/0 or sqrt(-1)

Value	Sign Bit (S)	Exponent (E)	Mantissa (M)	Description
+Zero	0	00000000	00000000000000000000000	Represents positive zero
-Zero	1	00000000	00000000000000000000000	Represents negative zero
+Infinity	0	11111111	00000000000000000000000	Represents positive infinity
-Infinity	1	11111111	00000000000000000000000	Represents negative infinity
NaN	0 or 1	11111111	Non-zero value	Represents “Not a Number”, used for undefined operations

Try them out here: https://www.h-schmidt.net/FloatConverter/IEEE754.html

Table of Contents

Floating-Point Representation

The IEEE 754 Standard

Single-Precision (32-bit) Example

Special Values in Floating-Point Representation

Special Values Overview