Friday, December 2, 2011

Floating-Point Representation


------------------------------------------------------------------------------
MC logo
Decimal to Floating-Point Conversions
------------------------------------------------------------------------------

The Conversion Procedure

The rules for converting a decimal number into floating point are as follows:
  1. Convert the absolute value of the number to binary, perhaps with a fractional part after the binary point. This can be done by converting the integral and fractional parts separately. The integral part is converted with the techniques examined previously. The fractional part can be converted by multiplication. This is basically the inverse of the division method: we repeatedly multiply by 2, and harvest each one bit as it appears left of the decimal.
  2. Append × 20 to the end of the binary number (which does not change its value).
  3. Normalize the number. Move the binary point so that it is one bit from the left. Adjust the exponent of two so that the value does not change.
  4. Place the mantissa into the mantissa field of the number. Omit the leading one, and fill with zeros on the right.
  5. Add the bias to the exponent of two, and place it in the exponent field. The bias is 2k−1 − 1, where k is the number of bits in the exponent field. For the eight-bit format, k = 3, so the bias is 23−1 − 1 = 3. For IEEE 32-bit, k = 8, so the bias is 28−1 − 1 = 127.
  6. Set the sign bit, 1 for negative, 0 for positive, according to the sign of the original number.

Using The Conversion Procedure

  • Convert 2.625 to our 8-bit floating point format.
    1. The integral part is easy, 210 = 102. For the fractional part:
      0.625× 2 =1.251Generate 1 and continue with the rest.
      0.25× 2 =0.50Generate 0 and continue.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 0.62510 = 0.1012, and 2.62510 = 10.1012.
    2. Add an exponent part: 10.1012 = 10.1012 × 20.
    3. Normalize: 10.1012 × 20 = 1.01012 × 21.
    4. Mantissa: 0101
    5. Exponent: 1 + 3 = 4 = 1002.
    6. Sign bit is 0.
    The result is 01000101. Represented as hex, that is 4516.
  • Convert -4.75 to our 8-bit floating point format.
    1. The integral part is 410 = 1002. The fractional:
      0.75× 2 =1.51Generate 1 and continue with the rest.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 4.7510 = 100.112.
    2. Normalize: 100.112 = 1.00112 × 22.
    3. Mantissa is 0011, exponent is 2 + 3 = 5 = 1012, sign bit is 1.
    So -4.75 is 11010011 = d316
  • Convert 0.40625 to our 8-bit floating point format.
    1. Converting:
      0.40625× 2 =0.81250Generate 0 and continue.
      0.8125× 2 =1.6251Generate 1 and continue with the rest.
      0.625× 2 =1.251Generate 1 and continue with the rest.
      0.25× 2 =0.50Generate 0 and continue.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 0.4062510 = 0.011012.
    2. Normalize: 0.011012 = 1.1012 × 2-2.
    3. Mantissa is 1010, exponent is -2 + 3 = 1 = 0012, sign bit is 0.
    So 0.40625 is 00011010 = 1a16
  • Convert -12.0 to our 8-bit floating point format.
    1. 1210 = 11002.
    2. Normalize: 1100.02 = 1.12 × 23.
    3. Mantissa is 1000, exponent is 3 + 3 = 6 = 1102, sign bit is 1.
    So -12.0 is 11101000 = e816
  • Convert decimal 1.7 to our 8-bit floating point format.
    1. The integral part is easy, 110 = 12. For the fractional part:
      0.7× 2 =1.41Generate 1 and continue with the rest.
      0.4× 2 =0.80Generate 0 and continue.
      0.8× 2 =1.61Generate 1 and continue with the rest.
      0.6× 2 =1.21Generate 1 and continue with the rest.
      0.2× 2 =0.40Generate 0 and continue.
      0.4× 2 =0.80Generate 0 and continue.
      0.8× 2 =1.61Generate 1 and continue with the rest.
      0.6× 2 =1.21Generate 1 and continue with the rest.
      The reason why the process seems to continue endlessly is that it does. The number 7/10, which makes a perfectly reasonable decimal fraction, is a repeating fraction in binary, just as the faction 1/3 is a repeating fraction in decimal. (It repeats in binary as well.) We cannot represent this exactly as a floating point number. The closest we can come in four bits is .1011. Since we already have a leading 1, the best eight-bit number we can make is 1.1011.
    2. Already normalized: 1.10112 = 1.10112 × 20.
    3. Mantissa is 1011, exponent is 0 + 3 = 3 = 0112, sign bit is 0.
    The result is 00111011 = 3b16. This is not exact, of course. If you convert it back to decimal, you get 1.6875.
  • Convert -1313.3125 to IEEE 32-bit floating point format.
    1. The integral part is 131310 = 101001000012. The fractional:
      0.3125× 2 =0.6250Generate 0 and continue.
      0.625× 2 =1.251Generate 1 and continue with the rest.
      0.25× 2 =0.50Generate 0 and continue.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 1313.312510 = 10100100001.01012.
    2. Normalize: 10100100001.01012 = 1.010010000101012 × 210.
    3. Mantissa is 01001000010101000000000, exponent is 10 + 127 = 137 = 100010012, sign bit is 1.
    So -1313.3125 is 11000100101001000010101000000000 = c4a42a0016
  • Convert 0.1015625 to IEEE 32-bit floating point format.
    1. Converting:
      0.1015625× 2 =0.2031250Generate 0 and continue.
      0.203125× 2 =0.406250Generate 0 and continue.
      0.40625× 2 =0.81250Generate 0 and continue.
      0.8125× 2 =1.6251Generate 1 and continue with the rest.
      0.625× 2 =1.251Generate 1 and continue with the rest.
      0.25× 2 =0.50Generate 0 and continue.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 0.101562510 = 0.00011012.
    2. Normalize: 0.00011012 = 1.1012 × 2-4.
    3. Mantissa is 10100000000000000000000, exponent is -4 + 127 = 123 = 011110112, sign bit is 0.
    So 0.1015625 is 00111101110100000000000000000000 = 3dd0000016
  • Convert 39887.5625 to IEEE 32-bit floating point format.
    1. The integral part is 3988710 = 10011011110011112. The fractional:
      0.5625× 2 =1.1251Generate 1 and continue with the rest.
      0.125× 2 =0.250Generate 0 and continue.
      0.25× 2 =0.50Generate 0 and continue.
      0.5× 2 =1.01Generate 1 and nothing remains.
      So 39887.562510 = 1001101111001111.10012.
    2. Normalize: 1001101111001111.10012 = 1.00110111100111110012 × 215.
    3. Mantissa is 00110111100111110010000, exponent is 15 + 127 = 142 = 100011102, sign bit is 0.
    So 39887.5625 is 01000111000110111100111110010000 = 471bcf9016






The IEEE standard for floating point arithmetic

The IEEE (Institute of Electrical and Electronics Engineers) has produced a standard for floating point arithmetic. This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented, as well as how arithmetic should be carried out on them.
Because many of our users may have occasion to transfer unformatted or "binary" data between an IEEE machine and the Cray or the VAX/VMS, it is worth noting the details of this format for comparison with the Cray and VAX representations. The differences in the formats also affect the accuracy of floating point computations.

Summary:

Single Precision

The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
  S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
  0 1      8 9                    31
The value V represented by the word may be determined as follows:
  • If E=255 and F is nonzero, then V=NaN ("Not a number")
  • If E=255 and F is zero and S is 1, then V=-Infinity
  • If E=255 and F is zero and S is 0, then V=Infinity
  • If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values.
  • If E=0 and F is zero and S is 1, then V=-0
  • If E=0 and F is zero and S is 0, then V=0
In particular,
  0 00000000 00000000000000000000000 = 0
  1 00000000 00000000000000000000000 = -0

  0 11111111 00000000000000000000000 = Infinity
  1 11111111 00000000000000000000000 = -Infinity

  0 11111111 00000100000000000000000 = NaN
  1 11111111 00100010001001010101010 = NaN

  0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
  0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
  1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5

  0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
  0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
  0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                       0.00000000000000000000001 = 
                                       2**(-149)  (Smallest positive value)

Double Precision

The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right. The first bit is the sign bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction 'F':
  S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
  0 1        11 12                                                63
The value V represented by the word may be determined as follows:
  • If E=2047 and F is nonzero, then V=NaN ("Not a number")
  • If E=2047 and F is zero and S is 1, then V=-Infinity
  • If E=2047 and F is zero and S is 0, then V=Infinity
  • If 0<E<2047 then V=(-1)**S * 2 ** (E-1023) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point.
  • If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-1022) * (0.F) These are "unnormalized" values.
  • If E=0 and F is zero and S is 1, then V=-0
  • If E=0 and F is zero and S is 0, then V=0

Reference:

ANSI/IEEE Standard 754-1985,
Standard for Binary Floating Point Arithmetic

IEEE Standard 754 Floating Point Numbers



IEEE Standard 754 floating point is the most common representation today for real numbers on computers, including Intel-based PC's, Macintoshes, and most Unix platforms. This article gives a brief overview of IEEE floating point and its representation. Discussion of arithmetic implementation may be found in the book mentioned at the bottom of this article.

What Are Floating Point Numbers?

There are several ways to represent real numbers on computers. Fixed point places a radix point somewhere in the middle of the digits, and is equivalent to using integers that represent portions of some unit. For example, one might represent 1/100ths of a unit; if you have four decimal digits, you could represent 10.82, or 00.01. Another approach is to use rationals, and represent every number as the ratio of two integers.
Floating-point representation - the most common solution - basically represents reals in scientific notation. Scientific notation represents numbers as a base number and an exponent. For example, 123.456 could be represented as 1.23456 × 102. In hexadecimal, the number 123.abc might be represented as 1.23abc × 162.
Floating-point solves a number of representation problems. Fixed-point has a fixed window of representation, which limits it from representing very large or very small numbers. Also, fixed-point is prone to a loss of precision when two large numbers are divided.
Floating-point, on the other hand, employs a sort of "sliding window" of precision appropriate to the scale of the number. This allows it to represent numbers from 1,000,000,000,000 to 0.0000000000000001 with ease.

Storage Layout

IEEE floating point numbers have three basic components: the sign, the exponent, and the mantissa. The mantissa is composed of the fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and need not be stored.
The following figure shows the layout for single (32-bit) and double (64-bit) precision floating-point values. The number of bits for each field are shown (bit ranges are in square brackets):
SignExponentFractionBias
Single Precision1 [31]8 [30-23]23 [22-00]127
Double Precision1 [63]11 [62-52]52 [51-00]1023

The Sign Bit

The sign bit is as simple as it gets. 0 denotes a positive number; 1 denotes a negative number. Flipping the value of this bit flips the sign of the number.

The Exponent

The exponent field needs to represent both positive and negative exponents. To do this, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. For reasons discussed later, exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.
For double precision, the exponent field is 11 bits, and has a bias of 1023.

The Mantissa

The mantissa, also known as the significand, represents the precision bits of the number. It is composed of an implicit leading bit and the fraction bits.
To find out the value of the implicit leading bit, consider that any number can be expressed in scientific notation in many different ways. For example, the number five can be represented as any of these:
        5.00 × 100
        0.05 × 102
        5000 × 10-3
In order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 × 100.
A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.

Putting it All Together

So, to sum up:
  1. The sign bit is 0 for positive, 1 for negative.
  2. The exponent's base is two.
  3. The exponent field contains 127 plus the true exponent for single-precision, or 1023 plus the true exponent for double precision.
  4. The first bit of the mantissa is typically assumed to be 1.f, where f is the field of fraction bits.

Ranges of Floating-Point Numbers

Let's consider single-precision floats for a second. Note that we're taking essentially a 32-bit number and re-jiggering the fields to cover a much broader range. Something has to give, and it's precision. For example, regular 32-bit integers, with all precision centered around zero, can precisely store integers with 32-bits of resolution. Single-precision floating-point, on the other hand, is unable to match this resolution with its 24 bits. It does, however, approximate this value by effectively truncating from the lower end. For example:
      11110000 11001100 10101010 00001111  // 32-bit integer
  = +1.1110000 11001100 10101010 x 231     // Single-Precision Float
  =   11110000 11001100 10101010 00000000  // Corresponding Value
This approximates the 32-bit value, but doesn't yield an exact representation. On the other hand, besides the ability to represent fractional components (which integers lack completely), the floating-point value can represent numbers around 2127, compared to 32-bit integers maximum value around 232.
The range of positive floating point numbers can be split into normalized numbers (which preserve the full precision of the mantissa), and denormalized numbers (discussed later) which use only a portion of the fractions's precision.

DenormalizedNormalizedApproximate Decimal
Single Precision± 2-149 to (1-2-23)×2-126± 2-126 to (2-2-23)×2127± ~10-44.85 to ~1038.53
Double Precision± 2-1074 to (1-2-52)×2-1022± 2-1022 to (2-2-52)×21023± ~10-323.3 to ~10308.3


Since the sign of floating point numbers is given by a special leading bit, the range for negative numbers is given by the negation of the above values.
There are five distinct numerical ranges that single-precision floating-point numbers are not able to represent:
  1. Negative numbers less than -(2-2-23) × 2127 (negative overflow)
  2. Negative numbers greater than -2-149 (negative underflow)
  3. Zero
  4. Positive numbers less than 2-149 (positive underflow)
  5. Positive numbers greater than (2-2-23) × 2127 (positive overflow)
Overflow means that values have grown too large for the representation, much in the same way that you can overflow integers. Underflow is a less serious problem because is just denotes a loss of precision, which is guaranteed to be closely approximated by zero.
Here's a table of the effective range (excluding infinite values) of IEEE floating-point numbers:

BinaryDecimal
Single± (2-2-23) × 2127~ ± 1038.53
Double± (2-2-52) × 21023~ ± 10308.25

Note that the extreme values occur (regardless of sign) when the exponent is at the maximum value for finite numbers (2127 for single-precision, 21023 for double), and the mantissa is filled with 1s (including the normalizing 1 bit).

Special Values

IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme.

Zero

As mentioned above, zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are distinct values, though they both compare as equal.

Denormalized

If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s × 0.f × 2-126, where s is the sign bit and f is the fraction. For double precision, denormalized numbers are of the form (-1)s × 0.f × 2-1022. From this you can interpret zero as a special type of denormalized number.

Infinity

The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.

Not A Number

The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).
A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.
An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature usage.
Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.

Special Operations

Operations on special numbers are well-defined by IEEE. In the simplest case, any operation with a NaN yields a NaN result. Other operations are as follows:

OperationResult
n ÷ ±Infinity0
±Infinity × ±Infinity±Infinity
±nonzero ÷ 0±Infinity
Infinity + InfinityInfinity
±0 ÷ ±0NaN
Infinity - InfinityNaN
±Infinity ÷ ±InfinityNaN
±Infinity × 0NaN

Summary

To sum up, the following are the corresponding values for a given representation:
Float Values (b = bias)
SignExponent (e)Fraction (f)Value
000..0000..00+0
000..0000..01
:
11..11
Positive Denormalized Real
0.f × 2(-b+1)
000..01
:
11..10
XX..XXPositive Normalized Real
1.f × 2(e-b)
011..1100..00+Infinity
011..1100..01
:
01..11
SNaN
011..1110..00
:
11..11
QNaN
100..0000..00-0
100..0000..01
:
11..11
Negative Denormalized Real
-0.f × 2(-b+1)
100..01
:
11..10
XX..XXNegative Normalized Real
-1.f × 2(e-b)
111..1100..00-Infinity
111..1100..01
:
01..11
SNaN
111..1110..00
:
11.11
QNaN

References

A lot of this stuff was observed from small programs I wrote to go back and forth between hex and floating point (printf-style), and to examine the results of various operations. The bulk of this material, however, was lifted from Stallings' book.
  1. Computer Organization and Architecture, William Stallings, pp. 222-234 Macmillan Publishing Company, ISBN 0-02-415480-6
  2. IEEE Computer Society (1985), IEEE Standard for Binary Floating-Point Arithmetic, IEEE Std 754-1985.
  3. Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture , (a PDF document downloaded from intel.com.)

See Also


© 2001-2005 Steve Hollasch




Q-Format number representation










No comments:

Post a Comment