Issue 2381: Inconsistency in parsing floating point numbers

This page is a snapshot from the LWG issues list, see the Library Active Issues List for more information and the meaning of C++23 status.

2381. Inconsistency in parsing floating point numbers

Section: 28.3.4.3.2.3 [facet.num.get.virtuals] Status: C++23 Submitter: Marshall Clow Opened: 2014-04-30 Last modified: 2023-11-22

Priority: 2

View other active issues in [facet.num.get.virtuals].

View all other issues in [facet.num.get.virtuals].

View all issues with C++23 status.

Discussion:

In 28.3.4.3.2.3 [facet.num.get.virtuals] we have:

Stage 3: The sequence of chars accumulated in stage 2 (the field) is converted to a numeric value by the rules of one of the functions declared in the header <cstdlib>:

For a signed integer value, the function strtoll.

For an unsigned integer value, the function strtoull.

For a floating-point value, the function strtold.

This implies that for many cases, this routine should return true:

bool is_same(const char* p)
{
  std::string str{p};
  double val1 = std::strtod(str.c_str(), nullptr);
  std::stringstream ss(str);
  double val2;
  ss >> val2;
  return std::isinf(val1) == std::isinf(val2) &&                 // either they're both infinity
         std::isnan(val1) == std::isnan(val2) &&                 // or they're both NaN
         (std::isinf(val1) || std::isnan(val1) || val1 == val2); // or they're equal
}

and this is indeed true, for many strings:

assert(is_same("0"));
assert(is_same("1.0"));
assert(is_same("-1.0"));
assert(is_same("100.123"));
assert(is_same("1234.456e89"));

but not for others

assert(is_same("0xABp-4")); // hex float
assert(is_same("inf"));
assert(is_same("+inf"));
assert(is_same("-inf"));
assert(is_same("nan"));
assert(is_same("+nan"));
assert(is_same("-nan"));

assert(is_same("infinity"));
assert(is_same("+infinity"));
assert(is_same("-infinity"));

These are all strings that are correctly parsed by std::strtod, but not by the stream extraction operators. They contain characters that are deemed invalid in stage 2 of parsing.

If we're going to say that we're converting by the rules of strtold, then we should accept all the things that strtold accepts.

[2016-04, Issues Telecon]

People are much more interested in round-tripping hex floats than handling inf and nan. Priority changed to P2.

Marshall says he'll try to write some wording, noting that this is a very closely specified part of the standard, and has remained unchanged for a long time. Also, there will need to be a sample implementation.

[2016-08, Chicago]

Zhihao provides wording

The src array in Stage 2 does narrowing only. The actual input validation is delegated to strtold (independent from the parsing in Stage 3 which is again being delegated to strtold) by saying:

[...] If it is not discarded, then a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1.

So a conforming C++11 num_get is supposed to magically accept an hexfloat without an exponent

0x3.AB

because we refers to C99, and the fix to this issue should be just expanding the src array.

Support for Infs and NaNs are not proposed because of the complexity of nan(n-chars).

[2016-08, Chicago]

Tues PM: Move to Open

[2016-09-08, Zhihao Yuan comments and updates proposed wording]

Examples added.

[2018-08-23 Batavia Issues processing]

Needs an Annex C entry. Tim to write Annex C.

Previous resolution [SUPERSEDED]:

This wording is relative to N4606.

Change 28.3.4.3.2.3 [facet.num.get.virtuals]/3 Stage 2 as indicated:

static const char src[] = "0123456789abcdefpxABCDEFPX+-";

Append the following examples to 28.3.4.3.2.3 [facet.num.get.virtuals]/3 Stage 2 as indicated:

[Example:

Given an input sequence of "0x1a.bp+07p",

if Stage 1 returns %d, "0" is accumulated;

if Stage 1 returns %i, "0x1a" are accumulated;

if Stage 1 returns %g, "0x1a.bp+07" are accumulated.

In all cases, leaving the rest in the input.

— end example]

[2021-05-18 Tim updates wording]

Based on the git history, libc++ appears to have always included p and P in src.

[2021-09-20; Reflector poll]

Set status to Tentatively Ready after eight votes in favour during reflector poll.

[2021-10-14 Approved at October 2021 virtual plenary. Status changed: Voting → WP.]

Proposed resolution:

This wording is relative to N4885.

Change 28.3.4.3.2.3 [facet.num.get.virtuals]/3 Stage 2 as indicated:
— Stage 2:
If in == end then stage 2 terminates. Otherwise a charT is taken from in and local variables are initialized as if by
```
char_type ct = *in;
char c = src[find(atoms, atoms + sizeof(src) - 1, ct) - atoms];
if (ct == use_facet<numpunct<charT>>(loc).decimal_point())
c = '.';
bool discard =
  ct == use_facet<numpunct<charT>>(loc).thousands_sep()
  && use_facet<numpunct<charT>>(loc).grouping().length() != 0;
```
where the values src and atoms are defined as if by:
```
static const char src[] = "0123456789abcdefpxABCDEFPX+-";
char_type atoms[sizeof(src)];
use_facet<ctype<charT>>(loc).widen(src, src + sizeof(src), atoms);
```
for this value of loc.
If discard is true, then if '.' has not yet been accumulated, then the position of the character is remembered, but the character is otherwise ignored. Otherwise, if '.' has already been accumulated, the character is discarded and Stage 2 terminates. If it is not discarded, then a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.
If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.
[Example:
Given an input sequence of "0x1a.bp+07p",
- if the conversion specifier returned by Stage 1 is %d, "0" is accumulated;
- if the conversion specifier returned by Stage 1 is %i, "0x1a" are accumulated;
- if the conversion specifier returned by Stage 1 is %g, "0x1a.bp+07" are accumulated.
In all cases, the remainder is left in the input.
— end example]
Add the following new subclause to C.6 [diff.cpp03]:

C.4.? [locale]: localization library [diff.cpp03.locale]
Affected subclause: 28.3.4.3.2.3 [facet.num.get.virtuals]
Change: The num_get facet recognizes hexadecimal floating point values.
Rationale: Required by new feature.
Effect on original feature: Valid C++2003 code may have different behavior in this revision of C++.