Workaround g_utf8_get_char_validate() bug with embedded NUL bytes (#777973)

If PipeCapture reads a NUL byte in the middle of what is expected to be
a multi-byte UTF-8 character then PipeCapture either returns the
captured characters to the previous update or loops forever depending on
whether the end of the stream is encountered before the read buffer is
full or not.  This is equivalent to saying whether the NUL byte occurs
within the last 512 bytes of the output or not.

This is caused by a bug in g_utf8_get_char_validated() reporting that a
partial UTF-8 character has been found when the NUL byte is encountered
in the middle of a multi-byte character even though more bytes are
available in the length specified buffer.  g_utf8_get_char_validated()
is always stopping at the NUL byte assuming it is working with a NUL
terminated string.

Workaround this by checking for g_utf8_get_char_validated() claiming a
partial UTF-8 character has been found when in fact there are at least
enough bytes in the read buffer to instead determine that it is really
an invalid UTF-8 character.

Reference:
    Bug 780095 - g_utf8_get_char_validated() stopping at nul byte even
                 for length specified buffers
    https://bugzilla.gnome.org/show_bug.cgi?id=780095

Bug 777973 - Segmentation fault on bad disk
This commit is contained in:
Mike Fleetwood 2017-03-15 17:02:04 +00:00 committed by Curtis Gedak
parent 6b82616d2e
commit 8dbbb47ce2
2 changed files with 38 additions and 0 deletions

View File

@ -45,6 +45,7 @@ private:
gpointer data );
static void append_unichar_vector_to_utf8( std::string & str,
const std::vector<gunichar> & ucvec );
static int utf8_char_length( unsigned char firstbyte );
Glib::RefPtr<Glib::IOChannel> channel; // Wrapper around fd
char * readbuf; // Bytes read from IOChannel (fd)

View File

@ -126,6 +126,18 @@ bool PipeCapture::OnReadable( Glib::IOCondition condition )
const gunichar UTF8_PARTIAL = (gunichar)-2;
const gunichar UTF8_INVALID = (gunichar)-1;
gunichar uc = g_utf8_get_char_validated( read_ptr, end_ptr - read_ptr );
if ( uc == UTF8_PARTIAL )
{
// Workaround bug in g_utf8_get_char_validated() in which
// it reports an partial UTF-8 char when a NUL byte is
// encountered in the middle of a multi-byte character,
// yet there are more bytes available in the length
// specified buffer. Report as invalid character instead.
int len = utf8_char_length( *read_ptr );
if ( len == -1 || read_ptr + len <= end_ptr )
uc = UTF8_INVALID;
}
if ( uc == UTF8_PARTIAL )
{
// Partial UTF-8 character at end of read buffer. Copy to
@ -231,6 +243,31 @@ void PipeCapture::append_unichar_vector_to_utf8( std::string & str, const std::v
}
}
int PipeCapture::utf8_char_length( unsigned char firstbyte )
{
// Recognise the size of FSS-UTF (1992) / UTF-8 (1993) characters given the first
// byte. Characters can be up to 6 bytes. (Later UTF-8 (2003) limited characters
// to 4 bytes and 21-bits of Unicode code-space).
// Reference:
// https://en.wikipedia.org/wiki/UTF-8
if ( ( firstbyte & 0x80 ) == 0x00 ) // 0xxxxxxx - 1 byte UTF-8 char
return 1;
else if ( ( firstbyte & 0xE0 ) == 0xC0 ) // 110xxxxx - First byte of a 2 byte UTF-8 char
return 2;
else if ( ( firstbyte & 0xF0 ) == 0xE0 ) // 1110xxxx - First byte of a 3 byte UTF-8 char
return 3;
else if ( ( firstbyte & 0xF8 ) == 0xF0 ) // 11110xxx - First byte of a 4 byte UTF-8 char
return 4;
else if ( ( firstbyte & 0xFC ) == 0xF8 ) // 111110xx - First byte of a 5 byte UTF-8 char
return 5;
else if ( ( firstbyte & 0xFE ) == 0xFC ) // 1111110x - First byte of a 6 byte UTF-8 char
return 6;
else if ( ( firstbyte & 0xC0 ) == 0x80 ) // 10xxxxxx - Continuation byte
return -1;
else // Invalid byte
return -1;
}
PipeCapture::~PipeCapture()
{
delete[] readbuf;