Effectively Avoid Problems When Consuming Legacy Character Encodings in Rust

There’s still a lot of old “extended” ASCII out there and you may need to deal with it. One source can be the old-fashioned “fixed-width” data formats, but it may be found in any old files like spreadsheets in Windows. Whatever the case you can’t just wish it away, sadly.

Introduction

The key to understand working with Rust encodings is that Rust guarantees that a String instance contains only valid UTF-8 byte sequences. Knowing this, you can understand from type signatures what errors to expect. When populating a buffer with string data and not wrapping a result, an encoding error must be returned in any case except unsafe code. When combining string creation and IO, string encoding errors are included in the IO errors. Therefore even if you don’t expect real IO errors, check the Error variant anyway as it may contain UTF-8 validation errors. If you need to step outside the UTF-8world, you need a different type from String. Sometimes Vec is sufficient, sometimes you need more.

While for most use cases Rust has you covered with String::from_utf8() and String::from_utf8_unchecked() and String::from_utf8_lossy(), there can be some situations where Rust can seemingly let you down silently with no warning. Things appear to work but you might have lost data without knowing it. I’ll go over when this happens and how to easily avoid it.

The standard Rust approach

Typically you can deal with invalid-UTF-8 encodings in Rust by either ignoring them – by-passing the UTF-8 check with unsafe – or converting to UTF-8 with encoding_rs before working with data. See the sample code at the end of this post. To consume encoding agnostic data from buffers (normally files) be sure to employ the implemented functions of BufRead that do not return or accept strings, but instead deal in Vec<u8>. This can be less obvious than you might think.

You run into problems when

You have unexpectedly consumed invalid UTF-8 and tried to put it into String
You know you have some non ASCII, non-UTF-8 data (perhaps Windows 1252) but you don’t want to convert to UTF-8, but you still need to do string types of operations on it. This could be because you need to maintain compatibility with some old software, or because you may be receiving a mix of encodings so don’t know which conversion to apply with encoding_rs.
You need to maximize performance on string processing when encoding doesn’t matter. Maybe you need constant time random access into a ASCII string or you want to avoid the cost of checking for UTF-8 or converting to UTF-8.

Problem (1) Can be prevented by correctly understanding exact behavior of the various functions of BufRead and rigorous error checking. (See the end of this post.) Depending on your requirements, problem (2) can be addressed by keeping data in Vec<u8> and, if you need string-like operations, using the bstr crate. For problem (3) see <a href=”“https://docs.rs/bstr/latest/bstr/”> bstr</a>.

Bstr

The bstr crate is great when you need to consume lots of text data not guaranteed to be in UTF-8 and you want or need to avoid converting to UTF-8.

You get byte slices and byte string traits to treat Vec<u8> like strings, plus a helper to convert string literals to byte slices (like str but non-UTF-8.)

Longer Notes

Just with two from functions on String you can deal with most byte text data if you want to stay in the UTF8 String world.

fn main() {
    // If you try to interpret these as UTF-8 you'll get failures (unprintable chars)
    // Any bytes over 127 aren't valid ASCII and invalid UTF-8 except a byte >= 192 followed
    // by correctly encoded bytes. This sequence doesn't qualify.
    let non_ascii_non_utf8_bytes = vec![65, 67, 70, 165, 188,67,65,66];

/* The encoding rules are    
    2 bytes: 110xxxxx 10xxxxxx
    3 bytes: 1110xxxx 10xxxxxx 10xxxxxx
    4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

     */
    
    // These are valid ASCII and therefore valid UTF-8
    let ascii_bytes = vec![67,65,66,101,102,32,88];

    // String::from_utf8() makes a String, but returns a Result type. If the bytes can't form a valid UTF-8 sequence you get an error
    // instead of an Ok varient.
    let valid_string = String::from_utf8(ascii_bytes.clone());    

    // You can skip the UTF-8 check and unwrapping the result if you use this unsafe _unchecked()
    // function. If you have invalid UTF-8 the result will be only partially printable (only the ASCII portion.)
    let invalid_string: String = unsafe{String::from_utf8_unchecked(non_ascii_non_utf8_bytes.clone())};
	// If you're literally doing nothing but storing the bytes in the string this could be all right. Not recommended, though!

    println!("From ASCII bytes: {}.  From non-ASCII bytes: {}",
        &valid_string.unwrap(), &invalid_string);

    // If you try to make a string from invalid UTF-8 using the normal from_utf8 method you get an error:
    let result = String::from_utf8(non_ascii_non_utf8_bytes);
    println!("Error was: {:?}",result);

    // Error was: Err(FromUtf8Error { bytes: [65, 67, 70, 165, 188, 67, 65, 66], 
        // error: Utf8Error { valid_up_to: 3, error_len: Some(1) } })                                                                                                            
    // 
}

Subtleties of Rust String Encodings and File I/O

The BufRead trait offers a number of different ways to get at data from a file. The documentation isn’t as clear as it might be. (It is technically correct though.)

Carefully check errors

I ran into some mysterious behavior when reading fixed width data. Originally the code sensibly used the read_until() function to fill a byte buffer, breaking on newlines.

But I did a refactor and switched to the Lines iterator. Similar to Split it gives you a next() that gives you a io::Result. It looks “more elegant” than split(b'\n'). The code did something like

while let Some(Ok(line)) = lines.next()`

This seemed to make sense; not much point in checking for I/O errors on every line.

What I didn’t realize was that next() on Lines would in addition to returning I/O errors, return any UTF-8 encoding errors. However, even knowing that, it does something else I wouldn’t have expected: When using the buffer filling approach (read_line()) if a chunk of bytes getting read in contains invalid UTF-8, the new data won’t get added to the buffer!

Rust won’t make strings that aren’t UTF-8 except through unsafe. you end up skipping the lines with non-Ok results. The buffer just doesn’t grow, and it silently skips over non-UTF-8 data. A correct variation would be


    while let Some(line) = input.read_line(&mut buffer) {
        println!("{}", line?.trim());
    }

or, with an iterator:

    for line in input.lines() {
        println!("{}", line?);
    }

NOTICE the ? that unwraps or returns the Err varient.

Remember to always match on Err! This is how you get silent failures. The more confusing variation is when you use read_line() and don’t check the Result for the Err varient. You get size of bytes read in the Ok varient, and the read_line() will go until you hit EOF. You think you’re just skipping IO errors, but no. If you have encoding errors your buffer may not get filled at all even though you reached the end of the file. But if you used read_until() in a similar way you’re fine.

The docs seem to indicate you’re just dealing with bytes on read_line() as well:

Read all bytes until a newline (the 0xA byte) is reached, and append them to the provided buffer.

This function will read bytes from the underlying stream until the newline delimiter (the 0xA byte) or EOF is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf.

If successful, this function will return the total number of bytes read.

Because in both cases I never stored data as strings – I immediately used “as_bytes()” on the returned strings with the second version – I just thought of the data as bytes. In many other languages this would be fine but Rust tries hard to protect you from invalid UTF-8 byte sequences when it makes String instances. But unlike some dynamic languages there’s no runtime error (Rust doesn’t panic.) Instead you have optionally checked encoding errors, with strings that would rather be empty than contain invalid UTF-8 data. Use that ? ! In my case I didn’t think it was necessary to check for encoding problems as I only wanted the original bytes and didn’t care about UTF-8 strings.

It seems like the std::io::BufRead trait interface could be clearer on when it’s returning raw unprocessed bytes and when it’s returning validated UTF-8 strings.

From the documentation on read_line() on io::BufRead, the end on errors finally makes things clear:

This function has the same error semantics as [read_until] and will also return an error if the read bytes are not valid UTF-8. If an I/O error is encountered then buf may contain some bytes already read in the event that all data read so far was valid UTF-8.

This is a pretty significant difference and really not the same as read_until. From the compiler and IDE assistance with Rust-Analyzer it's not obvious the io::Error` can represent a string encoding error.

In the end it seems like the type signatures provide nearly enough information, but the trait names and return types don’t communicate their behavior well. Something like “read_bytes_until()” and “split_as_bytes()” might be clearer, leaving String as the default otherwise.