Skip to content

comm: lossy UTF-8 conversion corrupts non-UTF-8 output #10192

Description

@sylvestre

Component

comm

Description

uutils comm converts each output line to UTF-8 using String::from_utf8_lossy() before printing, which replaces invalid UTF-8 byte sequences with U+FFFD. GNU comm writes raw bytes directly to stdout without UTF-8 conversion, preserving byte-exact input.

Test / Reproduction Steps

echo -ne "\xfe\n\xff\n" > /tmp/a
echo -ne "\xff\n\xfe\n" > /tmp/b
comm /tmp/a /tmp/b | od -An -tx1

GNU output:

09 09 ff 0a 09 09 fe 0a

uutils output:

ef bf bd 0a 09 09 ef bf bd 0a 09 ef bf bd 0a

Impact

Non-UTF-8 text are silently corrupted in the output.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions