Show / Hide Table of Contents

    Class CsvTokenizer

    Tokenizes a byte stream into CSV fields. The processing follows the guidelines set out in RFC 4180 unless and until the stream proves to be in an incompatible format, in which case a set of additional rules kick in to ensure that all streams are still compatible.

    The byte stream is tokenized according to the rules of the ASCII encoding. This makes it compatible with any encoding that encodes 0x0A, 0x0D, 0x22, and 0x2C the same way that ASCII encodes them. UTF-8 and Extended ASCII SBCS are notable examples of acceptable encodings. UTF-16 is a notable example of an unacceptable encoding; trying to use this class to process text encoded in an unacceptable encoding will yield undesirable results without any errors.

    All bytes that appear in the stream except 0x0A, 0x0D, 0x22, and 0x2C are unconditionally treated as data and passed through as-is. It is the consumer's responsibility to handle (or not handle) NUL bytes, invalid UTF-8, leading UTF-8 BOM, or any other quirks that come with the territory of text processing.

    Inheritance
    Object
    CsvTokenizer
    Inherited Members
    Object.Equals(Object)
    Object.Equals(Object, Object)
    Object.GetHashCode()
    Object.GetType()
    Object.MemberwiseClone()
    Object.ReferenceEquals(Object, Object)
    Object.ToString()
    Namespace: Cursively
    Assembly: Cursively.dll
    Syntax
    public class CsvTokenizer
    Remarks

    Each instance of this class expects to process all data from one stream, represented as zero or more ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase) followed by one ProcessEndOfStream(CsvReaderVisitorBase), before moving on to another stream. An instance may be reused after a stream has been fully processed, but each instance is also very lightweight, so it is recommended that callers simply create a new instance for each stream that needs to be processed.

    RFC 4180 leaves a lot of wiggle room for implementers. The following section explains how this implementation resolves ambiguities in the spec, explains where and why we deviate from it, and offers clarifying notes where the spec appears to have "gotchas", in the order that the relevant items appear in the spec, primarily modeled off of how Josh Close's CsvHelper library handles the same situations:

    • The spec says that separate lines are delimited by CRLF line breaks. This implementation accepts line breaks of any format (CRLF, LF, CR).
    • The spec says that there may or may not be a line break at the end of the last record in the stream. This implementation does not require there to be a line break, and it would not hurt to add one either.
    • The spec refers to an optional header line at the beginning. This implementation does not include any special treatment for the first line of fields; if they need to be treated as headers, then the consumer needs to know that and respond accordingly.
    • The spec says each record may contain "one or more fields". This implementation interprets that to mean strictly that any number of consecutive newline characters in a row are treated as one.
    • Many implementations allow the delimiter character to be configured to be something else other than a comma. This implementation does not currently offer that flexibility.
    • Many implementations allow automatically trimming whitespace at the beginning and/or end of each field (sometimes optionally). The spec expressly advises against doing that, and this implementation follows suit. It is our opinion that consumers ought to be more than capable of trimming spaces at the beginning or end as part of their processing if this is desired.
    • The spec says that the last field in a record must not be followed by a comma. This implementation interprets that to mean that if we do see a comma followed immediately by a line ending character, then that represents the data for an empty field.

    Finally, the spec has a lot to say about double quotes. This implementation follows the rules that it expressly lays out, but there are some "gotchas" that follow from the spec leaving it open-ended how implementations should deal with various streams that include double quotes which do not completely enclose fields, resolved as follows:

    If a double quote is encountered at the very beginning of a field, then all characters up until the next unescaped double quote or the end of the stream (whichever comes first) are considered to be part of the data for that field (we do translate escaped double quotes for convenience). This includes line ending characters, even though Excel seems to only make that happen if the field counts matching up. If parsing stopped at an unescaped double quote, but there are still more bytes after that double quote before the next delimiter, then all those bytes will be treated verbatim as part of the field's data (double quotes are no longer special at all for the remainder of the field).

    Double quotes encountered at any other point are included verbatim as part of the field with no special processing.

    var visitor = new MyVisitorSubclass();
    var tokenizer = new CsvTokenizer();
    tokenizer.ProcessNextChunk(File.ReadAllBytes("..."), visitor);
    tokenizer.ProcessEndOfStream(visitor);

    using (var stream = File.OpenRead("..."))
    {
        var visitor = new MyVisitorSubclass();
        var tokenizer = new CsvTokenizer();
        var buffer = new byte[81920];
        int lastRead;
        while ((lastRead = stream.Read(buffer, 0, buffer.Length)) != 0)
        {
            tokenizer.ProcessNextChunk(new ReadOnlySpan<byte>(buffer, 0, lastRead), visitor);
        }
    
    tokenizer.ProcessEndOfStream(visitor);
    

    }

    Constructors

    | Improve this Doc View Source

    CsvTokenizer()

    Initializes a new instance of the CsvTokenizer class.

    Declaration
    public CsvTokenizer()
    | Improve this Doc View Source

    CsvTokenizer(Byte)

    Initializes a new instance of the CsvTokenizer class.

    Declaration
    public CsvTokenizer(byte delimiter)
    Parameters
    Type Name Description
    Byte delimiter

    The single byte to expect to see between fields of the same record. This may not be an end-of-line or double-quote character, as those have special meanings.

    Exceptions
    Type Condition
    ArgumentException

    Thrown when delimiter is

    0x0A
    ,
    0x0D
    , or

    0x22
    .

    Methods

    | Improve this Doc View Source

    IsValidDelimiter(Byte)

    Checks if a particular byte value is legal for CsvTokenizer(Byte), i.e., that it is not

    0x0A
    ,
    0x0D
    , or
    0x22
    .

    Declaration
    public static bool IsValidDelimiter(byte delimiter)
    Parameters
    Type Name Description
    Byte delimiter

    The single byte to expect to see between fields of the same record. This may not be an end-of-line or double-quote character, as those have special meanings.

    Returns
    Type Description
    Boolean

    true if the delimiter is legal for CsvTokenizer(Byte), false otherwise.

    | Improve this Doc View Source

    ProcessEndOfStream(CsvReaderVisitorBase)

    Informs this tokenizer that the last chunk of data in the stream has been read, and so we should make any final interactions with the CsvReaderVisitorBase and reset our state to prepare for the next stream.

    Declaration
    public void ProcessEndOfStream(CsvReaderVisitorBase visitor)
    Parameters
    Type Name Description
    CsvReaderVisitorBase visitor

    The CsvReaderVisitorBase to interact with, or null if we should simply advance the parser state.

    Remarks

    If ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase) has never been called (or has not been called since the last time that this method was called), then this method will do nothing.

    | Improve this Doc View Source

    ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase)

    Accepts the next (or first) chunk of data in the CSV stream, and informs an instance of CsvReaderVisitorBase what it contains.

    Declaration
    public void ProcessNextChunk(ReadOnlySpan<byte> chunk, CsvReaderVisitorBase visitor)
    Parameters
    Type Name Description
    ReadOnlySpan<Byte> chunk

    A ReadOnlySpan<T> containing the next chunk of data.

    CsvReaderVisitorBase visitor

    The CsvReaderVisitorBase to interact with, or null if we should simply advance the parser state.

    Remarks

    If chunk is empty, this method will do nothing.

    • Improve this Doc
    • View Source
    Back to top Generated by DocFX