Class CsvTokenizer
Tokenizes a byte stream into CSV fields. The processing follows the guidelines set out in RFC 4180 unless and until the stream proves to be in an incompatible format, in which case a set of additional rules kick in to ensure that all streams are still compatible.
The byte stream is tokenized according to the rules of the ASCII encoding. This makes it compatible with any encoding that encodes 0x0A, 0x0D, 0x22, and 0x2C the same way that ASCII encodes them. UTF-8 and Extended ASCII SBCS are notable examples of acceptable encodings. UTF-16 is a notable example of an unacceptable encoding; trying to use this class to process text encoded in an unacceptable encoding will yield undesirable results without any errors.
All bytes that appear in the stream except 0x0A, 0x0D, 0x22, and 0x2C are unconditionally treated as data and passed through as-is. It is the consumer's responsibility to handle (or not handle) NUL bytes, invalid UTF-8, leading UTF-8 BOM, or any other quirks that come with the territory of text processing.
Inherited Members
Namespace: Cursively
Assembly: Cursively.dll
Syntax
public class CsvTokenizer
Remarks
Each instance of this class expects to process all data from one stream, represented as zero or more ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase) followed by one ProcessEndOfStream(CsvReaderVisitorBase), before moving on to another stream. An instance may be reused after a stream has been fully processed, but each instance is also very lightweight, so it is recommended that callers simply create a new instance for each stream that needs to be processed.
RFC 4180 leaves a lot of wiggle room for implementers. The following section explains how this implementation resolves ambiguities in the spec, explains where and why we deviate from it, and offers clarifying notes where the spec appears to have "gotchas", in the order that the relevant items appear in the spec, primarily modeled off of how Josh Close's CsvHelper library handles the same situations:
- The spec says that separate lines are delimited by CRLF line breaks. This implementation accepts line breaks of any format (CRLF, LF, CR).
- The spec says that there may or may not be a line break at the end of the last record in the stream. This implementation does not require there to be a line break, and it would not hurt to add one either.
- The spec refers to an optional header line at the beginning. This implementation does not include any special treatment for the first line of fields; if they need to be treated as headers, then the consumer needs to know that and respond accordingly.
- The spec says each record may contain "one or more fields". This implementation interprets that to mean strictly that any number of consecutive newline characters in a row are treated as one.
- Many implementations allow the delimiter character to be configured to be something else other than a comma. This implementation does not currently offer that flexibility.
- Many implementations allow automatically trimming whitespace at the beginning and/or end of each field (sometimes optionally). The spec expressly advises against doing that, and this implementation follows suit. It is our opinion that consumers ought to be more than capable of trimming spaces at the beginning or end as part of their processing if this is desired.
- The spec says that the last field in a record must not be followed by a comma. This implementation interprets that to mean that if we do see a comma followed immediately by a line ending character, then that represents the data for an empty field.
Finally, the spec has a lot to say about double quotes. This implementation follows the rules that it expressly lays out, but there are some "gotchas" that follow from the spec leaving it open-ended how implementations should deal with various streams that include double quotes which do not completely enclose fields, resolved as follows:
If a double quote is encountered at the very beginning of a field, then all characters up until the next unescaped double quote or the end of the stream (whichever comes first) are considered to be part of the data for that field (we do translate escaped double quotes for convenience). This includes line ending characters, even though Excel seems to only make that happen if the field counts matching up. If parsing stopped at an unescaped double quote, but there are still more bytes after that double quote before the next delimiter, then all those bytes will be treated verbatim as part of the field's data (double quotes are no longer special at all for the remainder of the field).
Double quotes encountered at any other point are included verbatim as part of the field with no special processing.
var visitor = new MyVisitorSubclass();
var tokenizer = new CsvTokenizer();
tokenizer.ProcessNextChunk(File.ReadAllBytes("..."), visitor);
tokenizer.ProcessEndOfStream(visitor);
}using (var stream = File.OpenRead("..."))
{
var visitor = new MyVisitorSubclass();
var tokenizer = new CsvTokenizer();
var buffer = new byte[81920];
int lastRead;
while ((lastRead = stream.Read(buffer, 0, buffer.Length)) != 0)
{
tokenizer.ProcessNextChunk(new ReadOnlySpan<byte>(buffer, 0, lastRead), visitor);
}
tokenizer.ProcessEndOfStream(visitor);
Constructors
| Improve this Doc View SourceCsvTokenizer()
Initializes a new instance of the CsvTokenizer class.
Declaration
public CsvTokenizer()
CsvTokenizer(Byte)
Initializes a new instance of the CsvTokenizer class.
Declaration
public CsvTokenizer(byte delimiter)
Parameters
Type | Name | Description |
---|---|---|
Byte | delimiter | The single byte to expect to see between fields of the same record. This may not be an end-of-line or double-quote character, as those have special meanings. |
Exceptions
Type | Condition |
---|---|
ArgumentException | Thrown when , , or
.
|
Methods
| Improve this Doc View SourceIsValidDelimiter(Byte)
Checks if a particular byte value is legal for CsvTokenizer(Byte), i.e., that it is not
0x0A
, 0x0D
, or 0x22
.
Declaration
public static bool IsValidDelimiter(byte delimiter)
Parameters
Type | Name | Description |
---|---|---|
Byte | delimiter | The single byte to expect to see between fields of the same record. This may not be an end-of-line or double-quote character, as those have special meanings. |
Returns
Type | Description |
---|---|
Boolean | true if the delimiter is legal for CsvTokenizer(Byte), false otherwise. |
ProcessEndOfStream(CsvReaderVisitorBase)
Informs this tokenizer that the last chunk of data in the stream has been read, and so we should make any final interactions with the CsvReaderVisitorBase and reset our state to prepare for the next stream.
Declaration
public void ProcessEndOfStream(CsvReaderVisitorBase visitor)
Parameters
Type | Name | Description |
---|---|---|
CsvReaderVisitorBase | visitor | The CsvReaderVisitorBase to interact with, or null if we should simply advance the parser state. |
Remarks
If ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase) has never been called (or has not been called since the last time that this method was called), then this method will do nothing.
ProcessNextChunk(ReadOnlySpan<Byte>, CsvReaderVisitorBase)
Accepts the next (or first) chunk of data in the CSV stream, and informs an instance of CsvReaderVisitorBase what it contains.
Declaration
public void ProcessNextChunk(ReadOnlySpan<byte> chunk, CsvReaderVisitorBase visitor)
Parameters
Type | Name | Description |
---|---|---|
ReadOnlySpan<Byte> | chunk | A ReadOnlySpan<T> containing the next chunk of data. |
CsvReaderVisitorBase | visitor | The CsvReaderVisitorBase to interact with, or null if we should simply advance the parser state. |
Remarks
If chunk
is empty, this method will do nothing.