$Id: README.EXT,v 1.5 2005/05/19 04:51:31 why Exp $ This is the documentation for libsyck and describes how to extend it. = Overview = Syck is designed to take a YAML stream and a symbol table and move data between the two. Your job is to simply provide callback functions which understand the symbol table you are keeping. Syck also includes a simple symbol table implementation. == About the Source == The Syck distribution is laid out as follows: lib/ libsyck source (core API) bytecode.re lexer for YAML bytecode (re2c) emitter.c emitter functions gram.y grammar for YAML documents (bison) handler.c internal handlers which glue the lexer and grammar implicit.re lexer for builtin YAML types (re2c) node.c node allocation and access syck.c parser funcs, central funcs syck.h libsyck definitions syck_st.c symbol table functions syck_st.h symbol table definitions token.re lexer for YAML plaintext (re2c) yaml2byte.c simple bytecode emitter ext/ ruby, python, php, cocoa extensions tests/ unit tests for libsyck YTS.c.rb generates YAML Testing Suite unit test (use: ruby YTS.c.rb > YTS.c) Basic.c allocation and buffering tests Parse.c parser sanity Emit.c emitter sanity == Using SyckNodes == The SyckNode is the structure which YAML data is loaded into while parsing. It's also a good structure to use while emitting, however you may choose to emit directly from your native types if your extension is very small. SyckNodes are designed to be used in conjunction with a symbol table. More on that in a moment. For now, think of a symbol table as a library which stores nodes, assigning each node a unique identifier. This identifier is called the SYMID in Syck. Nodes refer to each other by SYMIDs, rather than pointers. This way, the nodes can be free'd as the parser goes. To be honest, SYMIDs are used because this is the way Ruby works. And this technique means Syck can use Ruby's symbol table directly. But the included symbol table is lightweight, solves the problem of keeping too much data in memory, and simply pairs SYMIDs with your native object type (such as PyObject pointers.) Three kinds of SyckNodes are available: 1. scalar nodes (syck_str_kind): These nodes store a string, a length for the string and a style (indicating the format used in the YAML document). 2. sequence nodes (syck_seq_kind): Sequences are YAML's array or list type. These nodes store a list of items, which allocation is handled by syck functions. 3. mapping nodes (syck_map_kind): Mappings are YAML's dictionary or hashtable type. These nodes store a list of pairs, which allocation is handled by syck functions. The syck_kind_tag enum specifies the above enumerations, which can be tested against the SyckNode.kind field. PLEASE leave the SyckNode.shortcut field alone!! It's used by the parser to workaround parser ambiguities!! === Node API === SyckNode * syck_alloc_str() syck_alloc_seq() syck_alloc_str() Allocates a node of a given type and initializes its internal union to emptiness. When left as-is, these nodes operate as a valid empty string, empty sequence and empty map. Remember that the node's id (SYMID) isn't set by the allocation functions OR any other node functions herein. It's up to your handler function to do that. void syck_free_node( SyckNode *n ) While the Syck parser will free nodes it creates, use this to free your own nodes. This function will free all of its internals, its type_id and its anchor. If you don't need those members free, please be sure they are set to NULL. SyckNode * syck_new_str( char *str, enum scalar_style style ) syck_new_str2( char *str, long len, enum scalar_style style ) Creates scalar nodes from C strings. The first function will call strlen() to determine length. void syck_replace_str( SyckNode *n, char *str, enum scalar_style style ) syck_replace_str2( SyckNode *n, char *str, long len, enum scalar_style style ) Replaces the string content of a node `n', while keeping the node's type_id, anchor and id. char * syck_str_read( SyckNode *n ) Returns a pointer to the null-terminated string inside scalar node `n'. Normally, you might just want to use: char *ptr = n->data.str->ptr long len = n->data.str->len SyckNode * syck_new_map( SYMID key, SYMID value ) Allocates a new map with an initial pair of nodes. void syck_map_empty( SyckNode *n ) Empties the set of pairs for a mapping node. void syck_map_add( SyckNode *n, SYMID key, SYMID value ) Pushes a key-value pair on the mapping. While the ordering of pairs DOES affect the ordering of pairs on output, loaded nodes are deliberately out of order (since YAML mappings do not preserve ordering.) See YAML's builtin !omap type for ordering in mapping nodes. SYMID syck_map_read( SyckNode *n, enum map_part, long index ) Loads a specific key or value from position `index' within a mapping node. Great for iteration: for ( i = 0; i < syck_map_count( n ); i++ ) { SYMID key = sym_map_read( n, map_key, i ); SYMID val = sym_map_read( n, map_value, i ); } void syck_map_assign( SyckNode *n, enum map_part, long index, SYMID id ) Replaces a specific key or value at position `index' within a mapping node. Useful for replacement only, will not allocate more room when assigned beyond the end of the pair list. long syck_map_count( SyckNode *n ) Returns a count of the pairs contained by the mapping node. void syck_map_update( SyckNode *n, SyckNode *n2 ) Combines all pairs from mapping node `n2' into mapping node `n'. SyckNode * syck_new_seq( SYMID val ) Allocates a new seq with an entry `val'. void syck_seq_empty( SyckNode *n ) Empties a sequence node `n'. void syck_seq_add( SyckNode *n, SYMID val ) Pushes a new item `val' onto the end of the sequence. void syck_seq_assign( SyckNode *n, long index, SYMID val ) Replaces the item at position `index' in the sequence node with item `val'. Useful for replacement only, will not allocate more room when assigned beyond the end of the pair list. SYMID syck_seq_read( SyckNode *n, long index ) Reads the item at position `index' in the sequence node. Again, for iteration: for ( i = 0; i < syck_seq_count( n ); i++ ) { SYMID val = sym_seq_read( n, i ); } long syck_seq_count( SyckNode *n ) Returns a count of items contained by sequence node `n'. == YAML Parser == Syck's YAML parser is extremely simple. After setting up a SyckParser struct, along with callback functions for loading node data, use syck_parse() to start reading data. Since syck_parse() only reads single documents, the stream can be managed by calling syck_parse() repeatedly for an IO source. The parser has four callbacks: one for reading from the IO source, one for handling errors that show up, one for handling nodes as they come in, one for handling bad anchors in the document. Nodes are loaded in the order they appear in the YAML document, however nested nodes are loaded before their parent. === How to Write a Node Handler === Inside the node handler, the normal process should be: 1. Convert the SyckNode data to a structure meaningful to your application. 2. Check for the bad anchor caveat described in the next section. 3. Add the new structure to the symbol table attached to the parser. Found at parser->syms. 4. Return the SYMID reserved in the symbol table. === Nodes and Memory Allocation === One thing about SyckNodes passed into your handler: Syck WILL free the node once your handler is done with it. The node is temporary. So, if you plan on keeping a node around, you'll need to make yourself a new copy. And you'll probably need to reassign all the items in a sequence and pairs in a map. You can do this with syck_seq_assign() and syck_map_assign(). But, before you do that, you might consider using your own node structure that fits your application better. === A Note About Anchors in Parsing === YAML anchors can be recursive. This means deeper alias nodes can be loaded before the anchor. This is the trickiest part of the loading process. Assuming this YAML document: --- &a [*a] The loading process is: 1. Load alias *a by calling parser->bad_anchor_handler, which reserves a SYMID in the symbol table. 2. The `a' anchor is added to Syck's own anchor table, referencing the SYMID above. 3. When the anchor &a is found, the SyckNode created is given the SYMID of the bad anchor node above. (Usually nodes created at this stage have the `id' blank.) 4. The parser->handler function is called with that node. Check for node->id in the handler and overwrite the bad anchor node with the new node. === Parser API === See <syck.h> for layouts of SyckParser and SyckNode. SyckParser * syck_new_parser() Creates a new Syck parser. void syck_free_parser( SyckParser *p ) Frees the parser, as well as associated symbol tables and buffers. void syck_parser_implicit_typing( SyckParser *p, int on ) Toggles implicit typing of builtin YAML types. If this is passed a zero, YAML builtin types will be ignored (!int, !float, etc.) The default is 1. void syck_parser_taguri_expansion( SyckParser *p, int on ) Toggles expansion of types in full taguri. This defaults to 1 and is recommended to stay as 1. Turning this off removes a layer of abstraction that will cause incompatibilities between YAML documents of differing versions. void syck_parser_handler( SyckParser *p, SyckNodeHandler h ) Assign a callback function as a node handler. The SyckNodeHandler signature looks like this: SYMID node_handler( SyckParser *p, SyckNode *n ) void syck_parser_error_handler( SyckParser *p, SyckErrorHandler h ) Assign a callback function as an error handler. The SyckErrorHandler signature looks like this: void error_handler( SyckParser *p, char *str ) void syck_parser_bad_anchor_handler( SyckParser *p, SyckBadAnchorHandler h ) Assign a callback function as a bad anchor handler. The SyckBadAnchorHandler signature looks like this: SyckNode *bad_anchor_handler( SyckParser *p, char *anchor ) void syck_parser_file( SyckParser *p, FILE *f, SyckIoFileRead r ) Assigns a FILE pointer as an IO source and a callback function which handles buffering of that IO source. The SyckIoFileRead signature looks like this: long SyckIoFileRead( char *buf, SyckIoFile *file, long max_size, long skip ); Syck comes with a default FILE handler named `syck_io_file_read'. You can assign this default handler explicitly or by simply passing in NULL as the `r' parameter. void syck_parser_str( SyckParser *p, char *ptr, long len, SyckIoStrRead r ) Assigns a string as the IO source with a callback function `r' which handles buffering of the string. The SyckIoStrRead signature looks like this: long SyckIoFileRead( char *buf, SyckIoStr *str, long max_size, long skip ); Syck comes with a default string handler named `syck_io_str_read'. You can assign this default handler explicitly or by simply passing in NULL as the `r' parameter. void syck_parser_str_auto( SyckParser *p, char *ptr, SyckIoStrRead r ) Same as the above, but uses strlen() to determine string size. SYMID syck_parse( SyckParser *p ) Parses a single document from the YAML stream, returning the SYMID for the root node. == YAML Emitter == Since the YAML 0.50 release, Syck has featured a new emitter API. The idea here is to let Syck figure out shortcuts that will clean up output, detect builtin YAML types and -- especially -- determine the best way to format outgoing strings. The trick with the emitter is to learn its functions and let it do its job. If you don't like the formatting Syck is producing, please get in contact the author and pitch your ideas!! Like the YAML parser, the emitter has a couple of callbacks: namely, one for IO output and one for handling nodes. Nodes aren't necessarily SyckNodes. Since we're ultimately worried about creating a string, SyckNodes become sort of unnecessary. === The Emitter Process === 1. Traverse the structure you will be emitting, registering all nodes with the emitter using syck_emitter_mark_node(). This step will determine anchors and aliases in advance. 2. Call syck_emit() to begin emitting the root node. 3. Within your emitter handler, use the syck_emit_* convenience methods to build the document. 4. Call syck_emit_flush() to end the document and push the remaining document to the IO stream. Or continue to add documents to the output stream with syck_emit(). === Emitter API === See <syck.h> for the layout of SyckEmitter. SyckEmitter * syck_new_emitter() Creates a new Syck emitter. SYMID syck_emitter_mark_node( SyckEmitter *e, st_data_t node ) Adds an outgoing node to the symbol table, allocating an anchor for it if it has repeated in the document and scanning the type tag for auto-shortcut. void syck_output_handler( SyckEmitter *e, SyckOutputHandler out ) Assigns a callback as the output handler. void *out_handler( SyckEmitter *e, char * ptr, long len ); Receives the emitter object, pointer to the buffer and a count of bytes which should be read from the buffer. void syck_emitter_handler( SyckEmitter *e, SyckEmitterHandler void syck_free_emitter