COD CIF parser

Usage

  • Perl:

      ( $datablocks, $error_count, $error_messages ) = parse_cif( $file, \%options );
    
  • Python:

      datablocks, error_count, error_messages = parse( file, options )
    

Options

COD CIF parser is designed to detect and report the most common CIF syntax errors. This is implemented using the extended grammar. The behaviour of COD CIF parser is controlled by the following options:

  • fix_errors. Enable all syntax error correction functionality.
  • fix_data_header. Ignore stray CIF values before the first data block and missing data_ header.
  • fix_datablock_names. Append stray CIF values after the data block name to the data block name.
  • fix_duplicate_tags_with_same_values. Ignore two or more data items having the same value in the same data block.
  • fix_duplicate_tags_with_empty_values. Retain the value of the data item with a known value (not '?' or '.') if more than one data item is found in the same data block, and the rest of the values of the data item are unknown.
  • fix_string_quotes. Put more than one unquoted values following a non-loop data item in quotes.
  • allow_uqstring_brackets. Put unquoted strings starting with opening square bracket ([) in single quotes.
  • fix_ctrl_z. Remove DOS EOF (^Z, Ctrl-Z) characters that are not part of a quoted value or a text field.
  • fix_non_ascii_symbols. Encode non-ASCII symbols using numeric character references.
  • fix_missing_closing_double_quote. Insert a missing double closing quote where appropriate.
  • fix_missing_closing_single_quote. Insert a missing single closing quote where appropriate.

There are also several additional options that affect the way CIF files are parsed:

  • do_not_unfold_text. Parse files without applying the text field unfolding algorithm.
  • do_not_unprefix_text. Parse files without applying the text field unprefixing algorithm.
  • no_print. Return a data structure with error messages instead of outputting them directly to the stderr stream.

Usage

  • cif_filter and other scripts: CIF parser options can used in cif_filter and some other scripts by putting them in command line: options are prefixed with '--' and underscores ('_') are replaced with dashes ('-'), e.g. allow_uqstring_brackets becomes --allow-uqstring-brackets.

  • Parser APIs: central functions of both Perl and Python bindings for COD CIF parser accept two arguments, first being the file name, and second the associative array of options. For example:

    • Perl:

      parse_cif( $file, { 'fix_data_header' => 1 } )
      
    • Python:

      parse( file, { 'fix_data_header' : 1 } )
      

All other options are turned on/off likewise.

Data structure

Data blocks of parsed CIF file are stored in associative arrays with the following keys:

  • name (string): name of a CIF data block;
  • tags (array): data names present in the CIF data block (in lowercase);
  • values (associative array): keys are the values of the tags array, values are arrays containing values of each data item;
  • types (associative array): keys are the values of the tags array, values are arrays containing lexically derived data types of each data value;
  • precisions (associative array): keys are the values of the tags array, values are arrays containing standard uncertainties for each data item;
  • loops (array of arrays): each inner array corresponds to a loop from the CIF data block and contains a list of data items present in the loop;
  • inloop (associative array): keys are the values of the tags array, values correspond to indices of the outer loops array. It is used as an index to optimize data item-in-loop related searches.
  • cifversion (associative array): has keys major and minor, corresponding to the minor and major versions of CIF format, currently 1.1 or 2.0.

Example of Perl representation of a simple CIF file

Further reading

  • Merkys, A., Vaitkus, A., Butkus, J., Okulič-Kazarinas, M., Kairys, V. & Gražulis, S. (2016). COD::CIF::Parser: an error-correcting CIF parser for the Perl language. Journal of Applied Crystallography, 49(1), 292–301. https://doi.org/10.1107/S1600576715022396