Chemical Information Technology 2008-09

Lecture 1: What you need to know about the Scientific Literature, Computers, and Data

Objectives of these lectures:

To define Chemo-informatics as the collection, representation and organisation of chemical data to create chemical information, to which theories and models can be applied to create chemical knowledge^[1]
To introduce the background to the course, and the skills to be acquired during the course laboratories, including the use of computers, their software and network information resources available, prioritising and organising the information obtained using these tools and how to cite the chemical literature in your laboratory reports and essays.
To introduce the chemistry computer laboratory sessions and what you are expected to do during these sessions.
This course does not deal with any aspects of data logging, analysis and mining (often called Chemometrics) e.g. Excel spreadsheets, Mathematica, MatLab etc.

Online (and offline) Journals, books and other information sources

The world's scientific and chemical data, information and knowledge resides in the following resources:

Peer-reviewed primary scientific journals, as articles with identified authors
Peer-reviewed secondary scientific journals, as review articles/books with identified authors
Tertiary sources such as abstracts gleaned from the above two sources, collaboratively authored (peer review by a different name) wikipedia-like entries or curated/edited database collections
Web pages, lecture notes and other un-reviewed materials

The first two categories are identified using Journal/Book citations or the electronic equivalent, the DOI^[2]. Web sources use the URL (or TinyURL).

Golden rule: Always cite your sources, and if possible the primary ones.

↑ For an example of one bird's-eye view of chemistry, see A. H. Lipkus, Q. Yuan, K. A. Lucas, S. A. Funk, W. F. Bartelt, R. J. Schenck, and A. J. Trippe, J. Org. Chem., 2008, 73, 4443–4451. DOI:10.1021/jo8001276
↑ Search engines are increasingly using the Digital object identifier to link directly from the search to the original article reporting the results; N. Paski, Digital Object Identifiers for scientific data, Data Science Journal, 2005, 12-20. DOI:10.2481/dsj.4.12

Examples of (innovative) Chemistry Journals

American Chemical Society (ACS) (including Enhanced Web Objects)
The Royal Society of Chemistry (RSC) and its Project Prospect
Science Direct sites and Interactive Acrobat, DOI:10.1016/j.tibs.2008.06.004
The central library has many reference collections and a complete alphabetic list of Journals

Open vs Closed Data and Information

Traditionally, information derived from learned institutions had been freely and openly available to all, but over the last 60 years, a publishing model wherein publishers acquire copyrights has developed. Their model is to sell back the information to anyone who will pay for it (including the learned institutions). Much chemistry information is now of this closed variety, but increasingly a movement is under way to re-open it. Examples of both types will be shown.

How computers help manage data, information and knowledge acquired from the scientific literature

This course is all about managing data/information/knowledge with the help of computers.

A (Whirlwind) introduction/Reminder of how computers handle data

They do so with the help of an:

Operating System (OS), examples of which include:
1. Windows XP
2. Mac OS X
3. Redhat Linux
4. mobile devices such as Phones, managed by e.g. Symbian, Windows Mobile, Android, OS X, etc.
Access to which is controlled by authentication against User names/passwords and via Web-pages by the same authentication, and which serves to identify the author/curator of data and information so created.
Organisation: is historically be a metaphor based on Files or Documents which are located in Hierarchical Folders (Directories). Directories referred to as Home or My documents have special status for each authenticated user.
- Files: adopt naming convention can use up to 256 characters, but with some caveats:
1. do not use characters such as space, $, /, :, ? .
2. If you are tempted to use a space, use the underscore _ instead!
3. On Linux (only), Filenames are case sensitive. Often the cause of much confusion!
- File Content/Data type: is normally (approximately) indicated by adding a 2-4 character extension after a period (.docx) to the name.
1. This extension
  may or may not be visible.
2. Special types of file, used by the operating system, may be invisible by virtue of their name starting with a period.
3. The (free text) content of a file may have been indexed and hence may become
  searchable by the utilities provided by the operating system.
- File Metadata (Properties): Creation/Modification Dates, sizes, access permissions, "ownership", content, etc is also organised by the OS.
- File Location is in a hierarchy and is located by searches using file metadata as criteria.
File Size: In "bytes" (approximately, 1 character = 1 byte, sometimes 2 bytes). 10⁶ bytes =~1 Mbyte, 10⁹ bytes = ~1 Gbyte, 10¹² bytes = ~1 Tbyte. Maximum size for any file normally 2 Gbyte (Windows) or very much larger (Linux, Mac OS X).
- Archives: A collection of Folders and Files which preserves the hierarchy and file metadata (.zip, .sit, .tar).
Storage in an enterprise environment
- Permanent Data Storage, as files on:
  - Local hard drives (capacity 40 Gbytes to 1 Tbytes)
  - Network Drives:
    1. Drive H:Home directory (Desktop icon Home, capacity 500 Mbytes per user)
    2. Drive L: (Your "Home" on Linux systems)
    3. Drive Z: (A data-silo)
    4. Drive R: (Where files from departmental NMR Spectrometers are placed)
  - Removable media (memory sticks, iPods, CD-RW/DVD, capacity 1-120 Gbyte)
- Temporary Data Storage, as
  - "clipboard" in "System Memory" (capacity not known by user, but probably < 10 Mbyte)
  - cache or temporary files, not normally seen by the user but can wreak havoc if corrupt!
File Usage: Data Files are created and exchanged using:
- Combinations of programs, typically a Word processor (Word), a chemical drawing program (Chemdraw) and Bibliographic database (EndNote).
- Data exchange between these programs using copy/paste via clipboards or via files (drag-n-drop, save/open or sync).
File Data Structures: Internal structure of files can be hidden or exposed.
- Hidden (binary or clipboard) formats are normally understood only by specific programs and are not meant for humans. Examples include .docx (Office), .GIF, .PNG, .JPEG (Graphics), .MPEG (audio, video), .PDF (Acrobat).
- Exposed structures include HTML (structured Hypertext markup language), SVG (Scalable Vector graphics), TXT (un or semi-structured text)
- Specific Chemical types include:
  - Molecule specifications, with atom connection+co-ordinate types such as PDB, Molfile
  - Spectral/analytical specifications such as JCAMP
- Data: Semantics (meaning) can be added to data structures to make it re-usable in different contexts: XML (eXtensible markup language) is the best known way of doing this.
  - Chemical Specifications include Chemical Markup Language
- MetaData: Data should have descriptions to add context. HTML can have exposed metadata (i.e. this document). Acrobat has structure for metadata (XMP) but this is rarely used! RDF is used in the Semantic Web.
Data Security
- An important aspect of data is knowing when and how it was generated, and by whom. One way of doing this is using digital signatures and where necessary encryption, using a technology known as digital certificates. You will see these in action during the coursework.
Introduction to the lab sessions, using a combination of programs, illustrating many of the concepts noted above.

Lecture 2: Keyword-based General Bibliographic Searches (1D)

Objectives of this lectures: This part is centred around the search for the conversion of penicillin to cephalosporin and how to fine tune it. The EndNote bibliographic software will be introduced showing how it operates with Word. The use of Bibliographic and library indices using Web-browser interfaces. The following concepts will be introduced:

Boolean logical operators: AND (and the slightly more specific SAME), OR, NOT, XOR.
Wildcard (Stemming) characters: ? vs * vs $, *SULPHUR vs SULPHU* vs SUL*UR
Grouping: A AND (B OR C) vs (A AND B) OR C
Metadata-driven searches (fielded searches): author, year-of-publication with the syntax author:Blogs or au=Blogs

Specific Search engines used to illustrate these concepts

Robot based Internet Indices:
Unicorn (IC Site specific index of local resources):
- AND, OR, NOT, XOR (exclusive OR, which retrieves either term, but not both terms),
- $ = 1 Wild character
WOS (Web-of-Science): Uses
- field tags
- Booleans: AND, OR, NOT, SAME = Proximity operator,
- ? = 1 wild character, SUL*UR and BIOLOG* (but not *NATAL, ie middle and right) = 1 or more wild character,
- (...) for grouped expressions, i.e. A NOT (B OR C). Examples:
- AU=Blogs BS AND PY=2008
- TI=Reaction AND (TI=penicillin OR TI=cephalosporin) (134)
- (TI=Reaction AND TI=Penicillin) OR cephalosporin (2559)
Beilstein Crossfire:
- AND, OR, NOT, PROXIMITY, NEAR, NEXT (first term always before second term),
- WildCards: ? = 1 character, ?? = 2 characters, * = any number. Wild cards can be used left, middle, right (unlike WOS).
- Try "penicillin and cephalosporin" as a text search, then
- MP 155-156 AND MF = C29H28N2O6S1 AND ORP 190-200 as a field search"

Lecture 3. Chemical Connectivity and Structure Searches (2D)

Objectives of these lectures: Searching for chemical structures, sub-structures and reactions using 2D molecule definitions, starting with text descriptors of molecular connectivity (SMILES strings) generated using ChemDraw, and moving to the use of proprietary programs for defining connectivity and searching for molecular properties and molecular reactions. Illustrated via the following databases;

Web-based Databases

- Searching the eMolecules database using a SMILES string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N) (7)
- Searching the PubChem database using a SMILES or InChI string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N or InChI=1/C8H12N2OS/c1-4-8(2,3)12-7-5(9)6(11)10(4)7/h4-5,7H,1-3H3/i1-12,2-12,3-12,4-12,5-12,6-12,7-12,8-12,9-14,10-14,11-16,12-32) for "80% similar" (146)
- Organic syntheses for specific molecule queries.
- The on-line Corina service to convert a (1D) SMILES string to 3D molecular coordinates is an example of an "added-value" service (in this case a 1D to 3D conversion!).

Application based Databases

- Beilstein Crossfire molecule sub-structure, and reaction searches and application of AUTONOM to naming drawn structures. Stereochemistry.
- SciFinder sub-structure search and 3D coordinate display/export.

Lecture 4: Chemical Structure, Property and Shape Based Searches (3D)

3D Database searches

Sub-structure searching of the Cambridge crystal database of organic and organometallic molecules for specific molecules, and intermolecular interactions (e.g unusual π-H-O hydrogen bonds).
- Name based search: Helicene (107)
- 2D structure based search (hexahelicene, 61)
- 3D structure based search (CIYSIM)
A search of the Spectral Database for Organic Compounds SDBS for matching observed spectral peaks with the database.
A Search of the NIST Chemistry WebBook for thermodynamic and spectral searches. Export of spectral data. compound substructures.
Use of Jmol to display complex Protein Structures (also demo page). Brief overview of bio-informatics, Protein Databank (Keywords penicillin and tetrahedral) and Protein Explorer (direct entry).
Use of "added-value" sites such as ChemCalc for property calculations.

Wikis

The traditional stand-alone (=printable) document is being replaced by equivalent formats designed for an on-line existence. You will here be introduced to the Wiki, which some lab courses are adopting.

Go to Sandbox

Forward to coursework|Back to introduction

[1] For an example of one bird's-eye view of chemistry, see A. H. Lipkus, Q. Yuan, K. A. Lucas, S. A. Funk, W. F. Bartelt, R. J. Schenck, and A. J. Trippe, J. Org. Chem., 2008, 73, 4443–4451. DOI:10.1021/jo8001276

[2] Search engines are increasingly using the Digital object identifier to link directly from the search to the original article reporting the results; N. Paski, Digital Object Identifiers for scientific data, Data Science Journal, 2005, 12-20. DOI:10.2481/dsj.4.12

[1]

[2]