Lecture 1: What you need to know about the Scientific Literature, Computers, and Data
Objectives of these lectures:
To define Chemo-informatics as the collection, representation and organisation of chemical data to create chemical information, to which theories and models can be applied to create chemical knowledge[1]
To introduce the background to the course, and the skills to be acquired during the course laboratories, including the use of computers, their software and network information resources available, prioritising and organising the information obtained using these tools and how to cite the chemical literature in your laboratory reports and essays.
To introduce the chemistry computer laboratory sessions and what you are expected to do during these sessions.
This course does not deal with any aspects of data logging, analysis and mining (often called Chemometrics) e.g. Excel spreadsheets, Mathematica, MatLab etc.
Online (and offline) Journals, books and other information sources
The world's scientific and chemical data, information and knowledge resides in the following resources:
Peer-reviewed primary scientific journals, as articles with identified authors
Peer-reviewed secondary scientific journals, as review articles/books with identified authors
Tertiary sources such as abstracts gleaned from the above two sources, collaboratively authored (peer review by a different name) wikipedia-like entries or curated/edited database collections
Web pages, lecture notes and other un-reviewed materials
The first two categories are identified using Journal/Book citations or the electronic equivalent, the DOI[2]. Web sources use the URL (or TinyURL).
Golden rule: Always cite your sources, and if possible the primary ones.
↑For an example of one bird's-eye view of chemistry, see A. H. Lipkus, Q. Yuan, K. A. Lucas, S. A. Funk, W. F. Bartelt, R. J. Schenck, and A. J. Trippe, J. Org. Chem., 2008, 73, 4443–4451. DOI:10.1021/jo8001276
↑Search engines are increasingly using the Digital object identifier to link directly from the search to the original article reporting the results; N. Paski, Digital Object Identifiers for scientific data, Data Science Journal, 2005, 12-20. DOI:10.2481/dsj.4.12
Traditionally, information derived from learned institutions had been freely and openly available to all, but over the last 60 years, a publishing model wherein publishers acquire copyrights has developed. Their model is to sell back the information to anyone who will pay for it (including the learned institutions). Much chemistry information is now of this closed variety, but increasingly a movement is under way to re-open it. Examples of both types will be shown.
How computers help manage data, information and knowledge acquired from the scientific literature
Organisation of filesThis course is all about managing data/information/knowledge with the help of computers.
A (Whirlwind) introduction/Reminder of how computers handle data
They do so with the help of an:
Operating System (OS), examples of which include:
Windows XP
Mac OS X
Redhat Linux
mobile devices such as Phones, managed by e.g. Symbian, Windows Mobile, Android, OS X, etc.
Access to which is controlled by authentication against User names/passwords and via Web-pages by the same authentication, and which serves to identify the author/curator of data and information so created.
Organisation: is historically be a metaphor based on Files or Documents which are located in Hierarchical Folders (Directories). Directories referred to as Home or My documents have special status for each authenticated user.
Files: adopt naming convention can use up to 256 characters, but with some caveats:
do not use characters such as space, $, /, :, ? .
If you are tempted to use a space, use the underscore _ instead!
On Linux (only), Filenames are case sensitive. Often the cause of much confusion!
File Content/Data type: is normally (approximately) indicated by adding a 2-4 character extension after a period (.docx) to the name.
This extension may or may not be visible.
Special types of file, used by the operating system, may be invisible by virtue of their name starting with a period.
The (free text) content of a file may have been indexed and hence may become searchable by the utilities provided by the operating system.
File Metadata (Properties): Creation/Modification Dates, sizes, access permissions, "ownership", content, etc is also organised by the OS.
File Location is in a hierarchy and is located by searches using file metadata as criteria.
File Size: In "bytes" (approximately, 1 character = 1 byte, sometimes 2 bytes). 106 bytes =~1 Mbyte, 109 bytes = ~1 Gbyte, 1012 bytes = ~1 Tbyte. Maximum size for any file normally 2 Gbyte (Windows) or very much larger (Linux, Mac OS X).
Archives: A collection of Folders and Files which preserves the hierarchy and file metadata (.zip, .sit, .tar).
Storage in an enterprise environment
Permanent Data Storage, as files on:
Local hard drives (capacity 40 Gbytes to 1 Tbytes)
Drive R: (Where files from departmental NMR Spectrometers are placed)
Removable media (memory sticks, iPods, CD-RW/DVD, capacity 1-120 Gbyte)
Temporary Data Storage, as
"clipboard" in "System Memory" (capacity not known by user, but probably < 10 Mbyte)
cache or temporary files, not normally seen by the user but can wreak havoc if corrupt!
File Usage: Data Files are created and exchanged using:
Combinations of programs, typically a Word processor (Word), a chemical drawing program (Chemdraw) and Bibliographic database (EndNote).
Data exchange between these programs using copy/paste via clipboards or via files (drag-n-drop, save/open or sync).
File Data Structures: Internal structure of files can be hidden or exposed.
Hidden (binary or clipboard) formats are normally understood only by specific programs and are not meant for humans. Examples include .docx (Office), .GIF, .PNG, .JPEG (Graphics), .MPEG (audio, video), .PDF (Acrobat).
Exposed structures include HTML (structured Hypertext markup language), SVG (Scalable Vector graphics), TXT (un or semi-structured text)
Specific Chemical types include:
Molecule specifications, with atom connection+co-ordinate types such as PDB, Molfile
Data: Semantics (meaning) can be added to data structures to make it re-usable in different contexts: XML (eXtensible markup language) is the best known way of doing this.
MetaData: Data should have descriptions to add context. HTML can have exposed metadata (i.e. this document). Acrobat has structure for metadata (XMP) but this is rarely used! RDF is used in the Semantic Web.
Data Security
An important aspect of data is knowing when and how it was generated, and by whom. One way of doing this is using digital signatures and where necessary encryption, using a technology known as digital certificates. You will see these in action during the coursework.
Introduction to the lab sessions, using a combination of programs, illustrating many of the concepts noted above.
Lecture 2: Keyword-based General Bibliographic Searches (1D)
Eugene GarfieldKonrad Beilstein
Objectives of this lectures: This part is centred around the search for the conversion of penicillin to cephalosporin and how to fine tune it. The EndNote bibliographic software will be introduced showing how it operates with Word. The use of Bibliographic and library indices using Web-browser interfaces. The following concepts will be introduced:
Boolean logical operators: AND (and the slightly more specific SAME), OR, NOT, XOR.
Wildcard (Stemming) characters: ? vs * vs $, *SULPHUR vs SULPHU* vs SUL*UR
Grouping: A AND (B OR C) vs (A AND B) OR C
Metadata-driven searches (fielded searches): author, year-of-publication with the syntax author:Blogs or au=Blogs
Specific Search engines used to illustrate these concepts
AND, OR, NOT, PROXIMITY, NEAR, NEXT (first term always before second term),
WildCards: ? = 1 character, ?? = 2 characters, * = any number. Wild cards can be used left, middle, right (unlike WOS).
Try "penicillin and cephalosporin" as a text search, then
MP 155-156 AND MF = C29H28N2O6S1 AND ORP 190-200 as a field search"
Lecture 3. Chemical Connectivity and Structure Searches (2D)
Objectives of these lectures: Searching for chemical structures, sub-structures and reactions using 2D molecule definitions, starting with text descriptors of molecular connectivity (SMILES strings) generated using ChemDraw, and moving to the use of proprietary programs for defining connectivity and searching for molecular properties and molecular reactions. Illustrated via the following databases;
Web-based Databases
Searching the eMolecules database using a SMILES string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N) (7)
Searching the PubChem database using a SMILES or InChI string generated from Chemdraw (O=C(N2C1SC(C)(C)[C@H]2C)[C@H]1N or InChI=1/C8H12N2OS/c1-4-8(2,3)12-7-5(9)6(11)10(4)7/h4-5,7H,1-3H3/i1-12,2-12,3-12,4-12,5-12,6-12,7-12,8-12,9-14,10-14,11-16,12-32) for "80% similar" (146)
Lecture 4: Chemical Structure, Property and Shape Based Searches (3D)
3D Database searches
Sub-structure searching of the Cambridge crystal database of organic and organometallic molecules for specific molecules, and intermolecular interactions (e.g unusual π-H-O hydrogen bonds).
Use of "added-value" sites such as ChemCalc for property calculations.
Wikis
The traditional stand-alone (=printable) document is being replaced by equivalent formats designed for an on-line existence. You will here be introduced to the Wiki, which some lab courses are adopting.