Jump to content

It:lectures

From ChemWiki

Chemical Information Technology 2010-11

Go to Lectures, Go to Coursework

IT Relevant to Lecture Courses, Tutorials and Set Projects

Objectives of these lectures:

  1. To define Chemo-informatics as the collection, representation and organisation of chemical data to create chemical information, to which theories and models can be applied to create chemical knowledge[1]
  2. To introduce the background to the course, and the skills to be acquired during the course laboratories, including the use of computers, their software and network information resources available, prioritising and organising the information obtained using these tools and how to cite the chemical literature in your laboratory reports and essays.
  3. To introduce the chemistry computer laboratory sessions and what you are expected to achieve during these sessions.
  4. This course does not deal with any aspects of data logging, analysis and mining (often called Chemometrics) e.g. Excel spreadsheets, Mathematica, MatLab etc.

Lecture 1: Managing your Computer Desktop

Organisation of files
This course is all about managing data/information/knowledge with the help of computers.

They do so with the help of an:

  1. Operating System (OS), examples of which include:
    1. Microsoft Windows: Windows 7.
    2. Unix: Mac OS X, Redhat Linux
    3. Mobile devices/SmartPhones: Symbian, Windows Mobile, Android, iOS (+ iTunes)
  2. Access to which is controlled by authentication against User names/passwords and via Web-pages by the same authentication, and which serves to identify the author/curator of data and information so created.
  3. Organisation: is historically be a metaphor based on Files or Documents which are located in Hierarchical Folders (Directories). Directories referred to as Home or My documents have special status for each authenticated user.
    • Files: adopt naming convention can use up to 256 characters, but with some caveats:
    1. do not use characters such as space, $, /, :, ? .
    2. If you are tempted to use a space, use the underscore _ instead!
    3. On Linux (only), Filenames are case sensitive. Often the cause of much confusion!
    • File Content/Data type: is normally (approximately) indicated by adding a 2-4 character extension after a period (.docx) to the name.
    1. This extension
      may or may not be visible. Chemical files reserve ~8 different extensions, so you may end up with up to 8 files with apparently the same name!
    2. Special types of file, used by the operating system, may be invisible by virtue of their name starting with a period.
    3. The (free text) content of a file may have been indexed and hence may become
      searchable by the utilities provided by the operating system.
    • File Metadata (Properties): Creation/Modification Dates, sizes, access permissions, "ownership", content, etc is also organised by the OS.
    • File Location is in a hierarchy and is located by searches using file metadata as criteria.
  4. File Size: In "bytes" (approximately, 1 character = 1 byte, sometimes 2 bytes). 106 bytes =~1 Mbyte, 109 bytes = ~1 Gbyte, 1012 bytes = ~1 Tbyte. Maximum size for any file normally 2 Gbyte (Windows) or very much larger (Linux, Mac OS X).
  5. File Archives: A collection of Folders and Files which preserves the hierarchy and file metadata (.zip, .tar). A .docx file is in fact a (zip) archive
  6. File Storage (enterprise environment)
    • Permanent Data Storage, as files on:
      • Network Drives:
        1. Drive H Home directory (Desktop icon Home, capacity 1 Gbytes per user)
        2. Drive L: "Home" on Linux systems
        3. Drive M: Storage area for bibliographic libraries such as Mendeley.
        4. Drive R: Where files from departmental NMR Spectrometers are placed (read-only)
        5. Drive Z: Chemistry data-silo for temporary storage.
      • Local Hard Drive C: Not to be used except for temporary storage.
      • Removable media: memory sticks, iPods, CD-RW/DVD (capacity 1-120 Gbyte)
    • Temporary Data Storage, as
      • "clipboard" in "System Memory" (capacity not known by user, but probably < 10 Mbyte)
      • cache or temporary files, not normally seen by the user but can wreak havoc if corrupt!
  7. File Usage: Data Files are created and exchanged using:
    • Combinations of programs, typically a Word processor (Word), a chemical drawing program (Chemdraw) and Bibliographic database (EndNote/Mendeley).
    • Data exchange between these programs using copy/paste via clipboards or via files (drag-n-drop, save/open or sync).

Managing your Location

All the above applies when you are connecting to your resources on campus (which by definition also includes South Kensington Halls of residence). If you are outside this catchment area, some IT services will not work unless you enter the campus virtually by switching on something called a VPN. These services include most Scientific journals and the important databases.

Accessing Lecture Notes and Scientific Journals

The world's scientific and chemical data, information and knowledge resides in the following types of reliable resources:
  1. 1:Course Notes, written by experts in their field, which will often themselves cite:
  2. 2: Primary Peer-reviewed scientific journals (1665 - onwards), as articles with identified authors and hence provenance;
  3. 3: Secondary Peer-reviewed scientific journals, as review articles/books with identified authors
  4. 4: Tertiary sources such as abstracts gleaned from the above two sources, collaboratively authored (peer review by a different name) wikipedia-like entries, some blogs and curated/edited database collections.

Golden rule: Always cite your sources, and if possible the primary ones.

Course Notes and Other Parochial Materials

These can be found in two principle locations; the College Blackboard virtual learning environment and the chemistry department Wiki. They are mostly available in the form of Acrobat (PDF) format, and rather less commonly as Powerpoint slide shows. You would normally download these to your computer and store them in a library (EndNote or Mendeley), view them on the computer, or print them off. The notes are periodically placed into Blackboard by the lecturers, although they may not be regularly updated. Some lecturers also use blogs and podcasts (but not yet Twitter!).

Scientific Journals (Primary and secondary sources)

Journals and Books are identified using a formal citation, embedded in course notes, or other journals. With chemistry, this traditionally takes the form of a numeric superscript [2]. As recentlly as ~2005, you would have to visit a real library, and track down the journal to a specific shelf and then read the printed pages. Since then, it has become almost universal to add to the citation something called a DOI[3] which allows you to visit the journal electronically. Thus we have instead[2] with the DOI appended. Clicking on the link takes you directly to the journal page, when you will be presented with an abstract. These links can be embedded in HTML pages (such as the one we are looking at now) or in PDF files. If you are given a DOI in unlinked form, you can resolve it by typing http://dx.doi.org/the-DOI-itself into a browser (or you can track it down here if you know the authors and title). You can then view the article itself in either HTML or PDF form.

  1. HTML (Hypertext markup language) vs PDF (Portable document format)
    • PDF is the format preferred for producing printed copies, and is just starting to be deployed in new Bibliographic database systems such as Mendeley and in 3D forms[4].
    • HTML is nowadays viewed by an increasing number of publishers as the medium best suited for enhancing the journal article beyond the printable form. Many articles nowadays include rotatable molecules, and other interactive media.
  2. Journals themselves divide into those published by learned societies and by purely commercial organisations. Between them, the four below should cover perhaps 90% of the journals that you will need to access.
  3. The above represent the primary literature, and the articles there designed primarily for researchers. An excellent journal which addresses the more pedagogic aspects of chemistry is the Journal of Chemical Education (abbreviated to J. Chem. Ed.) which not only covers aspects of lectures, but also describes new and interesting laboratory experiments (some of which materialise in our own labs!).
  4. The central library has a chemistry librarian (Katharine Thompson) and many chemistry collections and a complete alphabetic list of Journals, together with an Inter-library loan (ILL) system for requesting reprints of journal and books not held on campus. A fully digital version of the ILL has recently been introduced, although (unlike most digital music) this has DRM (digital rights management).

When writing a laboratory report (and in later years literature reports, essays and perhaps even your own published article), you will be expected to cite your sources, in the manner shown below.


  1. For an example of one bird's-eye view of chemistry, see A. H. Lipkus, Q. Yuan, K. A. Lucas, S. A. Funk, W. F. Bartelt, R. J. Schenck, and A. J. Trippe, J. Org. Chem., 2008, 73, 4443–4451. DOI:10.1021/jo8001276
  2. 2.0 2.1 S. D. Rychnovsky, Org. Lett., 2006, 13, 2895-2898. DOI:10.1021/ol0611346
  3. N. Paski, Digital Object Identifiers for scientific data, Data Science Journal, 2005, 12-20. DOI:10.2481/dsj.4.12
  4. P. Kumar, A. Ziegler, J. Ziegler, B. Uchanska-Ziegler and A. Ziegler, Trend. Biochem. Sci., 2008, 33, 408-412. DOI:10.1016/j.tibs.2008.06.004
  5. C. S. Wannere, H. S. Rzepa, B. C. Rinderspacher, A. Paul, H. F. Schaefer III, P. v. R. Schleyer and C. S. M. Allan, J. Phys. Chem., 2009, DOI:10.1021/jp902176a

Tertiary Sources (Wikipedia)

The use of Wikipedia and Scientific blogs as a source of information. Its normally pretty good for chemistry, but do not always assume its correct!


Lecture 2: Bibliographic Searches using Scientific Databases

Eugene Garfield
Eugene Garfield
Konrad Beilstein
Konrad Beilstein
George Boole
George Boole

'This part of the course is centred how to search for information using search strings. To illustrate it, we will define the following search:

The conversion of penicillin to cephalosporin.

The following concepts will be introduced:

  1. Boolean logical operators: AND (and the slightly more specific SAME), OR, NOT, XOR.
  2. Wildcard (Stemming) characters: ? vs * vs $, *SULPHUR vs SULPHU* vs SUL*UR
  3. Grouping: A AND (B OR C) vs (A AND B) OR C
  4. Metadata-driven searches (fielded searches): author, year-of-publication with the syntax author:Blogs or au=Blogs

A summary of these features for the four main search engines can be found here

WOS

  1. WOS (Web-of-Science) uses:
    • field tags (such as title, author, publication name or organization)
    • Booleans: AND, OR, NOT, SAME = Proximity operator,
    • ? = 1 wild character, SUL*UR and BIOLOG* (but not *NATAL, ie middle and right) = 1 or more wild character,
    • (...) for grouped expressions, i.e. A NOT (B OR C). Examples:
      • au=Welton t* and og=imperial and py=2001-2010 and SO=(CHEMICAL COMMUNICATIONS)
      • TI=Reaction AND (TI=penicillin OR TI=cephalosporin) (141)
      • (TI=Reaction AND TI=Penicillin) OR Ti=cephalosporin (2683)
      • TI=Carbapenem AND TI=Penicillin and ti=synthesis (8)

Other Search engines

  1. Robot based Internet Indices:

Using Microsoft Office with EndNote: Bibliographic citation software

Many source of bibliographic information allow the export of the hit list to citation management software. Here the use of just one combination: WOS and Word+EndNote will be demonstrated, and you will have a chance to try it for yourselves in the lab sessions.

Using Mendeley as an organiser

Mendeley is a document organiser and knowledge mining system. The inputs to the program are citation lists obtained from bibliographic searches, and the associated Acrobat files for the documents themselves. Mendeley will index these, and allow you to search a collection of documents in a very similar manner to the iTunes music tracks. It also has a feature similar in concept to the iTunes Genius bar, whereby articles in your collection can be compared with related articles found by others. For example, you could add a reprint associated with a lab course, and find similar articles which may provide you with additional information.

Introduction to Lab courses

A quick overview of the lab, and what will be done in the first session.

IT Relevant to Laboratories and Reports

Objectives of these lectures: To demonstrate how to search for information relevant to laboratory courses, and lab. write-ups. This will include how to search for properties of chemicals (physical, spectroscopic), safety sheets, and 3D coordinates.

Lecture 3. MSDS Safety Sheets

  • The Aldrich catalogues can be searched for compounds and their MSDS safety sheets. Useful for completing COSHH forms. It is also useful for searching eg an Aldrich catalogue number (e.g. 254738) to acquire an MSDS data sheet, and inserting this into your Mendeley library for future access (using e.g. a mobile device).

Property searches and Errors

  1. Molecules 1-4
    Molecules 1-4
    Reaxys (formerly Beilstein):
    • "penicillin and cephalosporin" as a text authors and more search. Available Booleans: AND, NOT, OR, PROXIMITY, NEAR and NEXT with * as a wildcard anywhere in the query (unlike WOS). Grouping not supported.
    • MP.MP=155-156 and IDE.MF=C29H28N2O6S1 and ORP.ORP=190-200 as a field search using Properties (Advanced) from the Substances and Properties option and illustrating property ranges (which implies you have to be aware of the typical errors in many of the experimental measurements made on chemical instrumentation, such as melting points, optical rotations or as below NMR chemical shifts).
  2. A search of the Spectral Database for Organic Compounds SDBS for matching observed spectral peaks (with estimated errors) with the database.
    • 13C peaks: 163, 141, 133, 130, 129, 128, 98 (how big is the error?)
    • 1H peaks: 8.1,7.5,5.1,4.7 (how big is the error?)
    • IR peak: 1733 (how big is the error?)
  3. A Search of the NIST Chemistry WebBook for thermodynamic and spectral properties
  4. Use of "added-value" properties such as ChemCalc for molecular mass calculations as either C47H51NO14 (Taxol), which predicts how the mass spectrum (MS) may look given a formula, or as input of a (MS-derived) accurate mass (say 148.052±0.0005) which is converted to the most likely formula.

Using Chemdraw and Structure based searches (2D)

    • Searching the ChemSpider database using a SMILES string generated from Chemdraw O=C1C(N)C2N1C(C(O)=O)C(C)(C)S2
    • Searching the PubChem database using a SMILES or InChI string generated from Chemdraw O=C1C(N)C2N1C(C(O)=O)C(C)(C)S2 or InChI=1S/C8H12N2O3S/c1-8(2)4(7(12)13)10-5(11)3(9)6(10)14-8/h3-4,6H,9H2,1-2H3,(H,12,13) for "95% similar" (49 hits)
    • ChemNetBase has compilations of Drugs, inorganic and organometallic and natural products which might prove useful to you for laboratories.
    • Organic syntheses for specific molecule queries.
    • Application of Reaxys for specific molecule queries: search for the melting point of aspirin.

Lecture 4: Chemical 3D Structure and Shape Based Searches and Molecular Biology

  1. The on-line Corina service to convert a (1D) SMILES string to 3D molecular coordinates is an example of an "added-value" service (in this case a 1D to 3D conversion!).
  1. Sub-structure searching of the Cambridge crystal database (183/5E20C9) of organic and organometallic molecules for specific molecules, and intermolecular interactions (e.g unusual π-H-O hydrogen bonds).
    • Name based search: penicillin, 34.
    • 2D structure based search: penicillin, 54 (SMILES string is NOT accepted by this program)
    • 2D structure based search for one of the four molecules shown above
    • 3D structure based search for hydrogen bonds is shown in the lab course pages.
  2. Use of Jmol to display complex Protein Structures (also demo page and Nanotech model). Brief overview of the Protein Databank (DOI:10.1107/S0108767307035623 ) (Keywords penicillin and tetrahedral should reveal any enzyme inhibited with an analogue of a transition state and relating to penicillin) and Protein Explorer (direct entry and trying entering 1blh). Alternate search of biomolecules, including DNA.
    • The guidelines for demarcation between small molecule and bio- or macromolecule databases are found here.

Wikis

Pentahelicene

The traditional stand-alone (=printable) document is being replaced by equivalent formats designed for an on-line existence. You will here be introduced to the Wiki, a presentation system some lecture and lab courses have adopted.


Forward to coursework|Back to introduction