Where to find public information about GSK's proprietary structures?

2012-07-25T12:15:18Z

Jg1208: /* 1st Dataset: 135 SYK inhibitors */

===Background===

The Community Structure-Activity Resource (CSAR) is a cross-industry project aiming to provide large experimental datasets of crystal structures and binding affinities for protein-ligand complexes.

GSK is committed to supply CSAR with sets of SAR data. Before these are handed over, it must be ensured that the data is not already publically available. There are a growing number of free online databases where this can be done, and the present project aims to compare the usefulness of a selection of them.

===Remit===

* To evaluate the usefulness of various chemical databases by conducting comparative searches for GSK compounds.

* Databases used: PubChem, Molport, ChEMBL, ZINC, ChemSpider, Aureus

===Introduction===

The last 5 years has seen the emergence of a number of popular databases allowing researchers to search for various molecules. (fig.1). This is of particular importance to medicinal chemists working in drug discovery who often require bioactivity information in order to properly assess a compound’s suitability for a given target.

A decade ago, the ACS Chemical Abstracts Service held a monopoly on such registries, and chemists would always have to pay for information obtained using its SciFinder service. Today the problem is of an opposite nature – far from a lack of data sources, there is now a surplus of online registries purporting to be definitive provider of free information to researchers, many of whom may be unclear as to the pros and cons of each one. This is a problem in itself, and as a result the field of online chemistry data has become a hot topic in recent months1,3,4 as scientists try to get to the bottom of this saturated market. These investigations will hopefully lead to greater productivity and consistency across the field of research as groups begin to use more coherent data as the basis for their work.

The present project builds upon the work of Millard Lambert who quantitatively compared some of these databases by feeding in a set of GSK compounds known to be in the public domain, and seeing which databases contained which GSK compounds. This project repeats the procedure for other databases, while focusing on the ability of each one to be searched for batches of chemicals simultaneously.

The structure of this report is as follows: a brief background and guide to use for each database is given, followed by some notes on their usability and examples of the exportable information available from them. Finally, the results of inputting compounds from three distinct datasets are displayed, along with a brief discussion of their implications.

===Guide to each database===

====1. PubChem====

Website: http://pubchem.ncbi.nlm.nih.gov/

=====Background=====

• A US government initiative started in 2004; now one of the two largest online chemistry databases along with ChemSpider.

• Provides information on the biological activities of small molecules

• Around 33 million chemical entries

• Detailed bioassay data

=====Instructions for use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line and save it in a convenient location. This will be the structure file which serves as the input.

2. Navigate to http://pubchem.ncbi.nlm.nih.gov/ then click on ‘Chemical structure search’ underneath the search bar. Then click on the ‘Identity/Similarity’ tab next to ‘Search by’, then click on ‘Structure File’. Upload your structure file. Make sure that the Options tab below is set to ‘Identical Structures’ rather than e.g. ‘Similar Compounds 95%’. Click Search

3. After a few seconds a list of matching results will appear. To download an SDF file containing information on all of the compounds found in the database, navigate to the right hand side of the page and click on ‘Structure Download’. Choose the SDF format, and compression type ‘none’. Then click ‘Download’.

4. Once the job is finished, a link to the downloadable file should appear on the screen. Right click and save as a *.sdf file to a convenient location.

5. Open Excel and click on the JChem button. Then click ‘Import from File’ and select your new sdf file. After a few seconds the structure of each recognised compound should appear, along with other information as detailed below.

=====Notes=====

• Easy to use

• No registration required

• Occasionally the ‘structure file’ tab does not work. In this case the Pubchem management suggest using a script or application like Pipeline Pilot (info: Thiessen@ncbi.nlm.nih.gov) to batch query many compounds.

• SDF file contains key information e.g. structure, molecular weight, number of H-bonding groups and MMFF94 energy, but cannot obtain information such as publication
data/reference links from the SDF file.

=====Data Available from SDF File=====

[[Image:Pubchem sdf output.jpg|centre|700px|thumb|Output SDF file from PubChem]]

====2. ChEMBL====

Website: https://www.ebi.ac.uk/chembl/

=====Background=====

• Approx 7 million measurements for 1.1 million compounds and 8,900 protein targets (as of April 2012)

• Formed by the European Bioinformatics Institute

• Broad range of medicinal chemistry data e.g. biological activities, cell-based assay data, protein-ligand affinities

• 40% of data comes from PubChem

• Compounds also linked to Chemspider, Drugbank, PDBe, Wikipedia and ChEBI.

• DrugEBility service uses structural data to determine whether a given protein can be targeted with small molecules

=====Instructions For Use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line.

2. Navigate to https://www.ebi.ac.uk/chembl/ and click on ‘Compound Search’.

3. On the right hand side under ‘List search’, select the ‘SMILES search ‘ option and copy and paste your list of SMILES from earlier. Click ‘Fetch Compounds’.

4. After a few seconds the compounds search results should come back, showing structures and other info e.g. parent molecular weight and HBA/HBD. To download SDF file, navigate to the ‘Please select...’ tab on the top right hand side, and click on ‘Download (SDF)’.

5. Note that this SDF file will contain only the structure and Chembl ID! For more information such as molecular weight and hydrogen bond accepting/donating ability, click on ‘Download (Tab-delimited)’.

For information specifically on the compounds’ bioactivities (e.g. assay description and references to the literature they came from) go back to the ‘please select...’ tab and click on ‘Display Bioactivities’ then ‘Download all data (XLS). Open this in an excel file to get the data shown in the example below.

=====Notes=====

• Does not provide experimental conditions

• No login information required

• Works in any browser

• If only searching for one compound – offers a variety of compound sketchers (JME, Marvin, Jdraw)

• SDF file does not have much information – only Structure and ChEMBL ID

• Can download other information as an excel spreadsheet rather than SDF file, i.e. not a problem if Helium and JChem aren’t available.

=====Data Available from SDF File=====

i) By clicking on Download (Tab-delimited)

[[Image:Chembl sdf output1.jpg|centre|700px|thumb|Output SDF file from ChEMBL (tab-delimited)]]

ii) By clicking on Display bioactivities and then Download all data (XLS)

[[Image:Chembl sdf output2.jpg|centre|900px|thumb|Output SDF file from ChEMBL (XLS)]]

====3. Aureus====

Website: http://aureus.gsk.com/ (requires registration by contacting Aureus at support@aureus-sciences.com and providing email address, first and last name, and preferred username & password.)

=====Instructions For Use=====

NB. This only works in IE.

1. Use Helium to get a list of SMILES in excel, and convert these to structures using JChem by clicking on ‘From SMILES’ (note: Aureus cannot process a list of more than 100 at a time). Then go to JChem and click on ‘Export to file’ on the left hand side of the menu bar. Save it as a *.sdf file somewhere convenient.

2. Navigate to http://aureus.gsk.com/ and login (note, can’t use GSK credentials – need to set up account with Aureus).

3. After logging in, you will be faced with a screen showing preferences and support & contacts. Ignore these and click on ‘GPS’ (short for Global Pharmacology Space) in the black bar at the top.

4. Then click on ‘Advanced’ in the top left hand corner, then tick ‘Molecule’, then select molecules by ‘Structure’, then click on ‘Multi-Structures’.

5. An applet will appear providing the opportunity to draw a molecule. Instead, navigate to the brown toolbar at the bottom of the screen that says ‘Upload SDF file’ and click ‘Browse’. Select the SDF file you saved earlier, then click ‘Load molecules’ twice, followed by ‘Get molecules & activities’.

6. Any matches found will be displayed. To save information (such as molecule name, INCHI code, SMILES, publication link etc) for these molecules, tick the relevant molecules and protocols on the top and left hand edges of the screen, then click ‘Export’ at the bottom of the screen , and choose to export as a text, SDF, RDF or Excel file.

=====Notes=====

• Only works in Internet Explorer

• Need login details in order to access www.aureus.gsk.com. It took them 2 emails and a few days for them to provide this.

• Lots of information can be obtained in the SDF/Excel file e.g. polarisability, publication data, pKa, identifiers, SMILES (However, of 96 parameter columns on average only 45% contain data)

• Cannot search for more than 100 compounds at a time.

· Probably the least comfortable interface to use – e.g. upload of molecules often requires more than 2 clicks; sometimes even if you only search for 95 compounds it thinks that there are more than 100 and doesn’t permit the query; sometimes the next step in the process of searching isn’t obvious.

[[Image:Aureus sdf output.jpg|centre|900px|thumb|Output SDF file from Aureus]]

====4. Molport====

=====Background=====

*Not a research tool, rather a global trading platform for many suppliers
*8 million chemicals available for purchase
*Integrated with ZINC, Pubchem, ChemSpider <ref> http://www.molport.com/buy-chemicals/news </ref>

=====Notes=====

* Not easy to query for more than 1 compound at a time – need to request login details to use an internal sourcing wizard that Molport use. This wizard is still under development and only has access to the catalogues of 10 suppliers.

* However, even this wizard does not allow a simple database search. You have to give further information such as billing address and required quantities for purchase before you can search for multiple compounds.

* These drawbacks became known during the initial trial of 15 compounds (see below), and Molport was therefore disregarded as a database for later searches.

====5. ZINC====

====6. ChemSpider====

===Results & Discussion===

====Trial Dataset====

Initially four databases (ChEMBL, Molport, ZINC, PubChem) were tested for 15 compounds, with the following results:

[[Image:Trial_venn.jpg |centre|400px|thumb|Results from trial dataset]]

* Results suggest that Molport is in fact not yet fully integrated with Pubchem, but is with ZINC.
* Molport and ZINC were discarded for the reasons discussed above.

====1st Dataset: 135 SYK inhibitors====

* ChEMBL, PubChem and Aureus were then tested for 135 compounds selected as potential leads for the target Spleen tyrosine kinase (Syk), with the following results

[[Image:syk.png |centre|300px|thumb|Venn diagram to show hits for SYK inhibitor compounds from ChEMBL, PubChem and Aureus.]]

====2nd Dataset: 327 ASP2 inhibitors====

* 327 compounds tested for the target Asp were then searched for in the same three databases, with the following results:

[[Image:asp2.png |centre|300px|thumb|Venn diagram to show hits for ASP2 inhibitor compounds from ChEMBL, PubChem and Aureus.]]

====4th Dataset: 343 PDE4 inhibitors====

[[Image:pde4.png |centre|300px|thumb|Venn diagram to show hits for PDE4 inhibitor compounds from ChEMBL, PubChem and Aureus.]]

The results clearly show that PubChem has the largest database and is the best source if one simply wants to find out whether a given molecule is in the public domain. However, if one wants more detailed information on the medicinal properties of a certain compound then databases like ChEMBL and Aureus offer more data which can be exported into Excel. Hence it can be said that the effectiveness of a database for CSAR research is dependent on the aim of the project being undertaken, which might be expected given the somewhat disorganized nature of the nascent world of online chemical data.

Looking ahead, it might be expected that in the near future one of the databases discussed in this report will benefit from an improvement in the efficiency of database integration, and provide end users with a combination of a broad set of searchable chemicals in addition to a deep well of knowledge pertaining to each one. This would empower chemists greatly, and would likely improve the quality of research for the benefit of the chemical community, and hence the scientific one as a whole.

===References===

<references />

Where to find public information about GSK's proprietary structures?

2012-07-25T12:15:04Z

Jg1208: /* 4th Dataset: 343 PDE4 inhibitors */

===Background===

The Community Structure-Activity Resource (CSAR) is a cross-industry project aiming to provide large experimental datasets of crystal structures and binding affinities for protein-ligand complexes.

GSK is committed to supply CSAR with sets of SAR data. Before these are handed over, it must be ensured that the data is not already publically available. There are a growing number of free online databases where this can be done, and the present project aims to compare the usefulness of a selection of them.

===Remit===

* To evaluate the usefulness of various chemical databases by conducting comparative searches for GSK compounds.

* Databases used: PubChem, Molport, ChEMBL, ZINC, ChemSpider, Aureus

===Introduction===

The last 5 years has seen the emergence of a number of popular databases allowing researchers to search for various molecules. (fig.1). This is of particular importance to medicinal chemists working in drug discovery who often require bioactivity information in order to properly assess a compound’s suitability for a given target.

A decade ago, the ACS Chemical Abstracts Service held a monopoly on such registries, and chemists would always have to pay for information obtained using its SciFinder service. Today the problem is of an opposite nature – far from a lack of data sources, there is now a surplus of online registries purporting to be definitive provider of free information to researchers, many of whom may be unclear as to the pros and cons of each one. This is a problem in itself, and as a result the field of online chemistry data has become a hot topic in recent months1,3,4 as scientists try to get to the bottom of this saturated market. These investigations will hopefully lead to greater productivity and consistency across the field of research as groups begin to use more coherent data as the basis for their work.

The present project builds upon the work of Millard Lambert who quantitatively compared some of these databases by feeding in a set of GSK compounds known to be in the public domain, and seeing which databases contained which GSK compounds. This project repeats the procedure for other databases, while focusing on the ability of each one to be searched for batches of chemicals simultaneously.

The structure of this report is as follows: a brief background and guide to use for each database is given, followed by some notes on their usability and examples of the exportable information available from them. Finally, the results of inputting compounds from three distinct datasets are displayed, along with a brief discussion of their implications.

===Guide to each database===

====1. PubChem====

Website: http://pubchem.ncbi.nlm.nih.gov/

=====Background=====

• A US government initiative started in 2004; now one of the two largest online chemistry databases along with ChemSpider.

• Provides information on the biological activities of small molecules

• Around 33 million chemical entries

• Detailed bioassay data

=====Instructions for use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line and save it in a convenient location. This will be the structure file which serves as the input.

2. Navigate to http://pubchem.ncbi.nlm.nih.gov/ then click on ‘Chemical structure search’ underneath the search bar. Then click on the ‘Identity/Similarity’ tab next to ‘Search by’, then click on ‘Structure File’. Upload your structure file. Make sure that the Options tab below is set to ‘Identical Structures’ rather than e.g. ‘Similar Compounds 95%’. Click Search

3. After a few seconds a list of matching results will appear. To download an SDF file containing information on all of the compounds found in the database, navigate to the right hand side of the page and click on ‘Structure Download’. Choose the SDF format, and compression type ‘none’. Then click ‘Download’.

4. Once the job is finished, a link to the downloadable file should appear on the screen. Right click and save as a *.sdf file to a convenient location.

5. Open Excel and click on the JChem button. Then click ‘Import from File’ and select your new sdf file. After a few seconds the structure of each recognised compound should appear, along with other information as detailed below.

=====Notes=====

• Easy to use

• No registration required

• Occasionally the ‘structure file’ tab does not work. In this case the Pubchem management suggest using a script or application like Pipeline Pilot (info: Thiessen@ncbi.nlm.nih.gov) to batch query many compounds.

• SDF file contains key information e.g. structure, molecular weight, number of H-bonding groups and MMFF94 energy, but cannot obtain information such as publication
data/reference links from the SDF file.

=====Data Available from SDF File=====

[[Image:Pubchem sdf output.jpg|centre|700px|thumb|Output SDF file from PubChem]]

====2. ChEMBL====

Website: https://www.ebi.ac.uk/chembl/

=====Background=====

• Approx 7 million measurements for 1.1 million compounds and 8,900 protein targets (as of April 2012)

• Formed by the European Bioinformatics Institute

• Broad range of medicinal chemistry data e.g. biological activities, cell-based assay data, protein-ligand affinities

• 40% of data comes from PubChem

• Compounds also linked to Chemspider, Drugbank, PDBe, Wikipedia and ChEBI.

• DrugEBility service uses structural data to determine whether a given protein can be targeted with small molecules

=====Instructions For Use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line.

2. Navigate to https://www.ebi.ac.uk/chembl/ and click on ‘Compound Search’.

3. On the right hand side under ‘List search’, select the ‘SMILES search ‘ option and copy and paste your list of SMILES from earlier. Click ‘Fetch Compounds’.

4. After a few seconds the compounds search results should come back, showing structures and other info e.g. parent molecular weight and HBA/HBD. To download SDF file, navigate to the ‘Please select...’ tab on the top right hand side, and click on ‘Download (SDF)’.

5. Note that this SDF file will contain only the structure and Chembl ID! For more information such as molecular weight and hydrogen bond accepting/donating ability, click on ‘Download (Tab-delimited)’.

For information specifically on the compounds’ bioactivities (e.g. assay description and references to the literature they came from) go back to the ‘please select...’ tab and click on ‘Display Bioactivities’ then ‘Download all data (XLS). Open this in an excel file to get the data shown in the example below.

=====Notes=====

• Does not provide experimental conditions

• No login information required

• Works in any browser

• If only searching for one compound – offers a variety of compound sketchers (JME, Marvin, Jdraw)

• SDF file does not have much information – only Structure and ChEMBL ID

• Can download other information as an excel spreadsheet rather than SDF file, i.e. not a problem if Helium and JChem aren’t available.

=====Data Available from SDF File=====

i) By clicking on Download (Tab-delimited)

[[Image:Chembl sdf output1.jpg|centre|700px|thumb|Output SDF file from ChEMBL (tab-delimited)]]

ii) By clicking on Display bioactivities and then Download all data (XLS)

[[Image:Chembl sdf output2.jpg|centre|900px|thumb|Output SDF file from ChEMBL (XLS)]]

====3. Aureus====

Website: http://aureus.gsk.com/ (requires registration by contacting Aureus at support@aureus-sciences.com and providing email address, first and last name, and preferred username & password.)

=====Instructions For Use=====

NB. This only works in IE.

1. Use Helium to get a list of SMILES in excel, and convert these to structures using JChem by clicking on ‘From SMILES’ (note: Aureus cannot process a list of more than 100 at a time). Then go to JChem and click on ‘Export to file’ on the left hand side of the menu bar. Save it as a *.sdf file somewhere convenient.

2. Navigate to http://aureus.gsk.com/ and login (note, can’t use GSK credentials – need to set up account with Aureus).

3. After logging in, you will be faced with a screen showing preferences and support & contacts. Ignore these and click on ‘GPS’ (short for Global Pharmacology Space) in the black bar at the top.

4. Then click on ‘Advanced’ in the top left hand corner, then tick ‘Molecule’, then select molecules by ‘Structure’, then click on ‘Multi-Structures’.

5. An applet will appear providing the opportunity to draw a molecule. Instead, navigate to the brown toolbar at the bottom of the screen that says ‘Upload SDF file’ and click ‘Browse’. Select the SDF file you saved earlier, then click ‘Load molecules’ twice, followed by ‘Get molecules & activities’.

6. Any matches found will be displayed. To save information (such as molecule name, INCHI code, SMILES, publication link etc) for these molecules, tick the relevant molecules and protocols on the top and left hand edges of the screen, then click ‘Export’ at the bottom of the screen , and choose to export as a text, SDF, RDF or Excel file.

=====Notes=====

• Only works in Internet Explorer

• Need login details in order to access www.aureus.gsk.com. It took them 2 emails and a few days for them to provide this.

• Lots of information can be obtained in the SDF/Excel file e.g. polarisability, publication data, pKa, identifiers, SMILES (However, of 96 parameter columns on average only 45% contain data)

• Cannot search for more than 100 compounds at a time.

· Probably the least comfortable interface to use – e.g. upload of molecules often requires more than 2 clicks; sometimes even if you only search for 95 compounds it thinks that there are more than 100 and doesn’t permit the query; sometimes the next step in the process of searching isn’t obvious.

[[Image:Aureus sdf output.jpg|centre|900px|thumb|Output SDF file from Aureus]]

====4. Molport====

=====Background=====

*Not a research tool, rather a global trading platform for many suppliers
*8 million chemicals available for purchase
*Integrated with ZINC, Pubchem, ChemSpider <ref> http://www.molport.com/buy-chemicals/news </ref>

=====Notes=====

* Not easy to query for more than 1 compound at a time – need to request login details to use an internal sourcing wizard that Molport use. This wizard is still under development and only has access to the catalogues of 10 suppliers.

* However, even this wizard does not allow a simple database search. You have to give further information such as billing address and required quantities for purchase before you can search for multiple compounds.

* These drawbacks became known during the initial trial of 15 compounds (see below), and Molport was therefore disregarded as a database for later searches.

====5. ZINC====

====6. ChemSpider====

===Results & Discussion===

====Trial Dataset====

Initially four databases (ChEMBL, Molport, ZINC, PubChem) were tested for 15 compounds, with the following results:

[[Image:Trial_venn.jpg |centre|400px|thumb|Results from trial dataset]]

* Results suggest that Molport is in fact not yet fully integrated with Pubchem, but is with ZINC.
* Molport and ZINC were discarded for the reasons discussed above.

====1st Dataset: 135 SYK inhibitors====

* ChEMBL, PubChem and Aureus were then tested for 135 compounds selected as potential leads for the target Spleen tyrosine kinase (Syk), with the following results

[[Image:syk.png |centre|300px|thumb|Venn diagram to show hits for SYK compounds from ChEMBL, PubChem and Aureus.]]

====2nd Dataset: 327 ASP2 inhibitors====

* 327 compounds tested for the target Asp were then searched for in the same three databases, with the following results:

[[Image:asp2.png |centre|300px|thumb|Venn diagram to show hits for ASP2 inhibitor compounds from ChEMBL, PubChem and Aureus.]]

====4th Dataset: 343 PDE4 inhibitors====

[[Image:pde4.png |centre|300px|thumb|Venn diagram to show hits for PDE4 inhibitor compounds from ChEMBL, PubChem and Aureus.]]

The results clearly show that PubChem has the largest database and is the best source if one simply wants to find out whether a given molecule is in the public domain. However, if one wants more detailed information on the medicinal properties of a certain compound then databases like ChEMBL and Aureus offer more data which can be exported into Excel. Hence it can be said that the effectiveness of a database for CSAR research is dependent on the aim of the project being undertaken, which might be expected given the somewhat disorganized nature of the nascent world of online chemical data.

Looking ahead, it might be expected that in the near future one of the databases discussed in this report will benefit from an improvement in the efficiency of database integration, and provide end users with a combination of a broad set of searchable chemicals in addition to a deep well of knowledge pertaining to each one. This would empower chemists greatly, and would likely improve the quality of research for the benefit of the chemical community, and hence the scientific one as a whole.

===References===

<references />

Where to find public information about GSK's proprietary structures?

2012-07-25T12:14:53Z

Jg1208: /* 2nd Dataset: 327 ASP2 inhibitors */

===Background===

The Community Structure-Activity Resource (CSAR) is a cross-industry project aiming to provide large experimental datasets of crystal structures and binding affinities for protein-ligand complexes.

GSK is committed to supply CSAR with sets of SAR data. Before these are handed over, it must be ensured that the data is not already publically available. There are a growing number of free online databases where this can be done, and the present project aims to compare the usefulness of a selection of them.

===Remit===

* To evaluate the usefulness of various chemical databases by conducting comparative searches for GSK compounds.

* Databases used: PubChem, Molport, ChEMBL, ZINC, ChemSpider, Aureus

===Introduction===

The last 5 years has seen the emergence of a number of popular databases allowing researchers to search for various molecules. (fig.1). This is of particular importance to medicinal chemists working in drug discovery who often require bioactivity information in order to properly assess a compound’s suitability for a given target.

A decade ago, the ACS Chemical Abstracts Service held a monopoly on such registries, and chemists would always have to pay for information obtained using its SciFinder service. Today the problem is of an opposite nature – far from a lack of data sources, there is now a surplus of online registries purporting to be definitive provider of free information to researchers, many of whom may be unclear as to the pros and cons of each one. This is a problem in itself, and as a result the field of online chemistry data has become a hot topic in recent months1,3,4 as scientists try to get to the bottom of this saturated market. These investigations will hopefully lead to greater productivity and consistency across the field of research as groups begin to use more coherent data as the basis for their work.

The present project builds upon the work of Millard Lambert who quantitatively compared some of these databases by feeding in a set of GSK compounds known to be in the public domain, and seeing which databases contained which GSK compounds. This project repeats the procedure for other databases, while focusing on the ability of each one to be searched for batches of chemicals simultaneously.

The structure of this report is as follows: a brief background and guide to use for each database is given, followed by some notes on their usability and examples of the exportable information available from them. Finally, the results of inputting compounds from three distinct datasets are displayed, along with a brief discussion of their implications.

===Guide to each database===

====1. PubChem====

Website: http://pubchem.ncbi.nlm.nih.gov/

=====Background=====

• A US government initiative started in 2004; now one of the two largest online chemistry databases along with ChemSpider.

• Provides information on the biological activities of small molecules

• Around 33 million chemical entries

• Detailed bioassay data

=====Instructions for use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line and save it in a convenient location. This will be the structure file which serves as the input.

2. Navigate to http://pubchem.ncbi.nlm.nih.gov/ then click on ‘Chemical structure search’ underneath the search bar. Then click on the ‘Identity/Similarity’ tab next to ‘Search by’, then click on ‘Structure File’. Upload your structure file. Make sure that the Options tab below is set to ‘Identical Structures’ rather than e.g. ‘Similar Compounds 95%’. Click Search

3. After a few seconds a list of matching results will appear. To download an SDF file containing information on all of the compounds found in the database, navigate to the right hand side of the page and click on ‘Structure Download’. Choose the SDF format, and compression type ‘none’. Then click ‘Download’.

4. Once the job is finished, a link to the downloadable file should appear on the screen. Right click and save as a *.sdf file to a convenient location.

5. Open Excel and click on the JChem button. Then click ‘Import from File’ and select your new sdf file. After a few seconds the structure of each recognised compound should appear, along with other information as detailed below.

=====Notes=====

• Easy to use

• No registration required

• Occasionally the ‘structure file’ tab does not work. In this case the Pubchem management suggest using a script or application like Pipeline Pilot (info: Thiessen@ncbi.nlm.nih.gov) to batch query many compounds.

• SDF file contains key information e.g. structure, molecular weight, number of H-bonding groups and MMFF94 energy, but cannot obtain information such as publication
data/reference links from the SDF file.

=====Data Available from SDF File=====

[[Image:Pubchem sdf output.jpg|centre|700px|thumb|Output SDF file from PubChem]]

====2. ChEMBL====

Website: https://www.ebi.ac.uk/chembl/

=====Background=====

• Approx 7 million measurements for 1.1 million compounds and 8,900 protein targets (as of April 2012)

• Formed by the European Bioinformatics Institute

• Broad range of medicinal chemistry data e.g. biological activities, cell-based assay data, protein-ligand affinities

• 40% of data comes from PubChem

• Compounds also linked to Chemspider, Drugbank, PDBe, Wikipedia and ChEBI.

• DrugEBility service uses structural data to determine whether a given protein can be targeted with small molecules

=====Instructions For Use=====

1. Use Helium to get a list of SMILES in excel – convert this to a text doc with each SMILES occupying 1 line.

2. Navigate to https://www.ebi.ac.uk/chembl/ and click on ‘Compound Search’.

3. On the right hand side under ‘List search’, select the ‘SMILES search ‘ option and copy and paste your list of SMILES from earlier. Click ‘Fetch Compounds’.

4. After a few seconds the compounds search results should come back, showing structures and other info e.g. parent molecular weight and HBA/HBD. To download SDF file, navigate to the ‘Please select...’ tab on the top right hand side, and click on ‘Download (SDF)’.

5. Note that this SDF file will contain only the structure and Chembl ID! For more information such as molecular weight and hydrogen bond accepting/donating ability, click on ‘Download (Tab-delimited)’.

For information specifically on the compounds’ bioactivities (e.g. assay description and references to the literature they came from) go back to the ‘please select...’ tab and click on ‘Display Bioactivities’ then ‘Download all data (XLS). Open this in an excel file to get the data shown in the example below.

=====Notes=====

• Does not provide experimental conditions

• No login information required

• Works in any browser

• If only searching for one compound – offers a variety of compound sketchers (JME, Marvin, Jdraw)

• SDF file does not have much information – only Structure and ChEMBL ID

• Can download other information as an excel spreadsheet rather than SDF file, i.e. not a problem if Helium and JChem aren’t available.

=====Data Available from SDF File=====

i) By clicking on Download (Tab-delimited)

[[Image:Chembl sdf output1.jpg|centre|700px|thumb|Output SDF file from ChEMBL (tab-delimited)]]

ii) By clicking on Display bioactivities and then Download all data (XLS)

[[Image:Chembl sdf output2.jpg|centre|900px|thumb|Output SDF file from ChEMBL (XLS)]]

====3. Aureus====

Website: http://aureus.gsk.com/ (requires registration by contacting Aureus at support@aureus-sciences.com and providing email address, first and last name, and preferred username & password.)

=====Instructions For Use=====

NB. This only works in IE.

1. Use Helium to get a list of SMILES in excel, and convert these to structures using JChem by clicking on ‘From SMILES’ (note: Aureus cannot process a list of more than 100 at a time). Then go to JChem and click on ‘Export to file’ on the left hand side of the menu bar. Save it as a *.sdf file somewhere convenient.

2. Navigate to http://aureus.gsk.com/ and login (note, can’t use GSK credentials – need to set up account with Aureus).

3. After logging in, you will be faced with a screen showing preferences and support & contacts. Ignore these and click on ‘GPS’ (short for Global Pharmacology Space) in the black bar at the top.

4. Then click on ‘Advanced’ in the top left hand corner, then tick ‘Molecule’, then select molecules by ‘Structure’, then click on ‘Multi-Structures’.

5. An applet will appear providing the opportunity to draw a molecule. Instead, navigate to the brown toolbar at the bottom of the screen that says ‘Upload SDF file’ and click ‘Browse’. Select the SDF file you saved earlier, then click ‘Load molecules’ twice, followed by ‘Get molecules & activities’.

6. Any matches found will be displayed. To save information (such as molecule name, INCHI code, SMILES, publication link etc) for these molecules, tick the relevant molecules and protocols on the top and left hand edges of the screen, then click ‘Export’ at the bottom of the screen , and choose to export as a text, SDF, RDF or Excel file.

=====Notes=====

• Only works in Internet Explorer

• Need login details in order to access www.aureus.gsk.com. It took them 2 emails and a few days for them to provide this.

• Lots of information can be obtained in the SDF/Excel file e.g. polarisability, publication data, pKa, identifiers, SMILES (However, of 96 parameter columns on average only 45% contain data)

• Cannot search for more than 100 compounds at a time.

· Probably the least comfortable interface to use – e.g. upload of molecules often requires more than 2 clicks; sometimes even if you only search for 95 compounds it thinks that there are more than 100 and doesn’t permit the query; sometimes the next step in the process of searching isn’t obvious.

[[Image:Aureus sdf output.jpg|centre|900px|thumb|Output SDF file from Aureus]]

====4. Molport====

=====Background=====

*Not a research tool, rather a global trading platform for many suppliers
*8 million chemicals available for purchase
*Integrated with ZINC, Pubchem, ChemSpider <ref> http://www.molport.com/buy-chemicals/news </ref>

=====Notes=====

* Not easy to query for more than 1 compound at a time – need to request login details to use an internal sourcing wizard that Molport use. This wizard is still under development and only has access to the catalogues of 10 suppliers.

* However, even this wizard does not allow a simple database search. You have to give further information such as billing address and required quantities for purchase before you can search for multiple compounds.

* These drawbacks became known during the initial trial of 15 compounds (see below), and Molport was therefore disregarded as a database for later searches.

====5. ZINC====

====6. ChemSpider====

===Results & Discussion===

====Trial Dataset====

Initially four databases (ChEMBL, Molport, ZINC, PubChem) were tested for 15 compounds, with the following results:

[[Image:Trial_venn.jpg |centre|400px|thumb|Results from trial dataset]]

* Results suggest that Molport is in fact not yet fully integrated with Pubchem, but is with ZINC.
* Molport and ZINC were discarded for the reasons discussed above.

====1st Dataset: 135 SYK inhibitors====

* ChEMBL, PubChem and Aureus were then tested for 135 compounds selected as potential leads for the target Spleen tyrosine kinase (Syk), with the following results

[[Image:syk.png |centre|300px|thumb|Venn diagram to show hits for SYK compounds from ChEMBL, PubChem and Aureus.]]

====2nd Dataset: 327 ASP2 inhibitors====

* 327 compounds tested for the target Asp were then searched for in the same three databases, with the following results:

[[Image:asp2.png |centre|300px|thumb|Venn diagram to show hits for ASP2 inhibitor compounds from ChEMBL, PubChem and Aureus.]]

====4th Dataset: 343 PDE4 inhibitors====

[[Image:pde4.png |centre|300px|thumb|Venn diagram to show hits for PDE4 compounds from ChEMBL, PubChem and Aureus.]]

The results clearly show that PubChem has the largest database and is the best source if one simply wants to find out whether a given molecule is in the public domain. However, if one wants more detailed information on the medicinal properties of a certain compound then databases like ChEMBL and Aureus offer more data which can be exported into Excel. Hence it can be said that the effectiveness of a database for CSAR research is dependent on the aim of the project being undertaken, which might be expected given the somewhat disorganized nature of the nascent world of online chemical data.

Looking ahead, it might be expected that in the near future one of the databases discussed in this report will benefit from an improvement in the efficiency of database integration, and provide end users with a combination of a broad set of searchable chemicals in addition to a deep well of knowledge pertaining to each one. This would empower chemists greatly, and would likely improve the quality of research for the benefit of the chemical community, and hence the scientific one as a whole.

===References===

<references />

2012-07-25T10:00:37Z

Jg1208:

File:Pde4.png

2012-07-25T10:00:36Z

Jg1208:

File:Syk.png

2012-07-25T10:00:36Z

Jg1208:

2012-07-24T15:01:19Z

Jg1208: /* Results & Discussion */

2012-07-20T09:57:19Z

Jg1208:

664787

2012-07-20T09:49:28Z

Jg1208:

664787

2012-07-20T09:42:32Z

Jg1208: /* =Notes */