Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Primary Data for Chemistry

Ähnliche Präsentationen

Präsentation zum Thema: "Primary Data for Chemistry"—  Präsentation transkript:

1 Primary Data for Chemistry
DataCite Summer Meeting 2010 Making Datasets Visible and Accessible June 7/8, 2010 Primary Data for Chemistry Susanne Haak,1 Guido F. Herrmann,1 Irina Sens,2 Jan Brase2 1 Georg Thieme Verlag KG Ruedigerstrasse 14, Stuttgart, Germany 2 German National Library of Science and Technology (TIB) Welfengarten 1B, Hannover, Germany



4 PARTNERS TIB is the largest scientific library in the world Architecture, Chemistry, Computer Science, Mathematics, Physics, Engineering technology Financed by Federal Government and all Federal States € 8 Mio. annual acquisition budget 18,500 journal subscriptions 7,0 Mio. items Global Supplier for scientific and technical information of all types – text, numeric data, audio, video, etc. Global consortium carried by local institutions focused on improving the scholarly infrastructure around datasets and other non-textual information focused on working with data centers and organizations that hold data Providing standards, workflows and best-practice Initially, but not exclusively based on the DOI system Founded December 1st 2009 in London

5 PARTNERS Thieme Chemistry Part of the Thieme publishing group, based in Stuttgart (Germany) Publishes highly evaluated information about synthetic and general chemistry for professional chemists and advanced students since 1909.

6 PARTNERS This is one of the – far too few – intense co-operations between libraries and publishers. Our Journals always have been at the forefront of innovation and we are proud that once again, we can lead the way.


8 As a result, datasets are
BACKGROUND Gap in the scientific record between published research and the underlying data Published work held by publishers and libraries Datasets held by data centers No effective way to link between datasets and articles No widely used method to identify datasets No widely used method to cite datasets As a result, datasets are Difficult to discover Difficult to access

9 In chemistry research data are created:
BACKGROUND In chemistry research data are created: Using the vast array of chromatographic methods (GC, HPLC etc.) Employing spectroscopic methods (NMR, MS, UV/VIS, IR, X-ray etc.) As a result of theoretical calculations (quantum mechanics, simulation of spectra etc.) Or by using the various high-throughput technologies in medicinal chemistry Typical research data are created: Using the vast array of analytical techniques (GC, HPLC etc.) Employing spectroscopic methods (NMR, MS, UV/VIS, IR, X-Ray etc.) Crystallographic Data Source:

10 Primary Data in Organic Chemistry
BACKGROUND Estimation: Primary Data in Organic Chemistry to Datasets Per Year Columbus, Ohio (September 8, 2009) - Chemical Abstracts Service (CAS), a division of the American Chemical Society, announced that on September 7 it recorded the 50 millionth substance in CAS REGISTRYSM, the world's most comprehensive and high-quality compendium of publicly disclosed chemical information. The recently registered substance is a novel arylmethylidene heterocycle with analgesic properties. Reaching the 50 million mark so quickly is an indicator of the accelerating pace of scientific knowledge. CAS registered the 40 millionth substance just nine months ago - in contrast, it took 33 years for CAS to register the 10 millionth compound in 1990. From: Neudert, Reinhard - Weinheim Sent: Mittwoch, 3. März :31 To: Krimmer, Dr. Thomas Subject: AW: Zahl der Spektren in der (organischen) Chemie Hallo Herr Krimmer, ich habe die Zahlen gefunden. Diese beruhen auf vier starken Chemiejournalen von Wiley-VCH. Wollte man alle erfassen, müsste man hochrechnen. Da Wiley-VCH eine sehr starke Chemie hat wird der Faktor wohl um 3 herum liegen, also 15 Millionen Spektren in den letzten 25 Jahren Gruß Neudert Primärdaten in der Chemie In der chemischen Forschung fallen täglich große Menge Primärdaten an, die im akademischen Umfeld letztendlich in eine wissenschaftliche Publikation münden. So haben die größten Chemie Journale des Wiley-VCH Verlages im Jahr 2006 die folgende Anzahl von Artikel veröffentlicht: •           Angewandte Chemie : ca Beiträge •           Chemistry A European Journal: ca.1000 Beiträge •           EurJIC : ca. 610 Beiträge •           EurJOC: ca. 650 Beiträge           In den insgesamt ca Beiträgen werden im Mittel etwa 40 Spektren pro Artikel beschrieben. Diese auf Primärdaten beruhenden Spektren sind meist in den experimentellen Sektionen in stark reduzierter Form gelistet, werden aber auch häufig in Form von Abbildungen im Zusatzmaterial („Supporting Information“) gezeigt. In beiden Fällen sind die Daten nicht elektronisch recherchierbar, noch kann auf die ursprünglichen Messdaten in irgendeiner Form zugegriffen werden. Geht man von der oben genannten Spektrenzahl aus, dann sind in den letzten 25 Jahren im Zusammenhang mit dem Publikationsprozess nicht weniger als 5 Millionen Spektren, also 5 Millionen Primärdatensätze erzeugt worden. Number of spectra in the last year in 50 organic chemistry journals with 500 articles each with 80 spectra per article: 2,000,000; To put this number into perspective: MedLine currently contains some 40 Mio abstracts - in total and adds less than 1 Mio a year; CAS currently contains some 42 Mio substances – again in total, adding roughly 1 Mio per year! The 2 Mio here refer to organic chemistry and 1 single year only! Thus far, the vast amount of data lies scattered on the computers of scientists, who have produced the information. As no central repository exists, no accessible archival storage is possible at the moment. Due to the missing credit that working up such data currently receives, primary data is often poorly documented, difficult to access and not saved for the long term. Researchers are retracting a highly-cited 2004 Science paper describing a new way of adding sugars to proteins -- a longstanding challenge in molecular biology -- citing their inability to repeat the results and the absence of the original lab notebooks with the experiment details, they announced in Science last Thursday (November 26, 2009). source:


12 What is needed: Servers/Data Centers Metadata DOI
PROCESS What is needed: Servers/Data Centers Metadata DOI Creation of new and strengthening of existing data centers. Responsible for: Quality assurance Storage of the content and accessibility Creation of metadata Global access to data sets and their metadata through existing catalogues. TIB stores the metadata and keeps it searchable. Use of persistent identifiers – also for data (DOI = Digital Object Identifier) TIB registers research data worldwide from a scientific, technical or medical background The Digital Object Identifier (DOI®) System is for identifying content objects in the digital environment. Information about a digital object may change over time, including where to find it, but its DOI name will not change. The DOI System provides a framework for persistent identification. The system is managed by the International DOI Foundation Over 40 million DOI names have been assigned by DOI System Registration Agencies in the US, Australia, and Europe. You might have come across this when citing advanced online articles. Digital Object Identifier (DOI)

13 Thieme hosts the research data in a data center (FIZ Karlsruhe).
PROCESS At the same time with the article the author submits the research data to Thieme. Thieme hosts the research data in a data center (FIZ Karlsruhe). TIB assigns a DOI to the data. At the same time the article is published the primary data are published as independent entity but in connection with the article. The article quotes the research data as reference items with the assigned DOI.


15 RESULTS An abstract with primary data as supplementary information. Primary Data has its own DOI, different from the one of the paper – thus, PD can be cited independently. Clicking the link (or entering the DOI in a web browser) downloads a zip file.

16 RESULTS Primary data come neatly organized in a zip file. Numbering of the folders corresponds to numbering of the compounds in the corresponding article. The folder also contains a Read me file.

17 RESULTS The Read Me PDF in the zip-File describes the content and which programs can be used to view it

18 RESULTS From the Article: Carbon (top left) Proton (top right) COSY (bottom left) From Bruker: MA (bottom right) No PDFs or JPGs – these are actual raw & interactive data – which you can load into your system, zoom in, overlay with your own measurements, etc.


20 Benefits Citability of research data High visibility of the data
SUMMARY Benefits Citability of research data High visibility of the data Easy re-use and verification of the data sets Avoiding duplications Motivation for new research

21 Benefits for authors More work Proof of quality
SUMMARY Benefits for authors More work Proof of quality Documentation of validity More exposure for results First, it looks just like another burden. Not only do the original data really show how clean the products were, they also add great value and trust to the used methods. In addition, the work gets a good deal more exposure, as it will show up not only when someone looks for an article, but also for a substance, a method, a spectrum, etc. Imagine what that can mean for the rating of an author [h-factor (if the methodology is good, of course)]

22 Benefits for users Quick evaluation of papers
SUMMARY Benefits for users Quick evaluation of papers Find structures by spectra Find similar patterns Understand individual peaks But fast forward a few years with me Most articles come with primary data This data itself is fully searchable Spectra are linked via InChIs to the structures Users will be able to search for patterns, or even single peaks!

23 Open questions No specific regulations so far Copyright
SUMMARY Open questions No specific regulations so far Copyright Centralized data hosting Data compatibility To realize all this, there are some burdens to be crossed: Centralized data hosting with clear definitions for requirements needed. see Pangaea – Publishing Network for Geoscientific & Environmental Data hosted by the Alfred Wegener Institute for Polar and Marine Research (Bremerhaven) and the Center for Marine Environmental Sciences (Bremen) Supported by The European Commission, Research Federal Ministry of Education and Research (BMBF) Deutsche Forschungsgemeinschaft (DFG) International Ocean Drilling Program (IODP) The information system PANGAEA is operated as an Open Access library aimed at archiving, publishing and distributing georeferenced data from earth system research. The system guarantees long-term availability of its content through a commitment of the operating institutions. No regulations regarding format, copyright, use of data, definition of data as primary, etc. The project presented here is a start-up prototype and currently not more which is also the reason why the issues I just mentioned have not been fully addresses and solved yet. Data quality This isn’t meaning the scientific quality of the data but their technical characteristics, compatibility with different “hardware” (e.g. currently two main suppliers of NMR spectrometers, data are only cross-readable into one direction).

24 SUMMARY So, the only thing I have left to say: go, share your primary chemical data with your fellow researchers – now. Details on how-to do it can be found in our instructions for authors. If you happen to publish not only in SYNLETT and SYNTHESIS, please talk to your editor about primary data – they might already work on it.

25 Thank You!

Herunterladen ppt "Primary Data for Chemistry"

Ähnliche Präsentationen