Präsentation zum Thema: "Primary Data for Chemistry"— Präsentation transkript:
1 Primary Data for Chemistry DataCite Summer Meeting 2010 Making Datasets Visible and Accessible June 7/8, 2010Primary Data for ChemistrySusanne Haak,1 Guido F. Herrmann,1 Irina Sens,2 Jan Brase21 Georg Thieme Verlag KG Ruedigerstrasse 14, Stuttgart, Germany2 German National Library of Science and Technology (TIB)Welfengarten 1B, Hannover, Germany
2 PARTNERS BACKGROUND PROCESS RESULTS SUMMARY PARTNERS BACKGROUND
4 PARTNERSTIB is the largest scientific library in the worldArchitecture, Chemistry, Computer Science, Mathematics, Physics, Engineering technologyFinanced by Federal Government and all Federal States€ 8 Mio. annual acquisition budget18,500 journal subscriptions7,0 Mio. itemsGlobal Supplier for scientific and technical information of all types – text, numeric data, audio, video, etc.Global consortium carried by local institutionsfocused on improving the scholarly infrastructure around datasets and other non-textual informationfocused on working with data centers and organizations that hold dataProviding standards, workflows and best-practiceInitially, but not exclusively based on the DOI systemFounded December 1st 2009 in London
5 PARTNERSThieme ChemistryPart of the Thieme publishing group, based in Stuttgart (Germany)Publishes highly evaluated information about synthetic and general chemistry for professional chemists and advanced students since 1909.
6 PARTNERSThis is one of the – far too few – intense co-operations between libraries and publishers.Our Journals always have been at the forefront of innovation and we are proud that once again, we can lead the way.
8 As a result, datasets are BACKGROUNDGap in the scientific record between published research and the underlying dataPublished work held by publishers and librariesDatasets held by data centersNo effective way to link between datasets and articlesNo widely used method to identify datasetsNo widely used method to cite datasetsAs a result, datasets areDifficult to discoverDifficult to access
9 In chemistry research data are created: BACKGROUNDIn chemistry research data are created:Using the vast array of chromatographic methods (GC, HPLC etc.)Employing spectroscopic methods (NMR, MS, UV/VIS, IR, X-ray etc.)As a result of theoretical calculations (quantum mechanics, simulation of spectra etc.)Or by using the various high-throughput technologies in medicinal chemistryTypical research data are created:Using the vast array of analytical techniques (GC, HPLC etc.)Employing spectroscopic methods (NMR, MS, UV/VIS, IR, X-Ray etc.)Crystallographic DataSource:
10 Primary Data in Organic Chemistry BACKGROUNDEstimation:Primary Data in Organic ChemistrytoDatasets Per YearColumbus, Ohio (September 8, 2009) - Chemical Abstracts Service (CAS), a division of the American Chemical Society, announced that on September 7 it recorded the 50 millionth substance in CAS REGISTRYSM, the world's most comprehensive and high-quality compendium of publicly disclosed chemical information. The recently registered substance is a novel arylmethylidene heterocycle with analgesic properties. Reaching the 50 million mark so quickly is an indicator of the accelerating pace of scientific knowledge. CAS registered the 40 millionth substance just nine months ago - in contrast, it took 33 years for CAS to register the 10 millionth compound in 1990.From: Neudert, Reinhard - Weinheim Sent: Mittwoch, 3. März :31 To: Krimmer, Dr. Thomas Subject: AW: Zahl der Spektren in der (organischen) ChemieHallo Herr Krimmer,ich habe die Zahlen gefunden. Diese beruhen auf vier starken Chemiejournalen von Wiley-VCH. Wollte man alle erfassen, müsste man hochrechnen. Da Wiley-VCH eine sehr starke Chemie hat wird der Faktor wohl um 3 herum liegen, also 15 Millionen Spektren in den letzten 25 JahrenGruß NeudertPrimärdaten in der ChemieIn der chemischen Forschung fallen täglich große Menge Primärdaten an, die im akademischen Umfeld letztendlich in eine wissenschaftliche Publikation münden.So haben die größten Chemie Journale des Wiley-VCH Verlages im Jahr 2006 die folgende Anzahl von Artikel veröffentlicht:• Angewandte Chemie : ca Beiträge• Chemistry A European Journal: ca.1000 Beiträge• EurJIC : ca. 610 Beiträge• EurJOC: ca. 650 Beiträge In den insgesamt ca Beiträgen werden im Mittel etwa 40 Spektren pro Artikel beschrieben. Diese auf Primärdaten beruhenden Spektren sind meist in den experimentellen Sektionen in stark reduzierter Form gelistet, werden aber auch häufig in Form von Abbildungen im Zusatzmaterial („Supporting Information“) gezeigt. In beiden Fällen sind die Daten nicht elektronisch recherchierbar, noch kann auf die ursprünglichen Messdaten in irgendeiner Form zugegriffen werden. Geht man von der oben genannten Spektrenzahl aus, dann sind in den letzten 25 Jahren im Zusammenhang mit dem Publikationsprozess nicht weniger als 5 Millionen Spektren, also 5 Millionen Primärdatensätze erzeugt worden.Number of spectra in the last year in 50 organic chemistry journals with 500 articles each with 80 spectra per article: 2,000,000;To put this number into perspective: MedLine currently contains some 40 Mio abstracts - in total and adds less than 1 Mio a year; CAS currently contains some 42 Mio substances – again in total, adding roughly 1 Mio per year! The 2 Mio here refer to organic chemistry and 1 single year only!Thus far, the vast amount of data lies scattered on the computers of scientists, who have produced the information.As no central repository exists, no accessible archival storage is possible at the moment.Due to the missing credit that working up such data currently receives, primary data is often poorly documented, difficult to access and not saved for the long term.Researchers are retracting a highly-cited 2004 Science paper describing a new way of adding sugars to proteins -- a longstanding challenge in molecular biology -- citing their inability to repeat the results and the absence of the original lab notebooks with the experiment details, they announced in Science last Thursday (November 26, 2009).source:
12 What is needed: Servers/Data Centers Metadata DOI PROCESSWhat is needed:Servers/Data CentersMetadataDOICreation of new and strengthening of existing data centers.Responsible for:Quality assuranceStorage of the content and accessibilityCreation of metadataGlobal access to data sets and their metadata through existing catalogues.TIB stores the metadata and keeps it searchable.Use of persistent identifiers – also for data (DOI = Digital Object Identifier)TIB registers research data worldwide from a scientific, technical or medical backgroundThe Digital Object Identifier (DOI®) System is for identifying content objects in the digital environment.Information about a digital object may change over time, including where to find it, but its DOI name will not change.The DOI System provides a framework for persistent identification.The system is managed by the International DOI FoundationOver 40 million DOI names have been assigned by DOI System Registration Agencies in the US, Australia, and Europe.You might have come across this when citing advanced online articles.Digital Object Identifier (DOI)
13 Thieme hosts the research data in a data center (FIZ Karlsruhe). PROCESSAt the same time with the article the author submits the research data to Thieme.Thieme hosts the research data in a data center (FIZ Karlsruhe).TIB assigns a DOI to the data.At the same time the article is published the primary data are published as independent entity but in connection with the article.The article quotes the research data as reference items with the assigned DOI.
15 RESULTSAn abstract with primary data as supplementary information.Primary Data has its own DOI, different from the one of the paper – thus, PD can be cited independently.Clicking the link (or entering the DOI in a web browser) downloads a zip file.
16 RESULTSPrimary data come neatly organized in a zip file.Numbering of the folders corresponds to numbering of the compounds in the corresponding article.The folder also contains a Read me file.
17 RESULTSThe Read Me PDF in the zip-File describes the content and which programs can be used to view it
18 RESULTSFrom the Article:Carbon (top left)Proton (top right)COSY (bottom left)From Bruker:MA (bottom right)No PDFs or JPGs – these are actual raw & interactive data – which you can load into your system, zoom in, overlay with your own measurements, etc.
20 Benefits Citability of research data High visibility of the data SUMMARYBenefitsCitability of research dataHigh visibility of the dataEasy re-use and verification of the data setsAvoiding duplicationsMotivation for new research
21 Benefits for authors More work Proof of quality SUMMARYBenefits for authorsMore workProof of qualityDocumentation of validityMore exposure for resultsFirst, it looks just like another burden.Not only do the original data really show how clean the products were, they also add great value and trust to the used methods.In addition, the work gets a good deal more exposure, as it will show up not only when someone looks for an article, but also for a substance, a method, a spectrum, etc.Imagine what that can mean for the rating of an author [h-factor (if the methodology is good, of course)]
22 Benefits for users Quick evaluation of papers SUMMARYBenefits for usersQuick evaluation of papersFind structures by spectraFind similar patternsUnderstand individual peaksBut fast forward a few years with meMost articles come with primary dataThis data itself is fully searchableSpectra are linked via InChIs to the structuresUsers will be able to search for patterns, or even single peaks!
23 Open questions No specific regulations so far Copyright SUMMARYOpen questionsNo specific regulations so farCopyrightCentralized data hostingData compatibilityTo realize all this, there are some burdens to be crossed:Centralized data hosting with clear definitions for requirements needed.see Pangaea – Publishing Network for Geoscientific & Environmental Datahosted by the Alfred Wegener Institute for Polar and Marine Research (Bremerhaven) and the Center for Marine Environmental Sciences (Bremen)Supported byThe European Commission, ResearchFederal Ministry of Education and Research (BMBF)Deutsche Forschungsgemeinschaft (DFG)International Ocean Drilling Program (IODP)The information system PANGAEA is operated as an Open Access library aimed at archiving, publishing and distributing georeferenced data from earth system research. The system guarantees long-term availability of its content through a commitment of the operating institutions.No regulations regarding format, copyright, use of data, definition of data as primary, etc.The project presented here is a start-up prototype and currently not more which is also the reason why the issues I just mentioned have not been fully addresses and solved yet.Data qualityThis isn’t meaning the scientific quality of the data but their technical characteristics, compatibility with different “hardware” (e.g. currently two main suppliers of NMR spectrometers, data are only cross-readable into one direction).
24 SUMMARYSo, the only thing I have left to say: go, share your primary chemical data with your fellow researchers – now. Details on how-to do it can be found in our instructions for authors.If you happen to publish not only in SYNLETT and SYNTHESIS, please talk to your editor about primary data – they might already work on it.