Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen?

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen? 1.Ältere Publikationen haben größere Chance, zitiert zu sein 2.Aktualität der Zitierung ist von Interesse 3.Zitierung durch selber hochzitierte Artikel höher bewerten als durch wenig oder gar nicht zitierte 4.Probleme: Bedeutung aggregierter Indikatoren erschließt sich nicht unmittelbar Unterschiedliche Zitiergewohnheiten 5.Lösung: Vergleich von fachlich benachbarten Aufsätzen Interaktivität statt Indikatoren Nachbarschafts-/Ähnlichkeitsmaße

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Ähnlichkeitsmaße zur Bestimmung von Nachbarschaften basieren auf Links (Zitationen) Text Nutzung

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Co-citationBibliographic coupling Linkbasierte Ähnlichkeitsmaße: Grundformen Direct citation AB C AB Bibliographic coupling C AB Co-citation & Ähnlichkeitsmaß – z.B. Jaccard-Koeffizient für Kozitation: Zahl der in beiden Dokumenten zitierten Quellen Zahl der in mindestens 1 Dokument zitierten Quellen

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlichkeit: zitierende Dokumente

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlichkeit: zitierte Dokumente

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlichkeit: Lokale Kozitations-NBS

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlk.: Lokale Bibliographische-Kopplungs-NBS

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlk.: Lokale Bibliographische-Kopplungs-NBS Active Bibliography –auch von anderen zitierte Quellen Active Bibliography Score = Common Citation Inverse Document Frequency Active Bibliography –auch von anderen zitierte Quellen Active Bibliography Score = Common Citation Inverse Document Frequency

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Exkurs: Linkbasiertes Ranking Zitierung durch hochzitierte Artikel höher bewerten...  Verwendung von PageRank + X

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Exkurs Weitere Link-Analysen: Soziale-Netzwerk-Analyse (?)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Textbasierte Ähnlichkeit (I)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Textbasierte Ähnlichkeit (I) Similarity at the sentence level: –respects sentence structure (sequence, minus some data cleaning) –usually revisions of the document under consideration Similarity at the text level: –based on bag-of-words and TF.IDF

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Textbasierte Ähnlichkeit (II)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Nutzungs-basierte Ähnlichkeit (hier: community-basiert)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Interaktive Zitationsanalyse-Tools: Vorteile und Nachteile + sofortige Verfügbarkeit der Quellen (sofern OA) + Zusammenhänge werden sichtbar, insbesondere bei Inspektion des Volltext - nicht für alle Disziplinen in gleicher Qualität verfügbar - Lücken- und fehlerhafte Dokumentenerschließung - Frustration bei nicht OA verfügbaren Volltexten - Algorithmen nur selten nachvollziehbar oGoogle Scholar proprietär  Citeseer open source - Unübersichtlichkeit I: nur lokale Suche, ausgehend von einem Dokument - Unübersichtlichkeit II: kein oder unklares Ranking innerhalb von Ergebnismengen

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Problem Unübersichtlichkeit aufgrund nur lokaler Suche Nur lokale Suche der Nachbarschaft je 1 Dokuments Keine „Top-Down“-Gruppierung von Dokumenten Warum sind Gruppen sinnvoll? –Zitationsindikatoren müssen bezüglich solcher Gruppen gebildet werden –Verständnis eines Gebiets beinhaltet Bildung von Gruppen von Konzepten  Annahme: Konzepte werden durch Gruppen repräsentiert

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Grundidee: Ausgehend von Suchbegriff Gruppen finden Interaktives Tool: 1.Inhaltliche Einschränkung des Suchraums durch Suchbegriff / -phrase 2.Retrieval aller Dokumente, auf die dieser Suchbegriff passt 3.Erstellen einer Ähnlichkeitsmatrix all dieser Dokumente 4.Clustern, um Gruppen von Dokumenten zu bilden 5.Manuelle Bearbeitung (Umgruppieren, Löschen, Labeln) der Gruppen

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Systemarchitektur Web services Text mining / Information Extraction tools Text mining / Information Extraction tools Databases (local a/o mirrored) Databases (local a/o mirrored) other WS and info. sources VBA macro Berendt, Dingel, & Hanser (Proc. ECDL 2006): http://www.wiwi.hu-berlin.de/~berendt/DL

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Literatursuche und Bibliographie-Konstruktion

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Veröffentlichen und Teilen der Resultate

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Ausblick: Literatursuchen und –bewerten im Kontext

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Und wenn man nun die „richtige“ Nachbarschaft gefunden hat...... wie kann man dann Dokumente in dieser Gruppe bewerten / ranken?

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Backup-Folien

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Active Bibliography The active bibliography is a notion relative to a given document. Consider a document D, the active bibliography for D is the set of documents cited by D that are also being cited by other documents (in our case these "other documents" are in SMEALSearch). In other words, the active bibliography of a document is the sub-set of its bibliography that has a non-zero intersection with the bibliography (set) of any other document (in SMEALSearch).active bibliographydocument active bibliographydocuments SMEALSearchactive bibliographydocument bibliography documentSMEALSearch SMEALSearch attributes a score to each document sharing all or part of its bibliography with document D, which defines the active bibliography score for that document with respect to document D. The score for each such document is computed as follows : SMEALSearchdocumentbibliography documentactive bibliographydocument In the formula above, Citing(Y) is the set of documents citing the document Y. This score identifies the documents that share their bibliography with D: the higher the score for a document X, the more X cites the same documents as D. Note that the active bibliography score can be greater than 1. In the literature active bibliography (scoring) is also referred to as CCIDF (Common Citation Inverse Document Frequency) by analogy to TFIDF.documentsdocumentdocumentsbibliography documentdocumentsactive bibliography CCIDF

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA What is the text similarity based on text and how does SMEALSearch computes the associated similarity score ? For any document in its database, SMEALSearch maintains a set of similar documents (also in database), based on textual information. The documents in this set are those featuring words that are also found in the document under consideration. The similarity score between two documents emphasizes the occurrence of words which occur in both documents while having an overall low occurrence frequency in the database: the more both documents have infrequent words in common, the higher their relative similarity score based on text. More details can be found in the glossary entry for text similarity at the text level.documentSMEALSearchdocuments documentdocuments similarity score based on textglossarytext similarity at the text level Similarity @ Text Level For a given document D, a set of similar documents {Xi} can be generated by similarity at the text level. The similarity at the text level is an implementation of the TFIDF ranking scheme, which, for any document Xi, emphasizes the weight of words that (1) are common to both D and Xi, (2) have an overall low occurrence frequency in the database. The more such words are found for D and Xi, the higher their similarity score. The actual scoring formula for the similarity based on text is:TFIDF where freq(wi,X) gives the frequency (number of occurrences) of the word wi in the body of document X, and freq(wi) gives the frequency of the the word wi in the entire database (body of all the documents in database). Note finally that, for performance considerations, only the first 10,000 characters / 5000 words (whichever comes first) of the body of the current document (D) are considered for similarity analysis.bodydocumentbodydocumentsbodydocumentsimilarity What is the text similarity at the sentence level and what is the difference with the text similarity based on text ? Documents that are similar at the sentence level have a high ratio of sentences in common, that is that many sentences occurring in one document also occur in the other. While the text similarity is a notion based only on the occurrence of words, the similarity at the sentence level exploits the full sentence structure for comparison purposes. Documents listed as similar at the sentence level are usually revisions of the document under consideration. Note also that SMEALSearch filters out, at submission time, documents with a very high ratio of common sentences with a document already in database to avoid the creation of duplicates. More details can be found in the glossary entry for text similarity at the sentence level. Documentssimilar at the sentence leveldocumenttext similaritysimilarity at the sentence levelDocumentssimilar at the sentence leveldocumentSMEALSearchdocumentsdocumentglossarytext similarity at the sentence level Similarity @ Sentence Level In order to detect the overlap between documents, SMEALSearch maintains a database of all the sentences occurring in all the documents in database. SMEALSearch uses punctuation (periods, exclamation marks and question marks) to delineate sentences (although this is not always correct, e.g. consider abbreviations). Sentence identification is non-trivial, however the problem is simplified by considering a more compact form of each sentence: non alphabetical characters and vowels are removed, finally the resulting string is truncated to a maximum length if necessary. The database of sentences is an actual hash table which associates each sentence (unique string - key) to a list of documents (value) where this sentence occurs at least once. By considering all the sentences occurring in the current document, SMEALSearch can compute the ratio of those sentences that also occur in any other document in database: this is the percentage provided for each document listed in the corresponding section of the document pages.documentsSMEALSearchdocumentsSMEALSearchdocumentsdocumentSMEALSearchdocument document pages

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA The same in other archives (here: citebase)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Citations and downloads (citebase)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlichkeit: Lokale Kozitations-NBS

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA Linkbasierte Ähnlichkeit: Lokale Kozitations-NBS Co-citation score (?) Co-citation score (?)

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen?

Ähnliche Präsentationen

Präsentation zum Thema: "Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen?"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback

Anmelden

Anmeldung über soziales Netzwerk:

Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen?

Ähnliche Präsentationen

Präsentation zum Thema: "Bettina Berendt, Frank HavemannBeschleunigung der Wissenschaftskommunikation durch OA  Fragen & Lösungsansätze Was sollten Zitationsindikatoren berücksichtigen?"— Präsentation transkript:

Ähnliche Präsentationen

Über Projekt

Feedback