1 1 Bettina Berendt Humboldt-Universität zu Berlin – www.berendt.de * mit vielen Ko-AutorInnen ** mit Roberto Navigli, Università “La Sapienza”, Roma,

Slides:



Advertisements
Ähnliche Präsentationen
Cadastre for the 21st Century – The German Way
Advertisements

Vernetzung von Repositorien : DRIVER Guidelines Dr Dale Peters, SUB Goettingen 4. Helmholtz Open Access Workshop Potsdam, 17 Juni 2008.
Finding the Pattern You Need: The Design Pattern Intent Ontology
E-Solutions mySchoeller.com for Felix Schoeller Imaging
H - A - M - L - E - IC T Teachers Acting Patterns while Teaching with New Media in the Subjects German, Mathematics and Computer Science Prof. S. Blömeke,
Mean and variance.
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil3.
R. Zankl – Ch. Oelschlegel – M. Schüler – M. Karg – H. Obermayer R. Gottanka – F. Rösch – P. Keidler – A. Spangler th Expert Meeting Business.
Die ZBW ist Mitglied der Leibniz-Gemeinschaft Copyright © ZBW 2010 Seite 1 Potenziale semantischer Technologien für die Bibliothek der Zukunft Klaus Tochtermann.
Steinbeis Forschungsinstitut für solare und zukunftsfähige thermische Energiesysteme Nobelstr. 15 D Stuttgart WP 4 Developing SEC.
1 JIM-Studie 2010 Jugend, Information, (Multi-)Media Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
Telling Time in German Deutsch 1 Part 1 Time in German There are two ways to tell time in German. There are two ways to tell time in German. Standard.
Stärken-Schwächen- Chancen-Risken Lernziele: to talk about future job possibilities to develop the ability to compare and contrast to confidently use present,
Vorlesung: 1 Betriebliche Informationssysteme 2003 Prof. Dr. G. Hellberg Studiengang Informatik FHDW Vorlesung: Betriebliche Informationssysteme Teil2.
Institut für Verkehrsführung und Fahrzeugsteuerung > Technologien aus Luft- und Raumfahrt für Straße und Schiene Automatic Maneuver Recognition in the.
Lancing: What is the future? Lutz Heinemann Profil Institute for Clinical Research, San Diego, US Profil Institut für Stoffwechselforschung, Neuss Science.
Three minutes presentation I ArbeitsschritteW Seminar I-Prax: Inhaltserschließung visueller Medien, Spree WS 2010/2011 Giving directions.
Introduction to BOS Supplier Guidelines v6
Hochschulteam der Agentur für Arbeit Trier Preventing the Brainware Crisis Workshop Schloss Dagstuhl Student Enrollment in Computer Science.
Introduction to the topic. Goals: Improving the students essay style in general Finding special words and expressions that can be used in essay writing.
Deutsche Gesellschaft für Technische Zusammenarbeit GmbH Integrated Experts as interface between technical cooperation and the private sector – An Example.
Methods Fuzzy- Logic enables the modeling of rule based knowledge by the use of fuzzy criteria instead of exact measurement values or threshold values.
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 SADDU June 2008 S. Noël, K.Bramstedt,
Institut für Umweltphysik/Fernerkundung Physik/Elektrotechnik Fachbereich 1 Pointing Meeting Nov 2006 S. Noël IFE/IUP Elevation and Azimuth Jumps during.
20:00.
Die Geschichte von Rudi
Laurie Clarcq The purpose of language, used in communication, is to create a picture in the mind and/or the heart of another.
Machen Sie sich schlau am Beispiel Schizophrenie.
Institut AIFB, Universität Karlsruhe (TH) Forschungsuniversität gegründet 1825 Towards Automatic Composition of Processes based on Semantic.
| DC-IAP/SVC3 | © Bosch Rexroth Pneumatics GmbH This document, as well as the data, specifications and other information set forth in.
Centre for Public Administration Research E-Government for European Cities Thomas Prorok
BAS5SE | Fachhochschule Hagenberg | Daniel Khan | S SPR5 MVC Plugin Development SPR6P.
1 Ein kurzer Sprung in die tiefe Vergangenheit der Erde.
Christoph Durt: Wittgenstein on the possibility of philosophy: The importance of an intercultural approach
Deutsch 1 G Stunde. Donnerstag, der 25. Oktober 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & homeFamilie & Zuhause Objectives: Phrases.
Deutsch 1 G Stunde. Donnerstag, der 15. November 2012 Deutsch 1, G Stunde Heute ist ein B- Tag Unit: Family & home Familie & Zuhause Goal: to talk about,
Donnerstag, den 28. November FUTURE HOLIDAYS Lernziel: to learn how to express future ideas. Starter: Finde ein Paar! will plane...zu werde hoffe...zu.
INTAKT- Interkulturelle Berufsfelderkundungen als ausbildungsbezogene Lerneinheiten in berufsqualifizierenden Auslandspraktika DE/10/LLP-LdV/TOI/
XML IV: Cocoon 2.
DEUTSCHLAND UND DIE MEDIEN
DEUTSCHLAND UND DIE MEDIEN
© Apr-14 Prof. Dr. Christian Zich, Folie 1 Unternehmensführung / Sendung 11 Neue Managementtrends II.
Guten Morgen!.
Faculty of Public Health Department of Health Economics and Management University of Bielefeld WP 3.1 and WP 4.1: Macrocost EUprimecare Plenary Meeting.
Staatsballett Berlin Ein Verbesserungskonzept für den Social- Media Auftritt Your picture here.
Einführung Bild und Erkenntnis Einige Probleme Fazit Eberhard Karls Universität Tübingen Philosophische Fakultät Institut für Medienwissenschaft Epistemic.
Berner Fachhochschule Hochschule für Agrar-, Forst- und Lebensmittelwissenschaften HAFL Recent activities on ammonia emissions: Emission inventory Rindvieh.
4th Symposium on Lidar Atmospheric Applications
Ein Projekt des Technischen Jugendfreizeit- und Bildungsvereins (tjfbv) e.V. kommunizieren.de Blended Learning for people with disabilities.
Image Processing and Analysis Introduction. How do we see things ?
Symmetrische Blockchiffren DES – der Data Encryption Standard
Talking about yourself
Cross-Polarization Modulation in DWDM Systems
ESSnet Workshop Conclusions.
1 von 10 ViS:AT Abteilung IT/3, IT – Systeme für Unterrichtszwecke ViS:AT Österreichische Bildung auf Europaniveau BM:UKK Apple.
FORSCHUNGSINSTITUT FÜR ÖFFENTLICHE VERWALTUNG BEI DER DEUTSCHEN HOCHSCHULE FÜR VERWALTUNGSWISSENSCHAFTEN SPEYER Dr. Sonja Bugdahn 1 Can New Regulators.
By: Jade Bowerman. German numbers are quite a bit like our own. You start with one through ten and then you add 20, 30, 40 or 50 to them. For time you.
German Early Level The Weather.
1 (C)2006, Hermann Knoll, HTW Chur, FHO Quadratische Reste Definitionen: Quadratischer Rest Quadratwurzel Anwendungen.
KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Vorlesung Knowledge Discovery - Institut AIFB Tempus fugit Towards.
1 Stevens Direct Scaling Methods and the Uniqueness Problem: Empirical Evaluation of an Axiom fundamental to Interval Scale Level.
Component 4 Frankfurt (Oder) & Slubice Seite 1 COMPONENT 4: – in Frankfurt (Oder) & Slubice 1.All participants presented themselves.
Lehrstuhl für Waldbau, Technische Universität MünchenBudapest, 10./11. December 2006 WP 1 Status (TUM) Bernhard Felbermeier.
Selectivity in the German Mobility Panel Tobias Kuhnimhof Institute for Transport Studies, University of Karlsruhe Paris, May 20th, 2005.
Technische Universität München 1 CADUI' June FUNDP Namur G B I The FUSE-System: an Integrated User Interface Design Environment Frank Lonczewski.
Instrumente und Unterhaltung End of Unit Assessment.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Modalverben.
Numbers Greetings and Good-byes All about Me Verbs and Pronouns
1 Medienpädagogischer Forschungsverbund Südwest KIM-Studie 2014 Landesanstalt für Kommunikation Baden-Württemberg (LFK) Landeszentrale für Medien und Kommunikation.
FURTHER MASS SPECTROMETRY
 Präsentation transkript:

1 1 Bettina Berendt Humboldt-Universität zu Berlin – * mit vielen Ko-AutorInnen ** mit Roberto Navigli, Università “La Sapienza”, Roma, Italy Semantic Web Mining* Heute: Semantik für und aus Blogs**

2 2 Agenda 1. Motivation und Überblick n Warum Web? Warum Blogs? n Semantic Web Mining 2. Finding your way through blogspace: Using semantics for cross-domain blog analysis

3 3 Agenda 1. Motivation und Überblick n Warum Web? Warum Blogs? n Semantic Web Mining 2. Finding your way through blogspace: Using semantics for cross-domain blog analysis

4 4 Das Ziel

5 5 Das Wissen der Menschheit möglichst vielen Menschen effektiv zugänglich machen.

6 6 “Makrokosmos World Wide Web”

7 7 “Mikrokosmos Blogosphere”

8 8 Konkrete Ziele (Bsp. für Teil 2 dieses Vortrags) Klassifikation: „Dieser Blog behandelt Inhalte aus Ernährung und Gastronomie.“  Vorschläge von Meta-Tags für den Blog  Unterstützung von Blog-Suchmaschinen Empfehlungen mit Erklärung: „Wenn Sie diesen Blog interessant fanden, dann wird Sie vielleicht auch Blog... interessieren, und zwar weil...“

9 9 Das Potenzial

10 Sehr viel Wissen, für Menschen zugänglich.

11 Die Probleme

12 Sehr viel Wissen, für Menschen zugänglich.

13 Web Mining

14 Formen Knowledge discovery (aka Data mining): “the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1 Web Mining: die Anwendung von Data-Mining- Techniken auf Inhalt, (Hyperlink-) Struktur und Nutzung von Webressourcen. Webmining-Gebiete: Web content mining Web structure mining Web usage mining 1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

15 Webmining-Gebiete: Web content mining Web structure mining Web usage mining Web Mining: Beispiele

16 Das Hauptproblem des Web Mining

17 Syntax in, Syntax out.

18

19 Semi-automatisches Tagging: Tag-Empfehlung auf Basis von Syntax + existierenden Labels

20 Tagyu funktioniert auch (mit Einschränkungen) für Ressourcen in anderen Sprachen

21 Funktioniert das wirklich? (1)

22 Funktioniert das wirklich? (2)

23 Das Wikipedia 300 Component Model, generiert mit diskreter PCA cosco.hiit.fi/search/H300.html/topic_list - common phrases of selected components 1. process; water; air; pressure; gas; body of water; natural gas; high pressure; hot water; fresh water; 2. Mark; Gospel; Matthew; Luke; Rose; Virgin; Virgin Mary; Gospel of John; Gospel of Mark; Gospel of Luke; 3. part; text; Britannica; entry; Encyclopedia Britannica; Encyclop~¦dia Britannica; Encyclopaedia Britannica; domain Encyclop~¦dia Britannica; public domain Encyclop~¦dia Britannica; public domain text; 4. property; theorem; elements; proof; subset; axioms; proposition; natural numbers; fundamental theorem; mathematical logic; 5. Dove; AMD; Dove Streptopelia; imperial crown; Imperial army; imperial court; imperial family; Collared Dove Streptopelia; Imperial Russia; 6. side; feet; long time; long period; right side; left side; long distances; different types; short distance; opposite side; 7. David; bill; Bob; Jim; Allen; Dave; Current stars; former members; Bill Clinton; former President; 8. magazine; newspaper; political parties; public domain text; public opinion; political career; public schools; own right; political life; public service; 9. way; things; boy; cat; long time; same way; same thing; only way; different ways; good thing; 10. problems; zero; sum; digits; ~~; natural numbers; positive integer; mathematical analysis; decimal digits; natural logarithm; 11. population density; couples; races; total area; makeup; Demographics; median age; income; density; housing units; 175. Torres; Iraqi KASUMI KHAZAD Khufu; Granada; Spa; Fra; General information; General Public License; General Bernardo; New Granada; Torres Strait; 176. love; Me; Rolling Stones; love songs; Rolling Stone magazine; Love Me; Fall in Love; Meet Me; love story; professional wrestler; Zusammenfassend – Schwächen rein statistischer Ansätze: Interpretation der Resultate? Existenz von Resultaten? Korrektheit? Inferenzen? Zusammenfassend – Schwächen rein statistischer Ansätze: Interpretation der Resultate? Existenz von Resultaten? Korrektheit? Inferenzen?

24 Semantic Web

25 Das Semantic Web “The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in co-operation.” 1 “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming.” 2 1 Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Sci. American, May Berners-Lee, T. (2000). Semantic Web XML

26 Category structure: <RDF xmlns:r=" xmlns:d=" xmlns=" Top.... Arts... <symbolic r:resource="Typography:Top/Computers/Fonts"/>.... Category structure: <RDF xmlns:r=" xmlns:d=" xmlns=" Top.... Arts... <symbolic r:resource="Typography:Top/Computers/Fonts"/>.... Resources: <RDF xmlns:r=" xmlns:d=" xmlns=" Arts John phillips Blown glass A small display of glass by John Phillips Computers Resources: <RDF xmlns:r=" xmlns:d=" xmlns=" Arts John phillips Blown glass A small display of glass by John Phillips Computers Semantic Web: Beispiel

27 Warum Semantic Web? Bsp. strukturierte Suche – Metadaten gemäß Dublin Core (DC)

28 Semantische Suche: Bsp. 2 – Metadaten gemäß DC + Domänenontologie

29 Das Hauptproblem des Semantic Web

30 Wer soll das alles machen?

31 Der Ansatz

32 Web Mining: Maschinelles Lernen extrahiert aus Daten Wissen Das Semantic Web macht Wissen maschinen- verständlich Semantic Web Mining nutze Semantik zur Verbesserung v. Mining nutze Mining zur Generierung v. Semantik Semantic Web Mining nutze Semantik zur Verbesserung v. Mining nutze Mining zur Generierung v. Semantik

33

34 Web Mining Semantic Web Web Mining Semantic Web... p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03: ]"GET /search.html?t=jane%20austen&SID= &ord=asc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05: ] "GET /search.html?t=jane%20austen&m=vide o&SID=023785&ord=desc HTTP/1.0" p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06: ] "GET /view.asp?id=3456&SID= HTTP/1.0" Ver- stehen

35 Web Mining Semantic Web Literaturverzeichnis [1] Agarwal, R.; Krueger, B. P.; Scholes, G. D.; Yang, M.; Yom, J.; Mets, L.; Fleming, G. R. U ltrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements, J. Phys. Chem. B, 2000, 104, 2908,... beitragen

36 Web Mining Semantic Web ordnen und erklären beitragen

37 Agenda 1. Motivation und Überblick n Warum Web? Warum Blogs? n Semantic Web Mining 2. Finding your way through blogspace: Using semantics for cross-domain blog analysis

38 Context n Semi-automatic tagging n Blog recommendation n Semantics-enhanced text mining, word sense disambiguation n Exploratory analyses of blog contents n Computational Approaches to Analyzing Weblogs AAAI 2006 Spring Symposium n Read more in the paper:

39 Blog recommendation: collaborative + content-based filtering (

40 An example of exploratory blogs analysis (in which a syntax-based approach is sufficient): the run-up to the 2004 US presidential election ( Adamic & Glance, 2005)

41 Our procedure 1. Take a set of blog corpora (= collection of blogs manually labelled as belonging to one topic) 2. In all of the following analyses: l what is blog corpus about? l to which other blog corpora is it related, and why? 3. syntactic analysis: keyphrases 4. semantic analysis I: domain labels 5. semantic analysis II: structural semantic interconnections

42 Data

43 Sample data: 4 blog corpora n Food and drink n Health and medicine n Law n Weblogs about blogging n Randomly sampled from the Yahoo! blog directory, K words each n Available at

44 Syntactic analysis

45 What is a blog about? Term Extraction Domain relevance and domain consensus: Keyphrases: DR ≥ 0.35, DC ≥ 0.23 (values from previous experiments) t = term,  = corpus (here: blog corpus), b = a blog (here: as an element of a corpus  k )

46 What is shared by two blogs? Syntactic similarity: Jaccard coefficient T(C) = keyphrases / “terminology“ of corpus C

47 Semantic analysis I: WordNet and WordNet domains

48 WordNet

49 Hierarchical knowledge: Domain labels

50 Domain label statistics show that the blog corpora have clear thematic foci frequency of domain D in corpus C = no. of keyphrases in C with a sense that maps to D

51 Blog foci: Top 5 Domains FoodHealthLawMeta-blogs 1GastronomyMedicineLawTelecommunications 2AlimentationTime periodQualityTime period 3Quality PoliticsPerson 4BotanyBiologyAdministrationPublishing 5PersonPhysicsEconomy

52 Top-10 intersections n Law – meta-blogs l Law, politics, economy (+ 3 factotum) n Law – health l Law, psychology (+ 2 factotum) n Health – meta-blogs l Law (+ 2 factotum) n Food – law l Sociology (+ 2 factotum) n No overlap food – health, health – law

53 Semantic analysis II: Hierarchical and non-hierarchical knowledge: WordNet and SSI (Structural semantic interconnections)

54 The need for word sense disambiguation “She sat by the bank and looked sentimentally at the last fish.“ „She sat by the bank and looked sentimentally at the last coins.“ “She sat by the bank and looked sentimentally at the last coins.“

55 WordNet semantic relations

56 Structural semantic interconnections: bank – fish Details of SSI‘s enhanced lexcial database (extending WordNet) and of SSI‘s word sense disambiguation are described in R. Navigli & P. Velardi. Structural Semantic Interconnections: a knowledge-based approach to word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence (27-7), July, 2005.

57 Structural semantic interconnections: bank – coin

58 Knowledge-based similarity between blogs Example: n connection between two terms from the domain computer science n path weights: 0.33; 0.25; 0.25 = 1 / path length in no. of edges) Procedure: For each blog pair 1. find all SSI paths between all pairs of a term (keyphrase) from blog 1 and a term from blog 2 (in all conditions but the baseline: choose only terms that map to senses in the top domain(s), and choose only those senses) 2. Measure of blog pair similarity = sum over the weights of all these paths

59 Experi- mental settings

60 Results (Quantitative view)

61 Results: Qualitative view n Baseline: Spurious connections between law – metablogs: via computer science terms  filtered out in domain-label conditions n Correct connections throughout: Food – health: greasy food (cream cheese, chocolate sauce,...) – other fats, or health food n 1/3-relatedness reveals important connections: l Expected: law – metablogs: enterprise (related to law) – computer science (related to telecommunications), publishing, politics: law firms, news organizations, news story, political party l Unexpected: law – food: local government – town planning (including parking lots, the main drag) n Single-term expressions particularly visible in food – health (eggs, onions,... – health food; disease – beef)  lexicalization effect, depends on domains (also related domains in law – metablogs) n 3-relatedness: topic drift, many highly generic single-word terms (activity, life, computer, area, food) establish many generic paths to a 2nd corpus (these terms are „related to“ nearly everything else)  topic drift

62 Restricting path grammar to find valid interconnections n Starting from 3-relatedness n ≤ 1 related-to link  filters out 88.8% of the paths n ≤ 2 types of links  filters out 53.4% of the path n Results: l Mostly, “meaningful“ paths were retained. l But further research is needed.

63 Questions / future work n Evaluation l Standard datasets („senseval for blogs“): try the following ?! – –10 M posts from 1 M weblogs from three weeks in July –This data set has been selected as it spans a period of time during which an event of global significance occurred, namely the London bombings. l Compare syntax- and semantics-based approaches –Assuming that the semi-automatic approaches of Semantic Web Mining give qualitatively better results: How can the quality gains be weigthed against the additional costs of manual post-processing? n Improve path grammars n Ontology learning

64 … für Ihre Aufmerksamkeit! Danke …