Die Präsentation wird geladen. Bitte warten

Die Präsentation wird geladen. Bitte warten

Textklassifikation Der Scirus-Classifier. Überblick Komplexes Programm: –Porno-Filter –Extraktion von Namen –Klassifikation aufgrund von Text –Klassifikation.

Veröffentlicht von:Jï¿½rgen Kathol Geändert vor über 10 Jahren

Ähnliche Präsentationen

Präsentation zum Thema: "Textklassifikation Der Scirus-Classifier. Überblick Komplexes Programm: –Porno-Filter –Extraktion von Namen –Klassifikation aufgrund von Text –Klassifikation."— Präsentation transkript:

1 Textklassifikation Der Scirus-Classifier

2 Überblick Komplexes Programm: –Porno-Filter –Extraktion von Namen –Klassifikation aufgrund von Text –Klassifikation nach URL/Title –Feste Klassifikation aufgrund einer URL-Liste –Extraktion von Titel/Autor/Abstract etc bei Artikeln –Ausgabe von Refinement-Termen Hier nur von Interesse: Klassifikation aufgrund des textuellen Inhalts

3 Textklassifikation Lexikonbasiert: –Phrasen oder Wörter –Erhalten Gewicht für jede Kategorie –Starke Indikatoren Klassifikation durch Berechnung eines Scores: –Für jedes Vorkommen wird für jede Kategorie ein Zähler hochgesetzt –Normalisierung nach Dokumentlänge –Schwellenwert

4 Konfiguration

5 Konfigurations-Datein //Number of words to process for subject identification NWDS=2000000 MINWORDS=100 THRESHOLD=1 SUBJ=gen all 0 0 SUBJ=chem all 1 0 SUBJ=comp all 2 0 SUBJ=eng all 3 0 SUBJ=env all 4 0 SUBJ=geo all 5 0 SUBJ=astro all 6 0 SUBJ=life all 7 0 SUBJ=math all 8 0 SUBJ=mat all 9 0 SUBJ=med all 10 0 ….

6 Aufruf CIS Subject Identifier and Content Extractor Version 5.0 USAGE: classifier [-h[elp]] [-os|l[A]] [-it|f|h] [-s[ilent]] [-c CONFIG_FILE] [-nout] [-uat] [-URL ] [-smd ] [-ps] [-t FILES_TO_IDENTIFY] -h: print help -c CONFIG_FILE: Name of the configuration file. Default is././config.txt -os|l[A]: Output format -os: Short: only print well identified subjects(default) -ol: Long: print all subjects -ot: Topics only are output; one line Format: filename:WORDCOUNT#GENERALSCIENCESCORE#TOPICSWITHSCORE ´ -oA: Store and print all phrases for a topic ´ -oT: Print all phrases found in the dictionary ´ (Used for dictionary testing only) -T[t][i][o]: Tasks to carry out and to output (default: all are set) t: Topic identification i: Information from content extractor o: Offensive content filter -it|h|f: Input format -it: Plain text -ih: HTML-file -if: HTML-file preceded by header -nINTEGER :Minumum number of words in a document -MINTEGER :Maximum number of words to be processed in a document tokenizer stops after INTEGER words Documents with less words will get tag 'not_enough_data' -mINTEGER :Minimum score for accepted documents -rINTEGER : maximum relative count for phrase form/thousand In thousand phrases one phrase form will only be counted INTEGER times. -NINTEGER :Maximum number of phrases to output in results for topics -t FILES_TO_IDENTIFY List of files for which subject should be identified. Default: stdin. -D[r] D1|D2[:F1|F2[:FB1|FB2]]: process all files in directory and recurse Dr: descend recursively into subdirectories D1: name of directory to list or recurse F1... : filename patterns (my contain *) FB1: Patterns for forbidden directories (not recursed) -s: print only some important messages, not all. -nout: Turn off URL/Title classifier. -uat: Use all titles for classification (not just those enclosed in ). -URL : Filename of the URL list (format: ). -smd : Maximum number of words for small documents (default see config file). -ps: Print title and url scores -xml: Print XML output

7 Ablauf Einlesen des Textes bis zur spez. Anzahl von Wörtern Abgleich mit dem Lexikon Berechnen des Scores Ausgabe des Ergebnisses in Abhängigkeit vom Schwellenwert

8 Scoring Formel Sei: –d Dokument, –c Kategorie, –t Term, –l(t) = Länge von t, –wn(t) = Wortanzahl in t, –q(t,c) Gewicht von t für c und –s(t,c) starker Indikator t für c –T(c) Klassifikations-Schwellenwert für c –W = min(Wörter im Dokument, max proz. Wörter) Score(d,c) = t d (l(t)/2 + (wn(t) -1) x 2) x q(t,c))/W Si-score(d,c) = t d s(tc) d wird als c klassifiziert gdw. Si-score(d,c) > 1 && score(d,c) > T(c)

9 Klassifikations-Lexikon Format: TERM.INFO1/INFO2/... INFO: TOPICS#FREQUENCY#QUALITY#LENGTH#TYPE#ALONE#OUT PUT –TOPICS: MAIN:SUB –FREQUENCY: 1 (not used) –QUALITY: 0...9 –LENGTH (number of words) –TYPE: 0..3 0: genuine topic-subtopic indicator 1: only to distinguish between subtopics, not indicating topic itself 2: as 0, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0 3: as 1, but word is to be counted only if there are other phrases for same subtopic, with TYPE 0 –ALONE: 0/1 : strong indicator –OUTPUT: Ø,$, PHRASE

10 Klassifikations-Lexikon Beispiel –a vinculo matrimonii.18:0#1#0#3#0#0#$ –a-37 aircraft.14:0#1#1#3#0#1#a 37 aircraft –a-address register.2:0#1#1#3#0#1#a address register –a-bomb survivors.7:0#1#8#3#0#1#a bomb survivors –a-c substitutions.15:0#1#8#3#0#1#a c substitutions/7:0#1#8#3#0#1#a c substitutions –a-calcium-calmodulin kinase.11:0#1#8#4#0#1#a calcium-calmodulin kinase –a-chromanoxyl radical.7:0#1#8#3#0#1#a chromanoxyl radical –a-crystallin gene.15:0#1#8#3#0#1#a crystallin gene/7:0#1#8#3#0#1#a crystallin gene –a-d conversion.3:0#1#1#3#0#1#a d conversion –a-d converter.13:0#1#1#3#0#1#a d converter/3:0#1#1#3#0#1#a d converter/9:0#1#1#3#0#1#a d converter –a-deficient mice.11:0#1#7#3#0#1#a deficient mice/15:0#1#8#3#0#1#a deficient mice –a-delta activity.11:0#1#8#3#0#1#a delta activity

Herunterladen ppt "Textklassifikation Der Scirus-Classifier. Überblick Komplexes Programm: –Porno-Filter –Extraktion von Namen –Klassifikation aufgrund von Text –Klassifikation."

Ähnliche Präsentationen

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Google-Anzeigen