Thursday, October 12, 2017

corenlp


step 1: tokenize

  java -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.process.PTBTokenizer 2-malware.txt > 2-malware.tok

  less 2-malware.tok

Step 2: mark:
  perl -ne 'chomp; print "$_\tO\n"' 2-malware.tok > 2-malware.tsv

Step 3: generate tsv file using customized KEYWORD file

generate final.tsv

Step 4: generate customized NER using malware.prop: => will create malware-ner-model.ser.gz

 java -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop malware.prop

Step 5: using custom NER to detect text

java -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier  malware-ner-model.ser.gz -testFile sample.tsv.ini


--------------------------------------
java -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.process.PTBTokenizer jane-austen-emma-ch1.txt > jane-austen-emma-ch1.tok

java -mx15g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file jane-austen-emma-ch1.txt