From IMC wiki
Jump to: navigation, search

All files in PPAT repository:


Okay. So pipeline of stuff.

  1. First do typos/spelling for a clean(er) input file:
    • ppat/POLdata/POLTyposAnonymise.pl
    • Input file in first line: QMData_1123rec.xml.utf8.anon (need to change to correct)
    • produces outputfile QMdata_Anon_Final.xml
  2. Overview data checked and manipulated in excel: AppointmentsNew.csv; CaseStatusAll.csv (Add fields AGENTID, APPORDER, ATTENDORDER; CaseData Calculate data)
  3. Extract data in formats useful for multiple analyses:
    • ppat/POLdata/ExtractPOL.pl
    • takes as input QMdata_Anon_Final.xml; AppointmentsNew.csv; CaseStatusAll.csv (stupidly format specific; should've made these more generalisable, but hey!)
    • Creates a bunch of outputs, not all of which actually get used: POL_WEKA_Data.csv, TOPIC/AllWords.txt, TOPIC/AgentWords.txt, TOPIC/ClientWords.txt, TOPIC/MatchedTranscript.txt, CaseDataWeka.csv, also a lot of raw data files by single transcript needed for topic stuff
  4. Topic:
    • First create the mallet file ppat/DataAnalysis/tools/mallet/createMalletPOLfile.bat (we've used AllBoW, when prompted)
    • Then train the topic model ppat/DataAnalysis/tools/mallet/TrainPOLwDiagnostics.bat (Num Topics: 20; AllBoW)
    • This spits out the mallet outputs to the ppat/POLdata/TOPIC/AllBoW folder
    • Then I do some backwards engineering stuff to generate the html files. These are a bit ad hoc; and have a crucial variable $which = "AllBoW_20"; which needs to be set to the same as the Num of topics and type as specified in the mallet stuff (because we tried it all with client and agent separately, or all with indications as to who they were - AllBoW is the one we used; it's All Bag of Words) ppat/POLdata/TOPIC/BackwardsEngineerTopicWordsToCreateHTML.pl
    • ppat/POLdata/TOPIC/TopicProportion.pl creates a csv file of topic proportions by transcript and an html index file
  5. Sentiment: Matt did on full transcript; I did average calculations by person and transcript using excel, I think, as I only had to do it the once... (sorry!)
  6. Combine data for analyses: Create a file with each transcript and associated values (inc topic proportions; sentiment averages; data from POL_WEKA_Data.csv). I probably did this in excel, too - end up with a .csv file that can be imported into weka or spss.
    • Horrible excel calclulations file here: 2014-5-30_POL_WEKA_Data_COMBINED.xlsx (why didn't I just do a script...? Too late now!)


Open constructed csv file in WEKA - save as arff (all treatment sessions)

  • remove cases if status not completed treatment of in treatment (removeDroppedOutEtc.settings)
  • remove attributes (depends on data file constructing, but all GAD and WSAS measures for now; also agentid/caseref/apptref) 1-4,10-13,34-57
  • remove cases with missing values (for relevant outcome measure... - settings need adjusting for attribute index to e.g. PHQbinary)

Then depends a bit as different arff files for different feature sets

  • Words/ngrams from Allwords (NominalToString first, then StringtoWordVector using StringToWordVector.settings or StringToNGramVector.settings)
  • reorder (put PHQ whatever at end)

Have done these...