Transcription Issues

From IMC wiki
Jump to: navigation, search

Transferring text transcripts to ELAN

See this file: http://blogs.usyd.edu.au/elac/2008/11/how_to_import_a_basic_transcri.html

Files to import are tab-delimited text files; in PPAT repository /Transcripts/Amended/txt/Depression (see Mary or Chris)

  1. Open Elan
  2. Open text file
    1. Choose File --> Import --> CSV/Tab-delimited Text file
    2. Find File, select Open
    3. In import options:
      • Column 1 = Tier
      • Column 2 = Annotation
      • Specify First row of data (tick) = 2
    4. Open
  3. IMPORTANT: Go to: Options --> Propogate Time Changes --> Bulldozer Mode
  4. Select Linked Media Files
    1. Edit --> Linked Files
    2. Add (find video file)
    3. Add (find audio file)
  5. Align annotations to audio
    1. Select first annotation in tier (line under text goes blue)
    2. listen until you find matching audio
    3. select segment of sound wave that corresponds to text
    4. hit ctrl (or cmd on mac) + enter to move annotation
      • Don't worry about getting exactly right straight away as you can always move it again once it's all vaguely where it's supposed to be!
  6. Repeat for each annotation

Transcription into Elan

See this file: http://fave.ling.upenn.edu/downloads/ELAN_Introduction.pdf

  • Overlap should also be explicitly marked in the text with square brackets [ ] surrounding the word(s) that overlap in all turns that contain the overlapping material
    • This should also be linear as shown in the example on word transcripts below

Key points for transcription in Word

NB: THESE POINTS ONLY APPLY TO THE WORD TRANSCRIPTS TO BE USED FOR EXTRACTING DATA (obviously you might need extra pointers for CA level transcription that doesn't affect these basic transcriptions)

The following is based on inconsistencies in existing transcriptions

  1. Numbering
    • Should use the automatic numbering of word (i.e. each line number should not be separately typed)
  2. Line breaks
    • The return key should not just be automatically used when the end of the line is reached - a single numbered line may span more than one line in the word file, but this is arbitrary based on the font size and margin width so shouldn't be a factor in our analysis
      • The return key should be used before a change of speaker
      • or when the same speaker starts a new sentence (or sentence like unit...)
  3. Naming conventions
    • Some transcripts use DRR/PPP/OOO, some use DR/PP/OO, some use C/P/A - all of these are fine BUT there have been some typos with the first two (e.g. DDR; 000 (zeros instead of Os)) so C/P/A might be easier.
    • Following the identifier (e.g. C/P/A) should be consistently either a colon or a tab character, not '...' (which causes slight problems because of the way it is treated in word)
    • All contributions (i.e. new sentence like units) should start with the speakers identifier - even if it is the same speaker as the previous contribution (some files do this; some don't)
  4. Overlap
    • Overlap is marked by square brackets [] aligning overlapping material, but it hasn't been used consistently - the overlap needs to be marked in both turns containing the overlapping material
    • Multiple overlaps in one line should not happen! In some transcripts, it does, which means the turns are not transcribed in a time linear order (see e.g.)

As is (non-linear):

264 P um (.) my [my GP] found it out Dr. (name) found it [out] looked into my eye
265 C [optician] [yeah]

As it should be (linear):

264 P um (.) my [my GP]=
265 C [optician]
266 P =found it out Dr. (name) found it [out]=
267 C [yeah]
268 P =looked into my eye
  1. Miscellaneous
    • Pauses between turns should not have their own NUMBERED line - they should go on a separate line but should not be numbered
    • There should be no headers or footers
    • Titles should be consistent - Many are in the format: Video transcript AP016 (AC001): 23/06/2006 - this is good!