机器学习代写:COMP4446_COMP5046_A2

COMP 4446 / 5046 Assignment 2

Due:Monday,May15th(ie.,thestartofweek 12 ofsemester)

Inthisassignment,youwilllearnaboutakeypartofNLP:dataannotation.Thisisoftenthe
mostcriticalpartofworkonaproject.Ifyoudonotcreateaccuratedatasetsfortrainingand
evaluationthenitdoesn’tmatterhowgoodyourmodelis,youwillnotbeabletobuildan
effectivesystem.

Theassignmenthasaseriesofstages.Notethatafterstage 1 youneedtowaitforusto
sendyouafilebeforeyoucandostage 2 (andtherestoftheassignment).Wewillrespond
within 3 businessdays.

0 - Forming Groupsof1-3 students

Youmayworkonyourownorinagroupof 2 or 3 students.Youneedtodotwothingsinthis
stage:

  1. FormgroupsonCanvas.Seeinstructionshere.
  2. Onceyouhaveyourgroup,pleasewriteitdowninthisspreadsheet.
1
PleaserecordboththemembersinyourgroupandthegroupIDyouhaveinCanvas.

1 - InitialAnnotation

Inthissectionoftheassignment,youwillannotatedata(~2,500tokensoftext)anddevelop
annotationguidelinesthatdescribeyourannotationprocess.Thisannotationshouldbedone
byyou,notbyanAImodel.

  1. Downloadthedatafromhere.
  2. Findthefilethatmatchesyourgroupnumber(fromthespreadsheetinstage0).
  3. Read theinitial annotationguide(GoogleDocorWordDoc).Note,theguidehas
    beenupdatedtoincludethatnestednamedentitiesshouldbeannotated.Theguide
    hasanexample.
  4. Eachstudentinyourgroupshouldindependently,withoutdiscussion,annotatethe
    fileandkeepnoteson(a)examplesofeachcategory,and(b)explanationsofwhat
    youchosetodoinunusualcases,alongwithexamples.
1
2
Note:youmaydiscusstechnicaldecisionsaboutwhattooltouseforannotation,how
tosetitup,debuggingrunningit,etc.
  1. Storeyourannotationsasa“.txt”filewithonelineperannotation,inthisformat:
1
((linestart,tokenstart),(lineend,tokenend))-label
1
2
For example,ifa Personentity spannedthefirsttwotokensinthethirdline,you
wouldhave:

((2, 0), (2, 1)) - PER

1
2
3
4
For itemsthat spana singletokenyoumaysaveitinoneoftwoways(eitheris
acceptable):
((0, 5), (0, 5)) - PER
(0, 5) - PER
1
2
3
4
Note:
Thenumberingstartsfrom 0
Tokensarespecifiedbysplittingthefileonwhitespace
Blanklinescountwhendeterminingthelinenumber
  1. Meetasagroupandcreateanewversionoftheannotationguidethatadds:
    ○ Newexamplesofeachcategorythatcomefromyourdata.
    ○ Discussionofunusualcases,withthedecisionseachofyoumadeandwhat
    yourgrouphasdecidedisthebestapproachinfuture.
    

Todotheannotation,youmayuseanytoolyoulike.WerecommendSLATE,whichfollows
thefileformatdescribedabove.Someotherfreeoptionsaredocanno,andINCEpTION.If
youuseatoolthathasanauto-annotationmodeorsemi-automaticannotationmode(e.g.
bratandprodi.gyhavesuchmodes),pleasedonotuseitinthisassignment.Allannotations
shouldbedonebyyou.

Submit-https://canvas.sydney.edu.au/courses/ 48399 /assignments/ 446897

1
2
(a)PDFofyourannotationguide
(b)Textfilescontainingannotations,onetextfileforeachpersoninyourgroup

2 - Adjudicationand Refinement

Inthissection,youwilladjudicatedisagreementsintheannotationsofyourfile.Ifyourgroup
hasNmembersthenyouwillbecomparingN+!annotationfiles(theextraoneistheonewe
providetoyou).Thisadjudicationshouldbedonebyyou,notbyanAImodel.

  1. Downloadtheannotationsweprovideandfindthefileforyourgroup.
  2. Go through the annotations as a group and resolve every case where the
    annotationsdonotmatch.Afterdoingthisyoushouldhaveasinglefilethatisthe
    agreedannotations.
  3. Atthesametime,addandremoveexamplesandexplanationsfromyourannotation
    guidesothatitexplainsyourdecisions.
    a. For content you want to remove from the annotation guide, draw a line
       throughthetext(ie.,astrikethrough).
    b. Forcontentyouwanttoadd,includeitinbluetextsoitisclearlydifferentfrom
       theoriginaltext(whichshouldbeinblack).
    

Note-youshouldnotaddorremoveentirelabeltypeshere.Alwaysusethe 6 typeswe
specifiedintheinitialannotationguide.Youareonlychangingtheguidetoclarifyhowtodo
annotationforcasesthatmightbeambiguousortricky.

Submit-https://canvas.sydney.edu.au/courses/ 48399 /assignments/ 452047

1
2
(a)PDFoftherevisedannotationguidewithstrikethroughsandbluetext
(b)Textfileofthefinalannotations

3 - Improved Annotation

Now,youwillannotateanotherpieceoftext,usingyourrevisedguidelines.Thisannotation
shouldbedonebyyou,notbyanAImodel.

  1. Downloadthedatafromhere.
  2. Findthefilethatmatchesyourgroup
  3. Independently, withoutdiscussion,eachstudentinyourgroupshouldannotatethe
    file.

Submit-https://canvas.sydney.edu.au/courses/ 48399 /assignments/ 452048

1
(a)Textfilesofyourannotations,onetextfileforeachperson

4 - Evaluation Metrics

Inthissection,youwillimplementametrictoseehowconsistentyourannotationsare.

  1. ImplementF-Score(seelecture 8 orthiswikipediaarticle).Note,tobeconsidereda
    match,anannotationmusthavethesamespanandthesamelabel.
  2. Calculate F-Score for each pair of annotations in stage (1) of the assignment,
    including theannotationswe provide.If youareworkingonyourownthismeans
    calculatingtheF-Scorebetweenyourannotationsandtheonesweprovided.Ifyou
    areinagroupof 2 youwillcalculatethreevalues(personA-personB),(personA-
    provided),(person B-provided). Ifyouareinagroup of 3 youwillcalculatesix
    values.
1
2
Ifyouareworkinginagroupwith2+peoplethenalsocalculatetheaverageofthe
values.
  1. Repeatthepreviousstepusingthedatafromstage(3)oftheassignment.Ifyouare
    workingonyourown,youshouldcomparewiththeoutputinthisfile.

Submit-https://canvas.sydney.edu.au/courses/ 48399 /assignments/ 452051

1
2
Anipynbfilecontaining(a)yourcodeforcalculatingthemetricsand(b)theresultsof
yourcalculationsin(2)and(3).

5 - Model Evaluation

Inthissection,youwillmeasuretheaccuracyonyourdataofthreewidely-usedmodels.

  1. RunFlair,SpaCy,andStanzaonyourdata.Note,youshouldusetheir18-classNER
    models.
1
2
3
The 18 classesthosemodelsproduceinclude 5 oftheonesweconsiderhere.You
should post-processthe outputof themodels toremovecaseswheretheyusea
labelwearenotusing(e.g.TIME).
  1. Evaluateonyourdatafromstage 2 (ie.,theadjudicateddata).
1
2
3
Note:Youwillneedtomaptheiroutputtoourformat.Sometimestokenisationwillnot
matchup exactly. That’sokay-it willimpact scores, butyouwill stillbe ableto
comparethethree.

Submit-https://canvas.sydney.edu.au/courses/ 48399 /assignments/ 452052

1
2
3
(a) Text filescontainingthe outputofthe threemodels onyour adjudicated data
(stage2),intheformatspecifiedinstage 1
(b)Atextfilecontaining:
1
2
3
Flair-SCORE
SpaCy-SCORE
Stanza-SCORE
1
2
Where‘SCORE’isreplacedbytheF-Scoreforcomparingthemodel’soutputtoyour
adjudicatedannotations.

6 - [Bonus]Competition

Thisisanoptionalsectionwhereyoutrainmodelsforthistask.Wewillprovideallofthe
datastudentshavesubmittedinstages1,2,and3,whichyoucanusefortraining.Youwill
betestedonaseparatedatasetannotatedbythetutors.

Moredetailsofthecompetitionwillbereleasedlater.Itwillalsohaveadeadline 1 weekafter
the main assignment deadline (May 22nd). There will be NO EXTENSIONS for the
competition.

The competition can either be completed in the same group as for the rest of the
assignment,oronyourown.

Ifyoureceivebonusmarksandtheytakeyouroverallmarkfortheassignmentover 20 (ie.
100%)thenthebonuscancounttoyouroverallnon-examcoursemark.

Mark Allocation

Thetablebelowshowsthevalueofeachsection,brokendownacrosstheitemsyousubmit.

1
Section Value Breakdown
1
2
3
1 - InitialAnnotation 5 2 - Annotations
2 - Annotationguideexamples
1 - Annotationguideexplanations
1
2
2 - Adjudicationand
Refinement
1
2
5 2 - Adjudicatedannotations
3 - Annotationguideupdates
1
3 - ImprovedAnnotation 2 2 - Annotations
1
2
4 - EvaluationMetrics 4 2 - Code
2 - Resultsforthetwocalculationsusingthecode
1
2
3
4
5 - ModelEvaluation 4 1 - Flairoutput
1 - SpaCyoutput
1 - Stanzaoutput
1 - Scoresforthethreemodels

Note:YourannotationsMUSTmatchthefileformatwespecifiedinstage1.Iftheydonot,
youwillscore 0 forthem.

Bonuspointsinthecompetitionareawardedasfollows:

  • Top25%ofentrants,+1point
  • Top10%ofentrants,+2points
  • Top 2 entrants,+3points