Correcting some unknown Part of Speech tags #21
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Correcting some unknown Part of Speech tags
Some words are tagged with an unknown part of speech tag: "---------"
But many of the unknowns can be corrected and reasonably resolved.
This PR aims to improve the POS tag quality for the latin corpus.
People processing these tag sets will likely want to handle or drop the rest of the unknowns,
or unknowns between punctuation (i.e. alternate readings) as indicated below.
Corrections:
e.g.
<word id="5" form="L" lemma="L1" postag="---------" relation="ATR" head="7"/>
<word id="6" form="." lemma="PERIOD1" postag="u--------" relation="AuxG" head="5"/>
<word id="7" form="Sullae" lemma="Sulla1" postag="n-s---mg-" relation="ATR" head="4"/>
2. Obvious corruptions in the sequence of tags in the text, e.g.:
<word id="5" form="po" lemma="other" postag="---------" relation="XSEG" head="6"/>
<word id="6" form="puli" lemma="populus" postag="n-s---mg-" relation="ATR" head="4"/>
<word id="7" form="Romani" lemma="Romanus1" postag="a-s---mg-" relation="ATR" head="6"/>
or:
<word id="4" form="c" lemma="que1" postag="---------" relation="COORD" head="1"/>
<word id="5" form="ne" lemma="ne1" postag="d--------" relation="AuxZ" head="6"/>
...etc.
Items not corrected:
<word id="1" form="ma" lemma="other" postag="---------" relation="AuxZ" head="3"/>
<word id="2" form="/" lemma="SLASH1" postag="u--------" relation="AuxG" head="1"/>
<word id="3" form="deia" lemma="other" postag="---------" relation="AuxY_ExD_OBJ" head="0"/>
<word id="4" form="perima" lemma="other" postag="---------" relation="AuxZ" head="6"/>
<word id="21" form="[" lemma="bracket1" postag="u--------" relation="AuxG" head="22"/>
<word id="22" form="dict" lemma="other" postag="---------" relation="XSEG" head="24"/>
<word id="23" form="]" lemma="bracket1" postag="u--------" relation="AuxG" head="22"/>
<word id="24" form="ita" lemma="ita1" postag="d--------" relation="PRED_CO" head="19"/>
Texts used:
Source citations; corrections against Perseus texts
Sallust’s Bellum Catilinae (2007, 2nd ed.) by J. T. Ramsey
Sallust’s Bellum Catilinae, 2018, Geoffrey Steadman
Petronius Cena Trimalchionis, 2018, Geoffrey Steadman