Skip to content

Textmining literature resources, target text processing patterns

Jen Hammock edited this page Jul 30, 2021 · 1 revision

Species section target text - everything between the start and stop patterns, or appended by an append pattern- should be mined with the Habitat, Geography, Growth Form and Life History ontologies, and the Size and Associates target patterns. List target text should be mined with Habitat, Geography and Growth Form ontologies. Size and Associate data cannot be detected by the presence of a single term, as the other target data can; terms of interest are not sufficiently distinctive. Records can be detected when terms of interest are present in their own stereotyped patterns.

Size patterns

  • [Body Part term] [up to 10 intervening words and no sentence break] [number or number range] [units term] [dimension term]
  • [Body Part term] [dimension term (noun form)] [up to three words and/or a colon and/or a dash] [number or number range] [units term]
  • newline [number or number range] [units term] [dimension term] (whole organism measurement is implied for this pattern)

combinations producing records: (Units and dimension terms must agree. Pattern 1 shown, but applies to all patterns. )

  • "plant"..."high"
  • ["leaf", "rhizome", "body", "leg", "wing", "tarsus", "antenna"]... "long"
  • ["leaf", "body", "wing"]... "wide"
  • ["body", "cell", "bell", "colony"]... "in diameter"

Associates patterns

  • newline "Associates" [up to four punctuation marks, and/or up to two words] [Taxon name]
  • newline ["Host", "Hosts", "Host plant", "Host plants"] [up to four punctuation marks, and/or up to two words] [Taxon name]
  • newline ["On", "Found on"] [up to four punctuation marks, and/or up to two words] [Taxon name] (this pattern to be used selectively, default off. Appropriate for documents known to describe obligate parasites or other symbionts, sessile microbes or fungi)
Clone this wiki locally