Named entity recognition (NER) is a task of natural language processing: pull out named things text. It sounds like trivial at first. Just create a giant list of named things and compare against that.
But suppose, for example, University of Texas is on your list. If Texas is also on your list, do you report that you have a named entity inside a named entity? And how do you handle The University of Texas? Do you put it on your list as well? What about UT? Can you tell from context whether UT stands for University of Texas, University of Toronto, or the state of Utah?
Searching for Rice University would be even more fun. The original name of the school was The William Marsh Rice Institute for the Advancement of Letters, Science, and Art. I don’t know whether the name was ever officially changed. A friend who went to Rice told me they had a ridiculous cheer that spelled out every letter in the full name. And of course rice could refer to a grain.
Let’s see what happens when we run the following sentence through spaCy looking for named entities.
Researchers from the University of Texas at Austin organized a picleball game with their colleagues from Rice University on Tuesday.
I deliberately did not capitalize the definite article in front of University of Texas because I suspected spaCy might include the article if it were capitalized but not otherwise. It included the article in either case.
The results depend on the language model used. When I used en_core_web_trf
it included at Austin as part of the university name.
When I used the smaller en_core_web_sm
model it pulled out Austin as a separate entity.
The tag ORG stands for organization and DATE obviously stands for date. GPE is a little less obvious, standing for geopolitical entity.
When I changed Rice University to simply Rice, spaCy still recognized Rice as an organization. When I changed it to rice with no capitalization, it did not recognize it as an organization.
The other day I stress tested spaCy by giving it some text from Chaucer’s Canterbury Tales. Even though spaCy is trained on Modern English, it did better than I would have expected on Middle English.
Using the en_core_web_trf
model it recognizes Engelond and Caunterbury as cities.
When I switched to en_core_web_sm
it still recognized Caunterbury as city, but tagged Engelond as a person.
The post Named entity recognition first appeared on John D. Cook.