Person name cleaning

23 05 2008

In this article, David Loshin explains how complex a person name can be. In order to clean a database of person names, a first step is to standardize these names. The article starts with a simple pattern like “first name, last name” and ends up with more than 850 patterns used in one of D. Loshin’s projects.

The most common patterns for the token appearing in a name are

  • a first name (John)
  • a last name (Neumann)
  • an initial (L.)
  • a generational token (I, II, Jr, …)
  • a title (Mr, Prof., Dr…)
  • a prefix (von, da, di…)
  • a suffix (PhD, ESQ,…)

These token can appear several times in a string identifying a person. For example, a person can have several first names or a last name made of several words. Sometimes several persons are identified together in a record, for example in bank accounts. Then there are also cultural differences in name composition to handle. The standardization consists in identifying each pattern in a given record and then in extracting each token. Once this is done, we can start thinking at data quality…

Advertisements

Actions

Information

3 responses

23 05 2008
Pierrick LE GALL

I don’t know if you’ve discussed with Richard, but he has coded tParseName Perl component, available in future Talend Open Studio 2.4.0. The component is based on Lingua::EN::NameParse Perl CPAN module

http://search.cpan.org/perldoc?Lingua::EN::NameParse

23 05 2008
Person name cleaning | Janitorial Supplies and Tips

[…] unknown wrote an interesting post today onHere’s a quick excerptIn order to clean a database of person names, a first step is to standardize these names. The article starts with a simple pattern like “first name, last name” and ends up with more than 850 patterns used in one of D. Loshin’s projects. … […]

28 05 2008
scorreia

Hi Pierrick,
Thanks for the information, I did not know about this new component. I will speak with Richard about this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: