How to detect random text in a free text field?

13 03 2009

As suggested by Pierrick 😉 , I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:
[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:
[qsdfghjklmQSDFGHJKLM]{3,}

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…

Advertisements

Actions

Information

3 responses

13 03 2009
Datenqualität in Textfeldern mit RegExp überprüfen - dijit

[…] Ansatz, um Texteingaben nach bewussten Falscheingaben zu durchsuchen, hat mein Kollege Sebastiao in seinem Blog veröffentlicht.Er macht sich dabei eine sehr interessante Tatsache zu nutze – das […]

1 07 2009
bronius

Ha! Clever idea. The general concept is “detecting a number of characters entered whose keys are in physical proximity of one-another”.

I like it. How is it working out for you in practice?

2 07 2009
scorreia

yes, the physical proximity of the keys is what gave me this idea. But it’s also because on the French keyboard, there is no vowel on the home row. Hence the words created with the keys of the home row only have a low probability to be valid words.

I have run some tests on real data with emails and company names. I have found a few percentage of invalid data.

For example, I have found the following invalid emails:
szdfsdf@dfsdf.sdfsd
gddggh@nfngn.fhgfh
bnmbnm@gffdg.grgtrg
cdcdc@vde.dcdcw
fdghfdgh@fghfg.dfghfd
ff@fgfh.gbvgg
adsdc@dscsdcdsc.cddcsc
adsdc@dscsdcdsc.cddcsc
rhhtyhtyhtyh@rthgrthr.frtrgtr
vbcvb@cvbcvb.cvbcv
sdf@adf.pldff
ggugiug@ghghg.ckcjf
hkl@domain.sdfgsdfg

and for the companies:
fghfg
ssdfgfsdgvgx
SDCDSC
dslck
dscsd
sdcfscds
fdvdfv
xcvxcvx
dfsdf
sqsqsq
xvxvxv
sdfsdf
sddssdsdsd
cqwxd
ANPCyT
gdfgdf
fffhk
drgdrg
drgdrg
hjhkjhkj
hbhbjhb
hbhbjhb
nttcw
sgdfg
BCBSF
asdgggs
ytyrtytttttttttttttttttttttt
qwdqw
dddqd
sdfsdfsdfsdf
gfjhfg

In fact, I had to change my regular expression and search for 5 consecutive consonants instead of 4. The reason is that some companies have 4 consecutive consonants. For example: SNCF and many german companies contain the acronym GmbH.

But when I used my smaller regular expression limited to the home row keys, these terms do not match.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s




%d bloggers like this: