As suggested by Pierrick
, I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.
Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.
If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj“
if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?
Another solution is to use regular expressions. But what would be the regular expression that matches random text?
When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.
Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:
[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}
This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:
[qsdfghjklmQSDFGHJKLM]{3,}
Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.
For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…