Using regular expressions when profiling SQL Server with Talend Open Profiler

28 08 2009
Email column analysis

Email column analysis

By default, SQL Server does not provide integrated functions to do pattern matching with regular expressions. If you want to use Talend Open Profiler to analyze the matching of your data against some regular expressions (for example to analyze a column that contains emails, dates,…) you need to install a CLR User-Defined Function (UDF) on your SQL Server.

Amine who has done an internship at Talend has tried it. He provided us a step-by-step user guide about how to create the UDF, install it on your SQL Server and call it from Talend Open Profiler. This will allow you to use all predefined patterns to ensure the quality of your data.

You can also find new regular expressions for Talend Open Profiler in Talend Exchange.





How to detect random text in a free text field?

13 03 2009

As suggested by Pierrick 😉 , I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:
[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:
[qsdfghjklmQSDFGHJKLM]{3,}

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…





Share your regular expressions on Talend Exchange

12 02 2009

A new service is offered on the Talend Forge website. It’s called Talend Exchange and is aimed at providing a place where you can share TOS components, TOS templates, TOS jobs and TOP patterns. It is an enhancement of the previously called “Ecosystem” service.

I have added a few new regular expressions for Talend Open Profiler which may be useful to French users. You may download them and import them in Talend Open Profiler-1.1.x via the “import patterns” menu available on the “Libraries/Patterns” folder.

You may add your own regular expressions too. For the Talend Open Profiler-1.1.x platform, you need to create a simple xls file with the following columns
* C1: Label: the label of the pattern
* C2: Purpose: the purpose of the pattern
* C3: Description: the description of the pattern
* C4: Regular expression: the regular expression applicable to all databases
* C5: MySQL Regexp: the regular expression applicable to MySQL databases
* C6: Oracle Regexp: the regular expression applicable to Oracle databases
* C7: Author: the author of the regular expression
Then compress it in the zip format and you’re ready to upload it on Talend Exchange.

For the current development version of Talend Open Profiler-1.2.0xx, the task is even easier because there is a menu to export your regular patterns and/or your SQL patterns in csv format. For this version, it is suggested to export your patterns individually. Then compress your csv file(s) and upload them on Talend Exchange to share them with other users. The “import” menu will be able to read these files (once unzipped).

We’ll work later on a more integrated mechanism to share your patterns in one click.





How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT AVG(salary) FROM (
SELECT salary FROM employee
WHERE salary IS NOT NULL
ORDER by salary ASC
LIMIT p OFFSET n) T

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
FROM employee

For DB2, the following query is used in Talend Open Profiler:
SELECT AVG(salary) FROM (
SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee
) x
WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )
OR
( MOD(total,2) = 1 AND rn = next )

For Microsoft SQL Server, we used the TOP clause as follows
SELECT AVG(CAST(salary AS NUMERIC)) FROM (
SELECT TOP n salary FROM (
SELECT TOP m salary FROM employee
WHERE salary IS NOT NULL ORDER BY salary ASC
) AS FOO
ORDER BY salary DESC
) AS BAR

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT AVG(CAST (salary AS NUMERIC)) FROM (
SELECT DISTINCT salary FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) STT
WHERE
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SOU
WHERE SOU.salary <= STT.salary)
AND
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
UNION ALL
SELECT salary FROM employee
) AS SUR
WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.