Using regular expressions when profiling SQL Server with Talend Open Profiler

28 08 2009
Email column analysis

Email column analysis

By default, SQL Server does not provide integrated functions to do pattern matching with regular expressions. If you want to use Talend Open Profiler to analyze the matching of your data against some regular expressions (for example to analyze a column that contains emails, dates,…) you need to install a CLR User-Defined Function (UDF) on your SQL Server.

Amine who has done an internship at Talend has tried it. He provided us a step-by-step user guide about how to create the UDF, install it on your SQL Server and call it from Talend Open Profiler. This will allow you to use all predefined patterns to ensure the quality of your data.

You can also find new regular expressions for Talend Open Profiler in Talend Exchange.

Advertisements




Manage your data quality projects

9 03 2009

Here is an interesting screencast about how to create in 16 minutes a web repository to manage your data quality projects.

This application allows you to define data quality tasks and assign them to people. If you are starting a new data quality project, this example application can really help you to structure and organize your project.

In the last milestone version of Talend Open Profiler, it’s possible to define tasks attached to your analyses or to your other items such as a database table or column… This feature should help you to manage your data quality projects at the user level.
Try it.

Source: How to create an online data quality rules repository.





Share your regular expressions on Talend Exchange

12 02 2009

A new service is offered on the Talend Forge website. It’s called Talend Exchange and is aimed at providing a place where you can share TOS components, TOS templates, TOS jobs and TOP patterns. It is an enhancement of the previously called “Ecosystem” service.

I have added a few new regular expressions for Talend Open Profiler which may be useful to French users. You may download them and import them in Talend Open Profiler-1.1.x via the “import patterns” menu available on the “Libraries/Patterns” folder.

You may add your own regular expressions too. For the Talend Open Profiler-1.1.x platform, you need to create a simple xls file with the following columns
* C1: Label: the label of the pattern
* C2: Purpose: the purpose of the pattern
* C3: Description: the description of the pattern
* C4: Regular expression: the regular expression applicable to all databases
* C5: MySQL Regexp: the regular expression applicable to MySQL databases
* C6: Oracle Regexp: the regular expression applicable to Oracle databases
* C7: Author: the author of the regular expression
Then compress it in the zip format and you’re ready to upload it on Talend Exchange.

For the current development version of Talend Open Profiler-1.2.0xx, the task is even easier because there is a menu to export your regular patterns and/or your SQL patterns in csv format. For this version, it is suggested to export your patterns individually. Then compress your csv file(s) and upload them on Talend Exchange to share them with other users. The “import” menu will be able to read these files (once unzipped).

We’ll work later on a more integrated mechanism to share your patterns in one click.





Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.





What’s in your databases?

2 10 2008

Often you only know approximately what’s in your databases. Data profiling tools can help you to get a better idea of your database content. The goal of a data profiler is not to analyze your data in depth but to give you at a glance the main features of your data. Especially, data profilers can give you information about missing data, duplicates, badly formatted data, invalid data (out of range, incorrect business pattern…)

Talend Open Profiler (TOP) can help you to explore your data. The latest version is the 1.1.0. Its official documentation is available here. A lot of other informations can be found on the Data Quality Pro website which also made a 21 page tutorial for addressing your data quality issues with Talend Open Profiler and their free DQ Pattern analyser.

This tutorial shows you how to use TOP to explore your data and gives a lot of tips about how to interpret your profiling results. And this is really important, because you can profile easily your data and produce nice graphics with TOP, but if you don’t know what to do once you obtained the results, then profiling your data did not really help you to enhance your data quality. The tutorial also presents a very useful function called “DQ Pattern analyser” that lists the patterns existing in the data. It helps you to quickly see what’s wrong with your data and permits to identify rare occurences.
This function does not exists yet in TOP, but it will be implemented for the next version along with other new features.

By the way, if you are missing a feature, it’s time to tell Talend’s team which new feature would be great to be in TOP.





TOP announcements

11 09 2008

As I don’t want to announce here every version of Talend Open Profiler, you can find the announcements at the freshmeat page of TOP. The latest release is the 1.1.0RC1 release.





TOP 1.1.0 milestone 2

20 08 2008

The second milestone version of Talend Open Profiler is out. Try it now!

Among the new features, the support for Microsoft SQL Server has been added.