Quantum Data Quality?

31 05 2009

What’s the link between quantum physics and data quality? In a recent discussion, Jim Harris suggests that data quality projects yield to a measurement problem as there is in quantum physics. He asks:

“When does a data quality project stop existing as potential success or failure and become one or the other?”

The question makes sense, but I don’t think there is any paradox in it. In my opinion, as long as we do not measure the data quality, nothing can be said about the success or failure of the data quality project. It’s like with any probabilistic issue: as long as the throw of a die is not read, we don’t know the outcome but there is an outcome. In other words, the status of the data quality project is in an undefined state, but it is certainly not in both states at the same time as the Schrödinger’s cat.

Now what could be common to quantum physics and data quality?

I think that something which could be common to both fields is that a context is required in order to perform the measurement. Data quality is contextual. We cannot speak about the quality in an absolute manner (see this discussion). In some context the data will be of good quality whereas in other contexts they could be of bad quality. It depends on the intended use or purpose as says Henrik Liliendahl Sørensen. But it’s not sufficient yet to be quantum. Other features are required before we can speak about quantum quality. I suppose that we would need to identify non-commuting contexts in data quality projects. This would mean the order of the measurements of data quality would play a crucial role. I am not sure that we can exhibit this kind of behavior. And we need probably to define more precisely all the concepts before we can show that data quality has something quantum.





How to detect random text in a free text field?

13 03 2009

As suggested by Pierrick 😉 , I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:
[zrtypqsdfghjklmwxcvbnZRTYPQSDFGHJKLMWXCVBN]{4,}

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:
[qsdfghjklmQSDFGHJKLM]{3,}

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…





Manage your data quality projects

9 03 2009

Here is an interesting screencast about how to create in 16 minutes a web repository to manage your data quality projects.

This application allows you to define data quality tasks and assign them to people. If you are starting a new data quality project, this example application can really help you to structure and organize your project.

In the last milestone version of Talend Open Profiler, it’s possible to define tasks attached to your analyses or to your other items such as a database table or column… This feature should help you to manage your data quality projects at the user level.
Try it.

Source: How to create an online data quality rules repository.





Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.





Data quality blogs and resources

7 10 2008

The best data quality blogs are now listed in one place: Data Quality Pro Blog Finder

This post explains how the blogs are evaluated and shows that independent bloggers have more interactions with the community than vendor bloggers.

Data Quality Tools vendors are listed here.

Latest news are aggregated here and Data quality events are here.





What’s in your databases?

2 10 2008

Often you only know approximately what’s in your databases. Data profiling tools can help you to get a better idea of your database content. The goal of a data profiler is not to analyze your data in depth but to give you at a glance the main features of your data. Especially, data profilers can give you information about missing data, duplicates, badly formatted data, invalid data (out of range, incorrect business pattern…)

Talend Open Profiler (TOP) can help you to explore your data. The latest version is the 1.1.0. Its official documentation is available here. A lot of other informations can be found on the Data Quality Pro website which also made a 21 page tutorial for addressing your data quality issues with Talend Open Profiler and their free DQ Pattern analyser.

This tutorial shows you how to use TOP to explore your data and gives a lot of tips about how to interpret your profiling results. And this is really important, because you can profile easily your data and produce nice graphics with TOP, but if you don’t know what to do once you obtained the results, then profiling your data did not really help you to enhance your data quality. The tutorial also presents a very useful function called “DQ Pattern analyser” that lists the patterns existing in the data. It helps you to quickly see what’s wrong with your data and permits to identify rare occurences.
This function does not exists yet in TOP, but it will be implemented for the next version along with other new features.

By the way, if you are missing a feature, it’s time to tell Talend’s team which new feature would be great to be in TOP.





TOP 1.1.0 milestone 1

6 08 2008

The first milestone release of the next version of Talend Open Profiler is out!!

About the new features:

  • A “Result” tab has been added to the analysis editor in which result values are in tables.
  • In the indicator selector pop-up, a full row can be checked with one click.
  • You can start doing data quality monitoring by setting thresholds on indicators: when the thresholds are not respected, result is highlighted in red color in the Result tab of the analysis editor.
  • A new kind of analysis is provided: The connection analysis. But beware that the filter do not work yet. This means all tables are scanned. Don’t use it on big databases yet.
  • Regular Patterns can be imported from an Excel file.
  • A new type of Indicator has been created for SQL patterns. This allows you to create your own patterns to put in “LIKE” clause.
  • A menu “Column analysis” has been added on Table elements to profile all columns of one or several tables with a few clicks.
  • A new view outputs some details on the selected objects.
  • You can now see what objects are analyzed without having to open the analysis editor

You are welcome to suggest new features or report bugs in Talend’s bugtracker.