How do I connect my ipod to linux?

6 05 2012

Here is what I use in my .alias file:

alias ipodmount=’ifuse ~/ipod’
alias ipodumount=’fusermount -u ~/ipod’
alias ipod=’ipodmount; gtkpod;ipodumount’

This launches gtkpod and connects automatically to my ipod.

Using regular expressions when profiling SQL Server with Talend Open Profiler

28 08 2009
Email column analysis

Email column analysis

By default, SQL Server does not provide integrated functions to do pattern matching with regular expressions. If you want to use Talend Open Profiler to analyze the matching of your data against some regular expressions (for example to analyze a column that contains emails, dates,…) you need to install a CLR User-Defined Function (UDF) on your SQL Server.

Amine who has done an internship at Talend has tried it. He provided us a step-by-step user guide about how to create the UDF, install it on your SQL Server and call it from Talend Open Profiler. This will allow you to use all predefined patterns to ensure the quality of your data.

You can also find new regular expressions for Talend Open Profiler in Talend Exchange.


11 06 2009

How to avoid that svn+ssh always asks for the password:

1. create a ssh key: ssh-keygen (without a passphrase or it will be asked each time you do an update)

2. connect to the remote server via ssh

3. edit or create a file “.ssh/authorized_keys” and put the content of your local “.ssh/” file (if the “.ssh” folder does not exist, execute “ssh-keygen” in order to generate it. Then remove the generated key files: “id_rsa” and “” which are not required)

Quantum Data Quality?

31 05 2009

What’s the link between quantum physics and data quality? In a recent discussion, Jim Harris suggests that data quality projects yield to a measurement problem as there is in quantum physics. He asks:

“When does a data quality project stop existing as potential success or failure and become one or the other?”

The question makes sense, but I don’t think there is any paradox in it. In my opinion, as long as we do not measure the data quality, nothing can be said about the success or failure of the data quality project. It’s like with any probabilistic issue: as long as the throw of a die is not read, we don’t know the outcome but there is an outcome. In other words, the status of the data quality project is in an undefined state, but it is certainly not in both states at the same time as the Schrödinger’s cat.

Now what could be common to quantum physics and data quality?

I think that something which could be common to both fields is that a context is required in order to perform the measurement. Data quality is contextual. We cannot speak about the quality in an absolute manner (see this discussion). In some context the data will be of good quality whereas in other contexts they could be of bad quality. It depends on the intended use or purpose as says Henrik Liliendahl Sørensen. But it’s not sufficient yet to be quantum. Other features are required before we can speak about quantum quality. I suppose that we would need to identify non-commuting contexts in data quality projects. This would mean the order of the measurements of data quality would play a crucial role. I am not sure that we can exhibit this kind of behavior. And we need probably to define more precisely all the concepts before we can show that data quality has something quantum.

How to detect random text in a free text field?

13 03 2009

As suggested by Pierrick 😉 , I start this post with a picture:

I give here a tip to detect random text by using regular expressions. But before I give this expression, let me explain when and what for it is useful.

Nowadays, we find webforms almost everywhere in the Internet. These forms are used to collect data entered by users. If the form is well designed, the required fields cannot be left empty by the user. Other controls can also be done, such as to check the address, the country, the phone number format and so on.
Sometimes these controls are not done, either because the field is a free text field or simply because the developper of the web site did not develop them. In this case, the user can type anything in the text area and then anything enter in the database.

If the user does not want to fill in a field and this field is required, what is going to happen?
Either the user goes away and does not fill the form, or he types something like “lmqkjfdmklgj

if you are doing data quality, you may need to identify this kind of bad data. The question is: how to detect something that can be anything but has clearly no meaning?
There can exist several solutions to this question. The first one could be to check that the data is composed of real words. For this solution, you need a dictionary. And it should be rather complete in order not to miss some words. Then how do you handle proper names?

Another solution is to use regular expressions. But what would be the regular expression that matches random text?

When I looked at my keyboard (an azerty keyboard as in the picture above) I saw that all the vowels are on the second row. Moreover, the default starting position of the hands on the keyboard is to put the left index finger onto the “F” key and the right index finger onto the “J” key. These keys are on the third row of keys (called the home row). This means that when you want to type something randomly, there’s a great chance that you will type only letters from the home row. And on a French keyboard, this means that there will be no vowel in the entered text.

Given these considerations, a random text is a string of characters without any vowel. The regular expression to match it can be something like:

This expression matches any 4 consecutive consonants. Maybe, it’s not enough and some real words will be matched by this expression. For example, it matches the word “length“.
Either you can require at least 5 consecutive consonants or you can restrict the expression to the letter of the home row:

Try it on your data with Talend Open Profiler. You can either create your own “pattern” or download it on Talend Exchange.

For the English keyboards, the “a” vowel appears in the home row. This adds some difficulties because there are probably several words that could be formed with “a” and the letters of the home row. I let you adapt the regular expression to your needs and keyboard…

Manage your data quality projects

9 03 2009

Here is an interesting screencast about how to create in 16 minutes a web repository to manage your data quality projects.

This application allows you to define data quality tasks and assign them to people. If you are starting a new data quality project, this example application can really help you to structure and organize your project.

In the last milestone version of Talend Open Profiler, it’s possible to define tasks attached to your analyses or to your other items such as a database table or column… This feature should help you to manage your data quality projects at the user level.
Try it.

Source: How to create an online data quality rules repository.

Share your regular expressions on Talend Exchange

12 02 2009

A new service is offered on the Talend Forge website. It’s called Talend Exchange and is aimed at providing a place where you can share TOS components, TOS templates, TOS jobs and TOP patterns. It is an enhancement of the previously called “Ecosystem” service.

I have added a few new regular expressions for Talend Open Profiler which may be useful to French users. You may download them and import them in Talend Open Profiler-1.1.x via the “import patterns” menu available on the “Libraries/Patterns” folder.

You may add your own regular expressions too. For the Talend Open Profiler-1.1.x platform, you need to create a simple xls file with the following columns
* C1: Label: the label of the pattern
* C2: Purpose: the purpose of the pattern
* C3: Description: the description of the pattern
* C4: Regular expression: the regular expression applicable to all databases
* C5: MySQL Regexp: the regular expression applicable to MySQL databases
* C6: Oracle Regexp: the regular expression applicable to Oracle databases
* C7: Author: the author of the regular expression
Then compress it in the zip format and you’re ready to upload it on Talend Exchange.

For the current development version of Talend Open Profiler-1.2.0xx, the task is even easier because there is a menu to export your regular patterns and/or your SQL patterns in csv format. For this version, it is suggested to export your patterns individually. Then compress your csv file(s) and upload them on Talend Exchange to share them with other users. The “import” menu will be able to read these files (once unzipped).

We’ll work later on a more integrated mechanism to share your patterns in one click.

Datamining type

4 11 2008

In Talend Open Profiler, when you create a column analysis, you can see a combo box near each column in the editor which represents the data mining type of the column. What is it? And what is it useful for?

The available data mining types are

  1. nominal
  2. interval
  3. unstructured text
  4. other

Not all indicators (or metrics) can be computed on all kind of data. These data mining types helps Talend Open Profiler to choose the appropriate metrics for the column.

Nominal (also called “categorical” sometimes) means that the data can serve as label. For example, the type of a column called “WEATHER” with values: “sun”, “cloud”, “rain” would be nominal. In Talend Open Profiler, textual data are set to nominal data mining type.

But it happens that data such as “52200”, “75014” are nominal data too although they are represented by numbers. In fact, a column called “POSTAL_CODE” could have these values. It is clear for the user that these data are of nominal type because they identify a postal code in France. Computing mathematical quantities such as the average on these data is a non sense. In that case, the user should set the data mining type of this column to “nominal”, because there is currently no way to automatically guess the correct type in Talend Open Profiler in such a case.
The same is true for primary or foreign key data. Keys are most of the time numerical data, but their data mining type is “nominal”.

“Interval” data mining type is used for numerical data and time data. Difference between two values, averages can be computed on this kind of data. In databases, sometimes numerical quantities are stored in textual fields. With Talend Open Profiler, it’s possible to declare a textual column (e.g. a column of type VARCHAR) as an interval. In that case, the data should be treated as numerical data and summary statistics should be available. Currently, it’s not yet implemented because there is not yet an interface which allows the user to specify the format of the data. But this feature is planned for a future release.

The other two data mining types are not usual data mining types. In data mining we find sometimes the types “ordinal” and “ratio”.

The reason is that for the indicators currently available in Talend Open Profiler, these two types are not needed. Instead we added two other types to handle textual data. For example, a column “COMMENT” which contains text is not a nominal data, but still we could be interested in seeing the duplicate values of this column. Or we could implement metrics specific to text mining (but this is not for the current release…).

And finally, we also have the type “other” which design a data which Talend Open Profiler does not know how to handle yet.

How to compute a median in SQL

28 10 2008

In Talend Open Profiler, we generate SQL queries to get statistical informations. Among the currently available indicators, the median is one of the most difficult to compute. Nevertheless this indicator is worth computing because it is more stable than the mean indicator (average). By stable, I mean that it is less influenced by extremal values. This is not the case with the average which can vary a lot when extremal values exist.

I found several ways to compute the median depending on the database type. The most simple is for example with Oracle 10g which provides a MEDIAN function, so that your query writes
SELECT MEDIAN(salary) FROM employee

But for other databases, things begin to be more tricky. Let’s take MySQL first. One way to compute the median is the following:
SELECT salary FROM employee
ORDER by salary ASC
LIMIT n,p) T

where p=1 and n=N/2-1 when the number of non null rows N is even, or p=2 and n=(N-1)/2 when N is odd.

For Postgresql, the query is similar to the MySQL query and uses LIMIT too.
SELECT salary FROM employee
ORDER by salary ASC

This query can also be used on MySQL but not on old versions of MySQL (before 5.0).
For Oracle 9i, the MEDIAN function does not exists and we must use the PERCENTILE_CONT function:
FROM employee

For DB2, the following query is used in Talend Open Profiler:
SELECT salary, COUNT(*) OVER( ) total, CAST(COUNT(*) OVER( ) AS DECIMAL)/2 mid, CEIL(CAST(COUNT(*) OVER( ) AS DECIMAL)/2) next, ROW_NUMBER() OVER ( ORDER BY salary) rn FROM employee
) x
WHERE ( MOD(total,2) = 0 AND rn IN ( mid, mid+1 ) )
( MOD(total,2) = 1 AND rn = next )

For Microsoft SQL Server, we used the TOP clause as follows
SELECT TOP n salary FROM (
SELECT TOP m salary FROM employee

where n is given as in the MySQL case and m=n+p (p being given above for the MySQL case).

Up to now, the only way I found for computing the median on Sybase ASE is the following:
SELECT salary FROM employee
SELECT salary FROM employee
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
SELECT salary FROM employee
WHERE SOU.salary <= STT.salary)
(SELECT COUNT(salary) FROM employee) <= (SELECT COUNT(salary) FROM (
SELECT salary FROM employee
SELECT salary FROM employee
WHERE SUR.salary >= STT.salary) ) T

This query makes heavy use of correlated subqueries and I hope to find a more efficient way to compute a median on this database.

Median can be computed by other approaches. Temporary tables could be used or cursors. But Talend Open Profiler must only use SELECT statements because a data profiler could not have the permissions to create a table on a database and the use of cursors is too complex for this tool.

krunner history

18 10 2008

krunner is the small linux application launcher that pops-up when you hit Alt+F2. Its behavior in kde4 changed from the behavior of minicli of kde-3.5. In order to get the previous behavior back, right-click on the text field of krunner and change the text completion mode to “short automatic”.

I never knew this menu before today. That’s why I give this tip here.