Data quality blogs and resources

7 10 2008

The best data quality blogs are now listed in one place: Data Quality Pro Blog Finder

This post explains how the blogs are evaluated and shows that independent bloggers have more interactions with the community than vendor bloggers.

Data Quality Tools vendors are listed here.

Latest news are aggregated here and Data quality events are here.





What’s in your databases?

2 10 2008

Often you only know approximately what’s in your databases. Data profiling tools can help you to get a better idea of your database content. The goal of a data profiler is not to analyze your data in depth but to give you at a glance the main features of your data. Especially, data profilers can give you information about missing data, duplicates, badly formatted data, invalid data (out of range, incorrect business pattern…)

Talend Open Profiler (TOP) can help you to explore your data. The latest version is the 1.1.0. Its official documentation is available here. A lot of other informations can be found on the Data Quality Pro website which also made a 21 page tutorial for addressing your data quality issues with Talend Open Profiler and their free DQ Pattern analyser.

This tutorial shows you how to use TOP to explore your data and gives a lot of tips about how to interpret your profiling results. And this is really important, because you can profile easily your data and produce nice graphics with TOP, but if you don’t know what to do once you obtained the results, then profiling your data did not really help you to enhance your data quality. The tutorial also presents a very useful function called “DQ Pattern analyser” that lists the patterns existing in the data. It helps you to quickly see what’s wrong with your data and permits to identify rare occurences.
This function does not exists yet in TOP, but it will be implemented for the next version along with other new features.

By the way, if you are missing a feature, it’s time to tell Talend’s team which new feature would be great to be in TOP.





TOP announcements

11 09 2008

As I don’t want to announce here every version of Talend Open Profiler, you can find the announcements at the freshmeat page of TOP. The latest release is the 1.1.0RC1 release.





TOP 1.1.0 milestone 2

20 08 2008

The second milestone version of Talend Open Profiler is out. Try it now!

Among the new features, the support for Microsoft SQL Server has been added.





TOP 1.1.0 milestone 1

6 08 2008

The first milestone release of the next version of Talend Open Profiler is out!!

About the new features:

  • A “Result” tab has been added to the analysis editor in which result values are in tables.
  • In the indicator selector pop-up, a full row can be checked with one click.
  • You can start doing data quality monitoring by setting thresholds on indicators: when the thresholds are not respected, result is highlighted in red color in the Result tab of the analysis editor.
  • A new kind of analysis is provided: The connection analysis. But beware that the filter do not work yet. This means all tables are scanned. Don’t use it on big databases yet.
  • Regular Patterns can be imported from an Excel file.
  • A new type of Indicator has been created for SQL patterns. This allows you to create your own patterns to put in “LIKE” clause.
  • A menu “Column analysis” has been added on Table elements to profile all columns of one or several tables with a few clicks.
  • A new view outputs some details on the selected objects.
  • You can now see what objects are analyzed without having to open the analysis editor

You are welcome to suggest new features or report bugs in Talend’s bugtracker.





Talend Open Profiler video

6 08 2008

I found this video on Talend Open Profiler 1.0.0 on a French website dedicated to Business Intelligence.

The first video shows the installation of TOP on a Windows system and presents the layout of the application.

The second video is more interesting because it shows the functionalities of TOP. The demo shows how to create your own analyses and what you can tell about the quality of your data with a few clicks. It shows the use of the patterns indicators to check the validity of the email addresses, the phone numbers…

With this video, you can judge about the power of TOP in terms of speed. In this example, profiling around 7000 rows with all indicators selected and a few patterns defined takes less than 2 seconds.

If you want to test it by yourself, go to the Talend download page. You can even try the latest milestone release 1.1.0M1.





How to launch TOP with a specified JVM?

1 08 2008

If you need to specify the JVM path to be used by Talend Open Profiler. Simply edit the TalendOpenProfiler-XXX.ini corresponding to your system and add the following 2 lines at the beginning of the file:
-vm
C:/usr/bin/java.exe

Be sure to write them on 2 lines, not on one line, otherwise it will not work.

The same configuration setting applies to Talend Open Studio.

Source: Eclipse FAQ.





TOP: New version

4 07 2008

Some new features have been added. Here is a list:

  • A toolbar has been added with the buttons for running analyses, previewing graphics, saving files.
  • You can now drag & drop column into the analysis editor.
  • Some predefined analyses are now available by a right click on the columns.
  • The pattern editor is open when you create a new pattern so that you can easily modify your patterns.
  • A button for adding pattern indicator to a column in the analysis editor

Some bugs have been fixed. Among them, the most important are:

  • The frequency table now works
  • The cheat sheet is open at start
  • The number of elements is displayed correctly in the DQ repository view

Go to download page. Check also the “Getting started guide”: a new section with a short introduction to the usage of patterns.





Talend Open Profiler

20 06 2008

I have been working on this project for a few months now and I am pleased to announce the first public release candidate of Talend Open Profiler.

This tool helps you to browse, explore your databases and analyze your data. For each column that you want to analyze, you have several indicators at your disposal: row counts, null counts, duplicate counts… field length, frequency table, summary statistics… There are also indicators based on regular expressions. These indicators helps you to discover the percentage of data of bad quality.

You have the possibility to create your own expressions to check your data against whatever pattern you want.

The analyses that you create are automatically saved so that you can be run them several times and see how your data quality evolves.

Talend Open Profiler

You can download this profiling tool on the Talend site. The installation guide is on the Talend community wiki. And a short documentation for getting started is available in the documentation section.

If you find bugs or want to see new features, just fill a report at the Talend’s bugtracker.

If you want to discuss about Talend Open Profiler, data profiling, data quality, simply go the forum. We have open a section dedicated to this new tool.

If you want to know whether your data are as clean as a lotus leaf, try it.





Open Source Business Intelligence

28 05 2008

For French readers, you can find an introduction to open source business intelligence here. It is a short introduction for people new to BI with some useful links.

You can also find (still for French readers) on my other blog some hints on dimensional modeling extracted from the great book from Ralph Kimball and Margy Ross: “Entrepôts de données, guide pratique de modélisation dimensionnelle“.