16 thoughts on “Pentaho metadata”

  1. Hello
    I think that metadata is not only for BI. BI is the first class user of metadata functionality. Since I come more from a business process and performance management field I will take a look at Pentaho Metadata from the systems integration point of view.
    Here are some ideas that come to my mind:
    Allow categorization of data in databases or actually any source
    Some kind of a query language
    The benefits are two-fold:
    Systems are integrated to exchange data or process in certain business context. If you do it for CRM warehouse or for business activity monitoring it would be nice to know the meaning of data sort of faster (categories?).
    If you have a query language at hand you can quickly find for example data belonging to certain category in various sources – consolidation scenario or creating rules for master data management.

    Greetings and keep on good work,
    Pawel Urbanski

  2. Hello Matt,

    When you say cost base query generation do you mean adhoc generation of teh SQL query , I am unable to comprehend the possibility of such a thing , I would imagine that only the selection predicate would chnage keeping the join condition same for different user selections for the same given template/report.

    Is it not the case , or does the join condition between tables that are invloved for the data generation in the report change,if the join condition chnages dynamically , how would the report designer which in this case would be the business user designing report on the fly do data validation or know what to expect, unless its for very simple cases where the business user can only change the columns to display for a given condition.

    Any clarifications would be greatly appreciated.


  3. Hi “Pentaho”,

    If the metadata includes the relative size of the database tables (in rows) you can calculate the “cheapest” path if multiple SQL query solutions exist.
    Personally, I see these kinds of tricks as a “last resort”. Typically these things are confusing for the end-user and should therefor be avoided at all cost.

    To answer your specific question, yes it is possible that the where clause changes based on conditions, but I don’t think we have support for that just yet. (could be wrong, it’s an old post :-))
    Again, being the project lead for Pentaho Data Integration I know for a fact that the complex questions should be answered in the ETL framework, NOT in the reporting layer.
    Keep the reporting as simple as possible. Your users will thank you for it.



  4. HI Matt,

    I completely undertstand the need to keep the reporting as simple as possible. My question regarding the where clause changing dynamically was becuase of the Ad-hoc report claim of Pantaho , may be I am unable to comprehend the scope of ad-hocness, I mean what exactly can be chnaged from a report to another report to make it ad-hoc , from what I can infer based on your reply, as you have said that the where clause deos not change as of yet , which I suppose is not possible to chnage the where clause so as to make it completely ad-hoc in nature.

    By ad-hoc reports I think its implied that the formatting the , selection predicate i.e columns selcted can be changed but not the actual join condition which qualifies the date to be retrieved, please clarify if what I comprehend of your reply is infact correct.

    Thanks for your time!

  5. Ad-hoc means what it means: a user can select any column from a model/view he/she likes.
    You obviously can apply filters, select columns, and do all sorts of things. I do not hide the fact that there are certain limits (outer joins for example), but if you stick to sane data warehouse modeling and a classic star schema you should not see these as problematic.

    The SQL generation behind it is not seen by the users and doesn’t really enter the discussion.
    However, it must be obvious to you that in order to fulfill the wishes of the user, there is a need to change every part of the SQL statements that we generate, including obviously the “where” clause.


  6. Matt,

    My intention is not to try highlight the limitations , I am kind of curious to know IF THE ACTIONS OF THE USER CAN LEAD TO SELCTION(JOIN) PREDICATE CHANGING?If so the is will the automated query generation subsystem define the what is to be displayed to user?

    But being adhoc is limited to as you said applying filters, doing computations etc then it makes sense to qualify the adhoc nature to those operations. So it does not make sense for a adhoc report generation system IN GENERAL and Pentaho in specific to try to change the selection (join) predicate at run time.

  7. Actually, the user selection is the driving force behind the SQL generation. As such, SQL generation is always done at runtime, never ahead of time.
    Adding a single business column to the selection can require a few tables extra to be joined, changing the complete SQL.

    I think it can go way beyond that. The architecture I designed way back when this blog post appeared allowed not only for SQL to be generated, but complete transformations. You could do anything in there.

    What is also interesting in the architecture is that not only the business column selection can modify the SQL generation, but also the locale. It’s not uncommon to have columns like DESC_EN, DESC_ES, DESC_FR, etc.
    It could make sense to allow properties to pick the correct locale based on the locale of the user for which the report was meant.

    There is nothing blocking us from doing all these cool things, except for a chronic and severe lack of time ๐Ÿ˜‰


  8. Well Matt, its interesting to know the selection predicate i.e is the join condition chnages dynamically in the architecutre you designed , which could lead to addition and deletion of tables in the query, I guess all that information comes from metadata , and instead of report writer we need to have a metadata writer who carefully designs metadata knowing how the query is built in runtime.

    Is it possible for you to give some insight into the query generation process , any resources online or anything that could be shared.


  9. The query generation is pretty simple actually (for large values of simple).
    The user selects a number of business columns. We generate a unique list of tables from that selection.
    We know the tables used because we know the business tables.
    Each business table maps to a physical (RDBMS) table. As such the same physical table can appear more than once.
    We also “fill in the blanks” when the tables do not have a direct link. We chose the shortest path between 2 business tables. This is where the cost based optimization comes in. For example, if you have 2 fact tables linked to the same 2 dimensions you selected columns from, we will take the one with the lowest cost. (based on the relative size property)

    When that is done, we have selected and renamed the user-selected columns and tables. Then we will be able to add the conditions that the user specified by simply doing an “AND” to the joins.

    Architecturally, the metadata and the generation of the SQL query are separate, but obviously the second leans heavily on the first.

    All the best,


  10. Wow.. that is one heck of an easter egg! I just re-read this post and discovered it today.
    Will it soon become a supported feature? How does it relate to the “Cube Designer” product?

    I have enhancement requests and bug reports for it, but I won’t bother you with those until it is actually a feature. ๐Ÿ™‚

  11. Hello Matt,

    Can you please clarify what you mean by we “fill in the blanks” when the tables do not have a direct link. I suppose this direct link is part of teh metadata and if you dynamically find out the join between two tables , can it be possible that the returned row count of the query vary with the selection of the columns by the user.


  12. The number of rows returned by the query can ONLY vary if the data warehouse (star schema in this case) is set up incorrectly.
    In a well-conceived data warehouse, the relationship between the dimensions and the fact table(s) is always 1:N, not 0/1:N.
    That is the main reason for having an “unknown row” (typically key value 0, 1 or even -1) in the slowly changing dimensions for those rows that can’t be mapped to a valid key.


  13. Thanks Matt,

    I am not using a Star Schema but trying to implement a Ad-Hoc reporting system without modifying the underlying database, now this reporting system is adhoc in teh sense the join condition is determined using the selection predicate, what happens is that the join condition chnages with the chnage in the selection predicate,

    I suppose this is not the case in Pentaho , meaning if columns A.B , B.C and D.E are selected and the join condition is A.B=B.C AND B.C = D.E it will remain the same even if only two columns A.B, B.C are selected , is it the case in Pentaho that the join condition remains the same ?

    If so what do you mean by the following statement…

    [quote]We also โ€œfill in the blanksโ€ when the tables do not have a direct link. We chose the shortest path between 2 business tables. This is where the cost based optimization comes in. For example, if you have 2 fact tables linked to the same 2 dimensions you selected columns from, we will take the one with the lowest cost. (based on the relative size property)

  14. Hi Matt,

    I used Metadata Editor in the creation of a domain which is quite huge with 10 business models.I’m now facing a problem while saving it.Is this an issue regarding the size of the metadata model?Can you please throw light on this.


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.