Dimensional Analysis of Data Flow Programs

Date

2022-05-24

Authors

Shennat, Abdulmonem Ibrahim

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Our main objective is to design Dimensional Analysis (DA) algorithms for the multidimensional dialect PyLucid of Lucid, the equational data flow language. The significance is that the DA is indispensable for an efficient implementation of multidimensional Lucid and should aid the implementation of other data flow systems, such as Google’s TensorFlow. Data flow is a form of computation in which components of multidimensional datasets (MDDs) travel on communication lines in a network of processing stations. Each processing station incrementally transforms its input MDDs to its output, another (possibly very different) MDD. MDDs are very common in Health Information Systems and data science in general. An important concept is that of relevant dimension. A dimension is relevant if the coordinate of that dimension is required to extract a value. It is very important that in calculating with MDDs we avoid non-relevant dimensions, otherwise we duplicate entries (say, in a cache) and waste time and space. Suppose, for example, that we are measuring rainfall in a region. Each individual measurement (say, of an hour’s worth of rain) is determined by location (one dimension), day, (a second dimension) and time of day (a third dimension). All three dimensions are a priori relevant. Now suppose we want the total rainfall for each day. In this MDD (call it N) the relevant dimensions are location and day, but time of day is no longer relevant and must be removed. Normally this is done manually. However, can this process be automated? We answer this question affirmatively by devising and testing algorithms that produce useful and reliable approximations (specifically, upper bounds) for the dimensionalities of the variables in a program. By dimensionality we mean the set of relevant dimensions. For example, if M is the MDD of raw rain measurements, its dimensionality is {location, day, hour}, and that of N is {location, day}. Note that the dimensionality is more than just the rank, which is simply the number of dimensions. Previously, there’s extensive research on dataflow itself, which we summarize. However, an exhaustive literature search uncovered no relevant previous DA work other than that of the GLU (Granular Lucid) project in the 90s. Unfortunately the GLU project was funded privately and remains proprietary – not even the author has access to it. Our methodology is that we proceeded incrementally, solving increasingly difficult instances of DA corresponding to increasingly sophisticated language features. We solved the case of one dimension (time), two dimensions (time and space), and multiple dimensions. We also solved the difficult problem (which the GLU team never solved) of determining the dimensionality of programs that include user defined functions, including recursively defined functions. We do this by adapting the PyLucid interpreter (to produce the DAM interpreter) to evaluating the entire program over the (finite) domain of dimensionalities. As a result, the experimentally validated algorithms in our dissertation can produce useful upper bounds for the dimensionalities of the variables in multidimensional PyLucid programs. That also includes those with user-defined functions

Description

Keywords

Dataflow, Dimensional Analysis, PyLucid Interpreter, Multipledimensional, DAM Interpreter, User Defined Functions, Lucid

Citation