Tag: research

Interactive toxicogenomics

May 4th, 2017 — 10:14am

If you work in toxicology or drug discovery, you might be familiar with the database Open TG-GATEs, a large transcriptomics database that catalogues gene expression response to well-known drugs and toxins. This database was developed by Japan’s Toxicogenomics Project during many years, as a private-public sector partnership, and remains a very valuable resource. As with many large datasets, despite the open-ness, accessing and working with this data can require considerable work. Data must always be placed in a context, and these contexts must be continually renewed. One user-friendly interface to simplify access to this data is Toxygates, which I begun developing as a postdoc at NIBIOHN in the Mizuguchi Lab in 2012 (and am still the lead developer of). As a web application, Toxygates lets you look at data of interest in context, together with annotations such as gene ontology terms and metabolic pathways, as well as visualisation tools.

We are now releasing a new major version of Toxygates, which, among many other new features, allows you to perform and visualise gene set clustering analyses directly in the web browser. Gene sets can also be easily characterised through an enrichment function, which is supported by the TargetMine data warehouse. Last but not least, users can now upload their own data and cluster and analyse it in context, together with the Open TG-GATEs data.

Our new paper in Scientific Reports documents the new version of Toxygates and illustrates the use of the new functions through a case study performed on the hepatotoxic drug WY-14643. If you are curious, give it a try.

When I begun the development as a quick prototype, I had no idea that the project would still be evolving many years later. Toxygates represents considerable work and many learning experiences for me as a researcher and software engineer, and I’m very grateful to everybody who has collaborated with us, supported the project, and made our journey possible.


Comment » | Bioinformatics

Equipmental visibility and barriers to understanding

July 12th, 2013 — 9:28pm

The following is an excerpt from a text I am currently in the process of writing, which may or may not be published in this form. The text is concerned with the role of software in the scientific research process, and what happens when researchers must interact with software instead of hardware equipment, and finally the constraints that this places on the software development process.

Technological development since the industrial revolution has made equipment more intricate. Where we originally had gears, levers and pistons, we progressed via tape, vacuum tubes and punch cards to solid state memory, CPUs and wireless networks. The process of the elaboration of technology has also been the process of its hiding from public view. An increasing amount of complexity is packed into compact volumes and literally sealed into “black boxes”. This does not render the equipment inaccessible, but it does make it harder to understand and manipulate as soon as one wants to go outside of the operating constraints that the designers foresaw. As we have already noted, this poses problems to the scientific method. Scientists are human, and they engage with their equipment through the use of their five senses. Let us suggest a simple rule of thumb: the more difficult equipment is to see, touch, hear etc., the more difficult it becomes to understand it and modify its function. The evolution of technology has happened at the expense of its visibility. The user-friendly interface that provides a simple means of interacting with a complex piece of machinery, which initially is very valuable, can often become a local maximum that is difficult to escape if one wants to put the equipment to new and unforeseen uses. We may note two distinct kinds of user-friendly interfaces: interfaces where the simplified view closely approximates the genuine internals of the machinery, and interfaces where the simplified view uses concepts and metaphors that have no similarity to those internals. The former kind of interface we will call an authentic simplification, the latter an inauthentic simplification.

Of course, software represents a very late stage in the progression from simple and visible to complex and hidden machinery. Again we see how software can both accelerate and retard scientific studies. Software can perform complex information processing, but it is much harder to interrogate than physical equipment: the workings are hidden, unseen. The inner workings of software, which reside in source code, are notoriously hard to communicate. A programmer watching another programmer at work for hours may not fully be able to understand what kind of work is being done, even if both are highly skilled, unless a disciplined coding style and development methodology is being used. Software is by its very nature something hidden away from human eyes: from the very beginning it is written in artificial languages, which are then gradually compiled into even more artificial languages for the benefit of the processor that is to interpret them. Irreversible, one-way transformations are essential to the process of developing and executing software. This leads to what might be called a nonlinearity when software equipment is being used as part of an experimental setup. Whereas visible, tangible equipment generally yields more information about itself when inspected, and whereas investigators generally have a clear idea how hard it is to inspect or modify such equipment, software equipment often requires an unknown expenditure of effort to inspect or modify – unknown to all except those programmers who have experience working with the relevant source code, and even they will sometimes have a limited ability to judge how hard it would be to make a certain change (software projects often finish over time and over budget, but almost never under time or under budget). This becomes a severe handicap for investigators. A linear amount of time, effort and resources spent understanding or modifying ordinary equipment will generally have clear payoffs, but the inspection and modification of software equipment will be a dark area that investigators, unless they are able to collaborate well with programmers, will instinctively avoid.

To some degree these problems are inescapable, but we suggest the maximal use of authentic simplification in interfaces as a remedy. In addition, it is desirable to have access to multiple levels of detail in the interface, so that each level is an authentic simplification of the level below. In such interface strata, layers have the same structure and only differ in the level of detail. Thus, investigators are given, as far as possible, the possibility of smooth progression from minimal understanding to full understanding of the software. The bottom level interface should in its conceptual structure be very close to the source code itself.

Comment » | Bioinformatics, Computer science, Philosophy, Software development

The “Friedrich principles” for bioinformatics software

September 13th, 2012 — 12:51am

I’ve just come back from Biohackathon 2012 in Toyama, an annual event, traditionally hosted in Japan, where users of semantic web technologies (such as RDF and SPARQL) in biology and bioinformatics come together to work on projects. This was a nice event with an open and productive atmosphere, and I got a lot out of attending. I participated in a little project that is not quite ready to be released to the wider public yet. More on that in the future.

Recently I’ve also had a paper accepted at the PRIB (Pattern Recognition in Bioinformatics) conference, jointly with Gabriel Keeble-Gagnère. The paper is a slight mismatch for the conference, as it is really focussing on software engineering more than pattern recognition as such. In this paper, titled “An Open Framework for Extensible Multi-Stage Bioinformatics Software” (arxiv) we make a case for a new set of software development principles for experimental software in bioinformatics, and for big data sciences in general. We provide a software framework that supports application development with these principles – Friedrich – and illustrate its application by describing a de novo genome assembler we have developed.

The actual gestation of this paper in fact occurred in the reverse order from the above. In 2010, we begun development on the genome assembler, at the time a toy project. As it grew, it became a software framework, and eventually something of a design philosophy. We hope to keep building on these ideas and demonstrate their potential more thoroughly in the near future.

For the time being, these are the “Friedrich principles” in no particular order.

  • Expose internal structure.
  • Conserve dimensionality maximally. (“Preserve intermediate data”)
  • Multi-stage applications. (Experimental and “production”, and moving between the two)
  • Flexibility with performance.
  • Minimal finality.
  • Ease of use.

Particularly striking here is (I think) the idea that internal structure should be exposed. This is the opposite of encapsulation, an important principle in software engineering. We believe that when the users are researchers, they are better served by transparent software, since the workflows are almost never final but subject to constant revision. But of course, the real trick is knowing what to make transparent and what to hide – an economy is still needed.

Comment » | Bioinformatics, Computer science, Software development

My Ph.D. Thesis: “Extending the Java Programming Language for Evolvable Component Integration”

March 26th, 2012 — 2:22pm

After three very hectic first months of 2012, the final version of my Ph.D. thesis has been submitted and I’ve gone through the graduation ceremonies. From the 1st of April I will be a postdoctoral associate in bioinformatics at the National Institute of Biomedical Innovation in Osaka, Japan. I will comment further on my Ph.D. experience and my entry into bioinformatics when I can.

Being a Ph.D. student in the Honiden laboratory has been a great experience, and I am very grateful to professor Honiden and to the other lab members for their support.

My thesis and the associated slides are available. The abstract is as follows.

In the last few decades, software systems have become less and less atomic, and increasingly built according to the component-based software development paradigm: applications and libraries are increasingly created by combining existing libraries, components and modules. Object-oriented programming languages have been especially important in enabling this development through their essential feature of encapsulation: separation of interface and implementation. Another enabling technology has been the explosive spread of the Internet, which facilitates simple and rapid acquisition of software components. As a consequence, now, more than ever, different parts of software systems are maintained and developed by different people and organisations, making integration and reintegration of software components a very challenging problem in practice. 

One of the most popular and widespread object-oriented programming languages today is the Java language, which through features such as platform independence, dynamic class loading, interfaces, absence of pointer arithmetic, and bytecode verification, has simplified component-based development greatly. However, we argue that Java encapsulation, in the form supported by its interfaces, has several shortcomings with respect to the need for integration. API clients depend on the concrete forms of interfaces, which are collections of fields and methods that are identified by names and type signatures. But these interfaces do not capture essential information about how classes are to be used, such as usage protocols (sequential constraints), the meaning and results of invoking a method, or useful ways for different classes to be used together. Such constraints must be communicated as human-readable documentation, which means that the compiler cannot by itself perform tasks such as integrating components and checking the validity of an integration following an upgrade. In addition, many trivial interface changes, such as the ones that may be caused by common refactorings, do not lead to complex semantic changes, but they may still lead to compilation errors, necessitating a tedious manual upgrade process. These problems stem from the fact that client components depend on exact syntactic forms of interfaces they are making use of. In short, Java interfaces and integration dependencies are too rigid and capture both insufficient and excessive information with respect to the integration concern. 

We propose a Java extension, Poplar, which enriches interfaces with a semantic label system, which describes functional properties of variables, as well as an effect system. This additional information enables us to describe integration requests declaratively using integration queries. Queries are satisfied by integration solutions, which are fragments of Java code. Such solutions can be found by a variety of search algorithms; we evaluate the use of the well-known partial order planning algorithm with certain heuristics for this purpose. A solution is guaranteed to have at least the useful effects requested by the programmer, and no destructive effects that are not permitted. In this way, we generate integration links (solutions) from descriptions of intent, instead of making programmers write integration code manually. When components are upgraded, the integration links can be verified and accepted as still valid, or regenerated to conform to the new components, if possible. The design of Poplar is such that verification and reintegration can be carried out in a modular fashion. Poplar aims to provide a sound must-analysis for the establishment of labels, and a sound may-analysis for the deletion of labels. We describe the semantics of Poplar informally using examples, and provide a formal specification of Poplar, which is based on Middleweight Java (MJ). We describe an implementation of a Poplar integration checker and generator, called Jardine, which compiles Poplar code to pure Java. We evaluate the practical applicability of Jardine through a case study, which is carried out by refactoring the JFreeChart library. We also discuss the applicability of Poplar to Martin Fowler’s well known collection of refactorings. Our results show that Poplar is highly applicable to a wide range of refactorings and that the evolution of integrated components becomes considerably simpler.

Comment » | Computer science, Life

Pointers in programming languages

August 26th, 2011 — 12:21am

It is likely that few features cause as much problems as pointers and references in statement-oriented languages, such as C, C++ and Java. They are powerful, yes, and they allow us to control quite precisely how a program is to represent something. We can use them to conveniently compose objects and data without the redundancy of replicating information massively. In languages like C they are even more powerful than in Java, since just about any part of memory can be viewed as if it were just about anything through the use of pointer arithmetic, which is indeed frightening.

But they also complicate reasoning about programs enormously. Both human reasoning and automated reasoning. Pointers allow any part of the program to have side effects in any other part of the program (if we have a reference to an object that originated there), and they make it very hard to reason about the properties that an object might have at a given point in time (since we generally have no idea who might hold a reference to it – it is amazing that programmers are forced to track this in their heads, more or less). In my effort to design my own language, multiple pointers to the same objects – aliases – have come back from time to time to bite me and block elegant, attractive designs. I believe that this is a very hard problem to design around. Aliased pointers set up communication channels between arbitrary parts of a program.

Nevertheless attempts have been made, in academia and in research labs, to solve this problem. Fraction-based permissions track how many aliases exist and endow each alias with specific permissions to access the object that is referred to. Ownership analysis forces access to certain objects to go through special, “owning” objects. Unique or “unshared” pointers in some language extensions restrict whether aliases may be created or not. But so far no solution has been extremely attractive and convenient, and none has made it into mainstream languages. (I know that someone Philipp Haller made a uniqueness plugin for the Scala compiler, but it is not in wide use, I believe.)

If we are to attempt further incremental evolution of the C-family languages, aliased pointers are one of the most important issues we can attack in my opinion.

2 comments » | Computer science, Software development

Back to top