Latest Entries »

Testing questions about testing!

Today’s post comes courtesy of Mike Jackson, also from the Software Sustainability Institute. If the Institute was the Dukes of Hazzard television show, with Steve as Bo Duke, then Mike Jackson would surely be Luke Duke.  In this post, Mike answers a testing question about testing frameworks in Python.

Software testing is a vital part of software development. It not only allows us to demonstrate that our software satisfies its requirements but to ensure that our software is both correct and robust. Automated software testing provides us with a safety net during development, allowing us to fix bugs, make enhancements and extensions secure in the knowledge that if we break anything then the tests will catch this. After all, there are few things worse than fixing a bug to discover later that, in doing so, we’ve introduced a new one.

Philip Maechling of the Southern California Earthquake Center (SCEC), at USC, recently contacted the Institute with questions about software testing. Philip and his colleagues develop scientific software that outputs computational results into files. These files are typically simple ASCII text files but contain series’ of floating point numbers e.g. time series. Their acceptance testing involves comparing these files to existing reference result files.

Philip posed two questions:

  • Many unit test frameworks (e.g. JUnit and PyUnit) are focused around instantiating an object, or other software module, within a test class, calling methods on that module, then checking the values returned against expected values. While file comparisons can be done with such frameworks, they are complicated due to the need for floating point compares (which is tricky at the best of times), and differences in header information, or non-significant file contents. So, are you aware of any testing tools designed to support tests that are based on file-based comparisons?
  • In our file-based comparison tests, we often use the same reference files in multiple tests. In some testing circles, a directory of tests and expected test results are collected into a datastore called an “oracle”. When you want to know the correct results, you look up your test and find the expected result in this oracle. Are you aware of any software unit or acceptance testing tools that support the idea of a test oracle? The concept is simple, and we have implemented a couple of our own oracle datastores, but we seem to re-invent this each project. If there is a standard solution, I am interested in trying it out.

Question 2 is a generalisation of 1, using a set of reference files across multiple tests. As Philip comments, these reference files can be termed an “oracle”. More generally, “oracle” can be used to refer to anything which validates the outputs of a test i.e. checks that the outputs of the software during the test against the expected outputs. So, for example, in a PyUnit test that compares the outputs of a function, for some specific inputs, to some hard-coded values, the comparison code hard-coded values serve as the oracle. If a developer tests a GUI and assesses the correctness of its behaviour then they are serving as the oracle. Douglas Hoffman’s paper A Taxonomy for Test Oracles from Quality Week, 1998, gives an overview and taxonomy of oracles.

For question 1, an internet search did not reveal any Python frameworks that explicitly support tests that involve comparing floating point data files for equality. Even if a framework were available, there would still be work required by the developer to customise it towards the structure and content of their specific files. Two frameworks which adopt such a solution and provide something close to Philip’s requirements are Cram and TextTest. Cram is a framework for testing command-line applications. It runs commands and compares their outputs to expected outputs.  The outputs are compared using pattern matching and regular expressions. TextTest is similar but also has support for GUI testing. Outputs are compared directly, but filters are provided to handle run dependant content and floating point differences outwith user-defined tolerances.

One can envisage at least two general approaches to comparing output files of floating point values to reference files. The first is to:

  • Write a convertor that can be used to convert the output file data format into a simpler format containing just the floating point data.
  • Write a validator that takes in two floating point data sets and compares these, applying rounding or allowing for equality within defined tolerances.
  • Write each test to load the expected results from the reference files, the actual results from the output files, apply the convertor to both sets of results, then use the validator compare the two.

The second is to:

  • Manually convert the reference files into template files. Regular expressions can be used to both handle parts of the files that might vary across test runs (e.g. headers) as well as for specifying expected floating point values.
  • Write a validator which compares an output file to a reference file, applying the regular expressions in the reference file to assess whether the output file matches.
  • Write each test to apply the validator, comparing the output files to the reference files.

Personally, I prefer the former solution as it avoids messing around with regular expressions.

For either solution, there are a number of Python libraries that can be used to construct a possible solution. These include:

  • PyUnit, Python’s unit test library. This has test assertion commands (assertAlmostEqual and assertNotAlmostEqual) for comparing floating point values to a specific number of decimal places or within a specific tolerance.
  • Python difflib library. This provides functions to compare two files and return the lines for which they differ. This is similar to the output from CVS and SVN “diff” commands. Cram uses difflib.
  • Python re regular expression library. Cram uses re.
  • Python filecmp file comparison library.
  • An introduction to writing regular expressions for floating point numbers.
  • TextTest (source code) and Cram (source code) are both open source products and it might be possible to reuse their functionality for comparing script files.
  • Hamcrest library for building “matchers” which are useful for expressing custom comparisons. It has been ported to many languages including Python.

Mike

Choosing suitable open-source software

At the SeIUCCR Summer School in September I was asked a blinder of a question:

“How do I choose sustainable software for my project?”

Assuming an open-source context for this question, there are many things worth considering. It could be that the functionality of your software needs extended. Not wanting to re-invent the wheel, you’re looking for an appropriate library to provide that functionality. Or perhaps you have an analysis tool that outputs a certain data format that you need to post-process into an image. What should you look for in software?

It’s easy to reach for the first software package you come across that seems to do what you want. Perhaps it’s already installed in your target platform, or it’s the first thing you found on Google. But picking the wrong software can have expensive consequences if it doesn’t do everything you want or, even worse, development and support comes to a stop!

Taking a little time to make an informed choice is time well spent. So what questions can you ask about the software to find out if it’s suitable?

First off, and most obvious: does it do what you want? Be sure you know your requirements, not only what you need now, but what you need in the future. What are the goals? If it’s for a wider community, think about the goals of the user community too. If the software doesn’t meet your needs, you should check to see if the functionality can be extended, or look for more suitable software.

Have a look at the software’s support, and check for an active user and developer community. Check the forums, issue tracker and mailing lists (they should have them!) for activity and responsiveness. If you run into problems, support is your first port of call, so it must be good.

Most importantly, check that the software has a future! If the software’s development and support were to stop, you could find yourself looking around for a replacement. This is ultimately what you are trying to avoid! Some positive indicators for sustainability – in addition to a well established community – are a roadmap, a solid track record of previous releases and an actively maintained website. If you are aware of appropriate open standards that are commonly used within your research field, does the software use them?

How is the software is provided? Check that documentation is available, and whether the pre-requisites for the software are appropriate for your needs. If you plan to extend the software then access to appropriate source code is very  important – is it provided via a source code repository, and is the code in an understandable and maintainable state that you can extend?

Perhaps the most important of all, check the licence conditions of the software. If you intend to distribute the software, check that the licence allows this, and check that you can distribute any modifications or extensions you make.

Of course, with such a complex question, there’s always more to know. Check out the Software Sustainability Institute’s guide on Choosing the right open source software for your project, which goes into more detail.

Lastly, don’t be afraid to ask the software developers if you have any questions. If you get prompt and helpful responses, that’s a good indication you’ll be able to get the right support should you need it. If not, it might be time to look elsewhere. Now where did you put that list of alternatives…

Security Decay: Enter the Dragon!

Security in complex systems is always a tricky business. Consider production Grid infrastructures as an example. The intricacies of establishing working trust relationships between the users and the infrastructure, and between the systems themselves, is a mammoth task. Solving problems with such systems is also very tricky, as I’ve previously found when developing EU-wide Grid interoperability demonstrators of open standards. They appear like dragons: huge, daunting, and difficult to defeat.

The UK National Grid Service asked Steve (well, the Institute really) to help them out with their SARoNGS system. Our arrangement was very effective. The Software Sustainability Institute provided development effort for the investigation, whilst the NGS fixed issues and offered the in-depth systems knowledge that only they could provide.

So what is SARoNGS all about? The Shibboleth Access to Resources on the NGS service greatly simplifies authentication to NGS resources by accepting institutional Shibboleth credentials. It’s great for users, because they don’t need to apply for, own and use an X509 certificate. However, it appeared that the automatically generated SARoNGS certificates were being rejected by the NGS’s Workload Management Service (WMS). In short, you could no longer use SARoNGS certificates to submit jobs through the WMS without seeing a rather ominous error light up the screen:

Connection failed: CA certificate verification failed

We were warned here be dragons, but we ploughed on heedless of the danger.

You may have heard of software decay. This can occur when the environment around a piece of software changes, which leads to failures in the system as a whole. For example, an update to a dependent library or to the operating system could cause a problem. Updating one of the ubiquitous jar files in Java, only to find some of the API functions have become deprecated, can also cause grief. The good news is that there are things you can do to avoid this problem, some of which I’ve looked at already.

Security problems are often esoteric and difficult to solve. The problem could be a software dependency issue, say a newly updated security library with a bug that incorrectly interprets certificate attributes. It could also be a problem with the way in which trust relationships are defined. Sophisticated production Grid systems often trust a veritable legion of Certificate Authorities (CAs). Each CA has its own CA trust certificates, Certificate Revocation Lists (CRLs – a list of certificates not to trust) and signing policies. (I won’t get into how VOMS fits into the picture in this post, but if you’re interested, let me know.) Sorting out certificate problems can be like looking for a needle in a haystack… in a tornado. However, once identified, these issues are often easy to fix.

Systems can also fail when you haven’t changed anything at all, and this was the case with the first problem we found with SARoNGS.

Time is an important concept in security. For example, the NGS proxy certificates have a limited lifetime to reduce their vulnerability if the proxy is compromised. CRLs must also be kept up to date. The problem with expired CRLs is that they can cause the entire authentication step to fail, and this is what had happened with SARoNGS: a CRL in a critical location had expired. We updated the CRL and the first dragon lay dead on our screens!

When establishing trust in Grid systems, you need to decide which certificates to trust and where in the system to trust them. The second problem with SARoNGS was caused by two different signing policies being simultaneously. Some sites were intentionally configured to trust SARoNGS, and others were not. However, the installation of an update using the International Grid Trust Federation (IGTF) bundle meant that the UK e-Science signing policy reverted to the IGTF default: do not trust SARoNGS certificates. Again, an easy problem to fix, but a difficult one to identify. Once SARoNGS trust was reinstated in the signing policy (we used a modified NGS IGTF+ bundle) the problem was resolved and the last dragon soundly defeated.

And so the legend goes, the dragons of SARoNGS were slain. If you ever find yourself developing Grid software and run into a security brick wall, why not take a look at those conspicuous looking CRL and signing policy files? They could be dragons. And dragons need slaying!

“Programming is 10% science, 20% ingenuity, and 70% getting the ingenuity to work with the science.” – Anon

Software developers working on an academic research project will, unsurprisingly, do just that: develop software. But to what extent should a developer care about the research field in which they work? This and other related questions have popped up a few times recently, and this theme also permeated a few sessions at our Collaborations Workshop back in March.

The first thing to realise is that researchers and software developers both use long words, but they are generally different long words. Each group comprises experts (in their own discipline), but each group has its own language. For example (with elementary translations in [ ]) …

Researcher: Can you imagine what we could do with a more timely lattice energy landscape minimisation for these carbolyxate salts!? [If only we could do our calculations faster!]

Developer: Nope – does it have something to do with matrices?


Developer: Did you see that amazing multi-threaded C++ app for distributed problem solving for Android appear on the appstore last night? [Just imagine how many calculations you could do in a shorter space of time - on a smartphone!]

Researcher: Sorry, I don’t watch Star Trek

Ok, so this is a little oversimplified. But such a disparity in communication means that real opportunities are missed. How do developers make sure that they understand the importance of what researchers are saying, and vice versa?

Developers and researchers are waking up to what can be achieved if they work together, and find ways to communicate effectively. As developers, we shouldn’t expect researchers to automatically believe in our religious arguments about proper software design. But if we understand the science a little bit better, then we increase our chance of identifying ways to improve the science through what we know of software – and there lies the true benefit for researchers. You only have to look at the successes of TavernaOGSA-DAI, GridSAM and the ENGAGE projects to see what is possible when researchers and software developers come together in the right way. In the CPOSS ENGAGE project, the improved coupling of a number of software technologies enabled them to perform science that wasn’t even possible before.

Of course, taking the time to learn more about the science takes effort. But if you go that extra mile every once in a while, digging a little deeper into the science behind the software you’re writing, who knows? Software and software infrastructures have a lot to offer research, and you may just be the one to help the scientists make that important discovery.

One of the big problems with research, particularly in these austeric times, is finding the money to travel to all those great conferences. You miss opportunities to present your work, and you can miss out on discovering first hand what else is afoot in your research field.
The great news is that if you use software in your research and you have a good understanding of what’s happening in your field, funding is available from the SSI to help you with this conference travel trouble. Regardless of discipline, the SSI will pay a number of select researchers £3000 a year – in return for keeping us up to date on the latest developments in the field. This obviously helps you with disseminating your greatness and keeping up to date, while allowing us to build a network of Agents to understand which fields most need our help.
There are some compelling benefits to consider:
- Up to £3000 a year to attend the conferences and events that you want to attend
- Your advocacy will ensure that your field benefits from the best support for software development
- Add world-leading researchers to your professional network
- Free attendance at training events for new tools and technologies
- If you develop code, improve your knowledge of effective techniques for developing sustainable software
- A great addition to your CV
Not bad eh? And you don’t have to be a professor or Principle Investigator to qualify – you just need to be ‘in the know’. The SSI are looking for applicants from all disciplines, especially from those fields flagged as strategically important to UK research: the ageing population, environment and climate change, the digital economy, energy and food security.
The closing date for applications is 8th August 2011. If you’re interested, or would like to nominate someone else, why not find out more and apply:
http://software.ac.uk/join-our-agents-network
Plus, saying ‘I’m an Agent for the SSI’ sounds cool.

AgentSq72dpiOne of the big problems with research, particularly in these austere times, is finding the money to travel to all those great conferences. You miss opportunities to present your work, and you can miss out on discovering first hand what is afoot in your research field.

The great news is that if you use software in your research and you have a good understanding of what’s happening in your field, funding is available from the Software Sustainability Institute. Regardless of discipline, the Institute will pay a number of researchers £3000 a year – in return for keeping the institute up to date on the latest developments in the researcher’s field. This helps you with disseminating your greatness and keeping up to date, while allowing the institute to build a network of Agents to understand which fields most need help.

There are some compelling benefits to consider:

  • Up to £3000 a year to attend the conferences and events that you want to attend
  • Your advocacy will ensure that your field benefits from the best support for software development
  • Add world-leading researchers to your professional network
  • Free attendance at training events for new tools and technologies
  • If you develop code, improve your knowledge of effective techniques for developing sustainable software
  • A great addition to your CV

Not bad eh? And you don’t have to be a professor or Principle Investigator to qualify – you just need to be in the know. The institute are looking for applicants from all disciplines, especially from those fields flagged as strategically important to UK research: the ageing population, environment and climate change, the digital economy, energy and food security.

The closing date for applications is 8th August 2011. If you’re interested, or would like to nominate someone else, why not find out more and apply at http://software.ac.uk/join-our-agents-network?

Plus, saying ‘I’m an Agent for the SSI’ sounds cool.

A unit test framework in MATLAB?

You may recall a while back I looked at test-driven development, and covered unit testing. Well, I received a related question asking whether there was a unit test framework for MATLAB, so let’s have a quick look at a few of these…

Arguably the most popular is the xUnit Test Framework, which is compatible with MATLAB 7.6 (R2008a) or later. You can write unit tests using the standard MATLAB function files, or xUnit-style subclasses (like Java JUnit) and it has comprehensive documentation. There is also a very good technical article on Automated Software Testing for MATLAB which is aimed at researchers wanting to do unit testing in xUnit, complete with examples and advice. Greg Wilson from the excellent Software Carpentry project contributed to the article, and the Software Carpentry site has many general tutorials on using MATLAB.

There are others, including mlunit_2008a, which may also be worth a look, as well as the interesting Doctest for MATLAB, which works like doctest in Python – you embed simple tests into the function’s help description in the code. For example, you could specify in a MATLAB function:

function sum = subtract2(value)
% subtracts 2 from a number
%
% subtract2(value)
% returns (value - 2)
%
% Examples:
% >> subtract2(3)
% ans =
% 1
%
% >>subtract2([8 5])
% ans =
% 6 2
if ~ isnumeric(value) 
 error('subtract2(value) requires value to be a number'); 
end
sum = value - 2;

Then you can run doctest subtract2 and have those embedded tests returned:

TAP version 1.3
1.3
ok 1 - "subtract2(3)"
ok 2 - "subtract2([8 5])"

It would depend on what the nature of the tests you wish to write, and how complex, as to whether this approach is suitable.

’til next time!

Developing software in an open way – Part II

So, continuing on from Part 1 of Developing software in an open way, let’s answer the last two aspects of the question from Alex Voss…

“Should I make all my source code available from the start to attract potential collaborators and to solicit contributions or should I keep it close initially to avoid getting locked into early solutions that have been taken up by others?”

You will have project goals to fulfil, so I think it’s always a good idea to start development internally. This means you can put your software firmly on the path to meeting your own project’s goals. When and how external involvement comes into play will depend on the project.

Firstly, it’s important to consider how you want to govern your open-source project. If you are worried about becoming locked-in to a solution chosen by others, you should stay away from the democratic/meritocratic systems, because you lose some control with these approaches (you can be outvoted). A benevolent dictator approach means that you alone have the final say on which contributions to include into your software – and which ones to exclude. You gain control, but benevolent dictators have to put in a lot of work (they’re doing all the work of the committee in a meritocracy) and require good diplomacy and community skills. Whatever approach you decide, you should always spell out the governance and contribution policy clearly so that your contributors know where they stand before they work on your project.

If your project’s success is dependent on collaborating with other people, you will want to start publicising your project early on. But before you start with publicity and getting people on board, it’s a good idea to familiarise yourself with of an open-source infrastructure, such as SourceForge or Google Code. After looking around and selecting a suitable infrastructure that meets your needs, this means you will know how to operate the infrastructure and will be accustomed with uploading your developing code on a regular basis. Once this groundwork is in place, you can start seeking out and engaging people to participate in your project.

It’s also a good idea to architect your software so that it is readily extensible by other people, whilst not precluding the development of other alternative solutions. Software designed as a framework allows you to develop an initial release that extends the framework to accomplish a task needed for your project. Contributors can then plug in their code, which might extend the framework to accomplish the same task in a different way, or a new task altogether. The trade-off here with extensibility is between the time taken for initial development and the ease of long-term maintenance effort. You need to decide the right level of extensibility for your software.

In short, it’s about the effort you put in.

“Are there examples of how people have gone about this that I might learn from? “

OSS Watch has some great case studies and examples of how this can be done.  Moodle is often held up as a good example of what can be achieved. OSS Watch also has an interesting case study on Wookie that you might find useful. The key is to pick and choose the right open source features for your software.

Links:

Developing software in an open way – Part I

So I received this question from Alex Voss the other day:

“As I am just embarking on a software development project, I would like to know from Steve what the benefits and risks are of developing a piece of software in a completely open way? Should I make all my source code available from the start to attract potential collaborators and to solicit contributions or should I keep it close initially to avoid getting locked into early solutions that have been taken up by others? Are there examples of how people have gone about this that I might learn from?”

Certainly a challenging and quite wide-ranging question, but one that applies to many people who are considering open sourcing their software, so definitely worth a look. I’ll be answering this in two posts, so let’s take a look at each question in turn…

“What are the benefits and risks of developing a piece of software in a completely open way? “

By developing your software in an open way, you can build a community that will help to sustain your software beyond the original project. Open sourcing your software can raise its profile which can increase uptake, and it provides an organisational structure for developer contributions and feedback, which helps to improve it. A user community can also offer perspective and steer on your decisions. Focusing on the needs of an open-source community will allow you to develop software that can be used and developed by others.

But of course this approach has its risks. There can be a lot of effort involved in managing contributions, ensuring the code repository is usable and up to date, building public releases, dissemination, and continuing to engage the community as it develops. And of course, as this occurs, software support can become an issue. You risk spending more time handling the open-source and support aspects of the software, than you spend on core development to meet the goals of your project. You must select an appropriate licence that permits open development in the way you want, whilst protecting the software rights you wish to retain. A too restrictive licence may stifle the community development, and a too permissive one may not protect your interests – OSS-Watch provide a very useful discussion on this. If your software uses other open-source software you wish to distribute (such as libraries), you must also ensure your licence is compatible with those licences. OSS-Watch can directly advise on selecting a licence.

So you can get a lot from open sourcing your project, but these benefits are not without overheads and risks which must be considered. Next week I’ll be answering the second part of this – looking at how, and when, to open source your software and some good examples of how this has been done.

‘Till next week!

This post is inspired by a forum question from one of the SeIUCCR Community Champions, regarding software versioning issues on NGS clusters…

Developing code that depends on other code is  common – we do it all the time, even with the simplest program. Software reuse is good practice, but a problem can occur when you take your software from its well-known and tested environment, such as the one within your research group, to a new environment. You can find that the dependent software (the software you reused) isn’t available on the platform, or a specific version of that software isn’t available. If you’re unlucky, this dependency failure isn’t discovered until an infrequently used (and ill-tested) code path is executed sometime after deployment.

The good news is that a little bit of thought and preparation early in development can avoid a world of hurt later. It’s like not preparing a parachute properly: if you’re lucky it’ll deploy, but if you’re unlucky you’re going to be very disappointed. (And in both cases, you’ll be left with a nasty mess to clean up.)

Sometimes you can include the dependent software within your own deployment (e.g. Java JAR files, sometimes even Perl libraries), but very often this is neither a good idea nor possible for technical or licensing reasons. This is especially the case when working with entire software packages hooked together at a higher level. Packaging the dependent software can leave you with more to maintain, more internally complex software, and issues and conflicts if the packaging is not done correctly.

The answer to dependency problems is to ask lots of questions. and remove false assumptions about where your software will be used and any dependent software it uses.

One trick is learn about the target environment for your software: operating systems, environments, deployed software, and suchlike. Does your software need to work  across platforms? If the environment is a production grid infrastructure, what are the constraints for deploying dependent software or packages? The key is to know what the end-user community or system administrators are expecting from your software – and the best way to find out is to ask them.

You can also ask a few questions of the dependent software you are using: has your community converged on using a particular package? Does it have all the features you (will) need? Does it have a sustainable future, support, frequent releases?

For a possible candidate version of your chosen software, you will need to ask whether there are any known issues, bugs or plans that could cause problems. Will any soon-to-be deprecated features affect your software? Also, consider if a planned future environment upgrade will cause compatibility problems.

Finally, develop your code defensibly and take a changing software environment into account. Don’t use deprecated interfaces, try to keep the development and deployment environments as similar as possible, test often and well, loosely couple your code and always adopt good software maintenance practices (because you can’t go wrong with that one!).

Remember: assumptions can kill (your software)!

‘Till next time!

Links:

How to Specify and Implement Data Movement?

You may recall back in February I talked about the importance of data formats when choosing a programming language for sustainable development.  Since then, I’ve received the following question…

“We’re currently putting together a machine for data intensive research. The machine will have a data-staging node, and 120 other nodes (configured using ROCKS) which each have several large local disks (>6TB/node).  We want to try out different ways of staging the data to the different nodes, and keep a record of what we’ve done.  The main things we’d like to record are: the number and size of files, the pattern (one to many, many to many, etc), and the locations to which the data are sent.

We’d like to use some kind of standard way of specifying how we want the data to move, and allow for different data transfer implementations to be plugged in behind this.  I’ve heard that OGSA-DMI might be able to help us here. Do you think that could be helpful?  Do you have any other suggestions or advice as to how we could provide a standard interface to the up-loader of the data which could also potentially record what movement has taken place?”

Instead of the format of data, we’re now talking about the format of requirements for moving data.  Essentially, you have data stored in one or more ’source’ locations and you want to transfer it to one or more ’sink’ locations – how do you specify this?

The OGSA Data Movement Interface (DMI) from the Open Grid Forum (OGF) is an XML specification aimed at doing just that.  The good news is that the OGSA-DMI Plain Web Service Rendering Specification v1.0, soon to be ratified as an OGF full Proposed Recommendation, seems to meet your requirements.  You can readily specify multiple transfer sources and sinks at a high-level (one-to-many, many-to-many), and the specification does not mandate which transfer protocols are supported by the service (e.g. ftp, scp, etc.), so you are free to add those you wish to support.  For example, when specifying requirements for moving data, you could specify as a Source:

      <dmi-plain:SourceDEPR>
        <wsa:Address>
          http://www.ogf.org/ogsa/2007/08/addressing/none
        </wsa:Address>
        ...
        <wsa:Metadata>
          <dmi:DataLocations>
            <dmi:Data
              DataUrl="ftp://ftp.siteA.com/source/example.zip"
              ProtocolUri="http://www.ogf.org/ogsa-dmi/2006/03/im/protocol/ftp">
              <dmi:Credentials>
                <ws-sec:UsernameToken>
                  <Username>foo</Username>
                  <Password>bar</Password>
                </ws-sec:UsernameToken>
              </dmi:Credentials>
            </dmi:Data>
          </dmi:DataLocations>
          ...
        </wsa:Metadata>
        ...
      </dmi-plain:SourceDEPR>

And specify a corresponding Sink in a very similar high-level way.  Importantly, it contains definitions for specifying data movement requirements as well as the web service interface itself.  Perhaps worth investigating!

Of course, you have to implement the service’s back-end to perform the actual transfers.  You could consider the Commons Virtual File System (VFS) data transfer library, which provides a single Java API for transferring files using a number of protocols (e.g. FTP, SFTP, HTTP).  There are a couple of variants of this – the original Apache Commons-VFS and commons-vfs-grid on SourceForge which includes some fixes to the original and more advanced features.  Of course, the nodes in your cluster would have to support the protocol as a service to act as a Sink.  In addition, if the demands on the service are expected to be high, you may have to consider a scalable solution that farms out the transfer ‘jobs’ to nodes on your network.

As for recording the transfers, much of the information you need is embedded in the requests, so you could add a simple ‘recording’ feature into your implementation.  You would have to think about how to get the file sizes (perhaps only known at transfer time) into the recording log though!

I’m aware of two implementations of this specification.  The first is the UNICORE Grid middleware, and there is also an implementation from the DataMINX project which is an open-source project on Google Code.  DataMINX in particular offers a scalable architecture with worker nodes pluggable into a Java Messaging Service (JMS)-compliant queueing system, and is modular to the extent that you could make use of just the transfer worker client, for example.  Perhaps you’d like to investigate to see whether it is appropriate, and maybe contribute to its development?

Lastly, the OGSA-DMI Working Group within OGF is working towards an OGSA-DMI ‘Common’ specification, which is designed to specify only the requirements for transfers and not the service interface.  This would mean you can use this ‘Common’ specification within your own service in any way you choose.  Perhaps you’d like to join the group, contribute your use case, and help us work towards the next generation of an OGSA-DMI specification! :)

I hope this helps!

Powered by WordPress | Theme: Motion by 85ideas.