Tuesday, November 20, 2007

Harvesting Data for Conservation
Part 2: The Solution

In 2002, The Heinz Center released its report on ecosystem conditions in the United Stated, “The State of the Nation’s Ecosystems.” The authors declared the assessment as incomplete due to the lack of data collection, reporting and systems infrastructure to sufficiently assess ecosystem condition. Yet the reality is that between the efforts of academic researchers, local and national governments and conservation organizations, an enormous amount of information deeply relevant to conservation is being collected, even in digital form. However, variation in syntax (e.g., file format) and semantics (e.g., termonology) prevent practical aggregation and analysis of collected data.

In my last post, I described this fundamental problem in conservation information systems, that of data model variability. I described how the variation in the schema of common conservation information systems entities such as observations, protected areas, conservation projects, conservation activities frustrates our ability to provide rich data entry/management applications as well as aggregation and analysis capabilities, capabilities that would significantly inform assessments like "The State of the Nations Ecosystems."

The solution is to develop a system that supports rich data entry/management/reporting and yet is independent of a specific data model. The system would treat the definition of entities such as observations or protected areas as data itself. Each of these entities has a core schema, the set of attributes that makes it what it is. A species observation, for instance, consists of an observer (person), location, date and time, and a species identifier. This core can then be augmented with observation attributes from a library (e.g. egg count, nest height, stratum, life stage). When needed, more advanced users can build their own attributes (solarization, acidity) and submit them to the common attribute library. Finally, the entity core schema and a selected set of extended attributes can be combined into an extended entity schema. Extended entity schema are useful for repeated use potentially reflecting and enforcing a standard or protocol. The library, shared across the natural resources management community, can thus contain core and extended entity schema and their component entity attributes.

From VanCouver07

For instance, a standard field protocol for surveying invasive species can be captured in a invasive-species observation schema and applied in numerous invasive species surveys. The schema can be extended to support surveys of specific invasives. (For instance, it was determined the length of the hind legs on the invasive Cane Toad was correlated with the geographical front edge of the invasion. "Hind Leg Length" would then be a key attribute of a survey of Cane Toads.)

The data management application parses the entity schema and component attributes as the definition of data types and behaviors. It then provides rich data entry, management, mapping, reporting, spatial and statistical analysis on the entered data. We can afford to invest heavily in the development of this system because the functionality is not specific to a given conservation data model. Attribute definitions are localizable into other languages including labels, help text, error messages, a feature critical for global conservation. Thus ends the tyranny of the software engineer. No longer are users beholden to software developers to create custom applications with rich functionality to support their data models, data models can that evolve with the needs of conservation and basic scientific understanding.

The other win for conservation is the ability to aggregate and analyze the resulting datasets arising from conservation organizations, the academic research community, even state agencies. When users populate datasets based on shared entity schema and extended attribute definitions, these datasets are inherently standardized and available for rich analysis. For instance ,species observation data can be harvested and mapped across all species surveys (using secure web services) for their common core (observer, observed species, date/time, location). Even this basic map would constitute a major breakthrough for conservation. Analysis of invasive species observations, based on an invasive species observation schema, would similarly bring insights to patterns in invasions. Population reductions or migrations over time associated with climate change can be mapped and analyzed based on surveys where climate change, per se, was not the primary focus of the survey. Again, while this approach would enable conservation to make use of an enormous wealth of basic observation data, these concepts apply equally well to information about lands managed for conservation (protected areas), stewardship activities, conservation projects themselves, etc.

We developed such a simple with a small team at NatureServe to support observations. The data entry/management/reporting tool is rich in functionality and yet supports any data model we could think throw at it. Parks Canada is the first customer and is already excited about the ability to support users conducting specialized surveys within parks and as well as those carrying out high level analysis of observation data across parks.

There is nothing specific to conservation in this technology. Indeed, I see examples of related systems existing and emerging on the web. Freebase is the closest I've seen yet to supporting what we need. But I'm not sure Metaweb is going where we need to go.

For instance, the support for combinations of attribute definitions into entity schema, corresponding to basic entities in conservation like observations and protected areas, is critical to support user-driven standards and protocols. We must have the ability to search and browse a community repository for core entity schema and their associated attributes. This open-source style resource would allow schema authors to post their submissions for use by the conservation community, solicit feedback, post modifications, and report on usage. In this way, subject-matter experts in various areas of conservation and biodiversity can directly share their expertise with the community in the form of widely-used entity and attribute definitions.

By separating the data model from the data management application functionality, we can provide conservation practitioners on the ground with powerful and usable tools to capture and manage their information. This same approach, to the extent we succeed in building a rich, common library of data model components, will enable unprecedented aggregation and analysis of similar, though not identical, data sets. The efficacy and efficiency of conservation at the project level can thus be improved as well as our overall understanding of the status and dynamics of nature.

Labels: ,


At 2:52 PM, Blogger Jamie said...

Kristin -

I've read your past few posts on data technologies for ecology and conservation and find your analysis very insightful.

Providing flexible, collaborative data structures while maintaining some semblance of shared "semantics" is a real challenge.

In an area like conservation, the importance of sharing distributed observations is even more critical than in other domains. As an Economist, I look at this as a design question: how do you develop a collaborative repository which makes the utility of the shared repository, for any individual researcher, greater than that of a private "spreadsheet" collection method.

I work at Metaweb and think these are the important questions which will determine whether shared information systems will be able to improve our world in meaning ways.

I have been working with a small, but growing community of biologists who have been developing schemas on Freebase for taxonomy, ecological models and bioinformatics. I would welcome the opportunity to work with you on the problems you outline here and talk with you about the direction Metaweb is taking.


(my email is my first name at metaweb.com)

At 11:42 AM, Blogger Kristin said...

Hi Jamie,

Thanks for your comment.

Utility to the data producers is indeed the critical component. To acheive our goals, we simply have to beat Excel and that includes usability, performance and powerful functionality for this problem space. The very good news is that, when focused on the specific problem domain of managing conservation datasets, we definitely can beat Excel.

If we beat Excel and systems like Excel, the conservation data producer not only wins with a more suitable system to his/her problem, the conservation data consumer (aggregator/analyzer) also wins. Standards conformance, like armies in a Trojan horse, is embedded in the data management system. The data producer is producing conforming datasets without paying any additional costs. His/her data is available for downstream usage not because he's succumb to altruistic arguments about data sharing and then taken extra time and effort to cross-walk his/her data to standards, but because sharing amounts to checking a box, literally, checking a box.

It's possible and exciting. I very much believe that, as you put it, shared information systems can improve our world in meaningful ways.

I welcome the opportunity to collaborate. I'll be emailing you shortly.


At 6:07 PM, Anonymous Paul A said...

Kristin, we've already discussed these ideas in some detail and I'm totally on board. I just wanted make one comment in response to yours that there is still some cost to the data producer, in that there must still be compliance with the attribute library for the full goals to be achieved. So maybe consider it a watered down conformance to standards. In the best case scenario, everything you need is already in place because somebody else did the work. In the worst case you may have to research attributes and semantics already in the library and establish your extended schema. No argument about benefits to both producers and consumers, and still a dramatically lower cost than developing a completely new system to handle the different context.

At 7:14 PM, Blogger Kristin said...

Excellent point, Paul. You are quite right that if a required attribute is truly missing from the library (or worst of all, hard to find), the data producer either abandons the system (reverting to Excel) or pays the non-trivial cost of describing a new attribute and and template.

The hope is that costs to each individual data producer converge to zero over time only because of the contributions of data producers before him or her.

In practical terms, we know what's required here to make this work for conservation: organizations like The Nature Conservancy, NatureServe, Cornell Lab of Ornithology and others can lend their expertise and capacity to the "seeding" effort. We take our existing conservation data standards and describe them in the library, thus at least reducing the costs for data producers who follow by giving them a good head start.

Thanks again for your comment!


At 7:27 PM, Blogger Kristin said...

Three more thoughts on the costs to data producers.

First, there will always be a cost when the data producer is collecting truly novel information. For instance, if "soil acidity" was only recently measurable, the associated attribute would have to be described, potentially by the first researcher to measure it in the field.

Second, we can mitigate the costs by providing powerful and user friendly functionality for defining, even localizing, new attributes and submitting them back to the community repository.

Third, besides the "seeding" idea mentioned in my previous comment, the open source approach to the shared repository might help us as well by enlisting the power of ego. Is it hard to imagine egomaniac biologists investing themselves in the task of creating attributes and templates based on their expertise? We'll create attribute and entity template usage reports (sharing the usage counts without sharing the data) that have the effect of esteeming their authors.

- Kristin


Thursday, November 15, 2007

Harvesting Data for Conservation
Part 1: The Problem

Biodiversity Conservation, as an "industry," has not only common interest but critical need to improve a) the productivity and assessment of conservation activities and b) downstream aggregation and analysis across conservation activities. This need has been formally expressed (see conservationcommons.org). Yet its practical realization has thus far eluded us.

From VanCouver07
As I see it, the fundamental barrier to the development of rich applications to manage conservation projects as well as aggregation and analysis tools is this: data model variability. For instance, whereas a field observation consists of a fundamental core of information (observer, observed species, date/time and location), meaningful observations almost always describe more than just this core. This is in order to support the purpose of the observation activity. For example, if we're trying to understand reproduction rates among migrating bird species, an observation record will not only document the date/time, location, observer and bird species, but also the fact that this is a nest observation and how many total eggs are in the nest, how many of those appear to be in tact. An observation management system that only allows users to capture the core attributes would be useless for almost all specific observation activities.

The same is true for tracking and managing information on protected areas, stewardship activities (e.g. prescribed burns, reforestation) and other datasets critical to conservation. While there is a common core of attributes to describe these entities in conservation, users must be able to extend beyond this core in practical application.

Because of that variability of the data model, users end up pursuing one of two approaches to capture and manage their conservation datasets. The first and most prevalent option is to employ technologies that are completely generic (e.g. spreadsheets or simple databases). These systems meet immediate needs fairly well. However, their resulting datasets are completely nonstandard and therefore unavailable for aggregation with similar datasets. In this case, the needs of the data producers may be met, but data consumers are frustrated (see Mismatched Incentives).

Where aggregation of large datasets and/or specialized functionality is required, users pursue the other option: procuring the development of custom systems. Custom systems of course are developed at considerable cost and, because they are hard-bound to a static data model, these systems are suitable only for a single application or, at best, a similar class of applications. How unfortunate that our investments in conservation data management systems must be repeated for each new dataset. We can ill-afford to enrich the functionality (e.g. usability, mapping, reporting, feeds, import/export, wizards) or performance of any given system because this investment is specific to users of only a specific dataset. It is as if each dissertation, because of its unique content, required the development of a new word processor.

Neither the completely generic nor the custom approach supports our need for leveraged investment into rich data entry/management applications at the conservation activity level nor aggregation and analysis tools operating on standardized datasets.

Labels: ,


At 2:03 PM, Blogger frank said...

Hey Kristin. I couldn't agree more. We need minimum data models for the core conservation entities, including conservation projects, protected areas, and species and ecosystem occurrences. How do we get there?


Thursday, November 01, 2007

Changing Jerseys

As many of you know, I was enjoying my work at NatureServe. At the same time, starting last June, I began to see through various connections signs of positive change in The Nature Conservancy's approach to information systems development including an improved governance, deeper integration with science, accountability and software development practices. When my friend Dennis Fuze, who is in charge of all systems development there, told me he was hiring for a Director of Conservation Systems Development, it seemed like just the right role. So I jumped into the interview process (intense!) and was offered the job, reporting to same-friend Dennis.

My technology background suited the systems development folks, and the Nicholas School credentials turned out to be instrumental in swaying the conservation science and practitioners… those who will be the customers of the systems development I oversee.

So I got the job and now I’m terrified. Okay, not completely, but I am “excited.” The Conservancy is a BIG organization with all of the accompanying challenges plus a few more. It’s also one that I respect immensely, has wonderful people and unique opportunities. I am thrilled to have the chance to contribute.

Tree frog, Canaima National Park, Venezuela© Ana Garcia/The Nature Conservancy

I have a great team (spread all over the country!) and a challenging portfolio of existing and new projects. I'll continue to work with folks from NatureServe, now as a tough customer. I'm still in the Conservation Information Technology league, I've just switched jerseys.

Labels: , , , ,