Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

Tuesday 4 December 2018

Goodbye developer blog, hello data-blog!





GBIF has a new blog!



What is it?

A place for GBIF staff and guest bloggers to contribute:
  • Statistics 
  • Graphs 
  • Tutorials 
  • Ideas 
  • Opinions 

Who can contribute?

If you would like to contribute you can contact [email protected]. Guest blogs are very welcome.

How can I write a post?

There is a short turtorial on the blog github.

What about the developer blog?

The developer blog will remain up as an archive, but there are no plans to actively post new content here.


Friday 27 July 2018

How popular is your favorite species?






How to use

Use the box to the left to type in the species you are interested in.
Make sure to use a scientific name:
  • Aves instead of birds
  • Plantae instead of plants
  • Anura  instead of frogs

Explanation of tool

This tool plots the downloads through time for species or other taxonomic groups with more than 25 downloads at GBIF. Downloads at GBIF most often occur through the web interface. In a previous post, we saw that most users are downloading data from GBIF via filtering by scientific name (aka Taxon Key). Since the GBIF index currently sits at over 1 billion records (a 400+GB csv), most users will simply filter by their taxonomic group of interest and then generate a download.

How to bookmark a result?

If you would like to bookmark a result or graph to share with others, you can visit app page direcly: app link. On this page the state of the app will be saved inside the url. You can also save a jpg by clicking on the little sandwich in the top right.

What counts as a download?

For the graphs above, I decided that it would be more meaningful to roll up downloads below the queried taxonomic level.
  • If a user downloaded 5 different bird species at once, this would count as 1 download for Aves and 1 download for each of the species downloaded.
  • If a user only typed in Aves in the occurrence download interface and not any other species. This would only count as 1 download for Aves and 0 downloads for all bird species.
  • Similarly, if a user only typed the order Passeriformes into the search, this would count as 1 download for Passeriformes and 1 download for Aves (and 1 download for Animalia ect.) but 0 downloads for all the species, families, and genera within Passeriformes.
It is possible, but not as easy, to get data from GBIF without generating a download. In fact users can stream data using the GBIF occurrence api without ever generating a download. Currently users can “download” 200k-long chunks of occurrence data without generating a download by using the api. If someone got their data using the api in this way, we would not be able to track it currently. Presumably, the vast majority of users are getting their data directly through the web interface.

For more technical details on this tool, you can visit my personal blog:
http://www.johnwalleranalytics.org/2018/07/06/gbif-download-trends/




Thursday 28 June 2018

Occurrence Downloads

Occurrences at GBIF are often downloaded through the web interface, or through the api (via rgbif ect.). Users can place various filters on the data in order to limit the number of records returned. As the occurrence index is currently a 447 GB csv, most users want to use a filter.

Total monthly downloads

Here I plot the total monthly downloads for various popular filters. For the past few years, GBIF has be averaging around 10k downloads per month.

Two peaks in total downloads stand out:
  • Mar 2014
  • Sep 2016
The Sep 2016 peak seems to be explained by high DATASET_KEY downloads. Both the Mar 2014 and Sep 2016 peaks are well explained by the top users. Top users in this graph are all the downloads generated by the top 3 most active users on GBIF. These users generate downloads in the 1000s and are most likely to be automated downloads generated internally.

One interesting detail is that while No Filter Used is not used very often it accounts for more than 500 billion occurrence records downloaded.

Finally, if we look at the number of unique users (un-select everything else to see in isolation), we see that the number of individuals making downloads on GBIF has been increasing steadily with some perhaps interesting cyclical patterns. The graph below is interactive. You can see different data views by clicking on the names. 


Popular filters explained

There are many ways that a user can filter data. The types and combinations of filters are almost limitless. Below I describe some of the most common filters:

1. TAXON_KEY

This is one of the most common filters users place on the GBIF occurrence index. Users can either choose one or many taxon names to filter the data, and users can choose any taxon rank they want (species, genus, family, kingdom ect.).

2. COUNTRY

Here users can return records only from a certain country. This is the country the user searched and not where user is searching from.

3. HAS_GEOSPATIAL_ISSUE

Here users can specify that they want occurrence records without some interpreted error.

4. HAS_COORDINATE

Here users can say that they want occurrence records that have coordinates.

5. No Filter

Finally, a surprising number of users never put any filter and instead request to download the entire occurrence index. In the overwhelming majority of cases, we have to assume these users have done this by mistake.

You can read more about downloads at GBIF here:
http://www.johnwalleranalytics.org/2018/05/30/gbif-download-statistics/

Thursday 22 June 2017

GBIF Name Parser

The GBIF name parser has been a fundamental library for GBIF to parse a scientific name string into a structured representation of a name. It has been refined over many years based on actual name strings encountered in the GBIF occurrence and checklist indices. Over the years the major design goals have not changed much and can be summarised as follows:
  • extract canonical, code relevant name parts
    • populate only the ParsedName class of the GBIF API
    • ignore any superflous name parts irrelevant to the code, e.g. species authorships in infraspecific names, infrageneric placements of species or superflous infraspecific parts in quadrinomials
  • deal with a wide variety of names that the ParsedName class can represent
    • cultivar names
    • bacterial strains & candidate names
    • virus names
    • named hybrids
    • taxon concept references, sensu latu/strictu or aggregates
    • legacy ranks
  • extract notes often found in names:
    • nomenclatural remarks
    • determination notes like aff. 
    • partially determined species, e.g. only down to the genus: Abies spec.
  • in case author parsing is impossible, fallback to parsing just the canonical name without authors
  • allow slightly imperfect names not strictly well formed according to the rules
  • classify names according to our NameType enumeration
Compared to gnparser these are slightly different goals explaining some of the behavior explained in the recent paper from Dmitry Mozzherin 2017. As that paper explains the GBIF name parser is based on regular expressions, some of them even recursive. This is not the reason why we do not support hybrid formulas though. Hybrid formulas (e.g. Quercus robur x Q. macrocarpa) as opposed to named hybrids (e.g. Quercus x turneri) are a variable combination of names and thus are very different to the Linnean names represented by a ParsedName. For name matching, backbone building and many more problems hybrid formulas are incompatible and we instead decided to deal with hybrid formulas just as with other unparsable viruses or OTU names that do not follow the neat structure of Linnean names. We simply keep the entire string as it was, classify it with a NameType and do not further parse it.

GBIF exposes the name parser through the GBIF JSON API, here are some examples for illustration:
Authorships are not (yet) parsed into a list of individual authors. This has been done internally already and it is something we are likely to expose in the future. Currently the authorship is parsed into four pieces, the authorship and year for the combination and basionym.

gnparser in GBIF

The GNA name parser is a great parser for well formed names. It has slightly different goals, but since it is available for the JVM we have wrapped it to support the GBIF NameParser interface producing ParsedName instances. Wrapping the Scala based gnaparser was not as trivial as we had thought due to its different parsing output and the mismatch of Scala and Java collections, but working against the NameParser interface you can finally select your parser of choice.

The authorship semantics for original names are also slightly different between the two parsers. Again some examples to illustrate the difference:
Azalea schlippenbachii (Maxim.) Kuntze
Both parsers show the same semantics:
GBIF:
"authorship": "Kuntze",
"bracketAuthorship": "Maxim.",

GNA:
"value": "(Maxim.) Kuntze",
"basionym_authorship": {
  "authors": ["Maxim."]
},
"combination_authorship": {
  "authors": ["Kuntze"]
}
Rhododendron schlippenbachii Maxim.
The GBIF parser places the author into “authorship” as the author of the very combination.
The gnparser places the author into the basionym authorship instead even though it is not surrounded by brackets.
As the parser cannot know if the name actually is a basionym, i.e. there indeed exists a subsequenct recombination, this was slightly unexpected
and the GBIF NameParser wrapper had to swap the authorship to populate a ParsedName in such cases:
GBIF:
"authorship": "Maxim.",

GNA:
"basionym_authorship": {
  "authors": ["Maxim."]
}
Puma concolor (Linnaeus, 1771)
Both parsers show the same semantics:
GBIF:
"bracketAuthorship": "Linnaeus",
"bracketYear": "1771",

GNA:
"basionym_authorship": {
  "authors": ["Linnaeus"],
  "year": {
    "value": "1771"
  }
}
Ex authors are not parsed in GBIF (yet). They do cause a little issue as the sequence of authors varies between botany and zoology. In botany, the author of the earlier name precedes the later, valid one while in zoology this is reversed. GNA follows the zoological model, even though far more usages of ex-authors can be found in botanical names.

Uninomials are also treated differently. GBIF uses a single property genusOrAbove for both the genus part of a binomial, a standalone genus or a uninomal of a rank above genus. GNA places genera from a binomial into the genus field, but uses uninomial for a standalone genus name.

Performance

We are still comparing gnparser with the GBIF name parser, but initial tests using gnparser-0.4.0 to parse 1380 names from our unit tests suggests the GBIF parser is up to twice as fast running in a Java environment. There is an overhead of wrapping the gnparser for the ParsedName result class. But even if we just parse the names and do not convert the Scala result into a ParsedName it takes up 75% more time:
  
Total time parsing 1380 names
MacBookPro 2017, Java8, single thread:

  GBIF: 1331ms
  GNA : 2596ms
  GNA-: 2323ms # without wrapper

This contradicts the results presented in the gnaparser paper, but might be related to the selection of names or running the parser in different environments.

Future

We are working with GNA to improve both parsers and align them more. With slightly different goals it might be hard to fully merge the two projects, but we will try to unify the efforts as much as we can. For the GBIF name parser we will be adding parsed author and ex author teams in the near future. This is needed to do author comparisons for better name matching in the GBIF backbone building (where it already exists) and the Catalogue of Life.

Monday 27 February 2017

GBIF Backbone - February 2017 Update

We are happy to annouce that a new GBIF Backbone just went live, available also as an improved Darwin Core Archive for download. Here are some facts highlighting the important changes.

New source datasets

Apart from continuously updated source like the Catalog of Life or WoRMS here are the new datasets we used as a source to build the backbone.




The 43 sources used in this backbone build

Code changes




All other fixed issues in the source code that generates the backbone can be found in our Jira epic
and github milestone.

Backbone impact

The new backbone has a total of 5,887,500 names of which it treats 2,818,534 species names as accepted (up from 5,307,978 and 2,525,274 respectively).
More backbone metrics are available through our portal and in more detail through our API.


  • 105,296 deleted names, many of them previous erroneous duplicates
  • 685,853 new names
    • Animalia: 164 families; 6,616 genera; 257,196 species; 87,660 infraspecific
    • Archaea: 2 families; 6 genera; 48 species
    • Bacteria: 27 families; 225 genera; 2,470 species; 615 infraspecific
    • Chromista: 2 phyla; 13 classes; 58 order; 54 families; 767 genera; 12,124 species; 2,953 infraspecific
    • Fungi:  2 families; 269 genera; 8,703 species; 2,993 infraspecific
    • Plantae: 3 families; 795 genera; 63,617 species; 33,282 infraspecific
    • Protozoa: 4 families; 65 genera; 1,412 species; 280 infraspecific
    • Viruses: 8 families; 1,227 genera; 8,488 species
    • Unknown: 4 families; 2,708 genera; 13,076 species; 2,237 infraspecific

A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:


All contributors to the backbone arranged by number of names the source serves as the primary reference:


Occurrence impact

With a new backbone we have reprocessed all of our 712 million occurrences.

The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:


The 1,226,520 accepted species in GBIF occurrences (140 less than before) represent 44% of all accepted backbone species.

Wednesday 25 January 2017

Sampling-event standard takes flight on the wings of butterflies


Data collected from systematic monitoring schemes is highly valuable. That's because harvesting species data from a given set of sites repeatedly over time using a well-defined sampling effort opens the door to key ecological analyses including phenology, population trends, changes in community structure and other metrics related to a range of Essential Biodiversity Variables (EBVs).

A couple of years ago there was no faithful way to universally standardize data from systematic monitoring schemes. This meant that researchers using this kind of data would need to spend a lot of time deciphering it first. Their job would get even more complicated when trying to integrate data from various heterogeneous sources, each storing their data in different formats, units, etc.

Today, the situation looks much better thanks to a massive collaboration between GBIF, EU BON partners and the wider biodiversity community whose aim was to enable sharing of "sampling-event datasets".  

Indeed, one of the most successful outcomes from this collaboration has been the development of a standardized format for systematic butterfly monitoring schemes.

The format has been developed in close collaboration with the EU BON partners Israel Pe'er (GlueCAD- Biodiversity IT) and his son, Dr. Guy Pe'er, (UFZ), who works with systematic monitoring data.  The format can be adapted to many other types of systematic monitoring, for many taxonomic groups, as it ensures the following important conditions for researchers are met:
  • all visits to a given site are known, including those with no sightings, as this allows for analyses of species phenology, etc.
  • the range of species being recorded during sampling is explicit, as this allows for true absence to be determined.
  • the location hierarchies can be specified (e.g. the location is a fixed transect or subsection of a transect), as this allows users to group observations by location.
  • enough detailed information about the sampling effort and sampling area (e.g. units of measurement) are captured, as this allows users to calculate density or convert between units of abundance.
The Israeli Butterfly Systematic Monitoring Scheme (BMS-IL) dataset has already been published openly using this format. I'd like to invite everyone to explore this exemplar dataset from either the EU BON IPT or via GBIF.org.

In the future, I hope that GEO BON's Guidelines for Standardized Global Butterfly Monitoring will incorporate a new recommendation that all monitoring programs use this standardized format for sharing their data. Without a doubt this will make researchers' jobs easier when integrating data from several butterfly monitoring programs for their analyses. It will also enable integrating the data with standardized sampling-event data from other disciplines as well.

Ideally, making the data openly available in a standardized format also leads to new collaboration. So far, BMS-IL data has been used to assess trends in the abundance and phenology of Israel's butterflies for the benefit of conservation or climate change research for example. I would like to encourage you to reach out to Israel and Guy Pe'er if you have any novel ideas on how to reuse their newly standardized data in order to help unlock its full potential.

Thursday 12 January 2017

IPT v2.3.3 - Your repository for standardized biodiversity data


GBIF is pleased to announce the release of IPT v2.3.3, now available for download from the IPT website.

This version looks and feels the same as 2.3.2 but is much more robust and secure. I'd like to recommend that all existing IPT installations be upgraded as soon as possible following the instructions listed in the release notes.

Additionally, a couple new strategic features have been added to the tool to enhance its potential. A description of these new features follows below.

Improved dataset homepage


Compared with general-purpose repositories such as Dryad or Figshare, the IPT ensures that uploaded biodiversity data gets disseminated in a standardized format (Darwin Core Archive - DwC-A), facilitating wider reuse and enabling the data to be indexed by aggregators such as GBIF.org.

Interoperability comes at a small cost though, as depositors choosing to use the IPT must overcome a learning curve in understanding how to map their data to the Darwin Core standard.

To make this easier for depositors, a new set of Darwin Core Excel templates have recently been released. These new templates provide a simpler solution for capturing, formatting and uploading data to the IPT.

Similarly, users of the standardized data need to understand how to unpack a DwC-A and make sense of the data inside.  

Data Records section - RLS Global Reef Fish Dataset
doi:10.15468/qjgwba
To make this process easier for users, a new Data Records section has been added to the dataset homepage that provides an explanation of what the DwC-A format is with a graphic illustration showing the number of records in each file contained within it.

Overall this advancement will strengthen the IPT as a data repository, which is already capable of assigning DOIs to datasets to make them discoverable and citable. 

Translation into Russian 


Map of IPT installations in Russia - January 2017
Installed in 52 countries around the world, use of the IPT heavily is underrepresented across Russian speaking countries. Therefore to extend the IPT's reach in these areas, the user interface has been fully translated into Russian by a team of volunteer translators with the largest contribution made by Ivan Chadin from the Komi Science Centre of the Ural Branch of the Russian Academy of Sciences.

Map of data published by Russia - January 2017
At the time of writing there were already 18 datasets from Russia published by 5 IPTs installed across Pushchino, Moscow, St Petersburg and the Komi Republic. It will be exciting to watch this number grow over time in part thanks to this enormous volunteer contribution.



Acknowledgements


Once again I'd like to recognize all the volunteer translators that contributed their time and expertise to making this new version available in seven different languages:
  • Sophie Pamerlon (GBIF France) - Updating French translation
  • Yukiko Yamazaki (GBIF Japan (JBIF)) - Updating Japanese translation
  • Daniel Lins (Universidade de São Paulo, Research Center on Biodiversity and Computing - BioComp) - Updating Portuguese translation
  • Néstor Beltrán (Colombian Biodiversity Information System (SiB Colombia)) - Updating Spanish translation
  • Ivan Chadin (Institute of Biology of Komi Scientific Centre of the Ural Branch of the Russian Academy of Sciences), Max Shashkov (Institute of Physicochemical and Biological Problems in Soil Science, Russian Academy of Science) and Artyom Leostrin (Komarov Botanical Institute of the Russian Academy of Sciences (Saint-Petersburg)) - Adding Russian translation 
I'd also like to recognize a few volunteers that helped make significant improvements to the IPT codebase:
  • Bruno P. Kinoshita (National Institute of Water and Atmospheric Research (NIWA)) - Fixed issue #1241, ensuring the IPT can be installed on a server behind a proxy
  • Pieter Provoost (UNESCO) - Fixed issue #1248, improving the IPT's RSS feed
  • Tadj Youssouf (Security researcher, fb.com/oc3f.dz) - Helped address a cross site scripting issue
Although the core development of the IPT happens at the GBIF Secretariat, the coding, documentation, and internationalization are a community effort and everyone is welcome to join in.

I look forward to seeing the IPT's community of volunteers and users continue to grow and hope you can unlock the full potential of this publishing tool and repository.

Monday 8 August 2016

GBIF Backbone - August 2016 Update

GBIF has just put a new backbone taxonomy into production! Since our last update of the GBIF Backbone we have received various feedback and gained insight into potential code improvements. Here is a quick summary of what has changed in this August 2016 version.

Important code changes:

  • much less eager basionym detection resulting in fewer algorithmically assigned synonyms and removing many false synonyms especially in plants
  • detect and merge orthographic variants of species doing gender stemming, allowing double consonant characters, deal with author transliterations and merging hybrid names

All fixed issues in the source code that generates a new backbone can be found there, each of them often leads to actual reported user feedback: http://dev.gbif.org/issues/browse/POR-3029

New sources

The following new sources have been incorporated into the august backbone:
  • major new version of The Paleobiology Database contributing 2,315 new families, 11,390 genera and 131,958 species names to the backbone. Feeds many isExtinct and livingPeriod values into the backbone for fossil taxa
  • thousands of new Plazi articles with 1,883 genera, 28,725 species and 1,935 infraspecific names. Only use genus names and below from Plazi, excluding any synonyms until we are confident they are all correctly marked up
  • added Artsnavnebasen source, contributing 3,640 new genera and 29,751 species names to the backbone
  • added International Cichorieae Network source, contributing 190 new Asteraceae genera; 1,415 species and 3,427 infraspecies names to the backbone
The 39 sources used in this backbone build

Backbone impact

The new backbone has a total of 5,307,978 names of which it treats 2,525,274 species names as accepted (previously 2,420,842 out of 5,208,172). More backbone metrics are available through our portal and in more detail through our API.
  • 187,854 deleted names, mostly due to the removal of orthographic variants
  • 279,404 new names 
    • Unknown: 165 families; 743 genera; 785 species; 14 infraspecific
    • Animalia: 13 order; 1,649 families; 10,171 genera; 125,478 species; 4,398 infraspecific
    • Archaea: 2 genera; 3 species
    • Bacteria: 1 families; 33 genera; 544 species; 36 infraspecific
    • Chromista: 38 families; 412 genera; 5,594 species; 295 infraspecific
    • Fungi: 1 families; 691 genera; 11,127 species; 2,039 infraspecific
    • Plantae: 50 families; 666 genera; 82,672 species; 14,725 infraspecific
    • Protozoa: 1 class; 1 order; 4 families; 38 genera; 349 species; 24 infraspecific
    • Viruses: 1 families; 982 genera; 6,311 species
A very large and detailed log of the backbone build is also available.

The largest taxonomic groups in the backbone, exceeding 3% of all accepted species is shown in the following diagram:



The Catalogue of Life as the largest single primary source contributes 59,8% of all names (previously 60,9%). A breakdown by backbone constituents is now also available as a species search facet. For example this shows the breakdown for all accepted plant species in the backbone:


Occurrence impact

With a new backbone we have reprocessed all of our 642 million occurrences. The larger changes were:
  • Fixed various old/new world distributions of incorrectly synonymized species
  • Reduced the number of virus records from 157,492 down to just 5,348 records. Most occurrences were Lepidoptera, e.g. the common peacock butterfly that had formerly been mismatched because there was no classification given with the name.
Some more metrics of backbone names in our occurrences. There are:
  • 216,699 distinct genera in GBIF occurrences. That is 55% out of all 396.990 genera in the backbone
  • 1,226,668 accepted species in GBIF occurrences. That is 50% out of all 2,420,842 backbone species
  • 2,059,961 distinct names in GBIF occurrences. Which is 39% of all 5.208.172 names in the backbone
The distribution of the major taxonomic groups exceeding 3%, i.e have a minimum of 36.800 species, is shown in this last diagram:



Wednesday 20 July 2016

Probably Turboveg's best-kept secret

Turboveg is one of the most widely used software programs used to manage vegetation data. Probably its best-kept secret is that it can export vegetation data in Darwin Core Archive (DwC-A) format, which is a standard format that enables its quick and easy integration with other resources on GBIF.org. Turboveg v2 converts vegetation data into species occurrence data packaged as a DwC-A. Now thanks to an 8 month long collaboration between GBIF and Stephan Hennekens (Turboveg's developer), v3 will convert vegetation data into sampling event data packaged as a DwC-A - a much more faithful and useful representation of the data.

Turboveg

Screenshot of Turboveg v3 prototype
Turboveg is an easy to install and easy to use Windows program for storing, managing, visualizing and exporting vegetation data (relevés). A relevé is a list of the plants in a delimited plot of vegetation, with information on species cover and on substrate and other abiotic features in order to make as complete as possible description in terms of plant community composition and structure.

Today there are about 1500 users of the software worldwide managing more than 1,5 million relevés. Turboveg can export relevés in various file formats, which is useful to enable further analysis. Support for exporting relevés as species occurrence data packaged as a Darwin Core Archive (DwC-A) was added to v2 in 2011. Guidance on how to use this feature can be found in the Turboveg User Manual.

Version 3, due to be released in 2017, will export relevés as sampling event data packaged as DwC-A - a format that more accurately reflects the original data.

Sampling event data

Sampling event data derive from environmental, ecological, and natural resource investigations that follow standardized protocols for measuring and observing biodiversity. This is in contrast to opportunistic observation and collection data, which today form a significant proportion of openly accessible biodiversity data. A good example of sampling data is data coming from vegetation sampling events using the Braun-Blanquet protocol. Because the sampling methodology and sampling units are precisely described the resulting data is comparable and thus better suited for measuring trends in habitat change and climate change.

Sampling event data model

A data model provides the details of the structure of the data. Previously sampling event data couldn't be modelled in a standardized way in Darwin Core due to the complexity of encoding the underlying protocols. Over the past two years, however, GBIF has been working with EU BON and the wider bioinformatics community to develop a data model for sharing sampling event data. In March 2015 TDWG, the international body responsible for maintaining standards for the exchange of biological data, ratified changes that enabled support for modelling sampling event data.

In summary, the de facto data model for sampling event data in Darwin Core consists of three tables: Sampling event, Measurements or Facts and Species occurrences. 

A Sampling event can be associated with many Species occurrences, while a Species occurrence can only be associated to one Sampling event. Similarly, a Sampling event can be associated with many Measurements or Facts. In this way a Sampling event has a one-to-many relationship to both Species occurrences and Measurements or Facts. 

Note additional tables of information can also be added to a Sampling event, such as Multimedia (e.g. to record images of the plot). More information about this preferred data model for sampling event data can be found in the IPT Sampling Event Data Primer.


Sampling event data model for vegetation plot data

Vegetation surveys or relevés produce a wealth of information on species cover and on substrate and other abiotic features in the plot. Species cover can be measured using dozens of different vegetation abundance scales such as the Braun-Blanquet scale or Londo decimal scale to name a couple. To standardize how this information is stored, a custom Relevé table is used instead of the Measurements or Facts table.

This data model for vegetation plot data in Darwin Core consists of three tables: Sampling event, Relevé and Species Occurrence.

A Sampling event can be associated with only one Relevé. The Relevé consists of the most common relevé measurements covering all vegetation layers. Note for each measurement the unit and precision is explicitly defined. A Sampling event can also be associated with many Species occurrences, however, each Species occurrence should specify the vegetation layer where it was found hence the same species can be found within multiple vegetation layers. In this way the vegetation composition can be described for each layer within the plot.

Note that at the time of writing the Darwin Core standard doesn't have the terminology for storing vegetation layers. Therefore a formal proposal has been made to add the new term "layer" to Darwin Core. To standardise how this new term is populated, a custom vocabulary for vegetation layers has also been produced.


Example DwC-A export by Turboveg: Dutch Vegetation Database (LVD)

Fortunately, the Dutch Vegetation Database (LVD) has recently been republished using the new sampling event format and can thus serve as an exemplar dataset. LVD is a substantial dataset published by Alterra (a major Dutch research institute) that covers all plant communities in the Netherlands with more than 85 years of vegetation recording for some habitats. The latest version of this dataset has more than 650 thousand relevés associated with almost 12 million species occurrences.

Alterra uses Turboveg v3 to manage this dataset and export it in the standardized DwC-A format. It is important to note that special care is taken by the software to protect sensitive species: the location of plots, which have red list species observed in them, are obfuscated to a level of 5x5 km squares. Furthermore the software converts all coverage values to the same unit (e.g. species coverage values are converted into percentage coverage) in order to make the data easy to use and integrate with other sources.

Sampling event data on GBIF.org: Dutch Vegetation Database (LVD)

GBIF.org map of LVD georeferenced data
All versions of LVD are imported to the EU BON IPT where they get archived and published through GBIF.org

The 8 month long collaboration between GBIF and Stephan Hennekens culminated in the latest version of LVD being indexed into GBIF.org here. A special and grateful thanks is owed to Stephan for all his hard work to make this happen.

Over the next couple of years GBIF will continue working on enhancing the indexing and discovery of sampling event datasets (e.g. showing events' plots/transects on a map, filtering events by sampling protocol, indexing Relevés, etc.). At least when Turboveg v3 is released in 2017, users can already export their relevés into this new standardized format that represents their data much more faithfully.

Wednesday 6 April 2016

Updating the GBIF Backbone

The taxonomy employed by GBIF for organising all occurrences into a consistent view has remained unchanged since 2013. We have been working on a replacement for some time and are pleased to introduce a preview in this post. The work is rather complex and tries to establish an automated process to build a new backbone which we aim to run on a regular, probably quarterly basis. We would like to release the new taxonomy rather soon and improve the backbone iteratively. Large regressions should be avoided initially, but it is quite hard to evaluate all the changes between 2 large taxonomies with 4 - 5 million names each. We are therefore seeking feedback and help to discover oddities of the new backbone.

Relevance & Challenges

Every occurrence record in GBIF is matched to a taxon in the backbone. Because occurrence records in GBIF cover the whole tree of life and names may come from all possible, often outdated, taxonomies, it is important to have the broadest coverage of names possible. We also deal with fossil names, extinct taxa and (due to advanced digital publishing) even names that have just been described a week before the data is indexed at GBIF.
The Taxonomic Backbone provides a single classification and a synonymy that we use to inform our systems when creating maps, providing metrics or even when you do a plain occurrence search. It is also used to crosslink names between different checklist datasets.

The Origins

The very first taxonomy that GBIF used was based on the Catalogue of Life. As this only included around half the names we found in GBIF occurrences, all other cleaned occurrence names were merged into the GBIF backbone. As the backbone grew we never deleted names and increasingly faced more and more redundant names with slightly different classifications. It was time for a different procedure.

The Current Backbone

The current version of the backbone was built in July 2013. It is largely based on the Catalogue of Life from 2012 and has folded in names from 39 further taxonomic sources. It was built using an automated process that made use of selected checklists from the GBIF ChecklistBank in a prioritised order. The Catalogue of Life was still the starting point and provided the higher classification down to orders. The Interim Register of Marine and Nonmarine Genera was used as the single reference list for generic homonyms. Otherwise only a single version of any name was allowed to exist in the backbone, even where the authorship differed.

Current issues

We kept track of nearly 150 reported issues. Some of the main issues showing up regularly that we wanted to address were:
  • Enable an automated build process so we can use the latest Catalogue of Life and other sources to capture newly described or currently missing names
  • It was impossible to have synonyms using the same canonical name but with different authors. This means Poa pubescens was always considered a synonym of Poa pratensis L. when in fact Poa pubescens R.Br. is considered a synonym of Eragrostis pubescens (R.Br.) Steud.
  • Some families contain far too many accepted species and hardly any synonyms. Especially for plants the Catalogue of Life was surprisingly sparsely populated and we heavily relied on IPNI names. For example the family Cactaceae has 12.062 accepted species in GBIF while The Plant List recognizes just 2.233.
  • Many accepted names are based on the same basionym. For example the current backbone considers both Sulcorebutia breviflora Backeb. and Weingartia breviflora (Backeb.) Hentzschel & K.Augustin as accepted taxa.
  • Relying purely on IRMNG for homonyms meant that homonyms which were not found in IRMNG were conflated. On the other hand there are many genera in IRMNG - and thus in the backbone - that are hardly used anywhere, creating confusion and many empty genera without any species in our backbone.

The New Backbone

The new backbone is available for preview in our test environment. In order to review the new backbone and compare it to the previous version we provide a few tools with a different focus:
  • Stable ID report: We have joined the old and new backbone names to each other and compared their identifiers. When joining on the full scientific name there is still an issue with changing identifiers which we are still investigating.
  • Tree Diffs: For comparing the higher classification we used a tool from Rod Page to diff the tree down to families. There are surprisingly many changes, but all of them stem from evolution in the Catalogue of Life or the changed Algae classification.
  • Nub Browser: For comparing actual species and also reviewing the impact of the changed taxonomy on the GBIF occurrences, we developed a new Backbone Browser sitting on top of our existing API (Google Chrome only). Our test environment has a complete copy of the current GBIF occurrence index which we have reprocessed to use the new backbone. This also includes all maps and metrics which we show in the new browser.
Family Asparagaceae as seen in the nub browser:
Red numbers next to names indicate taxa that have fewer occurrences using the new backbone, while green numbers indicate an increase. This is also seen in the tree maps of the children by occurrences. The genus Campylandra J.G. Baker, 1875 is dark red with zero occurrences because the species in that genus were moved into the genus Rhodea in the latest Catalog of Life.

Species Asparagus asparagoides as seen in the nub browser:
The details view shows all synonyms, the basionym and also a list of homonyms from the new backbone.

Sources

We manually curate a list of priority ordered checklist datasets that we use to build the taxonomy. Three datasets are treated in a slightly special way:
  1. GBIF Backbone Patch: a small dataset we manually curate at GBIF to override any other list. We mainly use the dataset to add missing names reported by users.
  2. Catalogue of Life: The Catalogue of Life provides the entire higher classification above families with the exception of algaes.
  3. GBIF Algae Classification: With the withdrawal of Algaebase the current Catalogue of Life is lacking any algae taxonomy. To allow other sources to at least provide genus and species names for algae we have created a new dataset that just provides an algae classification down to families. This classification fits right into the empty phyla of the Catalogue of Life.
The GBIF portal now also lists the source datasets that contributed to the GBIF Backbone and the number of names that were used as primary references.

Other Improvements

As well as fixing the main issues listed above, there is another frequently occurring situation that we have improved. Many occurrences could not be matched to a backbone species because the name existed multiple times as an accepted taxon. In the new backbone, only one version of a name is ever considered to be accepted. All others now are flagged as doubtful. That resolves many issues which prevented a species match because of name ambiguity. For example there are many occurrences of Hyacinthoides hispanica in Britain which only show up in the new backbone (old / new occurrence, old / new match). This is best seen in the map comparison of the nub browser, try to swipe the map!

Known problems

We are aware of some problems with the new backbone which we like to address in the next stage. Two of these issues we consider as candidates for blocking the release of the new backbone:
Species matching service ignores authorship
As we better keep different authors apart the backbone now contains a lot more species names which just differ by their authorship. The current algorithm only keeps one of these names as the accepted name from the most trusted source (e.g. CoL) and treats the other as doubtful if they are not already treated as synonyms.
The problem currently is that the species matching service we use to align occurrences to the backbone does not deal with authorship. Therefore we have some cases where occurrences are attached to a doubtful name or even split across some of the “homonyms”.
There are nearly 166.832 species names with different authorship existing in the new backbone, accounting for 98.977.961 occurrences.
Too eager basionym merging
The same epithet is sometimes used by the same author for different names in the same family. This currently leads to an overly eager basionym grouping with less accepted names.
As these names are still in the backbone and occurrences can be matched to them this is currently not considered a blocker.

Thursday 25 February 2016

Reprojecting coordinates according to their geodetic datum

For a long time Darwin Core has a term to declare the exact geodetic datum used for the given coordinate. Quite a few data publishers in GBIF have used dwc:geodeticDatum for some time to publish the datum of their location coordinates.

Until now GBIF has treated all coordinates as if they were in WGS84, the widespread global standard datum used by the Global Positioning System (GPS). Accordingly locations given in a different datum, for example NAD27 or AGD66, were displaced on GBIF maps a little. This so called “datum shift” is not dramatic, but can be up to a few hundred metres depending on the location and datum. The Univeristy of Colorado has a nice visualization of the impact.

At GBIF we interpret the geodeticDatum and reproject all coordinates as good as we can into the single datum WGS84. In order to do this there are two main steps that need to be done: parse and interpret the given verbatim geodetic datum and then do the actual transformation based on the known geodetic parameters.

Parsing geodeticDatum

As usual GBIF receives a lot of noise when reading the dwc:geodeticDatum. After removing the obvious bad values, e.g. introduced by bad mappings done by the publisher, we still ended up with over 300 different values for some datum. Most commonly simple names or abbreviations are used, e.g. NAD27, WGS72, ED50, TOKYO. In some cases we also see proper EPSG http://www.epsg.org/ codes coming in, e.g. EPSG:4326 which is the EPSG code for WGS84. As EPSG is a widespread and complete reference dataset of geodetic parameters, supported by many java libraries, we decided to add a new DatumParser to our parser library that directly returns EPSG integer codes for datum values. That way we can lookup geodetic parameters easily in the following transformation step. In addition to parse any given EPSG:xyz code directly it also understands most datums found in the GBIF network based on a simple dictionary file which we manually curate.

Even though EPSG codes are well maintained, very complete and supported by most software opaque integer codes have a harder time to get used than meaningful short names. Maybe a lesson we should keep in mind when debating about identifiers elsewhere.

Our recommendation to publishers is to use the EPSG codes if you know them, otherwise stick to the simple well known names. A good place to search for EPSG codes is http://epsg.io/.

Transformation

Once we have a decimal coordinate and a well known geodetic source datum the transformation itself is rather straight forward. We use geotools to do the work. The first step in the transformation is to instantiate a CoordinateReferenceSystem (CRS) using the parsed EPSG code of the geodeticDatum. A CRS combines a datum with a coordinate system, in our case this always a 2 dimensional system with the prime meridian in Greenwich and longitude values increasing East, latitude values North.

As EPSG codes can refer to both, just a plain datum and also a complete spatial reference system, we need to take this into account when building the CRS like this:

 private CoordinateReferenceSystem parseCRS(String datum) {
    CoordinateReferenceSystem crs = null;
    // the GBIF DatumParser in use
    ParseResult<Integer> epsgCode = PARSER.parse(datum);
    if (epsgCode.isSuccessful()) {
      final String code = "EPSG:" + epsgCode.getPayload();
      // first try to create a full fledged CRS from the given code
      try {
        crs = CRS.decode(code);

      } catch (FactoryException e) {
        // that didn't work, maybe it is *just* a datum
        try {
          GeodeticDatum dat = DATUM_FACTORY.createGeodeticDatum(code);
      // build a CRS using the standard 2-dim Greenwich coordinate system
          crs = new DefaultGeographicCRS(dat, DefaultEllipsoidalCS.GEODETIC_2D);

        } catch (FactoryException e1) {
          // also not a datum, no further ideas, log error
          LOG.info("No CRS or DATUM for given datum code >>{}<<: {}", datum, e1.getMessage());
        }
      }
    }
    return crs;
  }

Once we have a CRS instance we can create a specific WGS84 transformation and apply it to our coordinate:

public ParseResult<LatLng> reproject(double lat, double lon, String datum) {
   CoordinateReferenceSystem crs = parseCRS(datum);
   MathTransform transform = CRS.findMathTransform(crs, DefaultGeographicCRS.WGS84, true);
   // different CRS may swap the x/y axis for lat lon, so check first:
   double[] srcPt;
   double[] dstPt = new double[3];
   if (CRS.getAxisOrder(crs) == CRS.AxisOrder.NORTH_EAST) {
     // lat lon
     srcPt = new double[] {lat, lon, 0};
   } else {
     // lon lat
     srcPt = new double[] {lon, lat, 0};
   }

   transform.transform(srcPt, 0, dstPt, 0, 1);
   return ParseResult.success(ParseResult.CONFIDENCE.DEFINITE, new LatLng(dstPt[1], dstPt[0]), issues);
  }

The actual projection code does a bit more of null and exception handling which I have removed here for simplicity.

As you can see above we also have to watch out for spatial reference systems that use a different axis ordering. Luckily geotools knows all about that and provides a very simple way to test for it.

Issue flags

As with most of our processing we flag records when problems or assumed behavior occurs. In the case of the geodetic datum processing we keep track of 5 distinct issues which are available as GBIF portal occurrence search filters:

  • COORDINATE_REPROJECTION_FAILED: A CRS was instantiated, but the transformation failed for some reason.
  • GEODETIC_DATUM_INVALID: The datum parser was unable to return an EPSG code for the given datum string.
  • COORDINATE_REPROJECTION_SUSPICIOUS: The reprojection resulted in a datum shift larger than 0.1 degrees.
  • GEODETIC_DATUM_ASSUMED_WGS84: No datum was given or the given datum was not understood. In that case the original coordinates remain untouched.
  • COORDINATE_REPROJECTED: The coordinate was successfully transformed and differs now from the verbatim one given.

Thursday 11 June 2015

Simplified Downloads

Since its re-launch in 2013 gbif.org has supported the downloading of occurrence data using an arbitrary query with the download file provided as a Darwin Core Archive file whose internal content is described here. This format contains comprehensive and self-explanatory information, which makes it suitable to be referenced in external resources. However, in cases where people only need the occurrence data in its simplest form the DwC-A format presents an additional complexity that can make it hard to use the data. Because of that we now support a new download format: a zip file that only contains a single file with the most common fields/terms used, where each column is separated by the TAB character. This makes things much easier when it comes to importing the data into tools such as Microsoft Excel, geographic information systems and relational databases. The current download functionality was extended to allow the selection of the desired format:

From this point the functionality remains the same: eventually you will receive an email containing a hyperlink where the file can be downloaded.

Technical Architecture

The simplified download format was implemented following the technical requirement that new formats should be supported in the near future with minimal impact to the formats supported at a specific moment. In general, occurrence downloads are implemented using two different sets of technologies depending on the estimated size of the download in number of records; a threshold of 200,000 records is set to define when a download is small (< 200K) and big (>200K), where history shows a vast majority of “small” downloads. The following chart summarizes the key technologies that enables occurrence downloads:

Download workflow

Occurrence downloads are automated using a workflow engine called Oozie, it coordinates the required steps to produce a single download file. In summary the workflow proceeds as follows:
  1. Initially, Apache Solr is contacted to determine the number of records that the download file will contain.
  2. Big or small?
    1.  If the amount of records is less than 200,000 (it is small download), Apache Solr is queried to iterate over the results; the detail of each occurrence record is fetched from HBase since it’s the official storage of occurrence records. Individual downloads are produced by a multi-threaded application implemented using the Akka framework; the Apache Zookeeper and Curator frameworks are used to limit the amount of threads that can be running at the same time (it avoids a thread explosion in the machines that run the download workflow).
    2. If the amount of records is greater than 200,000 (it is a big download), Apache Hive is used to retrieve the occurrence data from an HDFS table. To avoid overloading of HBase we create that HDFS table as a daily snapshot of the occurrence data stored in HBase.
  3. Finally the occurrence records are collected and organized in the requested output format (DwC-A or Simple).
Note: the details of how this is implemented can be consulted in the Github project: https://github.com/gbif/occurrence/tree/master/occurrence-download.

Conclusion

Reducing both the number of columns and the size (number of bytes) in our downloads has been one of our most requested features, and we hope this makes using the GBIF data easier for everyone.


Friday 29 May 2015

Don't fill your HDFS disks (upgrading to CDH 5.4.2)

Just a short post on the dangers of filling your HDFS disks. It's a warning you'll hear at conferences and in best practices blog posts like this one, but usually with only a vague consequence of "bad things will happen". We upgraded from CDH 5.2.0 to CDH 5.4.2 this past weekend and learned the hard way: bad things will happen.

The Machine Configuration

The upgrade went fine in our dev cluster (which has almost no data in HDFS) so we weren't expecting problems in production. Our production cluster is of course slightly different than our (much smaller) dev cluster. In production we have 3 masters, where one holds the NameNode and another holds the SecondaryNameNode (we're not yet using a High Availability setup, but it's in the plan). We have 12 DataNodes where each one has 13 disks dedicated to HDFS storage: 12 are 1TB and one is 512GB. They are formatted with 0% reserved blocks for root. The machines are evenly split into two racks.

Pre Upgrade Status

We were at about 75% total HDFS usage with only a few percent difference between machines. We were configured to use Round Robin block placement (dfs.datanode.fsdataset.volume.choosing.policy) with 10GB reserved for non-hdfs use (dfs.datanode.du.reserved), which are the defaults in CDH manager. Each of the 1TB disks was around 700GB used (of 932GB usable), and the 512 GB disks were all at their limit: 456GB used (of 466GB usable). That left only the configured 10GB free for non-hdfs use on the small disks. Our disks are mounted in the pattern /mnt/disk_a, /mnt/disk_b and so on, with /mnt/disk_m as the small disk. We're using the free version of CDHM so we can't do rolling upgrades, meaning this upgrade would bringing everything down. And because our cluster is getting full (> 80% usage is another rumoured "bad things" threshold) we have reduced one class of data (user's occurrence downloads) to a replication factor of 2 (from the default of 3). This is considered somewhere between naughty and criminal, and you'll see why below.

Upgrade Time

We followed the recommended procedure and did the oozie, hive, and CDH manager backups, downloaded the latest parcels, and pressed the big Update button. Everything appeared to be going fine until HDFS tried to start up again, where the symptom was that it was taking a really long time (several minutes, after which the CDHM upgrade process finally gave up saying the DataNodes weren't making contact). Looking at the NameNode logs we see that it was performing a "Block Pool Upgrade", which took btw 90 and 120 seconds for each of our ~700GB disks. Here's an excerpt of where it worked without problems:


2015-05-23 20:18:53,715 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_a/dfs/dn/in_use.lock acquired by nodename [email protected]
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:18:53,823 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:20:33,565 INFO org.apache.hadoop.hdfs.server.common.Storage: HardLinkStats: 59768 Directories, including 53157 Empty Directories, 0 single Link operations, 6611 multi-Link operations, linking 22536 files, total 22536 linkable files.  Also physically copied 0 other files.
2015-05-23 20:20:33,609 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrade of block pool BP-2033573672-130.226.238.178-1367832131535 at /mnt/disk_a/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535 is complete

That upgrade time happens sequentially for each disk, so even the though the machines were upgrading in parallel, we were still looking at ~30 minutes of downtime for the whole cluster. As if that wasn't sufficiently worrying, then we finally get to disk_m, our nearly full 512G disk:


2015-05-23 20:53:05,814 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /mnt/disk_m/dfs/dn/in_use.lock acquired by nodename [email protected]
2015-05-23 20:53:05,869 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,870 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled for /mnt/disk_m/
dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535
2015-05-23 20:53:05,886 INFO org.apache.hadoop.hdfs.server.common.Storage: Upgrading block pool storage directory /mnt/disk_m/
dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535.
   old LV = -56; old CTime = 1416737045694.
   new LV = -56; new CTime = 1432405112136
2015-05-23 20:54:12,469 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-2033573672-130.226.238.178-1367832131535
java.io.IOException: Cannot create directory /mnt/disk_m/
dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1259)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocksHelper(DataStorage.java:1296)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(DataStorage.java:1023)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.linkAllBlocks(BlockPoolSliceStorage.java:647)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doUpgrade(BlockPoolSliceStorage.java:456)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.doTransition(BlockPoolSliceStorage.java:390)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadStorageDirectory(BlockPoolSliceStorage.java:171)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:214)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1397)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1362)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:227)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:839)
        at java.lang.Thread.run(Thread.java:745)
2015-05-23 20:54:12,476 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-2033573672-130.226.238.178-1367832131535 : Cannot create directory /mnt/disk_m/dfs/dn/current/BP-2033573672-130.226.238.178-1367832131535/current/finalized/subdir91/subdir168

The somewhat misleading "Cannot create directory" is not a file permission problem but rather a disk full problem. During this block pool upgrade some temporary space is needed for rewriting metadata, and that space is apparently more than the 10G that was available to "non-HDFS" (which we've concluded means "not HDFS storage files, but everything else is fair game"). Because some space is available to start the upgrade, it begins, but then when it exhausts the disk it fails, and This Kills The DataNode. It does clean up after itself, but prevents the DataNode from starting, meaning our cluster was on its knees and in no danger of standing up.

So the problem was lack of free space, which on 10 of our 12 machines we were able to solve by wiping temporary files from the colocated yarn directory. Those 10 machines were then able to upgrade their disk_m and started up. We still had two nodes down and unfortunately they were in different racks, so that meant we had a big pile of our replication factor 2 files missing blocks (the default HDFS block replication policy places the second and subsequent copies on a different rack from the first copy).

While digging around in the different properties that we thought could affect our disks and HDFS behaviour we were also restarting the failing DataNodes regularly. At some point the log message changed to:

WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.FileNotFoundException: /mnt/disk_m/dfs/dn/in_use.lock (No space left on device)

After that message the DataNode started, but with disk_m marked as a failed volume. We're not sure why this happened but presume that after one of our failures it didn't clean up it's temp files on disk_m and then on subsequent restarts found the disk completely full and (rightly) considered it unusable and tried to carry on. With the final two DataNodes up we had almost all of our cluster, minus the two failed volumes. There were only 35 corrupted files (missing blocks) left after they came up. These were files set to replication factor 2, and by bad luck had both copies of some of their blocks on the failed disk_m (one from rack1, one from rack2).

It would not have been the end of the world to just delete the corrupted user downloads (they were all over a year old) but on principle, it would not be The Right Thing To Do.

On inodes and hardlinks

The normal directory structure of the dfs dir in a DataNode is /dfs/dn/current/<blockpool name>/current/finalized and within finalized are a whole series of directories to fan out the various blocks that the volume contains. During the block pool upgrade a copy of 'finalized' is made called previous.tmp. It's not a normal copy however - it uses hardlinks in order to avoid duplicating all of the data (which obviously wouldn't work). The copy is needed during the upgrade and is removed afterwards. Since our upgrade failed halfway through we had both directories and had no choice but to move the entire /dfs directory off of /disk_m to a temporary disk and complete the upgrade there. We first tried a copy (use cp -a to preserve hardlinks) to a mounted NFS share. The copy looked fine but on startup the DataNode didn't understand the mounted drive ("drive not formatted"). Then we tried copying to a USB drive plugged into the machine and that ultimately worked (despite feeling decidedly un-Yahoo). Once the USB drive was upgraded and online in the cluster, replication took over and copied all of its blocks to new homes on /rack2. We then unmounted the USB drive, wiped both /disk_m's and then let replication balance out again. Final result: no lost blocks.

Mitigation

With the cluster happy again we made a few changes to hopefully ensure this doesn't happen again:
  • dfs.datanode.du.reserved:25GB this guarantees 25GB free on each volume (up from 10GB) and should be enough to allow a future upgrade to happen
  • dfs.datanode.fsdataset.volume.choosing.policy:AvailableSpace 
  • dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction:1.0 together these two direct new blocks to disks that have more free space, thereby leaving our now full /disk_m alone

Conclusion

This was one small taste of what can go wrong with filling heterogenous disks in an HDFS cluster. We're sure there are worse dangers lurking on the full-disk horizon, so hopefully you've learned from our pain and will give yourself some breathing room when things start to fill up. Also, don't use a replication factor of less than 3 if there's anyway you can help it.