DataDepsGenerators.jl is a tool written to help users of the Julia programming language to observe best practices when making use of published datasets. Using the metadata present in published datasets, it generates the code for the data dependency registration blocks required by DataDeps.jl [@2018arXiv180801091W]. These registration blocks are effectively executable metadata, which can be resolved by DataDeps.jl to download the dataset. They include a message that is displayed to the user whenever the data set is automatically downloaded. This message should include provenance information on the dataset, so that downstream users know its original source and details on its processing.
This package should not be used as a direct dependency. Instead its interactive features should be used from the Julia REPL, to generate a good registration code block which can be added to your package.
The registration block can be immediately evaluated using
which is handy for interactive prototyping,
it is not great to put this code in a package packages;
as it involves triggering web requests every time the package is loaded.
Not to mention that DataDepsGenerators.jl has a pretty heavy set of dependencies,
which you really don't want weighing down your package.
Note: DataDepsGenerators does it's best to generate the correct registration code block. But it is up to you make sure it is right. The code it generates isn't always the cleanest. It may capture too much, or too little information. It might get things wrong (particularly when the metadata retrieved is wrong). You should take a few moments to check you are happy with the registration block code generated. Make a few tweaks, and it should be good to go.
All usage revolves around the
generate() is an overloaded method with two ways of usage:
The normal way to use the package is to use the the one or two argument form.
id_or_url an identifier for the dataset that we can use to look up the metadata on.
datadep_nameis an optional argument, this is what to use as the name of the datadep
"Foo", when you use the datadep in your code, you'll write
Stringcontaining the generated DataDeps registration block
displaying it in the REPL will show it as full of escape characters.
Example of use:
generate("https://datadryad.org/resource/doi:10.5061/dryad.74699", "Wettberg 2018, important crop's wild relatives")
Using this non-repository specific generate command causes DataDepGenerators
to query all repositories and metadata sources about this
Often more than one succeeds, since for example the data may have both a record in a DOI register,
as well as being on a platform we support the API for,
and on a page that has an in-line JSONLD fragment.
generate(datarepo::DataRepo, id_or_url, [datadep_name])::String
An extra argument needs to be provided to specify the data repository
datarepois a data repository.
To write the dependency block to a file, you just need to open the file (
"data.jl" in this example) and write to it.
using DataDepsGenerators open("data.jl", "w") do fh registration = generate("https://archive.ics.uci.edu/ml/datasets/Air+quality", "UCI Air") print(fh, registration) end
Then in your project to load the registration you can do:
using DataDeps function __init__() include("data.jl") end
This is pretty easy:
println(generate(UCI(), "https://archive.ics.uci.edu/ml/datasets/Air+quality", "UCI Air"))
then copy and paste into your project.
While this isn't advise for use in packages -- since it throws away many of the benifits of using DataDeps, it can be done. It is probably most useful in the REPL/IJulia.
using DataDeps using DataDepsGenerators eval(Meta.parse(generate(UCI(), "https://archive.ics.uci.edu/ml/datasets/Air+quality", "UCI Air")))
Then just use anywhere in your code (later in the REPL session for example)
datadep"UCI Air" as if it were the name of a directory holding that data.
(Which indeed what that string macro expands into -- even if it has to download the data first).
CKAN()- API Based
CKAN is primarily used by government organizations.
Data Repositories and examples of use:
DataCite()- API Based
DataCite is the largest providers of DOI for things other than papers, especially for data.
Example of use:
generate(DataCite(), "10.5063/F1HT2M7Q") generate(DataCite(), "https://search.datacite.org/works/10.15148/0e999ffc-e220-41ac-ac85-76e92ecd0320")
Either URL or DOI can be provided as arguments.
DataCite can not generate complete and usable registration blocks on its own, as it does not include the download URLs.
DataDryad()- Web Based
DataDryad is one of the bigger research data stores. Almost all the data in it is directly linked to one paper or another.
Example of use:
generate(DataDryad(), "https://datadryad.org/resource/doi:10.5061/dryad.74699", "Wild Crop Genomics")
Supports DataOne API version 2. There are differences in the API structure for each repository each of them, hence are accounted for, separately:
GitHub()- Web Based
Note that storing data in GitHub is generally not great particularly for large binary data. However, a fair few datasets are stored there anyway. A lot of these are plain-text and small files so it works out ok enough.
The generator for GitHub works on whole repositories, or on folders within repositories.
When downloading whole repositories, your other option would be to download a
tarball which GitHub provides; rather than generating a datadep with datadep generators which will result in downloading each file separately.
You could even manipulate DataDeps into doing a
Note GitHub does not like being used as a CDN. For this reason DataDepsGenerators generates URLs to http://cdn.rawgit.com which is a CDN wrapper over GitHub, so you won't thrash github's servers. Also note that the DataDepGenerator will produce URLs pointing to the current commit. So the if the repository is updated, the DataDep will still download the old data. (This is a feature).
At present, we do not support generating for any branch's other than master. Though it is a simple matter to do a find and replace for the commit SHAs in the generated code so as to point at any commit.
Figshare()- API Based
FigShare is a popular website for sharing figures and data.
Example of use:
generate(Figshare(), "10.1371/journal.pbio.2001414") generate(Figshare(), "https://figshare.com/articles/Youth_Activism_in_Chile_from_urban_educational_inequalities_to_experiences_of_living_together_and_solidarity/6504206")
A URL, DOI or Figshare ID can be provided as arguments.
JSONLD_DOI()- API Based
This uses a DataCite json-ld service to retrieve the metadata for CrossRef or DataCite issues DOIs. Like the DataCite generate this can not usually generate complete registration blocks as the API does not include the download URLs.
Example of use:
JSONLD_Web()- Web Based
A lot of data hosting websites like Kaggle, Zenodo, Dataverse etc (including several with their own generators)
store information in the form of JSON-LD
<script> fragments embedding in the HTML webpages.
It is used by Google Dataset search engine too,
so any result from Google Dataset search should work out of the box with DataDepGenerators.
The completeness of the information in the JSON-LD fragment varies depending on the site.
So this may generate incomplete registration blocks, e.g. with the download URL missing.
Example of use:
UCI()- Web Based
A fairly classic repository for (mostly) small Machine Learning datasets This uses webscraping, and since it is a hand written website it is not perfectly consistently written or formatted, thus the registrations can be a bit choppy and may e.g. contain links that should have been removed etc.
6 months ago