.. _datasets-ref:

========
Datasets
========

Datasets in Shoelace can be loaded either in RankSVM .txt format or a custom
binary format. When you are just getting started, you will want to load data
sets using RankSVM format as most data sets are available in this format.
However, for future reference it is much faster to save and load the custom
binary format.

RankSVM Format
==============
The most typical format to load a Learning to Rank data set is using RankSVM
format. Many publicly available anotated data sets have used this format. Given
a Learning to Rank data set called `dataset.txt`, you can load it as follows:

.. code-block:: python

    from shoelace.dataset import LtrDataset

    with open('./dataset.txt', 'r') as file:
        dataset = LtrDataset.load_txt(file)

Some data sets in RankSVM format do not come with query-level normalization. For
most loss functions that depend on exponentials it is very recommended to do
query-level normalization to prevent overflow errors. Fortunately, Shoelace has
normalization built-in as part of the data set loading facilities. You can load
a data set with query-level normalization by setting the `normalize` parameter
to `True`:

.. code-block:: python

    with open('./dataset.txt', 'r') as file:
        dataset = LtrDataset.load_txt(file, normalize=True)


Binary Format
=============
Shoelace provides a custom binary format that is much faster to save and load
than the RankSVM text format. Once you have loaded a dataset, you can save it
to a binary file once, so that you can load it much faster in future
experiments:

.. code-block:: python

    with open('./dataset.bin', 'wb') as file:
        dataset.save(file)

Once saved, you can load the file again. It will be several orders of magnitude
faster than loading the corresponding txt file:

.. code-block:: python

    with open('./dataset.bin', 'rb') as file:
        dataset = LtrDataset.load(file)


Iterators
=========
Chainer works with a concept of iterators that feed a neural network with
batches of data. We provide an iterator specifically designed for Learning to
Rank tasks. Such an iterator considers a single query (and all associated
query-document pairs) as a single learning batch. This will allow us to do
list-wise Learning to Rank by taking into account the entire ranked list for a
given query. As a consequence, not every minibatch will have the same size,
which is fortunately no problem for Chainer's run-time architecture.

You can load the data set into an iterator as follows:

.. code-block:: python

    from shoelace.dataset import LtrIterator

    iterator = LtrIterator(dataset)

You can additionally specify whether the iterator should repeat forever (e.g.
for training data, but not for test data) and whether the order of data should
be shuffled on every epoch:

.. code-block:: python

    iterator = LtrIterator(dataset, repeat=True, shuffle=True)