Adding data to IYP

Before you start hacking away, the first, and very important, step is to model your data as a graph. Take a look at the existing node and relationship types and see if and where your data could attach to the existing graph and if you can reuse existing relationship types. Also refer back to the IYP data modeling section. If you need help, feel free to start a discussion on GitHub.

Once you have modeled your data, you can start writing a crawler. The main tasks of a crawler are to fetch data, parse it, model it with IYP ontology, and push it to the IYP database. Most of these tasks are assisted by the IYP python library (described next).

IYP code structure

The repository and code is structured like this:

internet-yellow-pages/
├─ iyp/
│  ├─ __init__.py <- contains IYP module
│  ├─ crawlers/
│  │  ├─ org/ <- name of the organization
│  │  │  ├─ README.md <- README describing datasets and modelling
│  │  │  ├─ crawler1.py <- one crawler per dataset
│  │  │  ├─ crawler2.py
│  ├─ post/ <- for post-processing scripts

The canonical way to execute a crawler is:

python3 -m iyp.crawlers.org.crawler1

Writing a IYP crawler

A full explanation of how to write a crawler from scratch is outside the scope of this tutorial. To get you started, we point you to the existing documentation, the example crawler that you can use as a template, and the best practices for writing crawlers. You can also look at other existing crawlers and of course always contact us for help.

Making data publicly available

If you want to add private data to your own instance, feel free to do so. However, we welcome crawler contributions that add data to IYP!

The workflow for this is usually as follows:

  1. Propose a new dataset by opening a discussion. The point of the discussion is to decide if a dataset should be included and how to model it. Please add a short description of why the dataset would be useful for you/others. This is just to prevent adding datasets for the sake of it (“because we can”) which inflates to database size. You also do not have to provide a perfect model at the start, we can figure this out together.
  2. Once it is decided that we want to integrate the dataset and how to model it, the discussion will be converted into an issue. Then you (or someone else) can implement it.
  3. Open a pull request with the crawler implementation.
  4. We will merge it and the next dump will contain your dataset!