数据集:

Vipitis/Shadertoys-fine

任务:

文本生成

语言:

code

大小:

100K<n<1M

语言创建人:

machine-generated

批注创建人:

no-annotation

其他:

code

许可:

cc-by-nc-sa-3.0

数据集介绍文件清单

中文

Dataset Card for Shadertoys-fine

Dataset Summary

fine variant of the Shadertoys dataset (still WIP), where individual functions are avaialable as Datapoints.

Supported Tasks and Leaderboards

language-modeling : The dataset can be used to train a model for modelling programming languages, which consists in building language models for programming languages.

Languages

English (names, comments)
Shadercode programming language

Dataset Structure

Data Instances

A data point consists of the function string, it's name as well as a bit of metadata like the author and source URL. (in the future there might be a function string without comments)

{
  'name': '<type> <name>',
  'code': '<type> <name>(<inputs>) { <body> return <outputs>; }\n',
  'source': 'https://shadertoy.com/view/<shaderID>',
  'author': '<username>'
}

A data point in the return_completion subset for the return-completion task in ShaderEval includes just two features:

{
  'body': '<type> <name> <type> <name>(<inputs>) { <body> return',
  'return_statment': ' <outputs>: }\n',
}

Data Fields

'name' funciton identifier composed of the type and the name of the function
'code' the raw code (including comments) of function.
'source' URL to the shader. It might be on a different renderpass
'author' username of the shader author
'body' the body of the function without the return statement (no comments)
'return_statment' the return statement of the function. everything infront of the semicolon is kept and white sapces are stripped in the custome Evaluator.

Data Splits

Currently available (shuffled):

train (85.0%)
test (15.0%)

These splits should be indexed the same across both subsets. So if you are fine-tuning on the fine subset you won't get exposed to the return_completion test split. However there are many duplicates among both subsets and splits.

Dataset Creation

Data retrieved starting 2022-07-20

Source Data

Initial Data Collection and Normalization

All data was collected via the Shadertoy.com API and then by looking for keywords and counting curly brackets to figure out what is part of a function and what isn't.

Who are the source language producers?

Shadertoy.com contributers which publish shaders as 'public+API'

Licensing Information

The Default licnese for each Shader is CC BY-NC-SA 3.0. However, some Shaders might have a different license attached. The Dataset is currently not filtering for any licensis.

作者:

Vipitis

数据集大小:

189.77 MB