bigcode/the-stack-smol-xs | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

bigcode/the-stack-smol-xs

任务:

文本生成

子任务:

language-modeling

语言:

code

计算机处理:

multilingual

大小:

size_categories:unknown

语言创建人:

crowdsourced

数据集介绍文件清单

中文

Dataset Description

A small subset of the-stack dataset, with 87 programming languages, each has 100 random samples from the original dataset for visualization.

Languages

The dataset contains 87 programming languages:

'ada', 'agda', 'alloy', 'antlr', 'applescript', 'assembly', 'augeas', 'awk', 'batchfile', 'bison', 'bluespec', 'c',
'c++', 'c-sharp', 'clojure', 'cmake', 'coffeescript', 'common-lisp', 'css', 'cuda', 'dart', 'dockerfile', 'elixir',
'elm', 'emacs-lisp','erlang', 'f-sharp', 'fortran', 'glsl', 'go', 'groovy', 'haskell','html', 'idris', 'isabelle', 'java', 
'java-server-pages', 'javascript', 'julia', 'kotlin', 'lean', 'literate-agda', 'literate-coffeescript', 'literate-haskell',
 'lua', 'makefile', 'maple', 'markdown', 'mathematica', 'matlab', 'ocaml', 'pascal', 'perl', 'php', 'powershell', 'prolog',
  'protocol-buffer', 'python', 'r', 'racket', 'restructuredtext', 'rmarkdown', 'ruby', 'rust', 'sas', 'scala', 'scheme', 
  'shell', 'smalltalk', 'solidity', 'sparql', 'sql', 'stan', 'standard-ml', 'stata', 'systemverilog', 'tcl', 'tcsh', 'tex', 
  'thrift', 'typescript', 'verilog', 'vhdl', 'visual-basic', 'xslt', 'yacc', 'zig'

Dataset Structure

You can specify which language you want to load, python is loaded by default:

# to load go:
from datasets import load_dataset

load_dataset("bigcode/the-stack-smol-xs", "go")

DatasetDict({
    train: Dataset({
        features: ['content', 'lang', 'size', 'ext', 'max_stars_count', 'avg_line_length', 'max_line_length', 'alphanum_fraction'],
        num_rows: 100
    })
})

作者:

bigcode

数据集大小:

59.54 MB