模型:
ctrl
The CTRL model was proposed in CTRL: A Conditional Transformer Language Model for Controllable Generation by Nitish Shirish Keskar*, Bryan McCann*, Lav R. Varshney, Caiming Xiong and Richard Socher. It's a causal (unidirectional) transformer pre-trained using language modeling on a very large corpus of ~140 GB of text data with the first token reserved as a control code (such as Links, Books, Wikipedia etc.). The model developers released a model card for CTRL, available here .
In their model card , the developers write:
The CTRL Language Model analyzed in this card generates text conditioned on control codes that specify domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior.
The model is a language model. The model can be used for text generation.
In their model card , the developers write that the primary intended users are general audiences and NLP Researchers, and that the primary intended uses are:
In their model card , the developers write:
Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021) ). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
In their model card , the developers write:
We recognize the potential for misuse or abuse, including use by bad actors who could manipulate the system to act maliciously and generate text to influence decision-making in political, economic, and social settings. False attribution could also harm individuals, organizations, or other entities. To address these concerns, the model was evaluated internally as well as externally by third parties, including the Partnership on AI, prior to release.
To mitigate potential misuse to the extent possible, we stripped out all detectable training data from undesirable sources. We then redteamed the model and found that negative utterances were often placed in contexts that made them identifiable as such. For example, when using the ‘News’ control code, hate speech could be embedded as part of an apology (e.g. “the politician apologized for saying [insert hateful statement]”), implying that this type of speech was negative. By pre-selecting the available control codes (omitting, for example, Instagram and Twitter from the available domains), we are able to limit the potential for misuse.
In releasing our model, we hope to put it into the hands of researchers and prosocial actors so that they can work to control, understand, and potentially combat the negative consequences of such models. We hope that research into detecting fake news and model-generated content of all kinds will be pushed forward by CTRL. It is our belief that these models should become a common tool so researchers can design methods to guard against malicious use and so the public becomes familiar with their existence and patterns of behavior.
See the associated paper for further discussions about the ethics of LLMs.
In their model card , the developers write:
See the CTRL-detector GitHub repo for more on the detector model.
In their model card , the developers write:
This model is trained on 140 GB of text drawn from a variety of domains: Wikipedia (English, German, Spanish, and French), Project Gutenberg, submissions from 45 subreddits, OpenWebText, a large collection of news data, Amazon Reviews, Europarl and UN data from WMT (En-De, En-Es, En-Fr), question-answer pairs (no context documents) from ELI5, and the MRQA shared task, which includes Stanford Question Answering Dataset, NewsQA, TriviaQA, SearchQA, HotpotQA, and Natural Questions. See the paper for the full list of training data.
In the associated paper the developers write:
We learn BPE (Sennrich et al., 2015) codes and tokenize the data using fastBPE4, but we use a large vocabulary of roughly 250K tokens. This includes the sub-word tokens necessary to mitigate problems with rare words, but it also reduces the average number of tokens required to generate long text by including most common words. We use English Wikipedia and a 5% split of our collected OpenWebText data for learning BPE codes. We also introduce an unknown token so that during preprocessing we can filter out sequences that contain more than 2 unknown tokens. This, along with the compressed storage for efficient training (TFRecords) (Abadi et al., 2016), reduces our training data to 140 GB from the total 180 GB collected.
See the paper for links, references, and further details.
In the associated paper the developers write:
CTRL has model dimension d = 1280, inner dimension f = 8192, 48 layers, and 16 heads per layer. Dropout with probability 0.1 follows the residual connections in each layer. Token embeddings were tied with the final output embedding layer (Inan et al., 2016; Press & Wolf, 2016).
See the paper for links, references, and further details.
In their model card , the developers write that model performance measures are:
Performance evaluated on qualitative judgments by humans as to whether the control codes lead to text generated in the desired domain
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019) . Details are pulled from the associated paper .
In the associated paper the developers write:
CTRL was implemented in TensorFlow (Abadi et al., 2016) and trained with a global batch size of 1024 distributed across 256 cores of a Cloud TPU v3 Pod for 800k iterations. Training took approximately 2 weeks using Adagrad (Duchi et al., 2011) with a linear warmup from 0 to 0.05 over 25k steps. The norm of gradients were clipped to 0.25 as in (Merity et al., 2017). Learning rate decay was not necessary due to the monotonic nature of the Adagrad accumulator. We compared to the Adam optimizer (Kingma & Ba, 2014) while training smaller models, but we noticed comparable convergence rates and significant memory savings with Adagrad. We also experimented with explicit memory-saving optimizers including SM3 (Anil et al., 2019), Adafactor (Shazeer & Stern, 2018), and NovoGrad (Ginsburg et al., 2019) with mixed results.
See the paper for links, references, and further details.
BibTeX:
@article{keskarCTRL2019, title={{CTRL - A Conditional Transformer Language Model for Controllable Generation}}, author={Keskar, Nitish Shirish and McCann, Bryan and Varshney, Lav and Xiong, Caiming and Socher, Richard}, journal={arXiv preprint arXiv:1909.05858}, year={2019} }
APA:
This model card was written by the team at Hugging Face, referencing the model card released by the developers.
Use the code below to get started with the model. See the Hugging Face ctrl docs for more information.
Click to expand>>> from transformers import CTRLTokenizer, CTRLModel >>> import torch >>> tokenizer = CTRLTokenizer.from_pretrained("ctrl") >>> model = CTRLModel.from_pretrained("ctrl") >>> # CTRL was trained with control codes as the first token >>> inputs = tokenizer("Opinion My dog is cute", return_tensors="pt") >>> assert inputs["input_ids"][0, 0].item() in tokenizer.control_codes.values() >>> outputs = model(**inputs) >>> last_hidden_states = outputs.last_hidden_state >>> list(last_hidden_states.shape)