Skip to content

tlranda/lm_math

Repository files navigation

LM Memorization

A repository dedicated to using simple techniques to tease out what language models actually represent as learned knowledge and what they merely memorize from disparate training sources.

Setup and Installation

Python Interpreter

All included source code is written and tested in a Python 3.11 environment, though older versions of Python 3 may still function.

Packages

All packages imported by various scripts are included in requirements.txt and can be simply installed via pip3 install -r requirements.txt

Language Models

This repository is designed to work with locally run LMs via the Ollama python library. Changing to the OpenAI API/python library (which Ollama also implements and therefore can still interaface with your local models) is a consideration for the future.

Tokenization

For tokenization with GPT2's tokenizer (common to OpenAI models, LLAMA models, and several others -- however this may not be the tokenizer of your model, so update as needed), we use the HuggingFace Transformers library.

In the future, better tokenization control (including opting-out of tokenizers running) will be implemented.

Executable Scripts

All scripts are written with command line argument parsing, ergo their specific execution details can be read via: python3 <script.py> --help.

  • lm_math.py performs the primary investigation using locally run language models
  • plots.py produces figures using the outputs generated by lm_math.py

Supporting Scripts

  • special_operation.py defines an arbitrary function for lm_math.py to use as an un-memorizable input to the LM.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages