| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214 |
- Metadata-Version: 2.4
- Name: tokenizers
- Version: 0.22.2
- Classifier: Development Status :: 5 - Production/Stable
- Classifier: Intended Audience :: Developers
- Classifier: Intended Audience :: Education
- Classifier: Intended Audience :: Science/Research
- Classifier: License :: OSI Approved :: Apache Software License
- Classifier: Operating System :: OS Independent
- Classifier: Programming Language :: Python :: 3
- Classifier: Programming Language :: Python :: 3.9
- Classifier: Programming Language :: Python :: 3.10
- Classifier: Programming Language :: Python :: 3.11
- Classifier: Programming Language :: Python :: 3.12
- Classifier: Programming Language :: Python :: 3.13
- Classifier: Programming Language :: Python :: 3 :: Only
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
- Requires-Dist: huggingface-hub>=0.16.4,<2.0
- Requires-Dist: pytest ; extra == 'testing'
- Requires-Dist: pytest-asyncio ; extra == 'testing'
- Requires-Dist: requests ; extra == 'testing'
- Requires-Dist: numpy ; extra == 'testing'
- Requires-Dist: datasets ; extra == 'testing'
- Requires-Dist: ruff ; extra == 'testing'
- Requires-Dist: ty ; extra == 'testing'
- Requires-Dist: sphinx ; extra == 'docs'
- Requires-Dist: sphinx-rtd-theme ; extra == 'docs'
- Requires-Dist: setuptools-rust ; extra == 'docs'
- Requires-Dist: tokenizers[testing] ; extra == 'dev'
- Provides-Extra: testing
- Provides-Extra: docs
- Provides-Extra: dev
- Keywords: NLP,tokenizer,BPE,transformer,deep learning
- Author-email: Nicolas Patry <patry.nicolas@protonmail.com>, Anthony Moi <anthony@huggingface.co>
- Requires-Python: >=3.9
- Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
- Project-URL: Homepage, https://github.com/huggingface/tokenizers
- Project-URL: Source, https://github.com/huggingface/tokenizers
- <p align="center">
- <br>
- <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
- <br>
- <p>
- <p align="center">
- <a href="https://badge.fury.io/py/tokenizers">
- <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
- </a>
- <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
- <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
- </a>
- </p>
- <br>
- # Tokenizers
- Provides an implementation of today's most used tokenizers, with a focus on performance and
- versatility.
- Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
- If you are interested in the High-level design, you can go check it there.
- Otherwise, let's dive in!
- ## Main features:
- - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
- most common BPE versions).
- - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
- less than 20 seconds to tokenize a GB of text on a server's CPU.
- - Easy to use, but also extremely versatile.
- - Designed for research and production.
- - Normalization comes with alignments tracking. It's always possible to get the part of the
- original sentence that corresponds to a given token.
- - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
- ### Installation
- #### With pip:
- ```bash
- pip install tokenizers
- ```
- #### From sources:
- To use this method, you need to have the Rust installed:
- ```bash
- # Install with:
- curl https://sh.rustup.rs -sSf | sh -s -- -y
- export PATH="$HOME/.cargo/bin:$PATH"
- ```
- Once Rust is installed, you can compile doing the following
- ```bash
- git clone https://github.com/huggingface/tokenizers
- cd tokenizers/bindings/python
- # Create a virtual env (you can use yours as well)
- python -m venv .env
- source .env/bin/activate
- # Install `tokenizers` in the current virtual env
- pip install -e .
- ```
- ### Load a pretrained tokenizer from the Hub
- ```python
- from tokenizers import Tokenizer
- tokenizer = Tokenizer.from_pretrained("bert-base-cased")
- ```
- ### Using the provided Tokenizers
- We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
- these using some `vocab.json` and `merges.txt` files:
- ```python
- from tokenizers import CharBPETokenizer
- # Initialize a tokenizer
- vocab = "./path/to/vocab.json"
- merges = "./path/to/merges.txt"
- tokenizer = CharBPETokenizer(vocab, merges)
- # And then encode:
- encoded = tokenizer.encode("I can feel the magic, can you?")
- print(encoded.ids)
- print(encoded.tokens)
- ```
- And you can train them just as simply:
- ```python
- from tokenizers import CharBPETokenizer
- # Initialize a tokenizer
- tokenizer = CharBPETokenizer()
- # Then train it!
- tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
- # Now, let's use it:
- encoded = tokenizer.encode("I can feel the magic, can you?")
- # And finally save it somewhere
- tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
- ```
- #### Provided Tokenizers
- - `CharBPETokenizer`: The original BPE
- - `ByteLevelBPETokenizer`: The byte level version of the BPE
- - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
- - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
- All of these can be used and trained as explained above!
- ### Build your own
- Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
- by putting all the different parts you need together.
- You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
- #### Building a byte-level BPE
- Here is an example showing how to build your own byte-level BPE by putting all the different pieces
- together, and then saving it to a single file:
- ```python
- from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
- # Initialize a tokenizer
- tokenizer = Tokenizer(models.BPE())
- # Customize pre-tokenization and decoding
- tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
- tokenizer.decoder = decoders.ByteLevel()
- tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
- # And then train
- trainer = trainers.BpeTrainer(
- vocab_size=20000,
- min_frequency=2,
- initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
- )
- tokenizer.train([
- "./path/to/dataset/1.txt",
- "./path/to/dataset/2.txt",
- "./path/to/dataset/3.txt"
- ], trainer=trainer)
- # And Save it
- tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
- ```
- Now, when you want to use this tokenizer, this is as simple as:
- ```python
- from tokenizers import Tokenizer
- tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
- encoded = tokenizer.encode("I can feel the magic, can you?")
- ```
- ### Typing support and `stub.py`
- The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as `Any`. The `stub.py` helper walks the loaded extension modules, renders `.pyi` stub files (plus minimal forwarding `__init__.py` shims), and formats them so that tools like mypy/pyright can understand the public API. Run `python stub.py` whenever you change the Python-visible surface to keep the generated stubs in sync.
|