METADATA 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214
  1. Metadata-Version: 2.4
  2. Name: tokenizers
  3. Version: 0.22.2
  4. Classifier: Development Status :: 5 - Production/Stable
  5. Classifier: Intended Audience :: Developers
  6. Classifier: Intended Audience :: Education
  7. Classifier: Intended Audience :: Science/Research
  8. Classifier: License :: OSI Approved :: Apache Software License
  9. Classifier: Operating System :: OS Independent
  10. Classifier: Programming Language :: Python :: 3
  11. Classifier: Programming Language :: Python :: 3.9
  12. Classifier: Programming Language :: Python :: 3.10
  13. Classifier: Programming Language :: Python :: 3.11
  14. Classifier: Programming Language :: Python :: 3.12
  15. Classifier: Programming Language :: Python :: 3.13
  16. Classifier: Programming Language :: Python :: 3 :: Only
  17. Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
  18. Requires-Dist: huggingface-hub>=0.16.4,<2.0
  19. Requires-Dist: pytest ; extra == 'testing'
  20. Requires-Dist: pytest-asyncio ; extra == 'testing'
  21. Requires-Dist: requests ; extra == 'testing'
  22. Requires-Dist: numpy ; extra == 'testing'
  23. Requires-Dist: datasets ; extra == 'testing'
  24. Requires-Dist: ruff ; extra == 'testing'
  25. Requires-Dist: ty ; extra == 'testing'
  26. Requires-Dist: sphinx ; extra == 'docs'
  27. Requires-Dist: sphinx-rtd-theme ; extra == 'docs'
  28. Requires-Dist: setuptools-rust ; extra == 'docs'
  29. Requires-Dist: tokenizers[testing] ; extra == 'dev'
  30. Provides-Extra: testing
  31. Provides-Extra: docs
  32. Provides-Extra: dev
  33. Keywords: NLP,tokenizer,BPE,transformer,deep learning
  34. Author-email: Nicolas Patry <patry.nicolas@protonmail.com>, Anthony Moi <anthony@huggingface.co>
  35. Requires-Python: >=3.9
  36. Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
  37. Project-URL: Homepage, https://github.com/huggingface/tokenizers
  38. Project-URL: Source, https://github.com/huggingface/tokenizers
  39. <p align="center">
  40. <br>
  41. <img src="https://huggingface.co/landing/assets/tokenizers/tokenizers-logo.png" width="600"/>
  42. <br>
  43. <p>
  44. <p align="center">
  45. <a href="https://badge.fury.io/py/tokenizers">
  46. <img alt="Build" src="https://badge.fury.io/py/tokenizers.svg">
  47. </a>
  48. <a href="https://github.com/huggingface/tokenizers/blob/master/LICENSE">
  49. <img alt="GitHub" src="https://img.shields.io/github/license/huggingface/tokenizers.svg?color=blue">
  50. </a>
  51. </p>
  52. <br>
  53. # Tokenizers
  54. Provides an implementation of today's most used tokenizers, with a focus on performance and
  55. versatility.
  56. Bindings over the [Rust](https://github.com/huggingface/tokenizers/tree/master/tokenizers) implementation.
  57. If you are interested in the High-level design, you can go check it there.
  58. Otherwise, let's dive in!
  59. ## Main features:
  60. - Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3
  61. most common BPE versions).
  62. - Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes
  63. less than 20 seconds to tokenize a GB of text on a server's CPU.
  64. - Easy to use, but also extremely versatile.
  65. - Designed for research and production.
  66. - Normalization comes with alignments tracking. It's always possible to get the part of the
  67. original sentence that corresponds to a given token.
  68. - Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
  69. ### Installation
  70. #### With pip:
  71. ```bash
  72. pip install tokenizers
  73. ```
  74. #### From sources:
  75. To use this method, you need to have the Rust installed:
  76. ```bash
  77. # Install with:
  78. curl https://sh.rustup.rs -sSf | sh -s -- -y
  79. export PATH="$HOME/.cargo/bin:$PATH"
  80. ```
  81. Once Rust is installed, you can compile doing the following
  82. ```bash
  83. git clone https://github.com/huggingface/tokenizers
  84. cd tokenizers/bindings/python
  85. # Create a virtual env (you can use yours as well)
  86. python -m venv .env
  87. source .env/bin/activate
  88. # Install `tokenizers` in the current virtual env
  89. pip install -e .
  90. ```
  91. ### Load a pretrained tokenizer from the Hub
  92. ```python
  93. from tokenizers import Tokenizer
  94. tokenizer = Tokenizer.from_pretrained("bert-base-cased")
  95. ```
  96. ### Using the provided Tokenizers
  97. We provide some pre-build tokenizers to cover the most common cases. You can easily load one of
  98. these using some `vocab.json` and `merges.txt` files:
  99. ```python
  100. from tokenizers import CharBPETokenizer
  101. # Initialize a tokenizer
  102. vocab = "./path/to/vocab.json"
  103. merges = "./path/to/merges.txt"
  104. tokenizer = CharBPETokenizer(vocab, merges)
  105. # And then encode:
  106. encoded = tokenizer.encode("I can feel the magic, can you?")
  107. print(encoded.ids)
  108. print(encoded.tokens)
  109. ```
  110. And you can train them just as simply:
  111. ```python
  112. from tokenizers import CharBPETokenizer
  113. # Initialize a tokenizer
  114. tokenizer = CharBPETokenizer()
  115. # Then train it!
  116. tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])
  117. # Now, let's use it:
  118. encoded = tokenizer.encode("I can feel the magic, can you?")
  119. # And finally save it somewhere
  120. tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")
  121. ```
  122. #### Provided Tokenizers
  123. - `CharBPETokenizer`: The original BPE
  124. - `ByteLevelBPETokenizer`: The byte level version of the BPE
  125. - `SentencePieceBPETokenizer`: A BPE implementation compatible with the one used by SentencePiece
  126. - `BertWordPieceTokenizer`: The famous Bert tokenizer, using WordPiece
  127. All of these can be used and trained as explained above!
  128. ### Build your own
  129. Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer,
  130. by putting all the different parts you need together.
  131. You can check how we implemented the [provided tokenizers](https://github.com/huggingface/tokenizers/tree/master/bindings/python/py_src/tokenizers/implementations) and adapt them easily to your own needs.
  132. #### Building a byte-level BPE
  133. Here is an example showing how to build your own byte-level BPE by putting all the different pieces
  134. together, and then saving it to a single file:
  135. ```python
  136. from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors
  137. # Initialize a tokenizer
  138. tokenizer = Tokenizer(models.BPE())
  139. # Customize pre-tokenization and decoding
  140. tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
  141. tokenizer.decoder = decoders.ByteLevel()
  142. tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
  143. # And then train
  144. trainer = trainers.BpeTrainer(
  145. vocab_size=20000,
  146. min_frequency=2,
  147. initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
  148. )
  149. tokenizer.train([
  150. "./path/to/dataset/1.txt",
  151. "./path/to/dataset/2.txt",
  152. "./path/to/dataset/3.txt"
  153. ], trainer=trainer)
  154. # And Save it
  155. tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)
  156. ```
  157. Now, when you want to use this tokenizer, this is as simple as:
  158. ```python
  159. from tokenizers import Tokenizer
  160. tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")
  161. encoded = tokenizer.encode("I can feel the magic, can you?")
  162. ```
  163. ### Typing support and `stub.py`
  164. The compiled PyO3 extension does not expose type annotations, so editors and type checkers would otherwise see most objects as `Any`. The `stub.py` helper walks the loaded extension modules, renders `.pyi` stub files (plus minimal forwarding `__init__.py` shims), and formats them so that tools like mypy/pyright can understand the public API. Run `python stub.py` whenever you change the Python-visible surface to keep the generated stubs in sync.