METADATA 119 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584258525862587258825892590259125922593259425952596259725982599260026012602260326042605260626072608260926102611261226132614261526162617261826192620262126222623262426252626262726282629263026312632263326342635263626372638263926402641264226432644264526462647264826492650265126522653265426552656265726582659266026612662266326642665266626672668266926702671267226732674267526762677267826792680268126822683268426852686268726882689269026912692269326942695269626972698
  1. Metadata-Version: 2.4
  2. Name: stringzilla
  3. Version: 4.6.0
  4. Summary: Search, hash, sort, and process strings faster via SWAR and SIMD
  5. Home-page: https://github.com/ashvardanian/StringZilla
  6. Author: Ash Vardanian
  7. Author-email: 1983160+ashvardanian@users.noreply.github.com
  8. License: Apache-2.0
  9. Classifier: Development Status :: 5 - Production/Stable
  10. Classifier: Natural Language :: English
  11. Classifier: Intended Audience :: Developers
  12. Classifier: Intended Audience :: Information Technology
  13. Classifier: Programming Language :: C++
  14. Classifier: Programming Language :: Python :: 3 :: Only
  15. Classifier: Programming Language :: Python :: 3.8
  16. Classifier: Programming Language :: Python :: 3.9
  17. Classifier: Programming Language :: Python :: 3.10
  18. Classifier: Programming Language :: Python :: 3.11
  19. Classifier: Programming Language :: Python :: 3.12
  20. Classifier: Programming Language :: Python :: 3.13
  21. Classifier: Programming Language :: Python :: Implementation :: CPython
  22. Classifier: Programming Language :: Python :: Implementation :: PyPy
  23. Classifier: Operating System :: OS Independent
  24. Classifier: Topic :: File Formats
  25. Classifier: Topic :: Internet :: Log Analysis
  26. Classifier: Topic :: Scientific/Engineering :: Information Analysis
  27. Classifier: Topic :: System :: Logging
  28. Classifier: Topic :: Text Processing :: General
  29. Classifier: Topic :: Text Processing :: Indexing
  30. Requires-Python: >=3.8
  31. Description-Content-Type: text/markdown
  32. License-File: LICENSE
  33. Dynamic: author
  34. Dynamic: author-email
  35. Dynamic: classifier
  36. Dynamic: description
  37. Dynamic: description-content-type
  38. Dynamic: home-page
  39. Dynamic: license
  40. Dynamic: license-file
  41. Dynamic: requires-python
  42. Dynamic: summary
  43. # StringZilla 🦖
  44. ![StringZilla banner](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/StringZilla-v4.jpg?raw=true)
  45. Strings are the first fundamental data type every programming language implements in software rather than hardware, so dedicated CPU instructions are rare - and the few that exist are hardly ideal.
  46. That's why most languages lean on the C standard library (libc) for their string operations, which, despite its name, ships its hottest code in hand-tuned assembly.
  47. It does exploit SIMD, but it isn't perfect.
  48. 1️⃣ Even on ubiquitous hardware - over a billion 64-bit ARM CPUs - routines such as `strstr` and `memmem` top out at roughly one-third of available throughput.
  49. 2️⃣ SIMD coverage is uneven: fast forward scans don't guarantee speedy reverse searches, hashing and case-mapping is not even part of the standard.
  50. 3️⃣ Many higher-level languages can't rely on libc at all because their strings aren't NUL-terminated - or may even contain embedded zeroes.
  51. That's why StringZilla exists: predictable, high performance on every modern platform, OS, and programming language.
  52. [![StringZilla Python installs](https://static.pepy.tech/personalized-badge/stringzilla?period=total&units=abbreviation&left_color=black&right_color=blue&left_text=StringZilla%20Python%20installs)](https://github.com/ashvardanian/stringzilla)
  53. [![StringZilla Rust installs](https://img.shields.io/crates/d/stringzilla?logo=rust&label=Rust%20installs)](https://crates.io/crates/stringzilla)
  54. ![StringZilla code size](https://img.shields.io/github/languages/code-size/ashvardanian/stringzilla)
  55. <!-- Those badges often stay in stale state - greyed out. Consider enabling them later.
  56. [![Ubuntu status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=Linux%20CI&label=Ubuntu)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
  57. [![Windows status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=Windows%20CI&label=Windows)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
  58. [![macOS status](https://img.shields.io/github/checks-status/ashvardanian/StringZilla/main?checkName=macOS%20CI&label=macOS)](https://github.com/ashvardanian/StringZilla/actions/workflows/release.yml)
  59. -->
  60. StringZilla is the GodZilla of string libraries, using [SIMD][faq-simd] and [SWAR][faq-swar] to accelerate binary and UTF-8 string operations on modern CPUs and GPUs.
  61. It delivers up to __10x higher CPU throughput in C, C++, Rust, Python__, and other languages, and can be __100x faster than existing GPU kernels__, covering a broad range of functionality.
  62. It __accelerates exact and fuzzy string matching, hashing, edit distance computations, sorting, provides allocation-free lazily-evaluated smart-iterators, and even random-string generators__.
  63. [faq-simd]: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
  64. [faq-swar]: https://en.wikipedia.org/wiki/SWAR
  65. - 🐂 __[C](#basic-usage-with-c-99-and-newer):__ Upgrade LibC's `<string.h>` to `<stringzilla/stringzilla.h>` in C 99
  66. - 🐉 __[C++](#basic-usage-with-c-11-and-newer):__ Upgrade STL's `<string>` to `<stringzilla/stringzilla.hpp>` in C++ 11
  67. - 🧮 __[CUDA](#cuda):__ Process in-bulk with `<stringzillas/stringzillas.cuh>` in CUDA C++ 17
  68. - 🐍 __[Python](#quick-start-python):__ Upgrade your `str` to faster `Str`
  69. - 🦀 __[Rust](#quick-start-rust):__ Use the `StringZilla` traits crate
  70. - 🦫 __[Go](#quick-start-golang):__ Use the `StringZilla` cGo module
  71. - 🍎 __[Swift](#quick-start-swift):__ Use the `String+StringZilla` extension
  72. - 🟨 __[JavaScript](#quick-start-javascript):__ Use the `StringZilla` library
  73. - 🐚 __[Shell][faq-shell]__: Accelerate common CLI tools with `sz-` prefix
  74. - 📚 Researcher? Jump to [Algorithms & Design Decisions](#algorithms--design-decisions)
  75. - 💡 Thinking to contribute? Look for ["good first issues"][first-issues]
  76. - 🤝 And check the [guide](https://github.com/ashvardanian/StringZilla/blob/main/CONTRIBUTING.md) to set up the environment
  77. - Want more bindings or features? Let [me](https://github.com/ashvardanian) know!
  78. [faq-shell]: https://github.com/ashvardanian/StringZilla-CLI
  79. [first-issues]: https://github.com/ashvardanian/StringZilla/issues
  80. __Who is this for?__
  81. - For data-engineers parsing large datasets, like the [CommonCrawl](https://commoncrawl.org/), [RedPajama](https://github.com/togethercomputer/RedPajama-Data), or [LAION](https://laion.ai/blog/laion-5b/).
  82. - For software engineers optimizing strings in their apps and services.
  83. - For bioinformaticians and search engineers looking for edit-distances for [USearch](https://github.com/unum-cloud/usearch).
  84. - For [DBMS][faq-dbms] devs, optimizing `LIKE`, `ORDER BY`, and `GROUP BY` operations.
  85. - For hardware designers, needing a SWAR baseline for string-processing functionality.
  86. - For students studying SIMD/SWAR applications to non-data-parallel operations.
  87. [faq-dbms]: https://en.wikipedia.org/wiki/Database
  88. ## Performance
  89. <table>
  90. <tr>
  91. <th align="center" width="25%">C</th>
  92. <th align="center" width="25%">C++</th>
  93. <th align="center" width="25%">Python</th>
  94. <th align="center" width="25%">StringZilla</th>
  95. </tr>
  96. <!-- Unicode case-folding -->
  97. <tr>
  98. <td colspan="4" align="center">Unicode case-folding, expanding characters like <code>ß</code> → <code>ss</code></td>
  99. </tr>
  100. <tr>
  101. <td align="center">⚪</td>
  102. <td align="center">⚪</td>
  103. <td align="center">
  104. <code>.casefold</code><br/>
  105. <span style="color:#ABABAB;">x86:</span> <b>0.4</b> GB/s
  106. </td>
  107. <td align="center">
  108. <code>sz.utf8_case_fold</code><br/>
  109. <span style="color:#ABABAB;">x86:</span> <b>1.3</b> GB/s
  110. </td>
  111. </tr>
  112. <!-- Unicode case-insensitive search -->
  113. <tr>
  114. <td colspan="4" align="center">Unicode case-insensitive substring search</td>
  115. </tr>
  116. <tr>
  117. <td align="center">⚪</td>
  118. <td align="center">⚪</td>
  119. <td align="center">
  120. <code>icu.StringSearch</code><br/>
  121. <span style="color:#ABABAB;">x86:</span> <b>0.02</b> GB/s
  122. </td>
  123. <td align="center">
  124. <code>utf8_case_insensitive_find</code><br/>
  125. <span style="color:#ABABAB;">x86:</span> <b>3.0</b> GB/s
  126. </td>
  127. </tr>
  128. <!-- Substrings, normal order -->
  129. <tr>
  130. <td colspan="4" align="center">find the first occurrence of a random word from text, ≅ 5 bytes long</td>
  131. </tr>
  132. <tr>
  133. <td align="center">
  134. <code>strstr</code> <sup>1</sup><br/>
  135. <span style="color:#ABABAB;">x86:</span> <b>7.4</b> &centerdot;
  136. <span style="color:#ABABAB;">arm:</span> <b>2.0</b> GB/s
  137. </td>
  138. <td align="center">
  139. <code>.find</code><br/>
  140. <span style="color:#ABABAB;">x86:</span> <b>2.9</b> &centerdot;
  141. <span style="color:#ABABAB;">arm:</span> <b>1.6</b> GB/s
  142. </td>
  143. <td align="center">
  144. <code>.find</code><br/>
  145. <span style="color:#ABABAB;">x86:</span> <b>1.1</b> &centerdot;
  146. <span style="color:#ABABAB;">arm:</span> <b>0.6</b> GB/s
  147. </td>
  148. <td align="center">
  149. <code>sz_find</code><br/>
  150. <span style="color:#ABABAB;">x86:</span> <b>10.6</b> &centerdot;
  151. <span style="color:#ABABAB;">arm:</span> <b>7.1</b> GB/s
  152. </td>
  153. </tr>
  154. <!-- Substrings, reverse order -->
  155. <tr>
  156. <td colspan="4" align="center">find the last occurrence of a random word from text, ≅ 5 bytes long</td>
  157. </tr>
  158. <tr>
  159. <td align="center">⚪</td>
  160. <td align="center">
  161. <code>.rfind</code><br/>
  162. <span style="color:#ABABAB;">x86:</span> <b>0.5</b> &centerdot;
  163. <span style="color:#ABABAB;">arm:</span> <b>0.4</b> GB/s
  164. </td>
  165. <td align="center">
  166. <code>.rfind</code><br/>
  167. <span style="color:#ABABAB;">x86:</span> <b>0.9</b> &centerdot;
  168. <span style="color:#ABABAB;">arm:</span> <b>0.5</b> GB/s
  169. </td>
  170. <td align="center">
  171. <code>sz_rfind</code><br/>
  172. <span style="color:#ABABAB;">x86:</span> <b>10.8</b> &centerdot;
  173. <span style="color:#ABABAB;">arm:</span> <b>6.7</b> GB/s
  174. </td>
  175. </tr>
  176. <!-- Characters, normal order -->
  177. <tr>
  178. <td colspan="4" align="center">split lines separated by <code>\n</code> or <code>\r</code> <sup>2</sup></td>
  179. </tr>
  180. <tr>
  181. <td align="center">
  182. <code>strcspn</code> <sup>1</sup><br/>
  183. <span style="color:#ABABAB;">x86:</span> <b>5.42</b> &centerdot;
  184. <span style="color:#ABABAB;">arm:</span> <b>2.19</b> GB/s
  185. </td>
  186. <td align="center">
  187. <code>.find_first_of</code><br/>
  188. <span style="color:#ABABAB;">x86:</span> <b>0.59</b> &centerdot;
  189. <span style="color:#ABABAB;">arm:</span> <b>0.46</b> GB/s
  190. </td>
  191. <td align="center">
  192. <code>re.finditer</code><br/>
  193. <span style="color:#ABABAB;">x86:</span> <b>0.06</b> &centerdot;
  194. <span style="color:#ABABAB;">arm:</span> <b>0.02</b> GB/s
  195. </td>
  196. <td align="center">
  197. <code>sz_find_byteset</code><br/>
  198. <span style="color:#ABABAB;">x86:</span> <b>4.08</b> &centerdot;
  199. <span style="color:#ABABAB;">arm:</span> <b>3.22</b> GB/s
  200. </td>
  201. </tr>
  202. <!-- Characters, reverse order -->
  203. <tr>
  204. <td colspan="4" align="center">find the last occurrence of any of 6 whitespaces <sup>2</sup></td>
  205. </tr>
  206. <tr>
  207. <td align="center">⚪</td>
  208. <td align="center">
  209. <code>.find_last_of</code><br/>
  210. <span style="color:#ABABAB;">x86:</span> <b>0.25</b> &centerdot;
  211. <span style="color:#ABABAB;">arm:</span> <b>0.25</b> GB/s
  212. </td>
  213. <td align="center">⚪</td>
  214. <td align="center">
  215. <code>sz_rfind_byteset</code><br/>
  216. <span style="color:#ABABAB;">x86:</span> <b>0.43</b> &centerdot;
  217. <span style="color:#ABABAB;">arm:</span> <b>0.23</b> GB/s
  218. </td>
  219. </tr>
  220. <!-- Random Generation -->
  221. <tr>
  222. <td colspan="4" align="center">Random string from a given alphabet, 20 bytes long <sup>3</sup></td>
  223. </tr>
  224. <tr>
  225. <td align="center">
  226. <code>rand() % n</code><br/>
  227. <span style="color:#ABABAB;">x86:</span> <b>18.0</b> &centerdot;
  228. <span style="color:#ABABAB;">arm:</span> <b>9.4</b> MB/s
  229. </td>
  230. <td align="center">
  231. <code>uniform_int_distribution</code><br/>
  232. <span style="color:#ABABAB;">x86:</span> <b>47.2</b> &centerdot;
  233. <span style="color:#ABABAB;">arm:</span> <b>20.4</b> MB/s
  234. </td>
  235. <td align="center">
  236. <code>join(random.choices(x))</code><br/>
  237. <span style="color:#ABABAB;">x86:</span> <b>13.3</b> &centerdot;
  238. <span style="color:#ABABAB;">arm:</span> <b>5.9</b> MB/s
  239. </td>
  240. <td align="center">
  241. <code>sz_fill_random</code><br/>
  242. <span style="color:#ABABAB;">x86:</span> <b>56.2</b> &centerdot;
  243. <span style="color:#ABABAB;">arm:</span> <b>25.8</b> MB/s
  244. </td>
  245. </tr>
  246. <!-- Mapping characters with lookup table transforms -->
  247. <tr>
  248. <td colspan="4" align="center">Mapping characters with lookup table transforms</td>
  249. </tr>
  250. <tr>
  251. <td align="center">⚪</td>
  252. <td align="center">
  253. <code>std::transform</code><br/>
  254. <span style="color:#ABABAB;">x86:</span> <b>3.81</b> &centerdot;
  255. <span style="color:#ABABAB;">arm:</span> <b>2.65</b> GB/s
  256. </td>
  257. <td align="center">
  258. <code>str.translate</code><br/>
  259. <span style="color:#ABABAB;">x86:</span> <b>260.0</b> &centerdot;
  260. <span style="color:#ABABAB;">arm:</span> <b>140.0</b> MB/s
  261. </td>
  262. <td align="center">
  263. <code>sz_lookup</code><br/>
  264. <span style="color:#ABABAB;">x86:</span> <b>21.2</b> &centerdot;
  265. <span style="color:#ABABAB;">arm:</span> <b>8.5</b> GB/s
  266. </td>
  267. </tr>
  268. <!-- Sorting -->
  269. <tr>
  270. <td colspan="4" align="center">Get sorted order, ≅ 8 million English words <sup>4</sup></td>
  271. </tr>
  272. <tr>
  273. <td align="center">
  274. <code>qsort_r</code><br/>
  275. <span style="color:#ABABAB;">x86:</span> <b>3.55</b> &centerdot;
  276. <span style="color:#ABABAB;">arm:</span> <b>5.77</b> s
  277. </td>
  278. <td align="center">
  279. <code>std::sort</code><br/>
  280. <span style="color:#ABABAB;">x86:</span> <b>2.79</b> &centerdot;
  281. <span style="color:#ABABAB;">arm:</span> <b>4.02</b> s
  282. </td>
  283. <td align="center">
  284. <code>numpy.argsort</code><br/>
  285. <span style="color:#ABABAB;">x86:</span> <b>7.58</b> &centerdot;
  286. <span style="color:#ABABAB;">arm:</span> <b>13.00</b> s
  287. </td>
  288. <td align="center">
  289. <code>sz_sequence_argsort</code><br/>
  290. <span style="color:#ABABAB;">x86:</span> <b>1.91</b> &centerdot;
  291. <span style="color:#ABABAB;">arm:</span> <b>2.37</b> s
  292. </td>
  293. </tr>
  294. <!-- Edit Distance -->
  295. <tr>
  296. <td colspan="4" align="center">Levenshtein edit distance, text lines ≅ 100 bytes long</td>
  297. </tr>
  298. <tr>
  299. <td align="center">⚪</td>
  300. <td align="center">⚪</td>
  301. <td align="center">
  302. via <code>NLTK</code> <sup>5</sup> and <code>CuDF</code><br/>
  303. <span style="color:#ABABAB;">x86:</span> <b>1,615,306</b> &centerdot;
  304. <span style="color:#ABABAB;">arm:</span> <b>1,349,980</b> &centerdot;
  305. <span style="color:#ABABAB;">cuda:</span> <b>6,532,411,354</b> CUPS
  306. </td>
  307. <td align="center">
  308. <code>szs_levenshtein_distances_t</code><br/>
  309. <span style="color:#ABABAB;">x86:</span> <b>3,434,427,548</b> &centerdot;
  310. <span style="color:#ABABAB;">arm:</span> <b>1,605,340,403</b> &centerdot;
  311. <span style="color:#ABABAB;">cuda:</span> <b>93,662,026,653</b> CUPS
  312. </td>
  313. </tr>
  314. <!-- Alignment Score -->
  315. <tr>
  316. <td colspan="4" align="center">Needleman-Wunsch alignment scores, proteins ≅ 1 K amino acids long</td>
  317. </tr>
  318. <tr>
  319. <td align="center">⚪</td>
  320. <td align="center">⚪</td>
  321. <td align="center">
  322. via <code>biopython</code> <sup>6</sup><br/>
  323. <span style="color:#ABABAB;">x86:</span> <b>575,981,513</b> &centerdot;
  324. <span style="color:#ABABAB;">arm:</span> <b>436,350,732</b> CUPS
  325. </td>
  326. <td align="center">
  327. <code>szs_needleman_wunsch_scores_t</code><br/>
  328. <span style="color:#ABABAB;">x86:</span> <b>452,629,942</b> &centerdot;
  329. <span style="color:#ABABAB;">arm:</span> <b>520,170,239</b> &centerdot;
  330. <span style="color:#ABABAB;">cuda:</span> <b>9,017,327,818</b> CUPS
  331. </td>
  332. </tr>
  333. </table>
  334. Most StringZilla modules ship ready-to-run benchmarks for C, C++, Python, and more.
  335. Grab them from `./scripts`, and see [`CONTRIBUTING.md`](CONTRIBUTING.md) for instructions.
  336. On CPUs that permit misaligned loads, even the 64-bit SWAR baseline outruns both libc and the STL.
  337. For wider head-to-heads against Rust and Python favorites, browse the __[StringWars][stringwars]__ repository.
  338. To inspect collision resistance and distribution shapes for our hashers, see __[HashEvals][hashevals]__.
  339. [stringwars]: https://github.com/ashvardanian/StringWars
  340. [hashevals]: https://github.com/ashvardanian/HashEvals
  341. > Most benchmarks were conducted on a 1 GB English text corpus, with an average word length of 6 characters.
  342. > The code was compiled with GCC 12, using `glibc` v2.35.
  343. > The benchmarks were performed on Arm-based Graviton3 AWS `c7g` instances and `r7iz` Intel Sapphire Rapids.
  344. > Most modern Arm-based 64-bit CPUs will have similar relative speedups.
  345. > Variance within x86 CPUs will be larger.
  346. > For CUDA benchmarks, the Nvidia H100 GPUs were used.
  347. > <sup>1</sup> Unlike other libraries, LibC requires strings to be NULL-terminated.
  348. > <sup>2</sup> Six whitespaces in the ASCII set are: ` \t\n\v\f\r`. Python's and other standard libraries have specialized functions for those.
  349. > <sup>3</sup> All modulo operations were conducted with `uint8_t` to allow compilers more optimization opportunities.
  350. > The C++ STL and StringZilla benchmarks used a 64-bit [Mersenne Twister][faq-mersenne-twister] as the generator.
  351. > For C, C++, and StringZilla, an in-place update of the string was used.
  352. > In Python every string had to be allocated as a new object, which makes it less fair.
  353. > <sup>4</sup> Contrary to the popular opinion, Python's default `sorted` function works faster than the C and C++ standard libraries.
  354. > That holds for large lists or tuples of strings, but fails as soon as you need more complex logic, like sorting dictionaries by a string key, or producing the "sorted order" permutation.
  355. > The latter is very common in database engines and is most similar to `numpy.argsort`.
  356. > The current StringZilla solution can be at least 4x faster without loss of generality.
  357. > <sup>5</sup> Most Python libraries for strings are also implemented in C.
  358. > <sup>6</sup> Unlike the rest of BioPython, the alignment score computation is [implemented in C](https://github.com/biopython/biopython/blob/master/Bio/Align/_pairwisealigner.c).
  359. [faq-mersenne-twister]: https://en.wikipedia.org/wiki/Mersenne_Twister
  360. ## Functionality
  361. StringZilla is compatible with most modern CPUs, and provides a broad range of functionality.
  362. It's split into 2 layers:
  363. 1. StringZilla: single-header C library and C++ wrapper for high-performance string operations.
  364. 2. StringZillas: parallel CPU/GPU backends used for large-batch operations and accelerators.
  365. Having a second C++/CUDA layer greatly simplifies the implementation of similarity scoring and fingerprinting functions, which would otherwise require too much error-prone boilerplate code in pure C.
  366. Both layers are designed to be extremely portable:
  367. - [x] across both little-endian and big-endian architectures.
  368. - [x] across 32-bit and 64-bit hardware architectures.
  369. - [x] across operating systems and compilers.
  370. - [x] across ASCII and UTF-8 encoded inputs.
  371. Not all features are available across all bindings.
  372. Consider contributing if you need a feature that's not yet implemented.
  373. | | Maturity | C | C++ | Python | Rust | JS | Swift | Go |
  374. | :----------------------------- | :------: | :---: | :---: | :----: | :---: | :---: | :---: | :---: |
  375. | Substring Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
  376. | Character Set Search | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
  377. | Sorting & Sequence Operations | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ |
  378. | Lazy Ranges, Compressed Arrays | 🌳 | ❌ | ✅ | ✅ | ✅ | ❌ | ⚪ | ⚪ |
  379. | One-Shot & Streaming Hashes | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
  380. | Cryptographic Hashes | 🌳 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
  381. | Small String Class | 🧐 | ✅ | ✅ | ❌ | ⚪ | ❌ | ❌ | ❌ |
  382. | Random String Generation | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ |
  383. | | | | | | | | | |
  384. | Unicode Case Folding | 🧐 | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | ⚪ |
  385. | Case-Insensitive UTF-8 Search | 🚧 | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ | ⚪ |
  386. | TR29 Word Boundary Detection | 🚧 | ✅ | ✅ | ⚪ | ⚪ | ⚪ | ⚪ | ⚪ |
  387. | | | | | | | | | |
  388. | Parallel Similarity Scoring | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ |
  389. | Parallel Rolling Fingerprints | 🌳 | ✅ | ✅ | ✅ | ✅ | ⚪ | ⚪ | ⚪ |
  390. > 🌳 parts are used in production.
  391. > 🧐 parts are in beta.
  392. > 🚧 parts are under active development, and are likely to break in subsequent releases.
  393. > ✅ are implemented.
  394. > ⚪ are considered.
  395. > ❌ are not intended.
  396. ## Quick Start: Python
  397. Python bindings are available on PyPI for Python 3.8+, and can be installed with `pip`.
  398. ```bash
  399. pip install stringzilla # for serial algorithms
  400. pip install stringzillas-cpus # for parallel multi-CPU backends
  401. pip install stringzillas-cuda # for parallel Nvidia GPU backend
  402. ```
  403. You can immediately check the installed version and the used hardware capabilities with following commands:
  404. ```bash
  405. python -c "import stringzilla; print(stringzilla.__version__)"
  406. python -c "import stringzillas; print(stringzillas.__version__)"
  407. python -c "import stringzilla; print(stringzilla.__capabilities__)" # for serial algorithms
  408. python -c "import stringzillas; print(stringzillas.__capabilities__)" # for parallel algorithms
  409. ```
  410. ### Basic Usage
  411. If you've ever used the Python `str`, `bytes`, `bytearray`, or `memoryview` classes, you'll know what to expect.
  412. StringZilla's `Str` class is a hybrid of the above, providing a `str`-like interface to byte arrays.
  413. ```python
  414. from stringzilla import Str, File
  415. text_from_str = Str('some-string') # no copies, just a view
  416. text_from_bytes = Str(b'some-array') # no copies, just a view
  417. text_from_file = Str(File('some-file.txt')) # memory-mapped file
  418. import numpy as np
  419. alphabet_array = np.arange(ord("a"), ord("z"), dtype=np.uint8)
  420. text_from_array = Str(memoryview(alphabet_array))
  421. ```
  422. The `File` class memory-maps a file from persistent storage without loading its copy into RAM.
  423. The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
  424. A standard dataset pre-processing use case would be to map a sizable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
  425. ### Basic Operations
  426. - Length: `len(text) -> int`
  427. - Indexing: `text[42] -> str`
  428. - Slicing: `text[42:46] -> Str`
  429. - Substring check: `'substring' in text -> bool`
  430. - Hashing: `hash(text) -> int`
  431. - String conversion: `str(text) -> str`
  432. ### Advanced Operations
  433. ```py
  434. import sys
  435. x: bool = text.contains('substring', start=0, end=sys.maxsize)
  436. x: int = text.find('substring', start=0, end=sys.maxsize)
  437. x: int = text.count('substring', start=0, end=sys.maxsize, allowoverlap=False)
  438. x: str = text.decode(encoding='utf-8', errors='strict')
  439. x: Strs = text.split(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
  440. x: Strs = text.rsplit(separator=' ', maxsplit=sys.maxsize, keepseparator=False)
  441. x: Strs = text.splitlines(keeplinebreaks=False, maxsplit=sys.maxsize)
  442. ```
  443. It's important to note that the last function's behavior is slightly different from Python's `str.splitlines`.
  444. The [native version][faq-splitlines] matches `\n`, `\r`, `\v` or `\x0b`, `\f` or `\x0c`, `\x1c`, `\x1d`, `\x1e`, `\x85`, `\r\n`, `\u2028`, `\u2029`, including 3x two-byte-long runes.
  445. The StringZilla version matches only `\n`, `\v`, `\f`, `\r`, `\x1c`, `\x1d`, `\x1e`, `\x85`, avoiding two-byte-long runes.
  446. [faq-splitlines]: https://docs.python.org/3/library/stdtypes.html#str.splitlines
  447. ### Character Set Operations
  448. Python strings don't natively support character set operations.
  449. This forces people to use regular expressions, which are slow and hard to read.
  450. To avoid the need for `re.finditer`, StringZilla provides the following interfaces:
  451. ```py
  452. x: int = text.find_first_of('chars', start=0, end=sys.maxsize)
  453. x: int = text.find_last_of('chars', start=0, end=sys.maxsize)
  454. x: int = text.find_first_not_of('chars', start=0, end=sys.maxsize)
  455. x: int = text.find_last_not_of('chars', start=0, end=sys.maxsize)
  456. x: Strs = text.split_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
  457. x: Strs = text.rsplit_byteset(separator='chars', maxsplit=sys.maxsize, keepseparator=False)
  458. ```
  459. StringZilla also provides string trimming functions and random string generation:
  460. ```py
  461. x: str = text.lstrip('chars') # Strip leading characters
  462. x: str = text.rstrip('chars') # Strip trailing characters
  463. x: str = text.strip('chars') # Strip both ends
  464. x: bytes = sz.random(length=100, seed=42, alphabet='ACGT') # Random string generation
  465. sz.fill_random(buffer, seed=42, alphabet=None) # Fill mutable buffer with random bytes
  466. ```
  467. You can also transform the string using Look-Up Tables (LUTs), mapping it to a different character set.
  468. This would result in a copy - `str` for `str` inputs and `bytes` for other types.
  469. ```py
  470. x: str = text.translate('chars', {}, start=0, end=sys.maxsize, inplace=False)
  471. x: bytes = text.translate(b'chars', {}, start=0, end=sys.maxsize, inplace=False)
  472. ```
  473. For efficiency reasons, pass the LUT as a string or bytes object, not as a dictionary.
  474. This can be useful in high-throughput applications dealing with binary data, including bioinformatics and image processing.
  475. Here is an example:
  476. ```py
  477. import stringzilla as sz
  478. look_up_table = bytes(range(256)) # Identity LUT
  479. image = open("/image/path.jpeg", "rb").read()
  480. sz.translate(image, look_up_table, inplace=True)
  481. ```
  482. ### Hash
  483. Single-shot and incremental hashing are both supported:
  484. ```py
  485. import stringzilla as sz
  486. # One-shot - stable 64-bit output across all platforms!
  487. one = sz.hash(b"Hello, world!", seed=42)
  488. # Incremental updates return itself; digest does not consume state
  489. hasher = sz.Hasher(seed=42)
  490. hasher.update(b"Hello, ").update(b"world!")
  491. streamed = hasher.digest() # or `hexdigest()` for a string
  492. assert one == streamed
  493. ```
  494. ### SHA-256 Checksums
  495. SHA-256 cryptographic checksums are also available for single-shot and incremental hashing:
  496. ```py
  497. import stringzilla as sz
  498. # One-shot SHA-256
  499. digest_bytes = sz.sha256(b"Hello, world!")
  500. assert len(digest_bytes) == 32
  501. # Incremental SHA-256
  502. hasher = sz.Sha256()
  503. hasher.update(b"Hello, ").update(b"world!")
  504. digest_bytes = hasher.digest()
  505. digest_hex = hasher.hexdigest() # 64 character lowercase hex string
  506. # HMAC-SHA256 for message authentication
  507. mac = sz.hmac_sha256(key=b"secret", message=b"Hello, world!")
  508. ```
  509. StringZilla integrates seamlessly with memory-mapped files for efficient large file processing.
  510. The traditional approach with `hashlib`:
  511. ```python
  512. import hashlib
  513. with open("xlsum.csv", "rb") as streamed_file:
  514. hasher = hashlib.sha256()
  515. while chunk := streamed_file.read(4096):
  516. hasher.update(chunk)
  517. checksum = hasher.hexdigest()
  518. ```
  519. Can be simplified with StringZilla:
  520. ```python
  521. from stringzilla import Sha256, File
  522. mapped_file = File("xlsum.csv")
  523. checksum = Sha256().update(mapped_file).hexdigest()
  524. ```
  525. Both output the same digest: `7278165ce01a4ac1e8806c97f32feae908036ca3d910f5177d2cf375e20aeae1`.
  526. OpenSSL (powering `hashlib`) has faster Assembly kernels, but StringZilla avoids file I/O overhead with memory mapping and skips Python's abstraction layers:
  527. - OpenSSL-backed `hashlib.sha256`: 12.6s
  528. - StringZilla end-to-end: 4.0s — __3× faster!__
  529. ### Unicode Case-Folding and Case-Insensitive Search
  530. StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
  531. Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
  532. The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
  533. ```python
  534. import stringzilla as sz
  535. sz.utf8_case_fold('HELLO') # b'hello'
  536. sz.utf8_case_fold('Straße') # b'strasse' — ß (1 char) expands to "ss" (2 chars)
  537. sz.utf8_case_fold('efficient') # b'efficient' — ffi ligature (1 char) expands to "ffi" (3 chars)
  538. ```
  539. The case-insensitive search returns the byte offset of the match, handling expansions correctly.
  540. ```python
  541. import stringzilla as sz
  542. sz.utf8_case_insensitive_find('Der große Hund', 'GROSSE') # 4 — finds "große" at codepoint 4
  543. sz.utf8_case_insensitive_find('Straße', 'STRASSE') # 0 — ß matches "SS"
  544. sz.utf8_case_insensitive_find('efficient', 'EFFICIENT') # 0 — ffi ligature matches "FFI"
  545. # Iterator for finding ALL matches
  546. haystack = 'Straße STRASSE strasse'
  547. for match in sz.utf8_case_insensitive_find_iter(haystack, 'strasse'):
  548. print(match, match.offset_within(haystack)) # Yields: 'Straße', 'STRASSE', 'strasse'
  549. # With overlapping matches
  550. list(sz.utf8_case_insensitive_find_iter('aaaa', 'aa')) # ['aa', 'aa'] — 2 non-overlapping
  551. list(sz.utf8_case_insensitive_find_iter('aaaa', 'aa', include_overlapping=True)) # 3 matches
  552. ```
  553. ### Collection-Level Operations
  554. Once split into a `Strs` object, you can sort, shuffle, and reorganize the slices with minimal memory footprint.
  555. If all the chunks are located in consecutive memory regions, the memory overhead can be as low as 4 bytes per chunk.
  556. ```python
  557. lines: Strs = text.split(separator='\n') # 4 bytes per line overhead for under 4 GB of text
  558. batch: Strs = lines.sample(seed=42) # 10x faster than `random.choices`
  559. lines_shuffled: Strs = lines.shuffled(seed=42) # or shuffle all lines and shard with slices
  560. lines_sorted: Strs = lines.sorted() # returns a new Strs in sorted order
  561. order: tuple = lines.argsort() # similar to `numpy.argsort`
  562. ```
  563. Working on [RedPajama][redpajama], addressing 20 billion annotated English documents, one will need only 160 GB of RAM instead of terabytes.
  564. Once loaded, the data will be memory-mapped, and can be reused between multiple Python processes without copies.
  565. And of course, you can use slices to navigate the dataset and shard it between multiple workers.
  566. ```python
  567. lines[::3] # every third line
  568. lines[1::1] # every odd line
  569. lines[:-100:-1] # last 100 lines in reverse order
  570. ```
  571. [redpajama]: https://github.com/togethercomputer/RedPajama-Data
  572. ### Iterators and Memory Efficiency
  573. Python's operations like `split()` and `readlines()` immediately materialize a `list` of copied parts.
  574. This can be very memory-inefficient for large datasets.
  575. StringZilla saves a lot of memory by viewing existing memory regions as substrings, but even more memory can be saved by using lazily evaluated iterators.
  576. ```py
  577. x: SplitIterator[Str] = text.split_iter(separator=' ', keepseparator=False)
  578. x: SplitIterator[Str] = text.rsplit_iter(separator=' ', keepseparator=False)
  579. x: SplitIterator[Str] = text.split_byteset_iter(separator='chars', keepseparator=False)
  580. x: SplitIterator[Str] = text.rsplit_byteset_iter(separator='chars', keepseparator=False)
  581. ```
  582. StringZilla can easily be 10x more memory efficient than native Python classes for tokenization.
  583. With lazy operations, it practically becomes free.
  584. ```py
  585. import stringzilla as sz
  586. %load_ext memory_profiler
  587. text = open("enwik9.txt", "r").read() # 1 GB, mean word length 7.73 bytes
  588. %memit text.split() # increment: 8670.12 MiB (152 ms)
  589. %memit sz.split(text) # increment: 530.75 MiB (25 ms)
  590. %memit sum(1 for _ in sz.split_iter(text)) # increment: 0.00 MiB
  591. ```
  592. ### Low-Level Python API
  593. Aside from calling the methods on the `Str` and `Strs` classes, you can also call the global functions directly on `str` and `bytes` instances.
  594. Assuming StringZilla CPython bindings are implemented [without any intermediate tools like SWIG or PyBind](https://ashvardanian.com/posts/pybind11-cpython-tutorial/), the call latency should be similar to native classes.
  595. ```py
  596. import stringzilla as sz
  597. contains: bool = sz.contains("haystack", "needle", start=0, end=sys.maxsize)
  598. offset: int = sz.find("haystack", "needle", start=0, end=sys.maxsize)
  599. count: int = sz.count("haystack", "needle", start=0, end=sys.maxsize, allowoverlap=False)
  600. ```
  601. ### Similarity Scores
  602. StringZilla exposes high-performance, batch-oriented similarity via the `stringzillas` module.
  603. Use `DeviceScope` to pick hardware and optionally limit capabilities per engine.
  604. ```py
  605. import stringzilla as sz
  606. import stringzillas as szs
  607. cpu_scope = szs.DeviceScope(cpu_cores=4) # force CPU-only
  608. gpu_scope = szs.DeviceScope(gpu_device=0) # pick GPU 0 if available
  609. strings_a = sz.Strs(["kitten", "flaw"])
  610. strings_b = sz.Strs(["sitting", "lawn"])
  611. strings_a = szs.to_device(strings_a) # optional ahead of time transfer
  612. strings_b = szs.to_device(strings_b) # optional ahead of time transfer
  613. engine = szs.LevenshteinDistances(
  614. match=0, mismatch=2, # costs don't have to be 1
  615. open=3, extend=1, # may be different in Bio
  616. capabilities=("serial",) # avoid SIMD 🤭
  617. )
  618. distances = engine(strings_a, strings_b, device=cpu_scope)
  619. assert int(distances[0]) == 3 and int(distances[1]) == 2
  620. ```
  621. Note, that this computes byte-level distances.
  622. For UTF-8 codepoints, use a different engine class:
  623. ```py
  624. strings_a = sz.Strs(["café", "αβγδ"])
  625. strings_b = sz.Strs(["cafe", "αγδ"])
  626. engine = szs.LevenshteinDistancesUTF8(capabilities=("serial",))
  627. distances = engine(strings_a, strings_b, device=cpu_scope)
  628. assert int(distances[0]) == 1 and int(distances[1]) == 1
  629. ```
  630. For alignment scoring provide a 256×256 substitution matrix using NumPy:
  631. ```py
  632. import numpy as np
  633. import stringzilla as sz
  634. import stringzillas as szs
  635. substitution_matrix = np.zeros((256, 256), dtype=np.int8)
  636. substitution_matrix.fill(-1) # mismatch score
  637. np.fill_diagonal(substitution_matrix, 0) # match score
  638. engine = szs.NeedlemanWunsch(substitution_matrix=substitution_matrix, open=1, extend=1)
  639. scores = engine(strings_a, strings_b, device=cpu_scope)
  640. ```
  641. Several Python libraries provide edit distance computation.
  642. Most are implemented in C but may be slower than StringZilla on large inputs.
  643. For proteins ~10k chars, 100 pairs:
  644. - [JellyFish](https://github.com/jamesturk/jellyfish): 62.3s
  645. - [EditDistance](https://github.com/roy-ht/editdistance): 32.9s
  646. - StringZilla: __0.8s__
  647. Using the same proteins for Needleman-Wunsch alignment scores:
  648. - [BioPython](https://github.com/biopython/biopython): 25.8s
  649. - StringZilla: __7.8s__
  650. <details>
  651. <summary><b>§ Example converting from BioPython to StringZilla.</b></summary>
  652. ```py
  653. import numpy as np
  654. from Bio import Align
  655. from Bio.Align import substitution_matrices
  656. aligner = Align.PairwiseAligner()
  657. aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
  658. aligner.open_gap_score = 1
  659. aligner.extend_gap_score = 1
  660. # Convert the matrix to NumPy
  661. subs_packed = np.array(aligner.substitution_matrix).astype(np.int8)
  662. subs_reconstructed = np.zeros((256, 256), dtype=np.int8)
  663. # Initialize all banned characters to a the largest possible penalty
  664. subs_reconstructed.fill(127)
  665. for packed_row, packed_row_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
  666. for packed_column, packed_column_aminoacid in enumerate(aligner.substitution_matrix.alphabet):
  667. reconstructed_row = ord(packed_row_aminoacid)
  668. reconstructed_column = ord(packed_column_aminoacid)
  669. subs_reconstructed[reconstructed_row, reconstructed_column] = subs_packed[packed_row, packed_column]
  670. # Let's pick two examples of tripeptides (made of 3 amino acids)
  671. glutathione = "ECG" # Need to rebuild human tissue?
  672. thyrotropin_releasing_hormone = "QHP" # Or to regulate your metabolism?
  673. import stringzillas as szs
  674. engine = szs.NeedlemanWunsch(substitution_matrix=subs_reconstructed, open=1, extend=1)
  675. score = int(engine(sz.Strs([glutathione]), sz.Strs([thyrotropin_releasing_hormone]))[0])
  676. assert score == aligner.score(glutathione, thyrotropin_releasing_hormone) # Equal to 6
  677. ```
  678. </details>
  679. ### Rolling Fingerprints
  680. MinHashing is a common technique for Information Retrieval, producing compact representations of large documents.
  681. For $D$ hash-functions and a text of length $L$, in the worst case it involves computing $O(D \cdot L)$ hashes.
  682. ```py
  683. import numpy as np
  684. import stringzilla as sz
  685. import stringzillas as szs
  686. texts = sz.Strs([
  687. "quick brown fox jumps over the lazy dog",
  688. "quick brown fox jumped over a very lazy dog",
  689. ])
  690. cpu = szs.DeviceScope(cpu_cores=4)
  691. ndim = 1024
  692. window_widths = np.array([4, 6, 8, 10], dtype=np.uint64)
  693. engine = szs.Fingerprints(
  694. ndim=ndim,
  695. window_widths=window_widths, # optional
  696. alphabet_size=256, # default for byte strings
  697. capabilities=("serial",), # defaults to all, can also pass a `DeviceScope`
  698. )
  699. hashes, counts = engine(texts, device=cpu)
  700. assert hashes.shape == (len(texts), ndim)
  701. assert counts.shape == (len(texts), ndim)
  702. assert hashes.dtype == np.uint32 and counts.dtype == np.uint32
  703. ```
  704. ### Serialization
  705. #### Filesystem
  706. Similar to how `File` can be used to read a large file, other interfaces can be used to dump strings to disk faster.
  707. The `Str` class has `write_to` to write the string to a file, and `offset_within` to obtain integer offsets of substring view in larger string for navigation.
  708. ```py
  709. web_archive = Str("<html>...</html><html>...</html>")
  710. _, end_tag, next_doc = web_archive.partition("</html>") # or use `find`
  711. next_doc_offset = next_doc.offset_within(web_archive)
  712. web_archive.write_to("next_doc.html") # no GIL, no copies, just a view
  713. ```
  714. #### PyArrow
  715. A `Str` is easy to cast to [PyArrow](https://arrow.apache.org/docs/python/arrays.html#string-and-binary-types) buffers.
  716. ```py
  717. from pyarrow import foreign_buffer
  718. from stringzilla import Strs
  719. strs = Strs(["alpha", "beta", "gamma"])
  720. arrow = foreign_buffer(strs.address, strs.nbytes, strs)
  721. ```
  722. And only slightly harder to convert in reverse direction:
  723. ```py
  724. arr = pa.Array.from_buffers(
  725. pa.large_string() if strs.offsets_are_large else pa.string(),
  726. len(strs),
  727. [None,
  728. pa.foreign_buffer(strs.offsets_address, strs.offsets_nbytes, strs),
  729. pa.foreign_buffer(strs.tape_address, strs.tape_nbytes, strs)],
  730. )
  731. ```
  732. That means you can convert `Str` to `pyarrow.Buffer` and `Strs` to `pyarrow.Array` without extra copies.
  733. For more details on the tape-like layouts, refer to the [StringTape](https://github.com/ashvardanian/StringTape) repository.
  734. ## Quick Start: C/C++
  735. The C library is header-only, so you can just copy the `stringzilla.h` header into your project.
  736. Same applies to C++, where you would copy the `stringzilla.hpp` header.
  737. Alternatively, add it as a submodule, and include it in your build system.
  738. ```sh
  739. git submodule add https://github.com/ashvardanian/StringZilla.git external/stringzilla
  740. git submodule update --init --recursive
  741. ```
  742. Or using a pure CMake approach:
  743. ```cmake
  744. FetchContent_Declare(
  745. stringzilla
  746. GIT_REPOSITORY https://github.com/ashvardanian/StringZilla.git
  747. GIT_TAG main # or specify a version tag
  748. )
  749. FetchContent_MakeAvailable(stringzilla)
  750. ```
  751. Last, but not the least, you can also install it as a library, and link against it.
  752. This approach is worse for inlining, but brings [dynamic runtime dispatch](#dynamic-dispatch) for the most advanced CPU features.
  753. ### Basic Usage with C 99 and Newer
  754. There is a stable C 99 interface, where all function names are prefixed with `sz_`.
  755. Most interfaces are well documented, and come with self-explanatory names and examples.
  756. In some cases, hardware specific overloads are available, like `sz_find_skylake` or `sz_find_neon`.
  757. Both are companions of the `sz_find`, first for x86 CPUs with AVX-512 support, and second for Arm NEON-capable CPUs.
  758. ```c
  759. #include <stringzilla/stringzilla.h>
  760. // Initialize your haystack and needle
  761. sz_string_view_t haystack = {your_text, your_text_length};
  762. sz_string_view_t needle = {your_subtext, your_subtext_length};
  763. // Perform string-level operations auto-picking the backend or dispatching manually
  764. sz_cptr_t ptr = sz_find(haystack.start, haystack.length, needle.start, needle.length);
  765. sz_size_t substring_position = ptr ? (sz_size_t)(ptr - haystack.start) : SZ_SIZE_MAX; // SZ_SIZE_MAX if not found
  766. // Backend-specific variants return pointers as well
  767. sz_cptr_t ptr = sz_find_skylake(haystack.start, haystack.length, needle.start, needle.length);
  768. sz_cptr_t ptr = sz_find_haswell(haystack.start, haystack.length, needle.start, needle.length);
  769. sz_cptr_t ptr = sz_find_westmere(haystack.start, haystack.length, needle.start, needle.length);
  770. sz_cptr_t ptr = sz_find_neon(haystack.start, haystack.length, needle.start, needle.length);
  771. // Hash strings at once
  772. sz_u64_t hash = sz_hash(haystack.start, haystack.length, 42); // 42 is the seed
  773. sz_u64_t checksum = sz_bytesum(haystack.start, haystack.length); // or accumulate byte values
  774. // Hash strings incrementally with "init", "update", and "digest":
  775. sz_hash_state_t state;
  776. sz_hash_state_init(&state, 42);
  777. sz_hash_state_update(&state, haystack.start, 1); // first char
  778. sz_hash_state_update(&state, haystack.start + 1, haystack.length - 1); // rest of the string
  779. sz_u64_t streamed_hash = sz_hash_state_digest(&state);
  780. // SHA-256 cryptographic checksums
  781. sz_u8_t digest[32];
  782. sz_sha256_state_t sha_state;
  783. sz_sha256_state_init(&sha_state);
  784. sz_sha256_state_update(&sha_state, haystack.start, haystack.length);
  785. sz_sha256_state_digest(&sha_state, digest);
  786. // Perform collection level operations
  787. sz_sequence_t array = {your_handle, your_count, your_get_start, your_get_length};
  788. sz_sorted_idx_t order[your_count];
  789. sz_sequence_argsort(&array, NULL, order); // NULL allocator uses default
  790. ```
  791. <details>
  792. <summary><b>§ Mapping from LibC to StringZilla.</b></summary>
  793. By design, StringZilla has a couple of notable differences from LibC:
  794. 1. all strings are expected to have a length, and are not necessarily null-terminated.
  795. 2. every operations has a reverse order counterpart.
  796. That way `sz_find` and `sz_rfind` are similar to `strstr` and `strrstr` in LibC.
  797. Similarly, `sz_find_byte` and `sz_rfind_byte` replace `memchr` and `memrchr`.
  798. The `sz_find_byteset` maps to `strspn` and `strcspn`, while `sz_rfind_byteset` has no sibling in LibC.
  799. <table>
  800. <tr>
  801. <th>LibC Functionality</th>
  802. <th>StringZilla Equivalents</th>
  803. </tr>
  804. <tr>
  805. <td><code>memchr(haystack, needle, haystack_length)</code>, <code>strchr</code></td>
  806. <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td>
  807. </tr>
  808. <tr>
  809. <td><code>memrchr(haystack, needle, haystack_length)</code></td>
  810. <td><code>sz_rfind_byte(haystack, haystack_length, needle)</code></td>
  811. </tr>
  812. <tr>
  813. <td><code>memcmp</code>, <code>strcmp</code></td>
  814. <td><code>sz_order</code>, <code>sz_equal</code></td>
  815. </tr>
  816. <tr>
  817. <td><code>strlen(haystack)</code></td>
  818. <td><code>sz_find_byte(haystack, haystack_length, needle)</code></td>
  819. </tr>
  820. <tr>
  821. <td><code>strcspn(haystack, reject)</code></td>
  822. <td><code>sz_find_byteset(haystack, haystack_length, reject_bitset)</code></td>
  823. </tr>
  824. <tr>
  825. <td><code>strspn(haystack, accept)</code></td>
  826. <td><code>sz_find_byte_not_from(haystack, haystack_length, accept, accept_length)</code></td>
  827. </tr>
  828. <tr>
  829. <td><code>memmem(haystack, haystack_length, needle, needle_length)</code>, <code>strstr</code></td>
  830. <td><code>sz_find(haystack, haystack_length, needle, needle_length)</code></td>
  831. </tr>
  832. <tr>
  833. <td><code>memcpy(destination, source, destination_length)</code></td>
  834. <td><code>sz_copy(destination, source, destination_length)</code></td>
  835. </tr>
  836. <tr>
  837. <td><code>memmove(destination, source, destination_length)</code></td>
  838. <td><code>sz_move(destination, source, destination_length)</code></td>
  839. </tr>
  840. <tr>
  841. <td><code>memset(destination, value, destination_length)</code></td>
  842. <td><code>sz_fill(destination, destination_length, value)</code></td>
  843. </tr>
  844. </table>
  845. </details>
  846. ### Basic Usage with C++ 11 and Newer
  847. There is a stable C++ 11 interface available in the `ashvardanian::stringzilla` namespace.
  848. It comes with two STL-like classes: `string_view` and `string`.
  849. The first is a non-owning view of a string, and the second is a mutable string with a [Small String Optimization][faq-sso].
  850. ```cpp
  851. #include <stringzilla/stringzilla.hpp>
  852. namespace sz = ashvardanian::stringzilla;
  853. sz::string haystack = "some string";
  854. sz::string_view needle = sz::string_view(haystack).substr(0, 4);
  855. auto substring_position = haystack.find(needle); // Or `rfind`
  856. auto hash = std::hash<sz::string_view>{}(haystack); // Compatible with STL's `std::hash`
  857. haystack.end() - haystack.begin() == haystack.size(); // Or `rbegin`, `rend`
  858. haystack.find_first_of(" \v\t") == 4; // Or `find_last_of`, `find_first_not_of`, `find_last_not_of`
  859. haystack.starts_with(needle) == true; // Or `ends_with`
  860. haystack.remove_prefix(needle.size()); // Why is this operation in-place?!
  861. haystack.contains(needle) == true; // STL has this only from C++ 23 onwards
  862. haystack.compare(needle) == 1; // Or `haystack <=> needle` in C++ 20 and beyond
  863. ```
  864. StringZilla also provides string literals for automatic type resolution, [similar to STL][stl-literal]:
  865. ```cpp
  866. using sz::literals::operator""_sv;
  867. using std::literals::operator""sv;
  868. auto a = "some string"; // char const *
  869. auto b = "some string"sv; // std::string_view
  870. auto b = "some string"_sv; // sz::string_view
  871. ```
  872. [stl-literal]: https://en.cppreference.com/w/cpp/string/basic_string_view/operator%22%22sv
  873. ### Unicode Case-Folding and Case-Insensitive Search
  874. StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
  875. Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
  876. The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
  877. ```c
  878. char source[] = "Straße"; // German: "Street"
  879. char destination[64]; // Must be at least 3x source length
  880. sz_size_t result_len = sz_utf8_case_fold(source, strlen(source), destination);
  881. // destination now contains "strasse" (7 bytes), result_len = 7
  882. ```
  883. The case-insensitive search API returns a pointer to the start of the first relevant glyph in the haystack, or `NULL` if not found.
  884. It outputs the length of the matched haystack substring in bytes, and accepts a metadata structure to speed up repeated searches for the same needle.
  885. ```c
  886. sz_utf8_case_insensitive_needle_metadata_t metadata = {};
  887. sz_size_t match_length;
  888. sz_cptr_t match = sz_utf8_case_insensitive_find(
  889. haystack, haystack_len,
  890. needle, needle_len,
  891. &metadata, // Reuse for queries with the same needle
  892. &match_length // Output: bytes consumed in haystack
  893. );
  894. ```
  895. Same functionality is available in C++:
  896. ```cpp
  897. namespace sz = ashvardanian::stringzilla;
  898. sz::string_view text = "Hello World"; // Single search
  899. auto [offset, length] = text.utf8_case_insensitive_find("HELLO");
  900. sz::utf8_case_insensitive_needle pattern("hello"); // Repeated searches with pre-compiled pattern
  901. for (auto const& haystack : haystacks)
  902. auto match = haystack.utf8_case_insensitive_find(pattern);
  903. ```
  904. ### Similarity Scores
  905. StringZilla exposes high-performance, batch-oriented similarity via the `stringzillas/stringzillas.h` header.
  906. Use `szs_device_scope_t` to pick hardware and optionally limit capabilities per engine.
  907. ```cpp
  908. #include <stringzillas/stringzillas.h>
  909. szs_device_scope_t device = NULL;
  910. szs_device_scope_init_default(&device);
  911. szs_levenshtein_distances_t engine = NULL;
  912. szs_levenshtein_distances_init(0, 1, 1, 1, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine);
  913. sz_sequence_u32tape_t strings_a {data_a, offsets_a, count}; // or `sz_sequence_u64tape_t` for large inputs
  914. sz_sequence_u32tape_t strings_b {data_b, offsets_b, count}; // or `sz_sequence_t` to pass generic containers
  915. sz_size_t distances[count];
  916. szs_levenshtein_distances_u32tape(engine, device, &strings_a, &strings_b, distances, sizeof(distances[0]));
  917. szs_levenshtein_distances_free(engine);
  918. szs_device_scope_free(device);
  919. ```
  920. To target a different device, use the appropriate `szs_device_scope_init_{cpu_cores,gpu_device}` function.
  921. When dealing with GPU backends, make sure to use the "unified memory" allocators exposed as `szs_unified_{alloc,free}`.
  922. Similar stable C ABIs are exposed for other workloads as well.
  923. - UTF-8: `szs_levenshtein_distances_utf8_{sequence,u32tape,u64tape}`
  924. - Needleman-Wunsch: `szs_needleman_wunsch_scores_{sequence,u32tape,u64tape}`
  925. - Smith-Waterman: `szs_smith_waterman_scores_{sequence,u32tape,u64tape}`
  926. Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc.
  927. For that include `stringzillas/similarities.hpp` for C++ and `stringzillas/similarities.cuh` for CUDA.
  928. ```cpp
  929. #include <stringzillas/similarities.hpp>
  930. #include <stringzilla/types.hpp> // tape of strings
  931. #include <fork_union.hpp> // optional thread pool
  932. namespace sz = ashvardanian::stringzilla;
  933. namespace szs = ashvardanian::stringzillas;
  934. // Pack strings into an Arrow-like tape
  935. std::vector<std::string> left = {"kitten", "flaw"};
  936. std::vector<std::string> right = {"sitting", "lawn"};
  937. sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape_a, tape_b;
  938. auto _ = tape_a.try_assign(left.begin(), left.end());
  939. auto _ = tape_b.try_assign(right.begin(), right.end());
  940. // Run on the current thread
  941. using levenshtein_t = szs::levenshtein_distances<char, szs::linear_gap_costs_t, std::allocator<char>, sz_cap_serial_k>;
  942. levenshtein_t engine {szs::uniform_substitution_costs_t{0,1}, szs::linear_gap_costs_t{1}};
  943. std::size_t distances[2];
  944. auto _ = engine(tape_a, tape_b, distances);
  945. // Or run in parallel with a pool
  946. fork_union::basic_pool_t pool;
  947. auto _ = pool.try_spawn(std::thread::hardware_concurrency());
  948. auto _ = engine(tape_a, tape_b, distances, pool);
  949. ```
  950. All of the potentially failing StringZillas' interfaces return error codes, and none raise C++ exceptions.
  951. Parallelism is enabled at both collection-level and within individual pairs of large inputs.
  952. ### Rolling Fingerprints
  953. StringZilla exposes parallel fingerprinting (Min-Hashes or Count-Min-Sketches) via the `stringzillas/stringzillas.h` header.
  954. Use `szs_device_scope_t` to pick hardware and optionally limit capabilities per engine.
  955. ```c
  956. #include <stringzillas/stringzillas.h>
  957. szs_device_scope_t device = NULL;
  958. szs_device_scope_init_default(&device);
  959. szs_fingerprints_t engine = NULL;
  960. sz_size_t const dims = 1024; sz_size_t const window_widths[] = {4, 6, 8, 10};
  961. szs_fingerprints_init(dims, /*alphabet*/ 256, window_widths, 4, /*alloc*/ NULL, /*caps*/ sz_cap_serial_k, &engine);
  962. sz_sequence_u32tape_t texts = {data, offsets, count};
  963. sz_u32_t *min_hashes = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_hashes));
  964. sz_u32_t *min_counts = (sz_u32_t*)szs_unified_alloc(count * dims * sizeof(*min_counts));
  965. szs_fingerprints_u32tape(engine, device, &texts,
  966. min_hashes, dims * sizeof(*min_hashes), // support strided matrices
  967. min_counts, dims * sizeof(*min_counts)); // for both output arguments
  968. szs_fingerprints_free(engine);
  969. szs_device_scope_free(device);
  970. ```
  971. Moreover, in C++ codebases one can tap into the raw templates implementing that functionality, customizing them with custom executors, SIMD plugins, etc.
  972. For that include `stringzillas/fingerprints.hpp` for C++ and `stringzillas/fingerprints.cuh` for CUDA.
  973. ```cpp
  974. #include <stringzillas/fingerprints.hpp>
  975. #include <stringzilla/types.hpp> // tape of strings
  976. #include <fork_union.hpp> // optional thread pool
  977. namespace sz = ashvardanian::stringzilla;
  978. namespace szs = ashvardanian::stringzillas;
  979. // Pack strings into an Arrow-like tape
  980. std::vector<std::string> docs = {"alpha beta", "alpha betta"};
  981. sz::arrow_strings_tape<char, sz::size_t, std::allocator<char>> tape;
  982. auto _ = tape.try_assign(docs.begin(), docs.end());
  983. // Run on the current thread with a Rabin-Karp family hasher
  984. constexpr std::size_t dimensions_k = 256;
  985. constexpr std::size_t window_width_k = 7;
  986. using row_t = std::array<sz_u32_t, 256>;
  987. using fingerprinter_t = szs::floating_rolling_hashers<sz_cap_serial_k, dimensions_k>;
  988. fingerprinter_t engine;
  989. auto _ = engine.try_extend(window_width_k, dimensions_k);
  990. std::vector<row_t> hashes(docs.size()), counts(docs.size());
  991. auto _ = engine(tape, hashes, counts);
  992. // Or run in parallel with a pool
  993. fork_union::basic_pool_t pool;
  994. auto _ = pool.try_spawn(std::thread::hardware_concurrency());
  995. auto _ = engine(tape, hashes, counts, pool);
  996. ```
  997. ### CUDA
  998. StringZilla provides CUDA C++ templates for composable string batch-processing operations.
  999. Different GPUs have varying warp sizes, shared memory capacities, and register counts, affecting algorithm selection, so it's important to query the `gpu_specs_t` via `gpu_specs_fetch`.
  1000. For memory management, ensure that you use GPU-visible' unified memory` exposed in an STL-compatible manner as a `unified_alloc` template class.
  1001. For error handling, `cuda_status_t` extends the traditional `status_t` with GPU-specific information.
  1002. It's implicitly convertible to `status_t`, so you can use it in places expecting a `status_t`.
  1003. Most algorithms can load-balance both a large number of small strings and a small number of large strings.
  1004. Still, with large H100-scale GPUs, it's best to submit thousands of inputs at once.
  1005. ### Memory Ownership and Small String Optimization
  1006. Most operations in StringZilla don't assume any memory ownership.
  1007. But in addition to the read-only search-like operations StringZilla provides a minimalistic C and C++ implementations for a memory owning string "class".
  1008. Like other efficient string implementations, it uses the [Small String Optimization][faq-sso] (SSO) to avoid heap allocations for short strings.
  1009. [faq-sso]: https://cpp-optimizations.netlify.app/small_strings/
  1010. ```c
  1011. typedef union sz_string_t {
  1012. struct internal {
  1013. sz_ptr_t start;
  1014. sz_u8_t length;
  1015. char chars[SZ_STRING_INTERNAL_SPACE]; /// Ends with a null-terminator.
  1016. } internal;
  1017. struct external {
  1018. sz_ptr_t start;
  1019. sz_size_t length;
  1020. sz_size_t space; /// The length of the heap-allocated buffer.
  1021. sz_size_t padding;
  1022. } external;
  1023. } sz_string_t;
  1024. ```
  1025. As one can see, a short string can be kept on the stack, if it fits within `internal.chars` array.
  1026. Before 2015 GCC string implementation was just 8 bytes, and could only fit 7 characters.
  1027. Different STL implementations today have different thresholds for the Small String Optimization.
  1028. Similar to GCC, StringZilla is 32 bytes in size, and similar to Clang it can fit 22 characters on stack.
  1029. Our layout might be preferential, if you want to avoid branches.
  1030. If you use a different compiler, you may want to check its SSO buffer size with a [simple Gist](https://gist.github.com/ashvardanian/c197f15732d9855c4e070797adf17b21).
  1031. | | `libstdc++` in GCC 13 | `libc++` in Clang 17 | StringZilla |
  1032. | :-------------- | ---------------------: | -------------------: | ----------: |
  1033. | String `sizeof` | 32 | 24 | 32 |
  1034. | Inner Capacity | 15 | __22__ | __22__ |
  1035. This design has been since ported to many high-level programming languages.
  1036. Swift, for example, [can store 15 bytes](https://developer.apple.com/documentation/swift/substring/withutf8(_:)#discussion) in the `String` instance itself.
  1037. StringZilla implements SSO at the C level, providing the `sz_string_t` union and a simple API for primary operations.
  1038. ```c
  1039. sz_memory_allocator_t allocator;
  1040. sz_string_t string;
  1041. // Init and make sure we are on stack
  1042. sz_string_init(&string);
  1043. sz_string_is_on_stack(&string); // == sz_true_k
  1044. // Optionally pre-allocate space on the heap for future insertions.
  1045. sz_string_grow(&string, 100, &allocator); // == sz_true_k
  1046. // Append, erase, insert into the string.
  1047. sz_string_expand(&string, 0, "_Hello_", 7, &allocator); // == sz_true_k
  1048. sz_string_expand(&string, SZ_SIZE_MAX, "world", 5, &allocator); // == sz_true_k
  1049. sz_string_erase(&string, 0, 1);
  1050. // Unpacking & introspection.
  1051. sz_ptr_t string_start;
  1052. sz_size_t string_length;
  1053. sz_size_t string_space;
  1054. sz_bool_t string_is_external;
  1055. sz_string_unpack(string, &string_start, &string_length, &string_space, &string_is_external);
  1056. sz_equal(string_start, "Hello_world", 11); // == sz_true_k
  1057. // Reclaim some memory.
  1058. sz_string_shrink_to_fit(&string, &allocator); // == sz_true_k
  1059. sz_string_free(&string, &allocator);
  1060. ```
  1061. Unlike the conventional C strings, the `sz_string_t` is allowed to contain null characters.
  1062. To safely print those, pass the `string_length` to `printf` as well.
  1063. ```c
  1064. printf("%.*s\n", (int)string_length, string_start);
  1065. ```
  1066. ### What's Wrong with the C Standard Library?
  1067. StringZilla is not a drop-in replacement for the C Standard Library.
  1068. It's designed to be a safer and more modern alternative.
  1069. Conceptually:
  1070. 1. LibC strings are expected to be null-terminated, so to use the efficient LibC implementations on slices of larger strings, you'd have to copy them, which is more expensive than the original string operation.
  1071. 2. LibC functionality is asymmetric - you can find the first and the last occurrence of a character within a string, but you can't find the last occurrence of a substring.
  1072. 3. LibC function names are typically very short and cryptic.
  1073. 4. LibC lacks crucial functionality like hashing and doesn't provide primitives for less critical but relevant operations like fuzzy matching.
  1074. Something has to be said about its support for UTF-8.
  1075. Aside from a single-byte `char` type, LibC provides `wchar_t`:
  1076. - The size of `wchar_t` is not consistent across platforms. On Windows, it's typically 16 bits (suitable for UTF-16), while on Unix-like systems, it's usually 32 bits (suitable for UTF-32). This inconsistency can lead to portability issues when writing cross-platform code.
  1077. - `wchar_t` is designed to represent wide characters in a fixed-width format (UTF-16 or UTF-32). In contrast, UTF-8 is a variable-length encoding, where each character can take from 1 to 4 bytes. This fundamental difference means that `wchar_t` and UTF-8 are incompatible.
  1078. StringZilla [partially addresses those issues](#unicode-utf-8-and-wide-characters).
  1079. ### What's Wrong with the C++ Standard Library?
  1080. | C++ Code | Evaluation Result | Invoked Signature |
  1081. | :----------------------------------- | :---------------- | :----------------------------- |
  1082. | `"Loose"s.replace(2, 2, "vath"s, 1)` | `"Loathe"` 🤢 | `(pos1, count1, str2, pos2)` |
  1083. | `"Loose"s.replace(2, 2, "vath", 1)` | `"Love"` 🥰 | `(pos1, count1, str2, count2)` |
  1084. StringZilla is designed to be a drop-in replacement for the C++ Standard Templates Library.
  1085. That said, some of the design decisions of STL strings are highly controversial, error-prone, and expensive.
  1086. Most notably:
  1087. 1. Argument order for `replace`, `insert`, `erase` and similar functions is impossible to guess.
  1088. 2. Bounds-checking exceptions for `substr`-like functions are only thrown for one side of the range.
  1089. 3. Returning string copies in `substr`-like functions results in absurd volume of allocations.
  1090. 4. Incremental construction via `push_back`-like functions goes through too many branches.
  1091. 5. Inconsistency between `string` and `string_view` methods, like the lack of `remove_prefix` and `remove_suffix`.
  1092. Check the following set of asserts validating the `std::string` specification.
  1093. It's not realistic to expect the average developer to remember the [14 overloads of `std::string::replace`][stl-replace].
  1094. [stl-replace]: https://en.cppreference.com/w/cpp/string/basic_string/replace
  1095. ```cpp
  1096. using str = std::string;
  1097. assert(str("hello world").substr(6) == "world");
  1098. assert(str("hello world").substr(6, 100) == "world"); // 106 is beyond the length of the string, but its OK
  1099. assert_throws(str("hello world").substr(100), std::out_of_range); // 100 is beyond the length of the string
  1100. assert_throws(str("hello world").substr(20, 5), std::out_of_range); // 20 is beyond the length of the string
  1101. assert_throws(str("hello world").substr(-1, 5), std::out_of_range); // -1 casts to unsigned without any warnings...
  1102. assert(str("hello world").substr(0, -1) == "hello world"); // -1 casts to unsigned without any warnings...
  1103. assert(str("hello").replace(1, 2, "123") == "h123lo");
  1104. assert(str("hello").replace(1, 2, str("123"), 1) == "h23lo");
  1105. assert(str("hello").replace(1, 2, "123", 1) == "h1lo");
  1106. assert(str("hello").replace(1, 2, "123", 1, 1) == "h2lo");
  1107. assert(str("hello").replace(1, 2, str("123"), 1, 1) == "h2lo");
  1108. assert(str("hello").replace(1, 2, 3, 'a') == "haaalo");
  1109. assert(str("hello").replace(1, 2, {'a', 'b'}) == "hablo");
  1110. ```
  1111. To avoid those issues, StringZilla provides an alternative consistent interface.
  1112. It supports signed arguments, and doesn't have more than 3 arguments per function or
  1113. The standard API and our alternative can be conditionally disabled with `SZ_SAFETY_OVER_COMPATIBILITY=1`.
  1114. When it's enabled, the _~~subjectively~~_ risky overloads from the Standard will be disabled.
  1115. ```cpp
  1116. using str = sz::string;
  1117. str("a:b").front(1) == "a"; // no checks, unlike `substr`
  1118. str("a:b").front(2) == "2"; // take first 2 characters
  1119. str("a:b").back(-1) == "b"; // accepting negative indices
  1120. str("a:b").back(-2) == ":b"; // similar to Python's `"a:b"[-2:]`
  1121. str("a:b").sub(1, -1) == ":"; // similar to Python's `"a:b"[1:-1]`
  1122. str("a:b").sub(-2, -1) == ":"; // similar to Python's `"a:b"[-2:-1]`
  1123. str("a:b").sub(-2, 1) == ""; // similar to Python's `"a:b"[-2:1]`
  1124. "a:b"_sv[{-2, -1}] == ":"; // works on views and overloads `operator[]`
  1125. ```
  1126. Assuming StringZilla is a header-only library you can use the full API in some translation units and gradually transition to safer restricted API in others.
  1127. Bonus - all the bound checking is branchless, so it has a constant cost and won't hurt your branch predictor.
  1128. ### Beyond the C++ Standard Library - Learning from Python
  1129. Python is arguably the most popular programming language for data science.
  1130. In part, that's due to the simplicity of its standard interfaces.
  1131. StringZilla brings some of that functionality to C++.
  1132. - Content checks: `isalnum`, `isalpha`, `isascii`, `isdigit`, `islower`, `isspace`, `isupper`.
  1133. - Trimming character sets: `lstrip`, `rstrip`, `strip`.
  1134. - Trimming string matches: `remove_prefix`, `remove_suffix`.
  1135. - Ranges of search results: `splitlines`, `split`, `rsplit`.
  1136. - Number of non-overlapping substring matches: `count`.
  1137. - Partitioning: `partition`, `rpartition`.
  1138. For example, when parsing documents, it is often useful to split it into substrings.
  1139. Most often, after that, you would compute the length of the skipped part, the offset and the length of the remaining part.
  1140. This results in a lot of pointer arithmetic and is error-prone.
  1141. StringZilla provides a convenient `partition` function, which returns a tuple of three string views, making the code cleaner.
  1142. ```cpp
  1143. auto parts = haystack.partition(':'); // Matching a character
  1144. auto [before, match, after] = haystack.partition(':'); // Structure unpacking
  1145. auto [before, match, after] = haystack.partition(sz::byteset(":;")); // Character-set argument
  1146. auto [before, match, after] = haystack.partition(" : "); // String argument
  1147. auto [before, match, after] = haystack.rpartition(sz::whitespaces_set()); // Split around the last whitespace
  1148. ```
  1149. Combining those with the `split` function, one can easily parse a CSV file or HTTP headers.
  1150. ```cpp
  1151. for (auto line : haystack.split("\r\n")) {
  1152. auto [key, _, value] = line.partition(':');
  1153. headers[key.strip()] = value.strip();
  1154. }
  1155. ```
  1156. Some other extensions are not present in the Python standard library either.
  1157. Let's go through the C++ functionality category by category.
  1158. - [Splits and Ranges](#splits-and-ranges).
  1159. - [Concatenating Strings without Allocations](#concatenating-strings-without-allocations).
  1160. - [Random Generation](#random-generation).
  1161. - [Edit Distances and Fuzzy Search](#levenshtein-edit-distance-and-alignment-scores).
  1162. Some of the StringZilla interfaces are not available even Python's native `str` class.
  1163. Here is a sneak peek of the most useful ones.
  1164. ```cpp
  1165. text.hash(); // -> 64 bit unsigned integer
  1166. text.ssize(); // -> 64 bit signed length to avoid `static_cast<std::ssize_t>(text.size())`
  1167. text.contains_only(" \w\t"); // == text.find_first_not_of(sz::byteset(" \w\t")) == npos;
  1168. text.contains(sz::whitespaces_set()); // == text.find(sz::byteset(sz::whitespaces_set())) != npos;
  1169. // Simpler slicing than `substr`
  1170. text.front(10); // -> sz::string_view
  1171. text.back(10); // -> sz::string_view
  1172. // Safe variants, which clamp the range into the string bounds
  1173. using sz::string::cap;
  1174. text.front(10, cap) == text.front(std::min(10, text.size()));
  1175. text.back(10, cap) == text.back(std::min(10, text.size()));
  1176. // Character set filtering
  1177. text.lstrip(sz::whitespaces_set()).rstrip(sz::newlines_set()); // like Python
  1178. text.front(sz::whitespaces_set()); // all leading whitespaces
  1179. text.back(sz::digits_set()); // all numerical symbols forming the suffix
  1180. // Incremental construction
  1181. using sz::string::unchecked;
  1182. text.push_back('x'); // no surprises here
  1183. text.push_back('x', unchecked); // no bounds checking, Rust style
  1184. text.try_push_back('x'); // returns `false` if the string is full and the allocation failed
  1185. sz::concatenate(text, "@", domain, ".", tld); // No allocations
  1186. ```
  1187. ### Splits and Ranges
  1188. One of the most common use cases is to split a string into a collection of substrings.
  1189. Which would often result in [StackOverflow lookups][so-split] and snippets like the one below.
  1190. [so-split]: https://stackoverflow.com/questions/14265581/parse-split-a-string-in-c-using-string-delimiter-standard-c
  1191. ```cpp
  1192. std::vector<std::string> lines = split(haystack, "\r\n"); // string delimiter
  1193. std::vector<std::string> words = split(lines, ' '); // character delimiter
  1194. ```
  1195. Those allocate memory for each string and the temporary vectors.
  1196. Each allocation can be orders of magnitude more expensive, than even serial `for`-loop over characters.
  1197. To avoid those, StringZilla provides lazily-evaluated ranges, compatible with the [Range-v3][range-v3] library.
  1198. [range-v3]: https://github.com/ericniebler/range-v3
  1199. ```cpp
  1200. for (auto line : haystack.split("\r\n"))
  1201. for (auto word : line.split(sz::byteset(" \w\t.,;:!?")))
  1202. std::cout << word << std::endl;
  1203. ```
  1204. Each of those is available in reverse order as well.
  1205. It also allows interleaving matches, if you want both inclusions of `xx` in `xxx`.
  1206. Debugging pointer offsets is not a pleasant exercise, so keep the following functions in mind.
  1207. - `haystack.[r]find_all(needle, interleaving)`
  1208. - `haystack.[r]find_all(sz::byteset(""))`
  1209. - `haystack.[r]split(needle)`
  1210. - `haystack.[r]split(sz::byteset(""))`
  1211. For $N$ matches the split functions will report $N+1$ matches, potentially including empty strings.
  1212. Ranges have a few convenience methods as well:
  1213. ```cpp
  1214. range.size(); // -> std::size_t
  1215. range.empty(); // -> bool
  1216. range.template to<std::set<std::sting>>();
  1217. range.template to<std::vector<std::sting_view>>();
  1218. ```
  1219. ### Concatenating Strings without Allocations
  1220. Another common string operation is concatenation.
  1221. The STL provides `std::string::operator+` and `std::string::append`, but those are not very efficient, if multiple invocations are performed.
  1222. ```cpp
  1223. std::string name, domain, tld;
  1224. auto email = name + "@" + domain + "." + tld; // 4 allocations
  1225. ```
  1226. The efficient approach would be to pre-allocate the memory and copy the strings into it.
  1227. ```cpp
  1228. std::string email;
  1229. email.reserve(name.size() + domain.size() + tld.size() + 2);
  1230. email.append(name), email.append("@"), email.append(domain), email.append("."), email.append(tld);
  1231. ```
  1232. That's mouthful and error-prone.
  1233. StringZilla provides a more convenient `concatenate` function, which takes a variadic number of arguments.
  1234. It also overrides the `operator|` to concatenate strings lazily, without any allocations.
  1235. ```cpp
  1236. auto email = sz::concatenate(name, "@", domain, ".", tld); // 0 allocations
  1237. auto email = name | "@" | domain | "." | tld; // 0 allocations
  1238. sz::string email = name | "@" | domain | "." | tld; // 1 allocations
  1239. ```
  1240. ### Random Generation
  1241. Software developers often need to generate random strings for testing purposes.
  1242. The STL provides `std::generate` and `std::random_device`, that can be used with StringZilla.
  1243. ```cpp
  1244. sz::string random_string(std::size_t length, char const *alphabet, std::size_t cardinality) {
  1245. sz::string result(length, '\0');
  1246. static std::random_device seed_source; // Expensive to construct - due to system calls
  1247. static std::mt19937 generator(seed_source()); // Also expensive - due to the state size
  1248. std::uniform_int_distribution<std::size_t> distribution(0, cardinality);
  1249. std::generate(result.begin(), result.end(), [&]() { return alphabet[distribution(generator)]; });
  1250. return result;
  1251. }
  1252. ```
  1253. Mouthful and slow.
  1254. StringZilla provides a C native method - `sz_fill_random` and a convenient C++ wrapper - `sz::generate`.
  1255. Similar to Python it also defines the commonly used character sets.
  1256. ```cpp
  1257. auto protein = sz::string::random(300, "ARNDCQEGHILKMFPSTWYV"); // static method
  1258. auto dna = sz::basic_string<custom_allocator>::random(3_000_000_000, "ACGT");
  1259. dna.fill_random("ACGT"); // `noexcept` pre-allocated version
  1260. dna.fill_random(&std::rand, "ACGT"); // pass any generator, like `std::mt19937`
  1261. char uuid[36];
  1262. sz::fill_random(sz::string_span(uuid, 36), "0123456789abcdef-"); // Overwrite any buffer
  1263. ```
  1264. ### Bulk Replacements
  1265. In text processing, it's often necessary to replace all occurrences of a specific substring or set of characters within a string.
  1266. Standard library functions may not offer the most efficient or convenient methods for performing bulk replacements, especially when dealing with large strings or performance-critical applications.
  1267. - `haystack.replace_all(needle_string, replacement_string)`
  1268. - `haystack.replace_all(sz::byteset(""), replacement_string)`
  1269. - `haystack.try_replace_all(needle_string, replacement_string)`
  1270. - `haystack.try_replace_all(sz::byteset(""), replacement_string)`
  1271. - `haystack.lookup(sz::look_up_table::identity())`
  1272. - `haystack.lookup(sz::look_up_table::identity(), haystack.data())`
  1273. ### Sorting in C and C++
  1274. LibC provides `qsort` and STL provides `std::sort`.
  1275. Both have their quirks.
  1276. The LibC standard has no way to pass a context to the comparison function, that's only possible with platform-specific extensions.
  1277. Those have [different arguments order](https://stackoverflow.com/a/39561369) on every OS.
  1278. ```c
  1279. // Linux: https://linux.die.net/man/3/qsort_r
  1280. void qsort_r(void *elements, size_t count, size_t element_width,
  1281. int (*compare)(void const *left, void const *right, void *context),
  1282. void *context);
  1283. // macOS and FreeBSD: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/qsort_r.3.html
  1284. void qsort_r(void *elements, size_t count, size_t element_width,
  1285. void *context,
  1286. int (*compare)(void *context, void const *left, void const *right));
  1287. // Windows conflicts with ISO `qsort_s`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/qsort-s?view=msvc-170
  1288. void qsort_s(id *elements, size_t count, size_t element_width,
  1289. int (*compare)(void *context, void const *left, void const *right),
  1290. void *context);
  1291. ```
  1292. C++ generic algorithm is not perfect either.
  1293. There is no guarantee in the standard that `std::sort` won't allocate any memory.
  1294. If you are running on embedded, in real-time or on 100+ CPU cores per node, you may want to avoid that.
  1295. StringZilla doesn't solve the general case, but hopes to improve the performance for strings.
  1296. Use `sz_sequence_argsort`, or the high-level `sz::argsort`, which can be used sort any collection of elements convertible to `sz::string_view`.
  1297. ```cpp
  1298. std::vector<std::string> data({"c", "b", "a"});
  1299. std::vector<std::size_t> order = sz::argsort(data); //< Simple shortcut
  1300. // Or, taking care of memory allocation:
  1301. sz::argsort(data.begin(), data.end(), order.data(), [](auto const &x) -> sz::string_view { return x; });
  1302. ```
  1303. ### Standard C++ Containers with String Keys
  1304. The C++ Standard Templates Library provides several associative containers, often used with string keys.
  1305. ```cpp
  1306. std::map<std::string, int, std::less<std::string>> sorted_words;
  1307. std::unordered_map<std::string, int, std::hash<std::string>, std::equal_to<std::string>> words;
  1308. ```
  1309. The performance of those containers is often limited by the performance of the string keys, especially on reads.
  1310. StringZilla can be used to accelerate containers with `std::string` keys, by overriding the default comparator and hash functions.
  1311. ```cpp
  1312. std::map<std::string, int, sz::less> sorted_words;
  1313. std::unordered_map<std::string, int, sz::hash, sz::equal_to> words;
  1314. ```
  1315. Alternatively, a better approach would be to use the `sz::string` class as a key.
  1316. The right hash function and comparator would be automatically selected and the performance gains would be more noticeable if the keys are short.
  1317. ```cpp
  1318. std::map<sz::string, int> sorted_words;
  1319. std::unordered_map<sz::string, int> words;
  1320. ```
  1321. ### Compilation Settings and Debugging
  1322. __`SZ_DEBUG`__:
  1323. > For maximal performance, the C library does not perform any bounds checking in Release builds.
  1324. > In C++, bounds checking happens only in places where the STL `std::string` would do it.
  1325. > If you want to enable more aggressive bounds-checking, define `SZ_DEBUG` before including the header.
  1326. > If not explicitly set, it will be inferred from the build type.
  1327. __`SZ_USE_GOLDMONT`, `SZ_USE_WESTMERE`, `SZ_USE_HASWELL`, `SZ_USE_SKYLAKE`, `SZ_USE_ICE`, `SZ_USE_NEON`, `SZ_USE_NEON_AES`, `SZ_USE_NEON_SHA`, `SZ_USE_SVE`, `SZ_USE_SVE2`, `SZ_USE_SVE2_AES`__:
  1328. > One can explicitly disable certain families of SIMD instructions for compatibility purposes.
  1329. > Default values are inferred at compile time depending on compiler support (for dynamic dispatch) and the target architecture (for static dispatch).
  1330. __`SZ_USE_CUDA`, `SZ_USE_KEPLER`, `SZ_USE_HOPPER`__:
  1331. > One can explicitly disable certain families of PTX instructions for compatibility purposes.
  1332. > Default values are inferred at compile time depending on compiler support (for dynamic dispatch) and the target architecture (for static dispatch).
  1333. __`SZ_ENFORCE_SVE_OVER_NEON`__:
  1334. > SVE and SVE2 are expected to supersede NEON on ARM architectures.
  1335. > Still, oftentimes the equivalent SVE kernels are slower due to equally small register files and higher complexity of the instructions.
  1336. > By default, when both SVE and NEON are available, SVE is used selectively only for the algorithms that benefit from it.
  1337. > If you want to enforce SVE usage everywhere, define this flag.
  1338. __`SZ_DYNAMIC_DISPATCH`__:
  1339. > By default, StringZilla is a header-only library.
  1340. > But if you are running on different generations of devices, it makes sense to pre-compile the library for all supported generations at once, and dispatch at runtime.
  1341. > This flag does just that and is used to produce the `stringzilla.so` shared library, as well as the Python bindings.
  1342. __`SZ_USE_MISALIGNED_LOADS`__:
  1343. > Default is platform-dependent: enabled on x86 (where unaligned accesses are fast), disabled on others by default.
  1344. > When enabled, many byte-level operations use word-sized loads, which can significantly accelerate the serial (SWAR) backend.
  1345. > Consider enabling it explicitly if you are targeting platforms that support fast unaligned loads.
  1346. __`SZ_AVOID_LIBC`__ and __`SZ_OVERRIDE_LIBC`__:
  1347. > When using the C header-only library one can disable the use of LibC.
  1348. > This may affect the type resolution system on obscure hardware platforms.
  1349. > Moreover, one may let `stringzilla` override the common symbols like the `memcpy` and `memset` with its own implementations.
  1350. > In that case you can use the [`LD_PRELOAD` trick][ld-preload-trick] to prioritize its symbols over the ones from the LibC and accelerate existing string-heavy applications without recompiling them.
  1351. > It also adds a layer of security, as the `stringzilla` isn't [undefined for NULL inputs][redhat-memcpy-ub] like `memcpy(NULL, NULL, 0)`.
  1352. [ld-preload-trick]: https://ashvardanian.com/posts/ld-preload-libsee
  1353. [redhat-memcpy-ub]: https://developers.redhat.com/articles/2024/12/11/making-memcpynull-null-0-well-defined
  1354. __`SZ_AVOID_STL`__ and __`SZ_SAFETY_OVER_COMPATIBILITY`__:
  1355. > When using the C++ interface one can disable implicit conversions from `std::string` to `sz::string` and back.
  1356. > If not needed, the `<string>` and `<string_view>` headers will be excluded, reducing compilation time.
  1357. > Moreover, if STL compatibility is a low priority, one can make the API safer by disabling the overloads, which are subjectively error prone.
  1358. __`STRINGZILLA_BUILD_SHARED`, `STRINGZILLA_BUILD_TEST`, `STRINGZILLA_BUILD_BENCHMARK`, `STRINGZILLA_TARGET_ARCH`__ for CMake users:
  1359. > When compiling the tests and benchmarks, you can explicitly set the target hardware architecture.
  1360. > It's synonymous to GCC's `-march` flag and is used to enable/disable the appropriate instruction sets.
  1361. > You can also disable the shared library build, if you don't need it.
  1362. ## Quick Start: Rust
  1363. StringZilla is available as a Rust crate, with documentation available on [docs.rs/stringzilla](https://docs.rs/stringzilla).
  1364. You can immediately check the installed version and the used hardware capabilities with following commands:
  1365. ```bash
  1366. cargo add stringzilla
  1367. cargo run --example version
  1368. ```
  1369. To use the latest crate release in your project, add the following to your `Cargo.toml`:
  1370. ```toml
  1371. [dependencies]
  1372. stringzilla = ">=3" # for serial algorithms
  1373. stringzilla = { version = ">=3", features = ["cpus"] } # for parallel multi-CPU backends
  1374. stringzilla = { version = ">=3", features = ["cuda"] } # for parallel Nvidia GPU backend
  1375. ```
  1376. Or if you want to use the latest pre-release version from the repository:
  1377. ```toml
  1378. [dependencies]
  1379. stringzilla = { git = "https://github.com/ashvardanian/stringzilla", branch = "main-dev" }
  1380. ```
  1381. Once installed, all of the functionality is available through the `stringzilla` namespace.
  1382. Many interfaces will look familiar to the users of the `memchr` Rust crate.
  1383. ```rust
  1384. use stringzilla::sz;
  1385. // Identical to `memchr::memmem::find` and `memchr::memmem::rfind` functions
  1386. sz::find("Hello, world!", "world") // 7
  1387. sz::rfind("Hello, world!", "world") // 7
  1388. // Generalizations of `memchr::memrchr[123]`
  1389. sz::find_byte_from("Hello, world!", "world") // 2
  1390. sz::rfind_byte_from("Hello, world!", "world") // 11
  1391. ```
  1392. It also provides no constraints on the size of the character set, while `memchr` allows only 1, 2, or 3 characters.
  1393. In addition to global functions, `stringzilla` provides a `StringZilla` extension trait:
  1394. ```rust
  1395. use stringzilla::StringZilla;
  1396. let my_string: String = String::from("Hello, world!");
  1397. let my_str = my_string.as_str();
  1398. let my_cow_str = Cow::from(&my_string);
  1399. // Use the generic function with a String
  1400. assert_eq!(my_string.sz_find("world"), Some(7));
  1401. assert_eq!(my_string.sz_rfind("world"), Some(7));
  1402. assert_eq!(my_string.sz_find_byte_from("world"), Some(2));
  1403. assert_eq!(my_string.sz_rfind_byte_from("world"), Some(11));
  1404. assert_eq!(my_string.sz_find_byte_not_from("world"), Some(0));
  1405. assert_eq!(my_string.sz_rfind_byte_not_from("world"), Some(12));
  1406. // Same works for &str and Cow<'_, str>
  1407. assert_eq!(my_str.sz_find("world"), Some(7));
  1408. assert_eq!(my_cow_str.as_ref().sz_find("world"), Some(7));
  1409. ```
  1410. ### Hash
  1411. Single-shot and incremental hashing are both supported:
  1412. ```rs
  1413. let mut hasher = sz::Hasher::new(42);
  1414. hasher.write(b"Hello, ");
  1415. hasher.write(b"world!");
  1416. let streamed = hasher.finish();
  1417. let mut hasher = sz::Hasher::new(42);
  1418. hasher.write(b"Hello, world!");
  1419. assert_eq!(streamed, hasher.finish());
  1420. ```
  1421. To use StringZilla with `std::collections`:
  1422. ```rs
  1423. use std::collections::HashMap;
  1424. let mut map: HashMap<&str, i32, sz::BuildSzHasher> =
  1425. HashMap::with_hasher(sz::BuildSzHasher::with_seed(42));
  1426. map.insert("a", 1);
  1427. assert_eq!(map.get("a"), Some(&1));
  1428. ```
  1429. ### SHA-256 Checksums
  1430. SHA-256 cryptographic checksums are available:
  1431. ```rs
  1432. use stringzilla::sz;
  1433. // One-shot SHA-256
  1434. let digest = sz::Sha256::hash(b"Hello, world!");
  1435. assert_eq!(digest.len(), 32);
  1436. // Incremental SHA-256
  1437. let mut hasher = sz::Sha256::new();
  1438. hasher.update(b"Hello, ");
  1439. hasher.update(b"world!");
  1440. let digest = hasher.digest();
  1441. // HMAC-SHA256 for message authentication
  1442. let mac = sz::hmac_sha256(b"secret", b"Hello, world!");
  1443. ```
  1444. ### Unicode Case-Folding and Case-Insensitive Search
  1445. StringZilla implements both Unicode Case Folding and Case-Insensitive UTF-8 Search.
  1446. Unlike most libraries only capable of lower-casing ASCII-represented English alphabet, StringZilla covers over 1M+ codepoints.
  1447. The case-folding API expects the output buffer to be at least 3× larger than the input, to accommodate for the worst-case character expansions scenarios.
  1448. ```rust
  1449. use stringzilla::stringzilla as sz;
  1450. let source = "Straße"; // German: "Street"
  1451. let mut dest = [0u8; 64]; // Must be at least 3x source length
  1452. let len = sz::utf8_case_fold(source, &mut dest);
  1453. assert_eq!(&dest[..len], b"strasse"); // ß (2 bytes) → "ss" (2 bytes)
  1454. ```
  1455. The case-insensitive search returns `Some((offset, matched_length))` or `None`.
  1456. The `matched_length` may differ from needle length due to expansions.
  1457. ```rust
  1458. use stringzilla::stringzilla::{utf8_case_insensitive_find, Utf8CaseInsensitiveNeedle};
  1459. // Single search — ß (C3 9F) matches "SS"
  1460. if let Some((offset, len)) = utf8_case_insensitive_find("Straße", "STRASSE") {
  1461. assert_eq!(offset, 0);
  1462. assert_eq!(len, 7); // "Straße" is 7 bytes
  1463. }
  1464. // Repeated searches with pre-compiled needle metadata
  1465. let needle = Utf8CaseInsensitiveNeedle::new(b"STRASSE");
  1466. for haystack in &["Straße", "STRASSE", "strasse"] {
  1467. if let Some((offset, len)) = utf8_case_insensitive_find(haystack, &needle) {
  1468. println!("Found at byte {} with length {}", offset, len);
  1469. }
  1470. }
  1471. ```
  1472. ### Similarity Scores
  1473. StringZilla exposes high-performance, batch-oriented similarity via the `szs` module.
  1474. Use `DeviceScope` to pick hardware and optionally limit capabilities per engine.
  1475. ```rust
  1476. use stringzilla::szs; // re-exported as `szs`
  1477. let cpu_scope = szs::DeviceScope::cpu_cores(4).unwrap(); // force CPU-only
  1478. let gpu_scope = szs::DeviceScope::gpu_device(0).unwrap(); // pick GPU 0 if available
  1479. let strings_a = vec!["kitten", "flaw"];
  1480. let strings_b = vec!["sitting", "lawn"];
  1481. let engine = szs::LevenshteinDistances::new(
  1482. &cpu_scope,
  1483. 0, // match cost
  1484. 2, // mismatch cost - costs don't have to be 1
  1485. 3, // open cost - may be different in Bio
  1486. 1, // extend cost
  1487. ).unwrap();
  1488. let distances = engine.compute(&cpu_scope, &strings_a, &strings_b).unwrap();
  1489. assert_eq!(distances[0], 3);
  1490. assert_eq!(distances[1], 2);
  1491. ```
  1492. Note, that this computes byte-level distances.
  1493. For UTF-8 codepoints, use a different engine class:
  1494. ```rust
  1495. let strings_a = vec!["café", "αβγδ"];
  1496. let strings_b = vec!["cafe", "αγδ"];
  1497. let engine = szs::LevenshteinDistancesUtf8::new(&cpu_scope, 0, 1, 1, 1).unwrap();
  1498. let distances = engine.compute(&cpu_scope, &strings_a, &strings_b).unwrap();
  1499. assert_eq!(distances, vec![1, 1]);
  1500. ```
  1501. Similarly, for variable substitution costs, also pass in a a weights matrix:
  1502. ```rust
  1503. let mut substitution_matrix = [-1i8; 256 * 256];
  1504. for i in 0..256 { substitution_matrix[i * 256 + i] = 0; }
  1505. let engine = szs::NeedlemanWunschScores::new(&cpu_scope, &substitution_matrix, -3, -1).unwrap();
  1506. let scores = engine.compute(&cpu_scope, &strings_a, &strings_b).unwrap();
  1507. ```
  1508. Or for local alignment scores:
  1509. ```rust
  1510. let engine = szs::SmithWatermanScores::new(&cpu_scope, &substitution_matrix, -3, -1).unwrap();
  1511. let local_scores = engine.compute(&cpu_scope, &strings_a, &strings_b).unwrap();
  1512. ```
  1513. For high-performance applications, use the [StringTape](https://github.com/ashvardanian/StringTape) crate to pass strings to `compute_into` methods without extra memory allocations:
  1514. ```rust
  1515. use stringzilla::{szs, StringTape};
  1516. // Create StringTape from data (zero-copy compatible format)
  1517. let tape_a = StringTape::from_strings(&["kitten", "sitting", "flaw"]);
  1518. let tape_b = StringTape::from_strings(&["sitting", "kitten", "lawn"]);
  1519. // Use unified memory vector for GPU compatibility, initialized with max values
  1520. let mut distances = szs::UnifiedVec::<u32>::from_elem(u32::MAX, tape_a.len());
  1521. let engine = szs::LevenshteinDistances::new(&gpu_scope, 0, 1, 1, 1).unwrap();
  1522. engine.compute_into(&gpu_scope, &tape_a, &tape_b, &mut distances).unwrap();
  1523. // Results computed directly into unified memory, accessible from both CPU/GPU
  1524. assert_eq!(distances[0], 3); // kitten -> sitting
  1525. assert_eq!(distances[1], 3); // sitting -> kitten
  1526. assert_eq!(distances[2], 2); // flaw -> lawn
  1527. ```
  1528. ### Rolling Fingerprints
  1529. MinHashing is a common technique for Information Retrieval, producing compact representations of large documents.
  1530. For $D$ hash-functions and a text of length $L$, in the worst case it involves computing $O(D \cdot L)$ hashes.
  1531. ```rust
  1532. use stringzilla::szs;
  1533. let texts = vec![
  1534. "quick brown fox jumps over the lazy dog",
  1535. "quick brown fox jumped over a very lazy dog",
  1536. ];
  1537. let cpu = szs::DeviceScope::cpu_cores(4).unwrap();
  1538. let ndim = 1024;
  1539. let window_widths = vec![4u64, 6, 8, 10];
  1540. let engine = szs::Fingerprints::new(
  1541. ndim, // number of hash functions & dimensions
  1542. &window_widths, // optional predefined window widths
  1543. 256, // default alphabet size for byte strings
  1544. &cpu // device scope
  1545. ).unwrap();
  1546. let (hashes, counts) = engine.compute(&cpu, &texts).unwrap();
  1547. assert_eq!(hashes.len(), texts.len() * ndim);
  1548. assert_eq!(counts.len(), texts.len() * ndim);
  1549. ```
  1550. For zero-copy processing with [StringTape](https://github.com/ashvardanian/StringTape) format and unified memory:
  1551. ```rust
  1552. use stringzilla::{szs, StringTape};
  1553. let tape = StringTape::from_strings(&[
  1554. "quick brown fox jumps over the lazy dog",
  1555. "quick brown fox jumped over a very lazy dog",
  1556. ]);
  1557. // Pre-allocate unified memory buffers
  1558. let mut hashes = szs::UnifiedVec::<u32>::from_elem(u32::MAX, tape.len() * ndim);
  1559. let mut counts = szs::UnifiedVec::<u32>::from_elem(u32::MAX, tape.len() * ndim);
  1560. let engine = szs::Fingerprints::new(ndim, &window_widths, 256, &cpu).unwrap();
  1561. engine.compute_into(&cpu, &tape, &mut hashes, &mut counts).unwrap();
  1562. // Results computed directly into unified memory buffers
  1563. assert!(hashes.iter().any(|&h| h != u32::MAX)); // Verify computation occurred
  1564. assert!(counts.iter().any(|&c| c != u32::MAX));
  1565. ```
  1566. ## Quick Start: JavaScript
  1567. Install the Node.js package and use zero-copy `Buffer` APIs.
  1568. ```bash
  1569. npm install stringzilla
  1570. node -p "require('stringzilla').default.capabilities" # for CommonJS
  1571. node -e "import('stringzilla').then(m=>console.log(m.default.capabilities)).catch(console.error)" # for ESM
  1572. ```
  1573. ```js
  1574. import sz from 'stringzilla';
  1575. const haystack = Buffer.from('Hello, world!');
  1576. const needle = Buffer.from('world');
  1577. // Substring search (BigInt offsets)
  1578. const firstIndex = sz.find(haystack, needle); // 7n
  1579. const lastIndex = sz.findLast(haystack, needle); // 7n
  1580. // Character / charset search
  1581. const firstOIndex = sz.findByte(haystack, 'o'.charCodeAt(0)); // 4n
  1582. const firstVowelIndex = sz.findByteFrom(haystack, Buffer.from('aeiou')); // 1n
  1583. const lastVowelIndex = sz.findLastByteFrom(haystack, Buffer.from('aeiou')); // 8n
  1584. // Counting (optionally overlapping)
  1585. const lCount = sz.count(haystack, Buffer.from('l')); // 3n
  1586. const llOverlapCount = sz.count(haystack, Buffer.from('ll'), true); // 1n
  1587. // Equality/ordering utilities
  1588. const isEqual = sz.equal(Buffer.from('a'), Buffer.from('a'));
  1589. const order = sz.compare(Buffer.from('a'), Buffer.from('b')); // -1, 0, or 1
  1590. // Other helpers
  1591. const byteSum = sz.byteSum(haystack); // sum of bytes as BigInt
  1592. ```
  1593. ### Unicode Case-Folding and Case-Insensitive Search
  1594. StringZilla provides full Unicode case folding (including expansions like `ß → ss`, ligatures like `fi → fi`, and special folds like `µ → μ`, `K → k`)
  1595. and a case-insensitive substring search that accounts for those expansions.
  1596. ```js
  1597. import sz from "stringzilla";
  1598. // Case folding (returns a UTF-8 Buffer)
  1599. console.log(sz.utf8CaseFold(Buffer.from("Straße")).toString("utf8")); // "strasse"
  1600. console.log(sz.utf8CaseFold(Buffer.from("office")).toString("utf8")); // "office" (U+FB01 ligature)
  1601. // Case-insensitive substring search (full Unicode case folding)
  1602. const text = Buffer.from(
  1603. "Die Temperaturschwankungen im kosmischen Mikrowellenhintergrund sind ein Maß von etwa 20 µK.\n" +
  1604. "Typografisch sieht man auch: ein Maß von etwa 20 μK."
  1605. );
  1606. const patternBytes = Buffer.from("EIN MASS VON ETWA 20 μK");
  1607. const first = sz.utf8CaseInsensitiveFind(text, patternBytes);
  1608. console.log(first); // { index: 69n, length: ... } (byte offsets)
  1609. // Reuse the same needle efficiently
  1610. const pattern = new sz.Utf8CaseInsensitiveNeedle(patternBytes);
  1611. const again = pattern.findIn(text);
  1612. console.log(again.index === first.index);
  1613. ```
  1614. ### Hash
  1615. Single-shot and incremental hashing are both supported:
  1616. ```js
  1617. import sz from 'stringzilla';
  1618. // One-shot - stable 64-bit output across all platforms!
  1619. const hash = sz.hash(Buffer.from('Hello, world!'), 42); // returns BigInt
  1620. // Incremental updates - hasher maintains state
  1621. const hasher = new sz.Hasher(42); // seed: 42
  1622. hasher.update(Buffer.from('Hello, '));
  1623. hasher.update(Buffer.from('world!'));
  1624. const streamedHash = hasher.digest(); // returns BigInt
  1625. console.assert(hash === streamedHash);
  1626. ```
  1627. ### SHA-256 Checksums
  1628. SHA-256 cryptographic checksums are available:
  1629. ```js
  1630. import sz from 'stringzilla';
  1631. // One-shot SHA-256
  1632. const digest = sz.sha256(Buffer.from('Hello, world!')); // returns Buffer (32 bytes)
  1633. // Incremental SHA-256
  1634. const hasher = new sz.Sha256();
  1635. hasher.update(Buffer.from('Hello, '));
  1636. hasher.update(Buffer.from('world!'));
  1637. const digestBuffer = hasher.digest(); // returns Buffer (32 bytes)
  1638. const digestHex = hasher.hexdigest(); // returns string (64 hex chars)
  1639. ```
  1640. ## Quick Start: Swift
  1641. StringZilla can be added as a dependency in the Swift Package Manager.
  1642. In your `Package.swift` file, add the following:
  1643. ```swift
  1644. dependencies: [
  1645. .package(url: "https://github.com/ashvardanian/stringzilla")
  1646. ]
  1647. ```
  1648. The package currently covers only the most basic functionality, but is planned to be extended to cover the full C++ API.
  1649. ```swift
  1650. var s = "Hello, world! Welcome to StringZilla. 👋"
  1651. s[s.findFirst(substring: "world")!...] // "world! Welcome to StringZilla. 👋"
  1652. s[s.findLast(substring: "o")!...] // "o StringZilla. 👋"
  1653. s[s.findFirst(characterFrom: "aeiou")!...] // "ello, world! Welcome to StringZilla. 👋"
  1654. s[s.findLast(characterFrom: "aeiou")!...] // "a. 👋")
  1655. s[s.findFirst(characterNotFrom: "aeiou")!...] // "Hello, world! Welcome to StringZilla. 👋"
  1656. ```
  1657. ### Unicode Case-Folding and Case-Insensitive Search
  1658. ```swift
  1659. import StringZilla
  1660. let folded = "Straße".utf8CaseFoldedBytes()
  1661. print(String(decoding: folded, as: UTF8.self)) // "strasse"
  1662. let haystack =
  1663. "Die Temperaturschwankungen im kosmischen Mikrowellenhintergrund sind ein Maß von etwa 20 µK.\n"
  1664. + "Typografisch sieht man auch: ein Maß von etwa 20 μK."
  1665. let needle = "EIN MASS VON ETWA 20 μK"
  1666. if let range = haystack.utf8CaseInsensitiveFind(substring: needle) {
  1667. print(haystack[range]) // "ein Maß von etwa 20 µK"
  1668. }
  1669. // Reuse the same needle efficiently
  1670. let compiledNeedle = Utf8CaseInsensitiveNeedle(needle)
  1671. if let range = compiledNeedle.findFirst(in: haystack) {
  1672. print(haystack[range])
  1673. }
  1674. ```
  1675. ### Hash
  1676. StringZilla provides high-performance hashing for Swift strings:
  1677. ```swift
  1678. import StringZilla
  1679. // One-shot hashing - stable 64-bit output across all platforms!
  1680. let hash = "Hello, world!".hash(seed: 42)
  1681. // Incremental hashing for streaming data
  1682. var hasher = StringZillaHasher(seed: 42)
  1683. hasher.update("Hello, ")
  1684. hasher.update("world!")
  1685. let streamedHash = hasher.digest()
  1686. assert(hash == streamedHash)
  1687. ```
  1688. ### SHA-256 Checksums
  1689. SHA-256 cryptographic checksums are available:
  1690. ```swift
  1691. import StringZilla
  1692. // One-shot SHA-256
  1693. let digest = "Hello, world!".sha256() // returns [UInt8] (32 bytes)
  1694. // Incremental SHA-256
  1695. var hasher = StringZillaSha256()
  1696. hasher.update("Hello, ")
  1697. hasher.update("world!")
  1698. let digestBytes = hasher.digest() // [UInt8] (32 bytes)
  1699. let digestHex = hasher.hexdigest() // String (64 hex chars)
  1700. ```
  1701. ## Quick Start: GoLang
  1702. Add the Go binding as a module dependency:
  1703. ```bash
  1704. go get github.com/ashvardanian/stringzilla/golang@latest
  1705. ```
  1706. Build the shared C library once, then ensure your runtime can locate it (Linux shown):
  1707. ```bash
  1708. cmake -B build_shared -D STRINGZILLA_BUILD_SHARED=1 -D CMAKE_BUILD_TYPE=Release
  1709. cmake --build build_shared --target stringzilla_shared --config Release
  1710. export LD_LIBRARY_PATH="$PWD/build_shared:$LD_LIBRARY_PATH"
  1711. ```
  1712. Use finders (substring, bytes, and sets):
  1713. ```go
  1714. package main
  1715. import (
  1716. "fmt"
  1717. sz "github.com/ashvardanian/stringzilla/golang"
  1718. )
  1719. func main() {
  1720. s := "the quick brown fox jumps over the lazy dog"
  1721. // Substrings
  1722. fmt.Println(sz.Contains(s, "brown")) // true
  1723. fmt.Println(sz.Index(s, "the")) // 0
  1724. fmt.Println(sz.LastIndex(s, "the")) // 35
  1725. // Single bytes
  1726. fmt.Println(sz.IndexByte(s, 'o')) // 12
  1727. fmt.Println(sz.LastIndexByte(s, 'o')) // 41
  1728. // Byte sets
  1729. fmt.Println(sz.IndexAny(s, "aeiou")) // 2 (first vowel)
  1730. fmt.Println(sz.LastIndexAny(s, "aeiou")) // 43 (last vowel)
  1731. // Counting with/without overlaps
  1732. fmt.Println(sz.Count("aaaaa", "aa", false)) // 2
  1733. fmt.Println(sz.Count("aaaaa", "aa", true)) // 4
  1734. fmt.Println(sz.Count("abc", "", false)) // 4
  1735. fmt.Println(sz.Bytesum("ABC"), sz.Bytesum("ABCD"))
  1736. }
  1737. ```
  1738. ### Unicode Case-Folding and Case-Insensitive Search
  1739. ```go
  1740. package main
  1741. import (
  1742. "fmt"
  1743. sz "github.com/ashvardanian/stringzilla/golang"
  1744. )
  1745. func main() {
  1746. folded, _ := sz.Utf8CaseFold("Straße", true)
  1747. fmt.Println(folded) // "strasse"
  1748. haystack := "Die Temperaturschwankungen im kosmischen Mikrowellenhintergrund sind ein Maß von etwa 20 µK.\n" +
  1749. "Typografisch sieht man auch: ein Maß von etwa 20 μK."
  1750. needle := "EIN MASS VON ETWA 20 μK"
  1751. start64, len64, _ := sz.Utf8CaseInsensitiveFind(haystack, needle, true)
  1752. start, end := int(start64), int(start64+len64)
  1753. fmt.Println(haystack[start:end]) // "ein Maß von etwa 20 µK"
  1754. // Reuse the same needle efficiently
  1755. compiled, _ := sz.NewUtf8CaseInsensitiveNeedle(needle, true)
  1756. start64, len64, _ = compiled.FindIn(haystack, true)
  1757. start, end = int(start64), int(start64+len64)
  1758. fmt.Println(haystack[start:end])
  1759. }
  1760. ```
  1761. ### Hash
  1762. Single-shot and incremental hashing are both supported.
  1763. The `Hasher` type implements Go's standard `hash.Hash64` and `io.Writer` interfaces:
  1764. ```go
  1765. import (
  1766. "io"
  1767. sz "github.com/ashvardanian/stringzilla/golang"
  1768. )
  1769. // One-shot hashing
  1770. one := sz.Hash("Hello, world!", 42)
  1771. // Streaming hasher (implements hash.Hash64 and io.Writer)
  1772. hasher := sz.NewHasher(42)
  1773. hasher.Write([]byte("Hello, "))
  1774. hasher.Write([]byte("world!"))
  1775. streamed := hasher.Digest() // or hasher.Sum64()
  1776. fmt.Println(one == streamed) // true
  1777. // Works with io.Copy and any io.Reader
  1778. file, _ := os.Open("data.txt")
  1779. hasher.Reset()
  1780. io.Copy(hasher, file)
  1781. fileHash := hasher.Sum64()
  1782. ```
  1783. ### SHA-256 Checksums
  1784. SHA-256 cryptographic checksums are available.
  1785. The `Sha256` type implements Go's standard `hash.Hash` and `io.Writer` interfaces:
  1786. ```go
  1787. import (
  1788. "io"
  1789. sz "github.com/ashvardanian/stringzilla/golang"
  1790. )
  1791. // One-shot SHA-256
  1792. digest := sz.HashSha256([]byte("Hello, world!"))
  1793. fmt.Printf("%x\n", digest) // prints 32-byte hash in hex
  1794. // Streaming SHA-256 (implements hash.Hash and io.Writer)
  1795. hasher := sz.NewSha256()
  1796. hasher.Write([]byte("Hello, "))
  1797. hasher.Write([]byte("world!"))
  1798. digestBytes := hasher.Digest() // [32]byte
  1799. digestHex := hasher.Hexdigest() // string (64 hex chars)
  1800. // Works with io.Copy and any io.Reader
  1801. file, _ := os.Open("data.bin")
  1802. hasher.Reset()
  1803. io.Copy(hasher, file)
  1804. fileDigest := hasher.Digest()
  1805. // Standard hash.Hash interface methods
  1806. sum := hasher.Sum(nil) // []byte with 32 bytes
  1807. size := hasher.Size() // 32
  1808. blockSize := hasher.BlockSize() // 64
  1809. ```
  1810. ## Algorithms & Design Decisions
  1811. StringZilla aims to optimize some of the slowest string operations.
  1812. Some popular operations, however, like equality comparisons and relative order checking, almost always complete on some of the very first bytes in either string.
  1813. In such operations vectorization is almost useless, unless huge and very similar strings are considered.
  1814. StringZilla implements those operations as well, but won't result in substantial speedups.
  1815. Where vectorization stops being effective, parallelism takes over with the new layered cake architecture:
  1816. - StringZilla C library w/out dependencies
  1817. - StringZillas parallel extensions:
  1818. - Parallel C++ algorithms built with [Fork Union](https://github.com/ashvardanian/fork_union)
  1819. - Parallel CUDA algorithms for Nvidia GPUs
  1820. - Parallel ROCm algorithms for AMD GPUs 🔜
  1821. ### Exact Substring Search
  1822. Substring search algorithms are generally divided into: comparison-based, automaton-based, and bit-parallel.
  1823. Different families are effective for different alphabet sizes and needle lengths.
  1824. The more operations are needed per-character - the more effective SIMD would be.
  1825. The longer the needle - the more effective the skip-tables are.
  1826. StringZilla uses different exact substring search algorithms for different needle lengths and backends:
  1827. - When no SIMD is available - SWAR (SIMD Within A Register) algorithms are used on 64-bit words.
  1828. - Boyer-Moore-Horspool (BMH) algorithm with Raita heuristic variation for longer needles.
  1829. - SIMD backends compare characters at multiple strategically chosen offsets within the needle to reduce degeneracy.
  1830. On very short needles, especially 1-4 characters long, brute force with SIMD is the fastest solution.
  1831. On mid-length needles, bit-parallel algorithms are effective, as the character masks fit into 32-bit or 64-bit words.
  1832. Either way, if the needle is under 64-bytes long, on haystack traversal we will still fetch every CPU cache line.
  1833. So the only way to improve performance is to reduce the number of comparisons.
  1834. > For 2-byte needles, see `sz_find_2byte_serial_` in `include/stringzilla/find.h`:
  1835. https://github.com/ashvardanian/StringZilla/blob/e1966de91600298d3c5cf4fe7be40d434f0f405e/include/stringzilla/find.h#L422-L463
  1836. Going beyond that, to long needles, Boyer-Moore (BM) and its variants are often the best choice.
  1837. It has two tables: the good-suffix shift and the bad-character shift.
  1838. Common choice is to use the simplified BMH algorithm, which only uses the bad-character shift table, reducing the pre-processing time.
  1839. We do the same for mid-length needles up to 256 bytes long.
  1840. That way the stack-allocated shift table remains small.
  1841. > For mid-length needles (≤256 bytes), see `sz_find_horspool_upto_256bytes_serial_` in `include/stringzilla/find.h`:
  1842. https://github.com/ashvardanian/StringZilla/blob/e1966de91600298d3c5cf4fe7be40d434f0f405e/include/stringzilla/find.h#L620-L667
  1843. In the C++ Standards Library, the `std::string::find` function uses the BMH algorithm with Raita's heuristic.
  1844. Before comparing the entire string, it matches the first, last, and the middle character.
  1845. Very practical, but can be slow for repetitive characters.
  1846. Both SWAR and SIMD backends of StringZilla have a cheap pre-processing step, where we locate unique characters.
  1847. This makes the library a lot more practical when dealing with non-English corpora.
  1848. > The offset selection heuristic is implemented in `sz_locate_needle_anomalies_` in `include/stringzilla/find.h`:
  1849. https://github.com/ashvardanian/StringZilla/blob/e1966de91600298d3c5cf4fe7be40d434f0f405e/include/stringzilla/find.h#L244-L305
  1850. All those, still, have $O(hn)$ worst case complexity.
  1851. To guarantee $O(h)$ worst case time complexity, the Apostolico-Giancarlo (AG) algorithm adds an additional skip-table.
  1852. Preprocessing phase is $O(n+sigma)$ in time and space.
  1853. On traversal, performs from $(h/n)$ to $(3h/2)$ comparisons.
  1854. It however, isn't practical on modern CPUs.
  1855. A simpler idea, the Galil-rule might be a more relevant optimizations, if many matches must be found.
  1856. Other algorithms previously considered and deprecated:
  1857. - Apostolico-Giancarlo algorithm for longer needles. _Control-flow is too complex for efficient vectorization._
  1858. - Shift-Or-based Bitap algorithm for short needles. _Slower than SWAR._
  1859. - Horspool-style bad-character check in SIMD backends. _Effective only for very long needles, and very uneven character distributions between the needle and the haystack. Faster "character-in-set" check needed to generalize._
  1860. > § Reading materials.
  1861. > [Exact String Matching Algorithms in Java](https://www-igm.univ-mlv.fr/~lecroq/string).
  1862. > [SIMD-friendly algorithms for substring searching](http://0x80.pl/articles/simd-strfind.html).
  1863. ### Exact Multiple Substring Search
  1864. Few algorithms for multiple substring search are known.
  1865. Most are based on the Aho-Corasick automaton, which is a generalization of the KMP algorithm.
  1866. The naive implementation, however:
  1867. - Allocates disjoint memory for each Trie node and Automaton state.
  1868. - Requires a lot of pointer chasing, limiting speculative execution.
  1869. - Has a lot of branches and conditional moves, which are hard to predict.
  1870. - Matches text a character at a time, which is slow on modern CPUs.
  1871. There are several ways to improve the original algorithm.
  1872. One is to use sparse DFA representation, which is more cache-friendly, but would require extra processing to navigate state transitions.
  1873. ### Levenshtein Edit Distance
  1874. Levenshtein distance is the best known edit-distance for strings, that checks, how many insertions, deletions, and substitutions are needed to transform one string to another.
  1875. It's extensively used in approximate string-matching, spell-checking, and bioinformatics.
  1876. The computational cost of the Levenshtein distance is $O(n * m)$, where $n$ and $m$ are the lengths of the string arguments.
  1877. To compute that, the naive approach requires $O(n * m)$ space to store the "Levenshtein matrix", the bottom-right corner of which will contain the Levenshtein distance.
  1878. The algorithm producing the matrix has been simultaneously studied/discovered by the Soviet mathematicians Vladimir Levenshtein in 1965, Taras Vintsyuk in 1968, and American computer scientists - Robert Wagner, David Sankoff, Michael J. Fischer in the following years.
  1879. Several optimizations are known:
  1880. 1. __Space Optimization__: The matrix can be computed in $O(min(n,m))$ space, by only storing the last two rows of the matrix.
  1881. 2. __Divide and Conquer__: Hirschberg's algorithm can be applied to decompose the computation into subtasks.
  1882. 3. __Automata__: Levenshtein automata can be effective, if one of the strings doesn't change, and is a subject to many comparisons.
  1883. 4. __Shift-Or__: Bit-parallel algorithms transpose the matrix into a bit-matrix, and perform bitwise operations on it.
  1884. The last approach is quite powerful and performant, and is used by the great [RapidFuzz][rapidfuzz] library.
  1885. It's less known, than the others, derived from the Baeza-Yates-Gonnet algorithm, extended to bounded edit-distance search by Manber and Wu in 1990s, and further extended by Gene Myers in 1999 and Heikki Hyyro between 2002 and 2004.
  1886. StringZilla focuses on a different approach, extensively used in Unum's internal combinatorial optimization libraries.
  1887. It doesn't change the number of trivial operations, but performs them in a different order, removing the data dependency, that occurs when computing the insertion costs.
  1888. StringZilla __evaluates diagonals instead of rows__, exploiting the fact that all cells within a diagonal are independent, and can be computed in parallel.
  1889. We'll store 3 diagonals instead of the 2 rows, and each consecutive diagonal will be computed from the previous two.
  1890. Substitution costs will come from the sooner diagonal, while insertion and deletion costs will come from the later diagonal.
  1891. <table>
  1892. <tr>
  1893. <td>
  1894. <strong>Row-by-Row Algorithm</strong><br>
  1895. Computing row 4:
  1896. <pre>
  1897. ∅ A B C D E
  1898. ∅ 0 1 2 3 4 5
  1899. P 1 ░ ░ ░ ░ ░
  1900. Q 2 ■ ■ ■ ■ ■
  1901. R 3 ■ ■ □ → .
  1902. S 4 . . . . .
  1903. T 5 . . . . .
  1904. </pre>
  1905. </td>
  1906. <td>
  1907. <strong>Anti-Diagonal Algorithm</strong><br>
  1908. Computing diagonal 5:
  1909. <pre>
  1910. ∅ A B C D E
  1911. ∅ 0 1 2 3 4 5
  1912. P 1 ░ ░ ■ ■ □
  1913. Q 2 ░ ■ ■ □ ↘
  1914. R 3 ■ ■ □ ↘ .
  1915. S 4 ■ □ ↘ . .
  1916. T 5 □ ↘ . . .
  1917. </pre>
  1918. </td>
  1919. </tr>
  1920. <tr>
  1921. <td colspan="2">
  1922. <strong>Legend:</strong><br>
  1923. <code>0,1,2,3...</code> = initialization constants &nbsp;&nbsp;
  1924. <code>░</code> = cells processed and forgotten &nbsp;&nbsp;
  1925. <code>■</code> = stored cells &nbsp;&nbsp;
  1926. <code>□</code> = computing in parallel &nbsp;&nbsp;
  1927. <code>→ ↘</code> = movement direction &nbsp;&nbsp;
  1928. <code>.</code> = cells to compute later
  1929. </td>
  1930. </tr>
  1931. </table>
  1932. This results in much better vectorization for intra-core parallelism and potentially multi-core evaluation of a single request.
  1933. Moreover, it's easy to generalize to weighted edit-distances, where the cost of a substitution between two characters may not be the same for all pairs, often used in bioinformatics.
  1934. > § Reading materials.
  1935. > [Faster Levenshtein Distances with a SIMD-friendly Traversal Order](https://ashvardanian.com/posts/levenshtein-diagonal).
  1936. [rapidfuzz]: https://github.com/rapidfuzz/RapidFuzz
  1937. ### Needleman-Wunsch and Smith-Waterman Scores for Bioinformatics
  1938. The field of bioinformatics studies various representations of biological structures.
  1939. The "primary" representations are generally strings over sparse alphabets:
  1940. - [DNA][faq-dna] sequences, where the alphabet is {A, C, G, T}, ranging from ~100 characters for short reads to 3 billion for the human genome.
  1941. - [RNA][faq-rna] sequences, where the alphabet is {A, C, G, U}, ranging from ~50 characters for tRNA to thousands for mRNA.
  1942. - [Proteins][faq-protein], where the alphabet is made of 22 amino acids, ranging from 2 characters for [dipeptide][faq-dipeptide] to 35,000 for [Titin][faq-titin], the longest protein.
  1943. The shorter the representation, the more often researchers may want to use custom substitution matrices.
  1944. Meaning that the cost of a substitution between two characters may not be the same for all pairs.
  1945. In the general case the serial algorithm is supposed to work for arbitrary substitution costs for each of 256×256 possible character pairs.
  1946. That lookup table, however, is too large to fit into CPU registers, so instead, the upcoming design focuses on 32×32 substitution matrices, which fit into 1 KB with single-byte "error costs".
  1947. That said, most [BLOSUM][faq-blosum] and [PAM][faq-pam] substitution matrices only contain 4-bit values, so they can be packed even further.
  1948. [faq-dna]: https://en.wikipedia.org/wiki/DNA
  1949. [faq-rna]: https://en.wikipedia.org/wiki/RNA
  1950. [faq-protein]: https://en.wikipedia.org/wiki/Protein
  1951. [faq-blosum]: https://en.wikipedia.org/wiki/BLOSUM
  1952. [faq-pam]: https://en.wikipedia.org/wiki/Point_accepted_mutation
  1953. [faq-dipeptide]: https://en.wikipedia.org/wiki/Dipeptide
  1954. [faq-titin]: https://en.wikipedia.org/wiki/Titin
  1955. Next design goals:
  1956. - [ ] Needleman-Wunsch Automata
  1957. ### Memory Copying, Fills, and Moves
  1958. A lot has been written about the time computers spend copying memory and how that operation is implemented in LibC.
  1959. Interestingly, the operation can still be improved, as most Assembly implementations use outdated instructions.
  1960. Even performance-oriented STL replacements, like Meta's [Folly v2024.09.23 focus on AVX2](https://github.com/facebook/folly/blob/main/folly/memset.S), and don't take advantage of the new masked instructions in AVX-512 or SVE.
  1961. In AVX-512, StringZilla uses non-temporal stores to avoid cache pollution, when dealing with very large strings.
  1962. Moreover, it handles the unaligned head and the tails of the `target` buffer separately, ensuring that writes in big copies are always aligned to cache-line boundaries.
  1963. That's true for both AVX2 and AVX-512 backends.
  1964. StringZilla also contains "drafts" of smarter, but less efficient algorithms, that minimize the number of unaligned loads, performing shuffles and permutations.
  1965. That's a topic for future research, as the performance gains are not yet satisfactory.
  1966. > § Reading materials.
  1967. > [`memset` benchmarks](https://github.com/nadavrot/memset_benchmark?tab=readme-ov-file) by Nadav Rotem.
  1968. > [Cache Associativity](https://en.algorithmica.org/hpc/cpu-cache/associativity/) by Sergey Slotin.
  1969. ### Hashing
  1970. StringZilla implements a high-performance 64-bit hash function inspired by the "AquaHash", "aHash", and "GxHash" design and optimized for modern CPU architectures.
  1971. The algorithm utilizes AES encryption rounds combined with shuffle-and-add operations to achieve exceptional mixing properties while maintaining consistent output across platforms.
  1972. It passes the rigorous SMHasher test suite, including the `--extra` flag with no collisions.
  1973. The core algorithm operates on a dual-state design:
  1974. - __AES State__: Initialized with seed `XOR`-ed against π constants.
  1975. - __Sum State__: Accumulates shuffled input data with a permutation.
  1976. For strings ≤64 bytes, a minimal state processes data in 16-byte blocks.
  1977. Longer strings employ a 4× wider state (512 bits) that processes 64-byte chunks, maximizing throughput on modern superscalar CPUs.
  1978. The algorithm can be expressed in pseudocode as:
  1979. ```
  1980. function sz_hash(text: u8[], length: usize, seed: u64) -> u64:
  1981. # 1024 bits worth of π constants
  1982. pi: u64[16] = [
  1983. 0x243F6A8885A308D3, 0x13198A2E03707344, 0xA4093822299F31D0, 0x082EFA98EC4E6C89,
  1984. 0x452821E638D01377, 0xBE5466CF34E90C6C, 0xC0AC29B7C97C50DD, 0x3F84D5B5B5470917,
  1985. 0x9216D5D98979FB1B, 0xD1310BA698DFB5AC, 0x2FFD72DBD01ADFB7, 0xB8E1AFED6A267E96,
  1986. 0xBA7C9045F12C7F99, 0x24A19947B3916CF7, 0x0801F2E2858EFC16, 0x636920D871574E69]
  1987. # Permutation order for the sum state
  1988. shuffle_pattern: u8[16] = [
  1989. 0x04, 0x0b, 0x09, 0x06, 0x08, 0x0d, 0x0f, 0x05,
  1990. 0x0e, 0x03, 0x01, 0x0c, 0x00, 0x07, 0x0a, 0x02]
  1991. # Initialize key and states
  1992. keys_u64s: u64[2] = [seed, seed]
  1993. aes_u64s: u64[2] = [seed ⊕ pi[0], seed ⊕ pi[1]]
  1994. sum_u64s: u64[2] = [seed ⊕ pi[8], seed ⊕ pi[9]]
  1995. if length ≤ 64:
  1996. # Small input: process 1-4 zero-padded blocks of 16 bytes each
  1997. blocks_u8s: u8[16][] = split_into_blocks(text, length, 16)
  1998. for each block_u8s: u8[16] in blocks_u8s:
  1999. aes_u64s = AESENC(aes_u64s, block_u8s)
  2000. sum_u64s = SHUFFLE(sum_u64s, shuffle_pattern) + block_u8s
  2001. else:
  2002. # Large input: use 4× wider 512-bits states
  2003. aes_u64s: u64[8] = [
  2004. seed ⊕ pi[0], seed ⊕ pi[1], seed ⊕ pi[2], seed ⊕ pi[3],
  2005. seed ⊕ pi[4], seed ⊕ pi[5], seed ⊕ pi[6], seed ⊕ pi[7]]
  2006. sum_u64s: u64[8] = [
  2007. seed ⊕ pi[8], seed ⊕ pi[9], seed ⊕ pi[10], seed ⊕ pi[11],
  2008. seed ⊕ pi[12], seed ⊕ pi[13], seed ⊕ pi[14], seed ⊕ pi[15]]
  2009. # Process 64-byte chunks (4×16-byte blocks)
  2010. for each chunk_u8s: u8[64] in text:
  2011. blocks_u8s: u8[16][4] = split_chunk_into_4_blocks(chunk_u8s)
  2012. for i in 0..3:
  2013. offset: usize = i * 2 # Each lane stores two u64s
  2014. aes_u64s[offset:offset+1] = AESENC(aes_u64s[offset:offset+1], blocks_u8s[i])
  2015. sum_u64s[offset:offset+1] = SHUFFLE(sum_u64s[offset:offset+1], shuffle_pattern) + blocks_u8s[i]
  2016. # Fold 8×u64 state back to 2×u64 for finalization
  2017. aes_u64s: u64[2] = fold_to_2u64(aes_u64s)
  2018. sum_u64s: u64[2] = fold_to_2u64(sum_u64s)
  2019. # Finalization: mix length into key
  2020. key_with_length: u64[2] = [keys_u64s[0] + length, keys_u64s[1]]
  2021. # Multiple AES rounds for SMHasher compliance
  2022. mixed_u64s: u64[2] = AESENC(sum_u64s, aes_u64s)
  2023. result_u64s: u64[2] = AESENC(AESENC(mixed_u64s, key_with_length), mixed_u64s)
  2024. return result_u64s[0] # Extract low 64 bits
  2025. ```
  2026. This allows us to balance several design trade-offs.
  2027. First, it allows us to achieve a high port-level parallelism.
  2028. Looking at AVX-512 capable CPUs and their ZMM instructions, on each cycle, we'll have at least 2 ports busy when dealing with long strings:
  2029. - `VAESENC`: 5 cycles on port 0 on Intel Ice Lake, 4 cycles on ports 0/1 on AMD Zen4.
  2030. - `VPSHUFB_Z`: 3 cycles on port 5 on Intel Ice Lake, 2 cycles on ports 1/2 on AMD Zen4.
  2031. - `VPADDQ`: 1 cycle on ports 0/5 on Intel Ice Lake, 1 cycle on ports 0/1/2/3 on AMD Zen4.
  2032. When dealing with smaller strings, we design our approach to avoid large registers and maintain the CPU at the same energy state, thereby avoiding downclocking and expensive power-state transitions.
  2033. Unlike some AES-accelerated alternatives, the length of the input is not mixed into the AES block at the start to allow incremental construction, when the final length is not known in advance.
  2034. Also, unlike some alternatives, with "masked" AVX-512 and "predicated" SVE loads, we avoid expensive block-shuffling procedures on non-divisible-by-16 lengths.
  2035. > § Reading materials.
  2036. > [Stress-testing hash functions for avalance behaviour, collision bias, and distribution](https://github.com/ashvardanian/HashEvals).
  2037. ### SHA-256 Checksums
  2038. In addition to the fast AES-based hash, StringZilla implements hardware-accelerated SHA-256 cryptographic checksums.
  2039. The implementation follows the [FIPS 180-4 specification](https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf) and provides multiple backends.
  2040. ### Random Generation
  2041. StringZilla implements a fast [Pseudorandom Number Generator][faq-prng] inspired by the "AES-CTR-128" algorithm, reusing the same AES primitives as the hash function.
  2042. Unlike "NIST SP 800-90A" which uses multiple AES rounds, StringZilla uses only one round of AES mixing for performance while maintaining reproducible output across platforms.
  2043. The generator operates in counter mode with `AESENC(nonce + lane_index, nonce ⊕ pi_constants)`, rotating through the first 512 bits of π for each 16-byte block.
  2044. The only state required to reproduce an output is a 64-bit `nonce`, which is much cheaper than a Mersenne Twister.
  2045. [faq-prng]: https://en.wikipedia.org/wiki/Pseudorandom_number_generator
  2046. ### Sorting
  2047. For lexicographic sorting of string collections, StringZilla exports pointer-sized n‑grams ("pgrams") into a contiguous buffer to improve locality, then recursively QuickSorts those pgrams with a 3‑way partition and dives into equal pgrams to compare deeper characters.
  2048. Very small inputs fall back to insertion sort.
  2049. - Average time complexity: O(n log n)
  2050. - Worst-case time complexity: quadratic (due to QuickSort), mitigated in practice by 3‑way partitioning and the n‑gram staging
  2051. ### Unicode 17, UTF-8, and Wide Characters
  2052. Most StringZilla operations are byte-level, so they work well with ASCII and UTF-8 content out of the box.
  2053. In some cases, like edit-distance computation, the result of byte-level evaluation and character-level evaluation may differ.
  2054. - `szs_levenshtein_distances_utf8("αβγδ", "αγδ") == 1` — one unicode symbol.
  2055. - `szs_levenshtein_distances("αβγδ", "αγδ") == 2` — one unicode symbol is two bytes long.
  2056. Java, JavaScript, Python 2, C#, and Objective-C, however, use wide characters (`wchar`) - two byte long codes, instead of the more reasonable fixed-length UTF-32 or variable-length UTF-8.
  2057. This leads [to all kinds of offset-counting issues][wide-char-offsets] when facing four-byte long Unicode characters.
  2058. StringZilla uses proper 32-bit "runes" to represent unpacked Unicode codepoints, ensuring correct results in all operations.
  2059. Moreover, it implements the Unicode 17.0 standard, being practically the only library besides ICU and PCRE2 to do so, but with order(s) of magnitude better performance.
  2060. [wide-char-offsets]: https://josephg.com/blog/string-length-lies/
  2061. ### Case-Folding and Case-Insensitive Search
  2062. StringZilla provides Unicode-aware case-insensitive substring search that handles the full complexity of Unicode case folding.
  2063. This includes multi-character expansions:
  2064. | Character | Codepoint | UTF-8 Bytes | Case-Folds To | Result Bytes |
  2065. | --------- | --------- | ----------- | ------------- | ------------ |
  2066. | `ß` | U+00DF | C3 9F | `ss` | 73 73 |
  2067. | `ffi` | U+FB03 | EF AC 83 | `ffi` | 66 66 69 |
  2068. | `İ` | U+0130 | C4 B0 | `i` + `◌̇` | 69 CC 87 |
  2069. The search returns byte offsets and lengths in the original haystack, correctly handling length differences.
  2070. For example, searching for `"STRASSE"` (7 bytes) in `"Straße"` (7 bytes: 53 74 72 61 C3 9F 65) succeeds because both case-fold to `"strasse"`.
  2071. Note that Turkish `İ` and ASCII `I` are distinct: `İstanbul` case-folds to `i̇stanbul` (with combining dot), while `ISTANBUL` case-folds to `istanbul` (without).
  2072. They will not match each other — this is correct Unicode behavior for Turkish locale handling.
  2073. For wide-character environments (Java, JavaScript, Python 2, C#), consider transcoding with [simdutf](https://github.com/simdutf/simdutf).
  2074. ## Dynamic Dispatch
  2075. Due to the high-level of fragmentation of SIMD support in different CPUs, StringZilla uses the names of select Intel and ARM CPU generations for its backends.
  2076. You can query supported backends and use them manually.
  2077. Use it to guarantee constant performance, or to explore how different algorithms scale on your hardware.
  2078. ```c
  2079. sz_find(text, length, pattern, 3); // Auto-dispatch
  2080. sz_find_westmere(text, length, pattern, 3); // Intel Westmere+ SSE4.2
  2081. sz_find_haswell(text, length, pattern, 3); // Intel Haswell+ AVX2
  2082. sz_find_skylake(text, length, pattern, 3); // Intel Skylake+ AVX-512
  2083. sz_find_neon(text, length, pattern, 3); // Arm NEON 128-bit
  2084. sz_find_sve(text, length, pattern, 3); // Arm SVE 128/256/512/1024/2048-bit
  2085. ```
  2086. StringZilla automatically picks the most advanced backend for the given CPU.
  2087. Similarly, in Python, you can log the auto-detected capabilities:
  2088. ```python
  2089. python -c "import stringzilla; print(stringzilla.__capabilities__)" # ('serial', 'westmere', 'haswell', 'skylake', 'ice', 'neon', 'sve', 'sve2+aes')
  2090. python -c "import stringzilla; print(stringzilla.__capabilities_str__)" # "haswell, skylake, ice, neon, sve, sve2+aes"
  2091. ```
  2092. You can also explicitly set the backend to use, or scope the backend to a specific function.
  2093. ```python
  2094. import stringzilla as sz
  2095. sz.reset_capabilities(('serial',)) # Force SWAR backend
  2096. sz.reset_capabilities(('haswell',)) # Force AVX2 backend
  2097. sz.reset_capabilities(('neon',)) # Force NEON backend
  2098. sz.reset_capabilities(sz.__capabilities__) # Reset to auto-dispatch
  2099. ```
  2100. ## Contributing 👾
  2101. Please check out the [contributing guide](https://github.com/ashvardanian/StringZilla/blob/main/CONTRIBUTING.md) for more details on how to set up the development environment and contribute to this project.
  2102. If you like this project, you may also enjoy [USearch][usearch], [UCall][ucall], [UForm][uform], and [SimSIMD][simsimd]. 🤗
  2103. [usearch]: https://github.com/unum-cloud/usearch
  2104. [ucall]: https://github.com/unum-cloud/ucall
  2105. [uform]: https://github.com/unum-cloud/uform
  2106. [simsimd]: https://github.com/ashvardanian/simsimd
  2107. If you like strings and value efficiency, you may also enjoy the following projects:
  2108. - [simdutf](https://github.com/simdutf/simdutf) - transcoding UTF-8, UTF-16, and UTF-32 LE and BE.
  2109. - [hyperscan](https://github.com/intel/hyperscan) - regular expressions with SIMD acceleration.
  2110. - [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Aho-Corasick algorithm in Python.
  2111. - [rapidfuzz](https://github.com/rapidfuzz/RapidFuzz) - fast string matching in C++ and Python.
  2112. - [memchr](https://github.com/BurntSushi/memchr) - fast string search in Rust.
  2113. If you are looking for more reading materials on this topic, consider the following:
  2114. - [5x faster strings with SIMD & SWAR](https://ashvardanian.com/posts/stringzilla/).
  2115. - [The Painful Pitfalls of C++ STL Strings](https://ashvardanian.com/posts/painful-strings/).
  2116. - [Processing Strings 109x Faster than Nvidia on H100](https://ashvardanian.com/posts/stringwars-on-gpus/).
  2117. - [How a String Library Beat OpenCV at Image Processing by 4x](https://ashvardanian.com/posts/image-processing-with-strings/).
  2118. - [2x Faster Hashes on AWS Graviton: NEON → SVE2](https://ashvardanian.com/posts/aws-graviton-checksums-on-neon-vs-sve/).
  2119. ## License 📜
  2120. Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.