METADATA 7.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183
  1. Metadata-Version: 2.4
  2. Name: rfc3987-syntax
  3. Version: 1.1.0
  4. Summary: Helper functions to syntactically validate strings according to RFC 3987.
  5. Project-URL: Homepage, https://github.com/willynilly/rfc3987-syntax
  6. Project-URL: Documentation, https://github.com/willynilly/rfc3987-syntax#readme
  7. Project-URL: Issues, https://github.com/willynilly/rfc3987-syntax/issues
  8. Project-URL: Source, https://github.com/willynilly/rfc3987-syntax
  9. Author: Jan Kowalleck
  10. Author-email: Will Riley <wanderingwill@gmail.com>
  11. License-Expression: MIT
  12. License-File: LICENSE
  13. Keywords: RFC 3987,RFC3987,parser,syntax,validator
  14. Classifier: Development Status :: 3 - Alpha
  15. Classifier: Intended Audience :: Developers
  16. Classifier: Intended Audience :: Education
  17. Classifier: Intended Audience :: Information Technology
  18. Classifier: Intended Audience :: Science/Research
  19. Classifier: Intended Audience :: System Administrators
  20. Classifier: License :: OSI Approved :: Apache Software License
  21. Classifier: Natural Language :: English
  22. Classifier: Programming Language :: Python
  23. Classifier: Programming Language :: Python :: 3
  24. Classifier: Programming Language :: Python :: 3.9
  25. Classifier: Topic :: Scientific/Engineering
  26. Classifier: Topic :: Software Development
  27. Classifier: Topic :: Utilities
  28. Requires-Python: >=3.9
  29. Requires-Dist: lark>=1.2.2
  30. Provides-Extra: testing
  31. Requires-Dist: pytest>=8.3.5; extra == 'testing'
  32. Description-Content-Type: text/markdown
  33. # rfc3987-syntax
  34. Helper functions to parse and validate the **syntax** of terms defined in **[RFC 3987](https://www.rfc-editor.org/info/rfc3987)** — the IETF standard for Internationalized Resource Identifiers (IRIs).
  35. ## 🎯 Purpose
  36. The goal of `rfc3987-syntax` is to provide a **lightweight, permissively licensed Python module** for validating that strings conform to the **ABNF grammar defined in RFC 3987**. These helpers are:
  37. - ✅ Strictly aligned with the **syntax rules of RFC 3987**
  38. - ✅ Built using a **permissive MIT license**
  39. - ✅ Designed for both **open source and proprietary use**
  40. - ✅ Powered by [Lark](https://github.com/lark-parser/lark), a fast, EBNF-based parser
  41. > 🧠 **Note:** This project focuses on **syntax validation only**. RFC 3987 specifies **additional semantic rules** (e.g., Unicode normalization, BiDi constraints, percent-encoding requirements) that must be enforced separately.
  42. ## 📄 License, Attribution, and Citation
  43. **`rfc3987-syntax`** is licensed under the [MIT License](LICENSE), which allows reuse in both open source and commercial software.
  44. This project:
  45. - ❌ Does **not** depend on the `rfc3987` Python package (GPL-licensed)
  46. - ✅ Uses [`lark`](https://github.com/lark-parser/lark), licensed under MIT
  47. - ✅ Implements grammar from **[RFC 3987](https://datatracker.ietf.org/doc/html/rfc3987)**, using **[RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986)** where RFC 3987 delegates syntax
  48. > ⚠️ This project is **not affiliated with or endorsed by** the authors of RFC 3987 or the `rfc3987` Python package.
  49. Please cite this software in accordance with the enclosed CITATION.cff file.
  50. ## ⚠️ Limitations
  51. The grammar and parser enforce **only the ABNF syntax** defined in RFC 3987. The following are **not validated** and must be handled separately for full compliance:
  52. - ✅ Unicode **Normalization Form C (NFC)**
  53. - ✅ Bidirectional text (**BiDi**) constraints (RFC 3987 §4.1)
  54. - ✅ **Port number ranges** (must be 0–65535)
  55. - ✅ Valid **IPv6 compression** (only one `::`, max segments)
  56. - ✅ Context-aware **percent-encoding** requirements
  57. ChatGPT 40 was used during the original development process. Errors may exist due to this assistance. Additional review, testing, and bug fixes by human experts is welcome.
  58. ## 📦 Installation
  59. ```bash
  60. pip install rfc3987-syntax
  61. ```
  62. ## 🛠 Usage
  63. ### List all supported "terms" (i.e., non-terminals and terminals within ABNF production rules) used to validate the syntax of an IRI according to RFC 3987
  64. ```python
  65. from rfc3987_syntax import RFC3987_SYNTAX_TERMS
  66. print("Supported terms:")
  67. for term in RFC3987_SYNTAX_TERMS:
  68. print(term)
  69. ```
  70. ### Syntactically validate a string using the general-purpose validator
  71. ```python
  72. from rfc3987_syntax import is_valid_syntax
  73. if is_valid_syntax(term='iri', value='http://github.com'):
  74. print("✓ Valid IRI syntax")
  75. if not is_valid_syntax(term='iri', value='bob'):
  76. print("✗ Invalid IRI syntax")
  77. if not is_valid_syntax(term='iri_reference', value='bob'):
  78. print("✓ Valid IRI-reference syntax")
  79. ```
  80. ### Alternatively, use term-specific helpers to validate RFC 3987 syntax.
  81. ```python
  82. from rfc3987_syntax import is_valid_syntax_iri
  83. from rfc3987_syntax import is_valid_syntax_iri_reference
  84. if is_valid_syntax_iri('http://github.com'):
  85. print("✓ Valid IRI syntax")
  86. if not is_valid_syntax_iri('bob'):
  87. print("✗ Invalid IRI syntax")
  88. if is_valid_syntax_iri_reference('bob'):
  89. print("✓ Valid IRI-reference syntax")
  90. ```
  91. ### Get the Lark parse tree for a syntax validation (useful for additional semantic validation)
  92. ```python
  93. from rfc3987_syntax import parse
  94. ptree: ParseTree = parse(term="iri", value="http://github.com")
  95. print(ptree)
  96. ```
  97. ## 📚 Sources
  98. This grammar was derived from:
  99. - **[RFC 3987 – Internationalized Resource Identifiers (IRIs)]**
  100. → Defines IRI syntax and extensions to URI (e.g. Unicode characters, `ucschar`)
  101. → https://datatracker.ietf.org/doc/html/rfc3987
  102. - **[RFC 3986 – Uniform Resource Identifier (URI): Generic Syntax)]**
  103. → Provides reusable components like `scheme`, `authority`, `ipv4address`, etc.
  104. → https://datatracker.ietf.org/doc/html/rfc3986
  105. > 📝 When `RFC 3986` is listed as the source, it is **used in accordance with RFC 3987**, which explicitly references it for foundational elements.
  106. ### Rule-to-Source Mapping
  107. | Rule/Component | Source | Notes |
  108. |----------------------|------------|-------|
  109. | `iri` | RFC 3987 | Top-level IRI rule |
  110. | `iri_reference` | RFC 3987 | Top-level IRI Reference rule |
  111. | `absolute_iri` | RFC 3987 | Top-level Absolute IRI rule |
  112. | `scheme` | RFC 3986 | Referenced by RFC 3987 §2.2 |
  113. | `ihier_part` | RFC 3987 | IRI-specific hierarchy |
  114. | `irelative_ref` | RFC 3987 | IRI-specific relative ref |
  115. | `irelative_part` | RFC 3987 | IRI-specific relative part |
  116. | `iauthority` | RFC 3986 | Standard URI authority |
  117. | `ipath_abempty` | RFC 3986 | Path format variant |
  118. | `ipath_absolute` | RFC 3986 | Absolute path |
  119. | `ipath_noscheme` | RFC 3986 | Path disallowing scheme prefix |
  120. | `ipath_rootless` | RFC 3986 | Used in non-scheme contexts |
  121. | `iquery` | RFC 3987 | Query extension to URI |
  122. | `ifragment` | RFC 3987 | Fragment extension to URI |
  123. | `ipchar`, `isegment` | RFC 3986 | Path characters and segments |
  124. | `isegment_nz_nc` | RFC 3987 | IRI-specific path constraint |
  125. | `iunreserved` | RFC 3987 | Includes `ucschar` |
  126. | `ucschar`, `iprivate`| RFC 3987 | Unicode support |
  127. | `sub_delims` | RFC 3986 | Reserved characters |
  128. | `ip_literal` | RFC 3986 | IPv6 or IPvFuture in `[]` |
  129. | `ipv6address` | RFC 3986 | Expanded forms only |
  130. | `ipvfuture` | RFC 3986 | Forward-compatible |
  131. | `ipv4address` | RFC 3986 | Dotted-decimal IPv4 |
  132. | `ls32` | RFC 3986 | Final 32 bits of IPv6 |
  133. | `h16`, `dec_octet` | RFC 3986 | Hex and decimal chunks |
  134. | `port` | RFC 3986 | Optional numeric |
  135. | `pct_encoded` | RFC 3986 | Percent encoding (e.g. `%20`) |
  136. | `alpha`, `digit`, `hexdig` | RFC 3986 | Character classes |