Marketplace Model: Named Entity Recognition (NER) Standard
From Minibase
Named Entity Recognition (NER) Standard
This model is a lightweight instruction-tuned Named Entity Recognition (NER) system that extracts entities from English text and returns results in a structured JSON format.
It was trained on a curated dataset of synthetic and human-verified examples, covering four common entity categories:
• PER: Persons
• ORG: Organizations
• LOC: Locations
• MISC: Miscellaneous entities (events, works of art, awards, treaties, products, etc.)
The model is optimized for instruction-following use cases, where the input is a natural language task instruction plus a passage of text, and the output is a machine-parseable JSON object containing entity lists.
⸻
Intended Uses
• Use cases:
• Quick prototyping of NER applications.
• Building pipelines for entity extraction in compliance, analytics, or content management.
• Demonstrating instruction-following NER behavior for small LLMs.
• Teaching or experimenting with instruction-tuned models.
• Users: Developers, researchers, educators, and students exploring NER systems.
• Input: English text, paired with an instruction.
• Output: JSON-formatted entity annotations with four keys: PER, ORG, LOC, MISC.
Example
Input
{
"Instruction": "Parse the following text to identify the named entities and return a parseable JSON with the keyed named entities.",
"Input": "In July 1969, Neil Armstrong and Buzz Aldrin became the first humans to walk on the Moon as part of NASA's Apollo 11 mission."
}
Output
{
"PER": ["Neil Armstrong", "Buzz Aldrin"],
"ORG": ["NASA"],
"LOC": ["Moon"],
"MISC": ["Apollo 11"]
}
⸻
Training Data
• Dataset: Synthetic Named Entity Recognition (NER) Examples
• Size: 30,000 instruction-based records (extendable by synthetic generation).
• Languages: English only.
• Annotation schema: Four classes — PER, ORG, LOC, MISC.
• Data type: Instruction-Input-Response triplets with machine-parseable JSON outputs.
⸻
Model Details
• Architecture: LlamaForCausalLM.
• Objective: Instruction-based NER with structured JSON responses.
• Release: v1.0 (2025).
⸻
Performance
• Strengths:
• Consistent JSON formatting for easy downstream use.
• Handles short and long-form passages.
• Balanced examples across PER, ORG, LOC, and MISC.
• Limitations:
• Trained on a 30,000 example, synthetic dataset.
• English-only; not suitable for multilingual use.
• Entity boundaries may differ from standard datasets (e.g., CoNLL, OntoNotes).
• Not benchmarked against large-scale evaluation suites.
⸻
Ethical Considerations
• Data: Fully synthetic, no private or sensitive information included.
• Bias: Minimal risk due to synthetic nature, but coverage is limited to curated examples.
• Usage: Should not be used in high-stakes contexts (legal, medical, compliance-critical systems) without further training and validation.
It was trained on a curated dataset of synthetic and human-verified examples, covering four common entity categories:
• PER: Persons
• ORG: Organizations
• LOC: Locations
• MISC: Miscellaneous entities (events, works of art, awards, treaties, products, etc.)
The model is optimized for instruction-following use cases, where the input is a natural language task instruction plus a passage of text, and the output is a machine-parseable JSON object containing entity lists.
⸻
Intended Uses
• Use cases:
• Quick prototyping of NER applications.
• Building pipelines for entity extraction in compliance, analytics, or content management.
• Demonstrating instruction-following NER behavior for small LLMs.
• Teaching or experimenting with instruction-tuned models.
• Users: Developers, researchers, educators, and students exploring NER systems.
• Input: English text, paired with an instruction.
• Output: JSON-formatted entity annotations with four keys: PER, ORG, LOC, MISC.
Example
Input
{
"Instruction": "Parse the following text to identify the named entities and return a parseable JSON with the keyed named entities.",
"Input": "In July 1969, Neil Armstrong and Buzz Aldrin became the first humans to walk on the Moon as part of NASA's Apollo 11 mission."
}
Output
{
"PER": ["Neil Armstrong", "Buzz Aldrin"],
"ORG": ["NASA"],
"LOC": ["Moon"],
"MISC": ["Apollo 11"]
}
⸻
Training Data
• Dataset: Synthetic Named Entity Recognition (NER) Examples
• Size: 30,000 instruction-based records (extendable by synthetic generation).
• Languages: English only.
• Annotation schema: Four classes — PER, ORG, LOC, MISC.
• Data type: Instruction-Input-Response triplets with machine-parseable JSON outputs.
⸻
Model Details
• Architecture: LlamaForCausalLM.
• Objective: Instruction-based NER with structured JSON responses.
• Release: v1.0 (2025).
⸻
Performance
• Strengths:
• Consistent JSON formatting for easy downstream use.
• Handles short and long-form passages.
• Balanced examples across PER, ORG, LOC, and MISC.
• Limitations:
• Trained on a 30,000 example, synthetic dataset.
• English-only; not suitable for multilingual use.
• Entity boundaries may differ from standard datasets (e.g., CoNLL, OntoNotes).
• Not benchmarked against large-scale evaluation suites.
⸻
Ethical Considerations
• Data: Fully synthetic, no private or sensitive information included.
• Bias: Minimal risk due to synthetic nature, but coverage is limited to curated examples.
• Usage: Should not be used in high-stakes contexts (legal, medical, compliance-critical systems) without further training and validation.
Basic Information
Base Model:Standard Base
Created by:Michaelminibase
Times imported:1,115
Released:Oct 1, 2025
Model Size:368 MB
Model Type:Causal Language Model
Format:HIGH
Technical Details
Hidden Size:960
Hidden Layers:32
Attention Heads:15
Vocabulary Size:49,152
Max Context Length:8,192 tokens
Precision:BFloat16 (BF16)
Learning Rate:0.000050
Training Epochs:3
Effective Batch Size:16
Optimizer:AdamW
Training Datasets
| Name | Type | Examples | Size |
|---|---|---|---|
| Named Entity Recognition (NER) | SFT | 10,000 | 4.6 MB |
| Named Entity Recognition (NER) (part 2) | SFT | 10,000 | 4.5 MB |
| Named Entity Recognition (NER) (part 3) | SFT | 10,000 | 5.5 MB |