Marketplace Model: Named Entity Recognition (NER) Standard

From Minibase

Named Entity Recognition (NER) Standard

This model is a lightweight instruction-tuned Named Entity Recognition (NER) system that extracts entities from English text and returns results in a structured JSON format.
It was trained on a curated dataset of synthetic and human-verified examples, covering four common entity categories:
• PER: Persons
• ORG: Organizations
• LOC: Locations
• MISC: Miscellaneous entities (events, works of art, awards, treaties, products, etc.)

The model is optimized for instruction-following use cases, where the input is a natural language task instruction plus a passage of text, and the output is a machine-parseable JSON object containing entity lists.



Intended Uses
• Use cases:
• Quick prototyping of NER applications.
• Building pipelines for entity extraction in compliance, analytics, or content management.
• Demonstrating instruction-following NER behavior for small LLMs.
• Teaching or experimenting with instruction-tuned models.
• Users: Developers, researchers, educators, and students exploring NER systems.
• Input: English text, paired with an instruction.
• Output: JSON-formatted entity annotations with four keys: PER, ORG, LOC, MISC.

Example

Input

{
"Instruction": "Parse the following text to identify the named entities and return a parseable JSON with the keyed named entities.",
"Input": "In July 1969, Neil Armstrong and Buzz Aldrin became the first humans to walk on the Moon as part of NASA's Apollo 11 mission."
}

Output

{
"PER": ["Neil Armstrong", "Buzz Aldrin"],
"ORG": ["NASA"],
"LOC": ["Moon"],
"MISC": ["Apollo 11"]
}



Training Data
• Dataset: Synthetic Named Entity Recognition (NER) Examples
• Size: 30,000 instruction-based records (extendable by synthetic generation).
• Languages: English only.
• Annotation schema: Four classes — PER, ORG, LOC, MISC.
• Data type: Instruction-Input-Response triplets with machine-parseable JSON outputs.



Model Details
• Architecture: LlamaForCausalLM.
• Objective: Instruction-based NER with structured JSON responses.
• Release: v1.0 (2025).



Performance
• Strengths:
• Consistent JSON formatting for easy downstream use.
• Handles short and long-form passages.
• Balanced examples across PER, ORG, LOC, and MISC.
• Limitations:
• Trained on a 30,000 example, synthetic dataset.
• English-only; not suitable for multilingual use.
• Entity boundaries may differ from standard datasets (e.g., CoNLL, OntoNotes).
• Not benchmarked against large-scale evaluation suites.



Ethical Considerations
• Data: Fully synthetic, no private or sensitive information included.
• Bias: Minimal risk due to synthetic nature, but coverage is limited to curated examples.
• Usage: Should not be used in high-stakes contexts (legal, medical, compliance-critical systems) without further training and validation.

Basic Information

Base Model:Standard Base
Created by:Michaelminibase
Times imported:1,115
Released:Oct 1, 2025
Model Size:368 MB
Model Type:Causal Language Model
Format:HIGH

Technical Details

Hidden Size:960
Hidden Layers:32
Attention Heads:15
Vocabulary Size:49,152
Max Context Length:8,192 tokens
Precision:BFloat16 (BF16)
Learning Rate:0.000050
Training Epochs:3
Effective Batch Size:16
Optimizer:AdamW

Training Datasets

NameTypeExamplesSize
Named Entity Recognition (NER)SFT10,0004.6 MB
Named Entity Recognition (NER) (part 2)SFT10,0004.5 MB
Named Entity Recognition (NER) (part 3)SFT10,0005.5 MB