glossagen.pipelines package¶
Submodules¶
glossagen.pipelines.generate_glossary module¶
Module for generating a glossary based on a research document.
- class glossagen.pipelines.generate_glossary.GlossaryGenerator(research_doc: ResearchDoc, chunk_size: int = 20000)[source]¶
Bases:
object
A class that generates a glossary based on a research document.
Attributes¶
research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.
Methods¶
generate_glossary: Generates the glossary based on the research document.
- deduplicate_entries(glossary: list[TerminusTechnicus]) list[TerminusTechnicus] [source]¶
Deduplicate the glossary entries by considering plurals and similar-sounding terms.
- format_nicely(glossary: list[TerminusTechnicus]) str [source]¶
Format the glossary nicely.
- Args:
glossary (list[TerminusTechnicus]): The glossary to format.
Returns¶
str: The nicely formatted glossary.
- class glossagen.pipelines.generate_glossary.KeepImportantTerms(*, termini_technici: list[TerminusTechnicus], important_terms: list[TerminusTechnicus])[source]¶
Bases:
Signature
Keep only the important terms from a list of termini technici.
- important_terms: list[TerminusTechnicus]¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'important_terms': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of important terms extracted from the termini technici.\n NEEDS to be abbreviations or very important terms.', '__dspy_field_type': 'output', 'prefix': 'Important Terms:'}), 'termini_technici': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.', '__dspy_field_type': 'input', 'prefix': 'Termini Technici:'})}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- termini_technici: list[TerminusTechnicus]¶
- class glossagen.pipelines.generate_glossary.TerminusTechnicus(*, term: str, definition: str)[source]¶
Bases:
BaseModel
A terminus technicus, i.e. a techincal term in materials science and chemistry.
- definition: str¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'definition': FieldInfo(annotation=str, required=True, title='The definition of the technical term.'), 'term': FieldInfo(annotation=str, required=True, title='The technical term. Can also be an abbreviation.')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- term: str¶
- class glossagen.pipelines.generate_glossary.Text2GlossarySignature(*, text: str, glossary: list[TerminusTechnicus])[source]¶
Bases:
Signature
Generating a list of termini technici from a text in materials science and chemistry.
- glossary: list[TerminusTechnicus]¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'glossary': FieldInfo(annotation=list[TerminusTechnicus], required=True, json_schema_extra={'desc': 'The list of termini technici extracted from the text.\n ONLY TAKE VERY INPORTANT TERMS, no general terms like Chemistry.', '__dspy_field_type': 'output', 'prefix': 'Glossary:'}), 'text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'The text to extract the termini technici from.', '__dspy_field_type': 'input', 'prefix': 'Text:'})}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- text: str¶
- glossagen.pipelines.generate_glossary.generate_glossary(document_directory: str, log_to_wandb_flag: bool = True) Any [source]¶
Generate a glossary based on a research document.
- Args:
document_directory (str): The directory where the research document is stored.
Returns¶
str: The generated glossary.
- glossagen.pipelines.generate_glossary.log_to_wandb(glossary: list[TerminusTechnicus], chunk_size: int, project_name: str = 'GlossaGen', config: Dict[Any, Any] | None = None) None [source]¶
Initialize wandb and log the generated glossary as a wandb.Table.
- Args:
glossary (list[TerminusTechnicus]): The list of glossary terms to log. chunk_size (int): The size of the chunks the research document was split into. project_name (str): The name of the wandb project. config (dict): Configuration parameters for the wandb run.
glossagen.pipelines.glossary_to_ontology module¶
Module to generate an ontology from a glossary.
- class glossagen.pipelines.glossary_to_ontology.Glossary2Labels(*, input_text: str, labels: list[OntologyEntityLabels])[source]¶
Bases:
Signature
Generating a list of entities labels (general overarching classes, e.g. Material, Property, …, Other) from glossary term and description pairs.
- input_text: str¶
- labels: list[OntologyEntityLabels]¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'labels': FieldInfo(annotation=list[OntologyEntityLabels], required=True, json_schema_extra={'desc': 'List of general labels categorizing the terms.', '__dspy_field_type': 'output', 'prefix': 'Labels:'})}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class glossagen.pipelines.glossary_to_ontology.Glossary2Relations(*, input_text: str, relations: list[OntologyRelation])[source]¶
Bases:
Signature
Generating a list of relations (single verb) between entities from glossary term and description pairs.
- input_text: str¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'input_text': FieldInfo(annotation=str, required=True, json_schema_extra={'desc': 'Glossary term and description pairs, one by line.', '__dspy_field_type': 'input', 'prefix': 'Input Text:'}), 'relations': FieldInfo(annotation=list[OntologyRelation], required=True, json_schema_extra={'desc': '\n List of general relations between glossary terms for an ontology.\n Short terms or verbs describing the relation.', '__dspy_field_type': 'output', 'prefix': 'Relations:'})}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- relations: list[OntologyRelation]¶
- class glossagen.pipelines.glossary_to_ontology.Ontology(*, labels: Dict[str, str], relationships: List[str])[source]¶
Bases:
BaseModel
Represent an ontology with labels and relationships.
- labels: Dict[str, str]¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'labels': FieldInfo(annotation=Dict[str, str], required=True), 'relationships': FieldInfo(annotation=List[str], required=True)}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- relationships: List[str]¶
- class glossagen.pipelines.glossary_to_ontology.OntologyEntityLabels(*, label: str)[source]¶
Bases:
BaseModel
An ontology label, i.e. an entity label in materials science and chemistry.
- label: str¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'label': FieldInfo(annotation=str, required=True, title="A string 'label: description' describing a entity class.")}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- class glossagen.pipelines.glossary_to_ontology.OntologyGenerator(glossary: Dict[str, str])[source]¶
Bases:
object
A class that generates a ontology based on a glossary.
Attributes¶
research_doc (ResearchDoc): The research document to generate the glossary from. glossary_predictor (dspy.Predict): The predictor used to generate the glossary.
Methods¶
generate_glossary: Generates the glossary based on the research document.
- class glossagen.pipelines.glossary_to_ontology.OntologyRelation(*, relation: str)[source]¶
Bases:
BaseModel
Ontology relation, i.e. relation between two entities.
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'relation': FieldInfo(annotation=str, required=True, title='A relation type of the ontology.')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- relation: str¶
glossagen.pipelines.knowledge_graph module¶
Pipeline to generate a knowledge graph from research documents.
glossagen.pipelines.latex_glossary module¶
Extract glossary from LaTeX document.